Arun Kalyanasundaram

Hewlett Packard Bangalore, India

Hewlett Packard Bangalore, India

[email protected]

[email protected]

ABSTRACT

1. INTRODUCTION

Identifying the most influential nodes in a network is a well known problem which has received significant attention in the research community. However, as networks grow larger, an interesting variation of the problem becomes relevant where in the aim is to maximize the influence not in the whole network but only on a sub-set of the nodes in the network. We approach the subset specific top-k influential problem standalone and show that, unlike traditional approaches, a search for subset specific top-k influentials can be terminated earlier based on a parameter γ - thus allowing a trade-off between efficiency and effectiveness. For social networks, this parameter has a behavioral interpretation: it captures the ease of influencing nodes in the network. This work makes three key contributions. First, we propose an iterative network pruning algorithm to find subset specific top-k influentials and compare its performance to subsetadapted existing algorithms for various values of γ on real world data sets. Second, we extend the existing analytical framework for top-k influential detection to incorporate γ. Third, we analyze our algorithm under our analytical framework and show that the influence spread function continues to be sub-modular. Though our work has been motivated by online social networks, we believe that it is useful in other domains where diffusion over networks is considered.

Many processes can be modeled as diffusion over networks, for example, information diffusion in web-based networks, sensor networks, distribution networks, epidemiology [8][4][9], etc. An algorithmic problem related to diffusion in social networks was identified in [5] as follows: given an underlying network, what set of individuals (seed set) should be ‘infected’ /‘seeded’ with some information to trigger a large scale cascade in the network. In other words, how do we find the set of nodes to seed with some information so as to maximize the spread of seeded information in the networks. This is often referred to as the top-k influential detection problem. Though the problem was identified in the context of viral marketing and advertising [5], it has applications even in other processes on the web and distribution and sensor networks for early cascade detection [9]. In this work, we address the subset-specific version of this problem where the aim is to find the top-k influentials for some given subset of the nodes. This problem is motivated by the observation that as online social networks grow in size and / or density, it is often desirable to find the nodes which maximize the spread of influence to a subset of the nodes, for example, a viral marketing campaign on Facebook (800+ Million users1 ) or Twitter (200+ Million users2 ) may want to target a small subset of users by demographics (e.g. nationality, age, sex, etc.) or, for example, local businesses may want to focus only on people living in a certain geographical area. In another use case, the given problem may manifest itself in the design of a political (commercial) campaign which aims to focus only on nodes which are supporters / detractors (fans / critics). Given the popularity of large scale data mining and sentiment analysis tools, it is often possible to segregate nodes into these categories. Once such segregation is possible, the design of such targeted campaigns becomes the next challenge and the application of the proposed problem statement becomes relevant. The subset-specific top-k influential problem potentially offers the opportunity to improve the efficiency of traditional approaches. Many algorithms for top-k influential detection are notoriously inefficient and their run-time complexities scale exponentially as the size and / or density of the network increases. This challenge becomes particularly extreme in the case where the network structure changes faster than the algorithm can complete - and hence the results may no

Categories and Subject Descriptors F.2.2 [Analysis of Algorithms and Problem Complexity]: Non-numerical Algorithms and Problems

General Terms Algorithms, Performance, Experimentation

Keywords influentials, social networks, information diffusion

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WebSci 2012, June 22–24, 2012, Evanston, Illinois, USA. Copyright 2012 ACM 978-1-4503-1228-8 ...$10.00.

1 http://www.facebook.com/notes/statspotting/facebooknow-has-more-than-800-million-activeusers/204500822949549 2 http://www.bbc.co.uk/news/business-12889048

v5

v5

v5

v3

v3 v6

v6

v2 v4

v8

v4

v7 (a)

v2 v8

v7 (b)

v1

v5

v5

v8

v6

v4 (a)

v5

v3

v3

v6 v4

v2

v2 v1

v3

v6

v4 (b)

v3 v6

v4 (c)

v6

v4 (d)

v7 (c)

Figure 2: Different sub-graphs for D0 = {v4 , v6 } Figure 1: The set of influentials, A may not lie in D0 , where the shaded nodes form the set, D0 . longer be valid. One approach suggested in existing literature [8], is to explicitly allow each node v to have an associated weight zv capturing how important it is that v be activated. However, we show how the subset specific top-k influential problem is an opportunity to design more efficient algorithms. We propose an iterative pruning algorithm and compare its performance to subset adapted existing algorithms. We introduce a parameter, γ, which can be tuned to allow a trade-off between efficiency and effectiveness (influence spread) of our algorithm. In social networks, this parameter has a behavioral interpretation - it can be used to specify how easy it is to influence a particular user, a concept earlier introduced in [15]. Further, we show that our proposed iterative pruning algorithm ensures that the influence spread function continues to be sub-modular across iterations thus enabling the evaluation of performance guarantees. The rest of this paper is organized as follows. In section 2, we motivate the problem we are addressing and state it formally. In section 3, we review related work and its application to our problem. In section 4, we extend the framework proposed in [8] for our problem, propose a subset-adapted version of the greedy algorithm and discuss our iterative pruning algorithm. In section 5, we analyze our proposed algorithm and derive a lower bound for the influence spread function. In section 6, we provide results of simulation of our algorithms on real world data sets, compare them with existing algorithms and explain the conditions under which our algorithms perform better. Proofs of sub-modularity of the influence spread function used in our approach are in section 7. We conclude in section 8 with a summary and potential directions for future work.

2.

PROBLEM STATEMENT

Formally, our problem can be stated as follows: Given a weighted graph, G(V, E), w : E → R, and a destination set, D0 ⊆ V , find the top-k nodes in V which maximize the spread of influence on D0 . Alternately, we can state the problem as follows: Given a weighted graph, G(V, E), w : E → R, and a destination set, D0 ⊆ V , find the top-k nodes in V which when initialized with some information, maximizes the reach of this information to nodes in D0 assuming some diffusion model. We note three salient properties of this problem. (i) The top-k influential nodes for a given destination set, D0 may lie outside D0 . Figure 1 shows three sample graphs where the shaded nodes form the set, D0 and the set of influentials (A) may lie outside D0 . Alternately, we may want the seed set A to lie outside D0 , for example, if we do not want to seed the information extrinsically to D0 but rather have it reach D0 via social influence.(ii) Maximizing the spread of influence

to a ‘given subset of the nodes’ is different from maximizing it over a ‘given sub-graph’. Consider the graph, G in Figure 2(a) in which the destination set, D0 consists of v4 and v6 , i.e., we want to maximize the influence on v4 and v6 . Figure 2(b) shows a subgraph induced by v4 and v6 . Finding the top-2 influentials in this subgraph will return v4 and v6 however, this rejects the potential of flow of information to v4 and v6 via other nodes in G. Figure 2(c) shows another approach where we consider the subgraph which includes all nodes which are no more than 1-hop away from the destination set and the corresponding edges. Figure 2(d) shows the same approach extended to include all nodes which are no more than 2-hops away from the destination set. In this example, when we consider nodes which are no more than 3hops away from the destination set, we recover the original graph G in 2(a). (iii) When D0 = V , the problem statement reduces to the traditional top-k influential detection problem.

3. RELATED WORK The problem of finding the most influential nodes in online social networks was proposed in [5] in the context of viral marketing strategies in social networks. This problem has been shown to be NP-hard and various approximation algorithms have been suggested in the literature [8]. Broadly, these algorithms may be classified into two categories: structural and dynamical. Structural approaches rely primarily on the structural properties of the underlying graph (degree, centrality, etc.) to determine the most influential nodes in the graph. These approaches often use measures of centrality from the field of social network analysis [14], for example, degree centrality, betweenness centrality and eigenvector centrality may be used as proxies for influence. Dynamical approaches are ‘richer’ in that besides the network structure they account for potential flows and their interference in the network structure. However, this richness in the model comes at the cost of reduced efficiency. Dynamical approaches for determining the most influential nodes aim to capture the dynamics of the diffusion process on the underlying network, i.e., they assume an underlying diffusion model which can be parameterized with empirical data from the real world implementation. For example, the probability that a given node will receive an information from the neighbor on twitter may be determined using nonstructural parameters like the frequency of tweeting and/or how often a user reads tweets, etc. as has been done in [12]. In [8], an analysis framework based on the sub-modularity of the influence spread functions was proposed and was used to analyze the problem of finding the top-k influential nodes using Independent Cascade Model (ICM) [6] and Linear Threshold Model (LTM) [7] as the underlying diffusion models. Based on this analysis, it was shown that the optimal solution for influence maximization can be approximated by the greedy algorithm [8] to within a factor of 63% for both

the diffusion models. The LTM is a deterministic model in which a node v, weighs the state of each of its neighbor u, by a weight bv,u and compares it to a threshold θv . If the weighted fraction of a nodes’ neighbors are active, it sets its state to active. Unlike the LTM, the ICM is a stochastic model. At each time step t, a node v is given a single chance to activate its neighbors with an independent success probability pv,u , where u is a neighbor of v. Even though the greedy algorithm has been shown to be effective within 63% of the optimal solution, it has a notoriously high run-time and various optimizations have been proposed in the literature. In [9], the gain in efficiency is obtained by further exploiting the sub-modularity of the influence spread function by using the fact that as the greedy algorithm iteratively grows the set A, the marginal gain due to any node cannot increase between iterations. Based on this observation, [9] proposes an algorithm called CELF (CostEffective Lazy Forward Selection) which generates efficiency gains by calculating the marginal gain due to only a sub-set of the nodes in each iteration. Since CELF does not calculate the marginal gain due to all nodes in each iteration, it is more efficient than the greedy algorithm. In [3], the efficiency of the greedy algorithm is improved by a degree discount heuristic which discounts the gain in influence due to a new node u if the set A already contains a neighbor of u. In addition, they do not take into account the multi-hop propagation of influence (called indirect influence) which makes their algorithm much simpler and leads to efficiency gains. In [10], an alternate approach is proposed where the information diffusion process is modeled as a co-operative game and the concept of Shapley Value [13] is used to find the (approximate) marginal contribution of a node to a group / set. Based on this, they find the top-k influential nodes. From an efficiency perspective, their approach is interesting because the run-time of their algorithm is independent of the value of k - the desired cardinality of the seed set. A recent work [1], proposes efficient algorithms for the top-k influential detection problem. They show that their techniques when modified appropriately can be applied to the problem of subset specific influential detection, but they do not show any performance guarantees. Our work is unique that it proposes an algorithm for subset specific influential detection which can enable a trade-off between efficiency and effectiveness using a parameter, γ which can either be set externally or measured empirically based on the domain of the underlying network. Furthermore, despite iteratively pruning the graph to improve efficiency, we are able to maintain the sub-modularity of the influence spread function thus enabling the potential of offering performance guarantees. We also believe that the extension of the analytical framework that we propose has standalone value for analysis of other algorithms in this domain.

4.

APPROACH

In this section, we first summarize the framework proposed in [8] and then extend it for our problem. The key parameter for analysis is the influence spread function σ(A) which quantifies, in expectation, the influence on G when a seed set of nodes A, is activated extrinsically. The key observation of [8] is that the simulation of the diffusion dynamics is comparable to a sampling of edges in G since the influence probabilities of each edge are independent of one

another and of any previous history. Hence, instead of analyzing the ‘set of active nodes’ after simulation, we can analyze the ‘sub-graphs of G(V, E)’ obtained by sampling each edge vw ∈ E with probability pv,w . For a particular subgraph X obtained by sampling G, the ‘set of nodes reachable from A in X’ represents the ‘set of active nodes’ that would have been activated in G with A as the seed set in some run of the diffusion simulation. Thus, if R(A, X) denotes the set of nodes reachable from each node of set A in one sample X of G, it also denotes the set of active nodes when nodes in A are activated extrinsically in one simulation of the ICM. The cardinality of this set R(A, X) is σX (A), which is the number of nodes reachable from A in sub-graph X (activated in one simulation sample). Hence, σX (A) = |

[

R(v, X)| = |R(A, X)|

(1)

v ∈ A

In [8] it is shown that σX (A) is sub-modular. Since the expected number of nodes activated (σ(.)) is the weighted average of number of nodes activated over all outcomes (σX (.)), σ(.) is also sub-modular: σ(A) =

X

P rob[X].σX (A)

(2)

outcomes X

Therefore σ(A) is sub-modular since it is a non-negative linear combination of sub-modular functions, σX (A) as shown in (2). Using earlier results from [11], [8] that for submodular functions, the performance of greedy algorithms is within a factor of 63% of the optimal algorithm. For our problem, we want to find the influentials for a given destination set, D0 . We define the expected influence spread on D0 as σ(A)D0 . In other words, σ(A)D0 is the expected number of nodes activated in the destination set, D0 when the initial seed set is A. For a given live edge graph X, σ(A)D0 is simply the set of nodes that are reachable from A and belong to the set D0 . Hence, σX (A)D0 = |R(A, X) ∩ D0 | σ(A)D0 =

X

P rob[X].σX (A)D0

(3) (4)

outcomes X

4.1 Subset-Adapted Greedy Since our approach is closest to that proposed in [8] and [9], we first modify the greedy algorithm proposed in [8] for the subset-specific influential detection problem. The pseudocode of subset-adapted greedy approach is shown in Algorithm 1.

4.2 Iterative Pruning The Iterative Pruning approach is based on the notion of computing L(u, A) - the likelihood that a particular node u would be influenced due to a seed set, A. Similar to σ(A), this entity is defined in expectation. However, we now have a finer resolution of analyzing the spread of influence due to a seed set, A.

Algorithm 1: Pseudocode - Subset Adapted Greedy: General greedy [8] modified for subset specific influential detection.

Figure 3 explains the pruning approach on a small graph where the destination subset consists of v4 and v6 . Initially both A and ψ are empty as shown in Figure 3(a). In the first iteration, v3 is chosen as the most influential and added to A. At this stage, we also add v4 to ψ since L(v4 , A) exceeds γv4 . Furthermore, we prune the graph to remove all edges that lead to v4 alone thus leading to the graph in Figure 3(b). In the next iteration, v5 is chosen as the most influential node and added to A. Also, v6 is added to ψ since L(v6 , A) exceeds γv6 and all edges that lead to v6 are removed thus resulting in the graph in Figure 3(c). Algorithm 3 gives the pseudocode of the pruning process.

Input: 1. G(V, E) : An undirected graph. 2. k: Desired cardinality of A (The seed set of influentials). 3. D0 : The destination set, such that D0 ⊆ V Output: A: The set of top-k influentials. σ(A)D0 : The expected influence spread in D0 . 1 A ← {}, N ← 10000 2 G(V ′ , E ′ ) ← Sub-graph induced with D0 in G(V, E)

Algorithm 2: Pseudocode - Iterative Pruning (IPr)

3 while (|A| < k) do 4 foreach v in V ′ \ A do 5 δv ← 0 6 for X = 1 to N do 7 δv + = |R(A ∪ v, X) ∩ D0 | 8 end 9 end 10 A ← A ∪ {v : max(δv )} 11 σ(A)D0 ← max(δv )/N 12 end 13 Output A, σ(A)D0

Input: 1. G(V, E) : An undirected graph. 2. D0 : The destination set, such that D0 ⊆ V . 3. k: Desired cardinality of A. 4. ∀u ∈ D0 , γu : The threshold when a node u is considered ‘influenced’ in expectation. Output: A: The set of top-k influentials. σ(A)D0 : The expected influence spread in D0 . 1 A ← {}, ψ ← {}, N ← 10000, i ← 0 2 G0 (V ′ , E0′ ) ← Sub-graph induced with D0 in G(V, E)

We introduce a system parameter γu ∈ [0, 1] which sets the threshold when a node u should be considered as ‘influenced’ in expectation. Thus, we consider a node u influenced if the likelihood that it will be influenced by a particular seed set, A exceeds γu , i.e., L(u, A) exceeds γu , where γu could be some distribution, F (u) over the set of nodes. For readers familiar with the LTM model [7][8], γu is similar to θu - the threshold for a node u to become active. However, there are two important differences. First, the LTM is a deterministic model and thus θu is a simple threshold whereas we use a stochastic diffusion model (ICM) and thus γu is an expectation threshold. Second, whereas θu is compared only with the immediate neighborhood of u, γu incorporates the potential influence that can reach u from all over the network. The key idea in our approach is to identify a set of nodes ψ that are most likely to have been influenced by a given set A. The influence propagation to elements in ψ is annulled while evaluating the increase in influence due to a new node v being added to A. In other words, we de-prioritize the spread of influence to nodes in ψ and we do this by pruning the graph G to get rid of all paths that lead ONLY to the nodes in ψ. It is this graph pruning which enables significant efficiency gains in our approach (Algorithm 2). Equation (5) gives the invariant that holds for all nodes in ψ, ∀u ∈ ψ, L(u, A) ≥ γu

(5)

In Algorithm 2, as we grow the seed set A from A0 (empty) to Ak (containing top-k influentials), we add a set of nodes to ψ that can be considered influenced due to Ai . The network is then pruned by first removing edges vu ∈ E, such that u ∈ ψ and v ∈ V . Next, all edges wv are removed if v is not connected to any node in the set D0 \ ψ. This iterative pruning is repeated until no further removal of edges is possible.

3 while (|A| < k) ∧ (Ei 6= ∅) do 4 foreach v in V ′ \ (A ∪ ψ) do 5 δv ← 0 6 for X = 1 to NSdo 7 δv + = |{ R(w, X)} ∩ D0 | w∈A∪{v}

9 10

∀u ∈ D0 , L(u, A ∪ {v})+ = LX (u, A ∪ {v}) end end

11

δmax ← max(δv )/N

12

σ(A)D0 ← δmax +

8

i=|A|−1 P

ǫi (A)

i=0

13 14

A ← A ∪ {v : (δv = δmax )} ∀u ∈ D0 , L(u, A) ← L(u, A)/N

15 foreach u in D0 \ ψ do 16 if γu ≤ L(u, A) then 17 ψ ← ψ ∪ {u} 18 end 19 end ′ 20 Ei+1 ← RemoveEdges(Gi(V ′ , Ei′ ), D0 , ψ) 21 i←i+1 22 end 23 Output A, σ(A)D0 Since our underlying diffusion model is stochastic (ICM), multiple instances of activating the initial set A may lead to a different set of nodes in G being activated. In algorithm implementation, we run the ICM with the same initial set A multiple times and capture the set of resulting active nodes for each simulation. The sum of the fraction of times each node occurs in the‘sets of active nodes’ is the expectation

v5

v1

v5

v1

v5 v3

v3

v3 v6

v4

v2

v2

v2 v1

G = G0 , A = ∅ D0 = {v4 , v6 },ψ = ∅ (a)

v6

v4

v6

v4

G = G1 , A = {v3 }

G = G2 , A = {v3 , v5 }

D0 = {v4 , v6 }, ψ = {v4 }

D0 = {v4 , v6 }, ψ = {v4 , v6 }

(b)

(c)

Figure 3: Pruning the graph to block the spread of influence to already influenced nodes (ψ)

no pruned edges. Line 12 of Algorithm 2 shows the corresponding computation step, explanation of this is postponed to section 5. The final phase is the pruning process given in Algorithm 3 (invoked in line 20 of Algorithm 2). Algorithm 3 consists of two outer loops as shown in lines 2 and 8. The first loop removes all edges of nodes in ψ and also populates a set, S with the nodes connected to ψ. The second loop iteratively prunes all edges which are not reachable from S to atleast one node in the set, D0 \ ψ.

5. ANALYSIS OF ITERATIVE PRUNING that a set of nodes will be active due to a seed set A, which is the influence spread σ(A) due to A. Note that the above approach may be generalized to the top-k influential detection for the whole graph but it is especially suitable for subset specific influential detection since we need to track L(u, A) only for nodes belonging to the destination set, D0 . As the size of D0 grows, the space complexity of our algorithm grows. In section 5.2, we show that the additional overhead in runtime due to iterative pruning is negligible compared to the overall reduction in runtime complexity. Algorithm 3: Pseudocode - RemoveEdges Input: 1. G(V, E) : The graph G with set of nodes V and edges E 2. D0 : The destination set, such that D0 ⊆ V 3. ψ : The set of already influenced nodes. ψ ⊆ D0 Output: The set of edges after the pruning operation. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

S ← {} foreach v in ψ do foreach uv in E do E ← E \ {uv} S ← S ∪ {u} end end foreach v in S do if ¬ isReachable(v, D0 \ ψ) then foreach uv in E do E ← E \ {uv} S ← S ∪ {u} end end end return E

Algorithm 2 consists of three different stages. The first phase generates the sub-graph induced by D0 on G. The sub-graph is initialized by starting with D0 and then including all nodes (and corresponding edges) which have paths to nodes in D0 . The second phase (on lines 4 - 19) involves finding the node with the maximum marginal contribution. In the first iteration of A, the algorithm behaves exactly like the subset adapted greedy (Algorithm 1), however, subsequent iterations operate on a pruned graph, Gi . Since the purpose of pruning is only to improve the algorithms’ efficiency, the actual influence spread for a given seed set, A in any iteration should be computed on the graph, G0 with

Recall that R(A, X) is the set of nodes reachable from A on a sample sub-graph X of G. We introduce LX (u, A) and define it as the likelihood that a node u would be influenced due to a seed set, A in a given sub-graph, X. Therefore LX (u, A) is the likelihood that a node u is reachable from A in the sub-graph, X. LX (u, A) = {

if u ∈ R(A, X) otherwise

1 0

(6)

In Section 4.2, we defined L(u, A) and its significance can be understood in reference to σ(A). While σ(A) is the expected number of nodes activated in G, L(u, A) captures the expectation that a node u is active when A is the seed set. Hence L(u, A) is the weighted average of the likelihood of activation of a node u over all outcomes (sample sub-graphs). X L(u, A) = P rob[X].LX (u, A) (7) outcomes X

We now have a finer resolution of analyzing the spread of influence due to a seed set A. Specifically, we relate σ(A) to L(u, A) as follows, X LX (u, A) (8) σX (A) = u∈V

Substituting (8) in (2) we get, X X LX (u, A) P rob[X]. σ(A) = outcomes X

=

X

X

u∈V

P rob[X].LX (u, A)

(9)

outcomes X u∈V

Substituting (7) in (9) we get, X σ(A) = L(u, A)

(10)

u∈V

5.1 Influence Spread Function The key challenge introduced by pruning the graph is that the iterative evaluation of influence spread function, σ(A) now takes places over a changing graph. We use σ

in Gi cannot be higher than in Gi+1 . Hence, σ

(A) = σ

(A) + ǫi (A)

(11)

LGi (u, A) = LGi+1 (u, A) + ∆i (A)

(12)

Where ǫi ∈ R+ and ∆i ∈ R+ The quantity ǫi (A) in (11) is the expected number of nodes that would be influenced by the seed set, A if the edges pruned in the ith iteration were not pruned. Although the graph is iteratively pruned, the influence spread due to a given seed set in any iteration, i should be computed on the original graph (G0 ) which has no pruned edges. Therefore we compute the function σ

i=k−1 X

ǫi (A)

(14)

i=1

Since we are interested in maximizing the spread of influence to the destination set, D0 , we focus on the function σ

σ

X

L(u, A)

(16)

V \ψi+1

Since ψi ⊆ ψi+1 , subtracting (16) from (15) and using (11), we get, X ǫi (A) ≥ L(u, A) (17) ψi+1 \ψi

We use (17) in our implementation of the pseudocode to compute ǫi (A). Therefore the influence spread computed by using (17) in (14) is a lower bound.

5.2 Runtime Complexity Let G(V, E) be the sub-graph induced by a given destination set, D0 . The runtime complexity of subset adapted greedy (Algorithm 1) is O(k.|V |.|E|.N ). The runtime complexity of Iterative Pruning (Algorithm 2) depends on the number of edges pruned. Suppose Gi (V, Ei ) is the graph obtained after the ith iteration, then the runtime complexity i=l P |Ei |), where l = k of Algorithm 2 is given by O(|V |.N i=0

when |ψ| < |D0 | and l ≤ k when |ψ| = |D0 | (if all nodes in D0 are considered influenced then the algorithm could

terminate even before k nodes are added to the seed set). Hence, the iterative pruning algorithm could be up to k − 1 times faster when compared to the subset adapted greedy approach. The process of pruning in Algorithm 2 inducts an additional runtime overhead, however, we show that it does not have an impact on the overall runtime complexity. Below we discuss the runtime overhead of Algorithm 2. (i) Computation of L(u, A): This is done as part of the computation of influence spread using (10) and hence does not incur any additional overhead. (ii) Computation of ǫi (A): This can be obtained from L(u, A) as shown in (17) and hence the additional overhead is only by a constant factor. (iii) Threshold Comparison: This is done for each element in D0 in every iteration of A, hence incurs an additional overhead of O(k.|D0 |) (iv) Pruning Operation: The runtime complexity of this can never exceed O(|E|) (total number of edges in the sub-graph G). In addition, the pruning operation needs to compute the reachability of every node in D0 from any node v ∈ V . This can be pre-computed when inducing the sub-graph and stored in a data structure that is retrievable in constant time. The sum of runtime overheads from points (i) to (iv) gives O(k.|D0 |+|E|), which is negligible compared to overall runtime complexity of Algorithm 2.

6. EXPERIMENTS The goal of our experiments is to evaluate the performance (influence spread) and efficiency (run-time) of our algorithm (a) as compared to other existing approaches and (b) as a function of the system parameter, γu . Our results show that the run-time of our algorithm is significantly lower than the subset adapted greedy algorithm with a minimal reduction in influence spread. In our experiments, we choose γu = γ to be a constant for all nodes. This allows us to evaluate our algorithm for different values of γ and use it as a tunable parameter for a trade-off between performance and efficiency. In practice, γu could be a distribution, F (u) over the set of nodes, in which case the mean or variance of the distribution could act as a tunable parameter. The likelihood, L(u, A) that a particular node u would be influenced due to a seed set A depends on the number of activation paths from A to u and the lengths of such paths. We observe that L(u, A) can be written as a function of the propagation probability p when the diffusion model used is ICM. For example, suppose a node u has a single activation path which is one hop distance from one of the nodes v in A, then L(u, A) is equal to the edge weight uv, which is proportional to p. Similarly, suppose the node u has an activation path which is two hop distance from A, then L(u, A) is proportional to p2 . As shown in section 4.2, a node u is considered influenced if L(u, A) exceeds γu , hence value of γ should be correlated with the diffusion probability p. We therefore explore the variation of γ in this neighborhood using the following equation, γ = min(cpα , 1)

(18)

Where α and c are positive integers. For α = 1 and c ≥ 1, the size of the influenced set, ψ tends to zero. For α > 1, we set c = 1 and find the minimum value of α at which |ψ| = |D0 | at the end of the first iteration of IPr. Based on this, we derive the following values of γ to be used in our experiments: {p4 , p3 , p2 , p, 2p, 4p}.

running time (in minutes)

3000

3015

3009

2472

2500 2000

1586 1500 1000 500 108

102

98

IP

167

IP

190

IP

209

IP

205

110

IP

116

IP

208 0

rw

rw

ith

ith

ith

ith

4

ith

rw

rw

rw

et

p

3

2

p

ith

bs

rw

Su

γ= r(

IP

γ= r(

IP p

C

C

F

EL

F

F

EL

EL

F

F

EL

EL

C

C

C F

= (γ

= (γ

4

3

p

p

2

p

p)

)

)

)

)

2p

= (γ

= (γ

= (γ )

EL

4p

C

p)

=4

F

(γ

dy

e re

Pr

G

-I

= (γ

ed pt

EL

da

C

A

)

)

)

) g

in

ed pt

da

un Pr

A

p)

γ= r(

IP

γ= r(

IP 2p

e

et

iv

bs

t ra

γ= r(

IP

Ite

Su

Figure 4: Running times of different algorithms on the HEPT network. k = 30, |D0 | = 152 and m = 3

which are two hop away from D0 . In principle, this process continues until no more nodes can be added. For ease of implementation, we limit the number of hops to a constant m ∈ Z+ and we use m = 2 and m = 3 in our experiments. This choice is informed by recent work which have empirically validated that diffusion cascades and influence are most prominent at such short depths [2][4]. For a given dataset, we use the same value of m and D0 for all algorithms. We do not use m in our analysis and is only used for ease of implementation. Hence when a different value of m is chosen, the actual results only vary by a constant factor while the relative comparison with different algorithms remains the same. We use |D0 | = 152 and m = 3 for the HEPT network and |D0 | = 133 and m = 2 for the SMRE network. The complete network data including the chosen set D0 can be downloaded from https:// sourceforge.net/projects/pein/files/latest/download

6.1 Setup running time (in seconds)

600

578

570

500

457

400 291

300

202

200

187 152

100 17

16

IP

17

IP

IP

21

IP

22

IP

26

IP

26 0

ith

ith

rw

rw

ith

ith

rw

rw

ith

rw

ith

4

p

et

bs

rw

Su

γ= r( 3

2

p

γ= r(

IP

IP p

F

EL

C F

F

EL

C

EL

F

EL

C

C F

F

pt

EL

C

da

EL

C

)

A

)

)

)

4

3

p

= (γ

2

p

p

= (γ

= (γ

p)

)

)

)

)

2p

= (γ

= (γ )

y

p)

=4

F

EL

C

4p

(γ

ed

re

Pr

G

-I

= (γ

ed

g

ed

in

pt

un

da

Pr

A

p)

γ= r(

IP

γ= r(

IP 2p

e

et

tiv

bs

ra

γ= r(

IP

Ite

Su

Figure 5: Running times of different algorithms on the SMRE network. k = 30, |D0 | = 133 and m = 2 We use two real-world academic co-authorship networks for all our experiments. Each node in a co-authorship graph represents an author and if two or more authors co-authored a paper, then an edge (undirected) exists between each pair of nodes representing the authors. If two authors u , v collaborated on tuv papers, then the edge weight uv in ICM is 1 − (1 − p)tuv [8]. The dataset we use consists of 15233 nodes and 58891 edges, is from the ”High Energy Physics Theory” (HEPT) section of e-print arXiv3 with papers between 1991 and 2003. This is the same dataset used in [8] and [3]. Another dataset consists of 1336 nodes and 2200 edges is obtained from the conference on software maintenance and re-engineering4 (SMRE). The SMRE network is chosen because it is a sparse graph with multiple connected components. The destination set, D0 is chosen randomly from the set of nodes in each of the datasets. Once we have chosen D0 , we need to induce the sub-graph on which to evaluate and compare the algorithms. The process of inducing the subgraph starts with adding nodes which are one hop away from nodes in D0 . The process is then repeated by adding nodes 3

http://www.arXiv.org http://www.informatik.uni-trier.de/∼ley/db/conf/csmr/ Data retrieved from: http://www.aisee.com/gdl/rm.htm 4

We use the ICM with a constant propagation probability p = 0.1 in all our experiments. All algorithms are implemented using the Java programming language and run on a 2.99 GHz AMD Athlon II X2 B24 dual-core processor with 2 gigabytes of Random Access Memory. The influence spread for a given seed set is determined by averaging over 10000 iterations, and edge outcomes are chosen randomly every time. It is found that with 10000 iterations the quality of approximation is comparable to that with 30000 or more iterations [8]. For a given dataset, the same sub-graph is used to compare the performance and efficiency of the following algorithms / heuristics. (i) Subset Adapted Greedy: This is Algorithm 1. (ii) Subset Adapted CELF: This is CELF [9] optimization applied to Algorithm 1. (iii) Iterative Pruning (IPr): This is Algorithm 2. (iv) Iterative Pruning with CELF: This is CELF optimization applied to Algorithm 2. (v) Degree: A heuristic which selects k highest degree nodes from the induced sub-graph. (vi) Random: A naive approach which randomly selects k nodes from the induced sub-graph. The results with CELF optimization are only reported in runtime comparisons since the influence spread remains unchanged.

6.2 Results We find that for low values of γ there is an immense gain in efficiency with a negligible drop in influence spread. For higher values of γ, fewer nodes are added to the influenced set, ψ and hence fewer edges in the graph are pruned which does not yield any significant reduction in run-time. Figure 4 and 5 show the running times of different algorithms for HEPT and SMRE datasets respectively. The running time of IPr when γ = p4 is about 96% and 73% lower than subset adapted greedy, on HEPT and SMRE respectively. Since the SMRE dataset is a sparse network, there are fewer edges to prune and hence the efficiency gain is not as high when compared to HEPT dataset. The running time of IPr with CELF when γ = p4 is about 52% and 38% lower than subset adapted CELF, on HEPT and SMRE respectively. In CELF [9], the marginal gain (δv ) due to each node v is stored in a priority queue and a nodes’ δv is re-computed only if it is greater than the value of its predecessor in the queue. Since the marginal gain of most nodes do not drop significantly after every iteration, the recomputation is done for only a few nodes. But this may

10 Random Degree Subset Adapted Greedy Iterative Pruning (γ=4p)

8

γ=p

Iterative Pruning (γ=p )

size of influenced set (|ψ|)

influence spread on D0

γ=p 4 γ=p

80

2

3

Iterative Pruning (γ=p ) 4 Iterative Pruning (γ=p )

4

2

60

40

20

0

0 0

5

10

15 seed set size (|A|)

20

25

30

Figure 6: Influence spreads of different algorithms on the HEPT network. |D0 | = 152 and m = 3

0

8

80

Degree Subset Adapted Greedy Iterative Pruning (γ=4p) Iterative Pruning (γ=2p)

70

Iterative Pruning (γ=p)

60

2

size of influenced set (|ψ|)

Iterative Pruning (γ=p ) 3 Iterative Pruning (γ=p ) 4 Iterative Pruning (γ=p ) 6

4

5

10

15 seed set size (|A|)

20

25

30

Figure 8: Variation in |ψ| for different values of γ using Iterative Pruning on the HEPT network.

Random 10

influence spread on D0

2

3

Iterative Pruning (γ=2p) Iterative Pruning (γ=p)

6

γ=4p γ=2p γ=p

100

γ=4p γ=2p γ=p 2

γ=p 3 γ=p 4 γ=p

50

40

30

20 2 10

0

0 0

5

10

15 seed set size (|A|)

20

25

30

0

5

10

15 seed set size (|A|)

20

25

30

Figure 7: Influence spreads of different algorithms on the SMRE network. |D0 | = 133 and m = 2

Figure 9: Variation in |ψ| for different values of γ using Iterative Pruning on the SMRE network.

not be the case when the graph is pruned, because there is a steep drop in the marginal gain of several nodes due to the removal of certain edges. This is why the efficiency gains of IPr with CELF are not as significant as compared to IPr with subset adapted greedy. An improvement in this direction is a scope for future work. When γ = 4p, the runtime is slightly higher (about 0.3% - 2%) than the subset adapted algorithms, which is the overhead due to threshold comparison. Figure 6 and Figure 7 show the influence spreads of various algorithms for HEPT and SMRE networks respectively. The influence spread of Iterative Pruning with γ = p4 is about 10% and 21% lower than the subset adapted greedy algorithm for k = 30 on HEPT and SMRE respectively. However when γ = 2p the reduction in influence spread is only about 2% and 10% on HEPT and SMRE respectively, while the gain in efficiency is about 20%. Hence the parameter γ can be used to tune the algorithm for a trade-off between performance and efficiency. It is important to note that the influence spread computed by our implementation of the IPr algorithm is only a lower bound as shown sec-

tion 5.1. Therefore the performance of IPr can be better than the above obtained results. The influence spread using Random approach is unacceptably low. The Degree heuristic performs reasonably well for seed set size less than five, however the overall drop in influence spread compared to subset adapted greedy is above 65%. We observe that in IPr, an increase in efficiency always comes at the cost of a decrease in influence spread. For example, in the HEPT network, efficiency gain between γ = p3 and γ = p2 is about 45% with a performance drop of about 4%. However, when the same comparison is made between γ = p4 and γ = p3 , there is almost no change in either the influence spread or the run-times. Finding a relation that between the two is an important scope for future work which will allow the choice to be made between a desired efficiency and an acceptable performance. Note that when evaluating the marginal gain due to a node v, we do not account for v itself being activated since it has been influenced extrinsically and not due to social influence, which is normally used in other approaches like greedy [8]. However when the same methodology is used with the subset

specific problem, we see that the ratio of influence spread on D0 to k is as low as 1 : 3. This is the same behavior observed in [1] where the influence spread within a target subset in the DBLP network is measured and the ratio was found to be much lower, at 1 : 100. In general we find that the influence spread saturates beyond certain value of k and this depends on several factors including the set of nodes in the target subset, D0 . It is also possible that some of the nodes in the seed set A actually belong to D0 . In our experiments we found that only 4 out of 30 nodes in the seed set were in D0 for the SMRE dataset, while none of the nodes in A belonged to D0 for the HEPT dataset. Figure 8 and Figure 9 show the change in size of the influenced set, ψ with the increase in size of seed set, A using IPr for different values of γ on HEPT and SMRE networks respectively. We observe that (a) slope of the curve rises steadily with decrease in γ and (b) the increase in size of influenced set follows an almost linear relation with increase in seed set size. While the former is as expected, the latter comes somewhat as a surprise, since it implies that most of the influentials are equally capable in terms of the number of nodes that they can ‘influence’. However, a generalization cannot be made unless larger seed sets with several other datasets are considered.

7.

PROOF OF SUB-MODULARITY

σ(S ∪ {v}) − σ(S) ≥ σ(T ∪ {v}) − σ(T ) : S ⊆ T

(19) (20)

In our case, we investigate if the influence spread function continues to be sub-modular even when the underlying graph is pruned, in other words, we show that σ Gi (A) satisfies the properties of monotonicity and diminishing returns by proving that (21) and (22) hold true respectively, σ

L(V \ ψi+1 , A) = L(V \ ψi , A) − L(ψi+1 \ ψi , A)

(27)

L(ψi+1 \ ψi , A) = L(ψi+1 , A) − L(ψi , A)

(28)

Substituting (27) in (26) and using (28), we get, L(V \ ψi , A ∪ {v}) − L(V \ ψi , A) ≥ L(ψi , A) − L(ψi , A ∪ {v})

(29)

Since ψi ⊆ V , we have, L(V \ ψi , A) = L(V, A) − L(ψi , A)

(30)

Using (30) in (29), we get, L(V, A ∪ {v}) − L(V, A) ≥ 0

(31)

The function L(u, A) is a non-decreasing function of the cardinality of A and since |A ∪ {v}| > |A|, (31) is valid. Hence the proof of monotonicity of σ

Using (11) in (22), we have the following relation to prove, σ

(32)

The expression,σ (A ∪ {v}) − σ (A) is the marginal contribution of v, given a seed set, A in a graph, Gi . Since S ⊆ T , the marginal contribution of v in Gi with seed set, S cannot be lower than that with seed set, T . Therefore for (32) to hold, the following inequality should hold, ǫi (T ∪ {v}) − ǫi (T ) ≥ 0

(33)

Substituting (11) in (33), we get, σ

≥

σ

∴ σ

(22)

(34)

7.1 Monotonicity Let ψi be the set of influenced nodes in Gi . Since in a given graph, Gi we can use (10) to compute σ

(26)

Since ψi ⊆ ψi+1 , we have (27) and (28),

(21)

σ

L(V \ ψi+1 , A ∪ {v}) + L(ψi+1 , A ∪ {v}) ≥ L(V \ ψi , A) + L(ψi , A)

7.2 Diminishing Returns

In this section, we analyze the behavior of σ(A) when evaluated over a graph which is pruned in each iteration. When the underlying graph is fixed, [8] showed the sub-modularity of σ(A) by proving its monotonicity and diminishing returns. Thus, we know that (19) and (20) hold, σ(A ∪ {v}) ≥ σ(A)

Substituting (25) in (21), we have the following relation to prove,

u∈ψi

In a given graph, the likelihood that a node u would be influenced by a seed set, A is independent of the likelihood of another node v being influenced by A. Therefore, X L(u, A) = L(S, A) (24) u∈S

Using (24) in (23), we get, σ

(25)

The LHS and RHS of (34) are the marginal gain due to v in Gi and Gi+1 respectively. Since Ei+1 ⊆ Ei , the marginal gain due to a node v in Gi+1 cannot be higher than in Gi . Therefore, (34) is valid and this proves the property of diminishing returns of σ

8. CONCLUSION We addressed the problem of maximizing the spread of influence to a given subset in a social network. We believe that as online social networks increase in size, this problem would be of increasing importance. Our proposed approach is independent of how the sub-set is specified. In social networks, for example, demographic information, attitude towards a

product, current location, etc. may be used to create the subset that needs to be targeted. Our work proposes an iterative pruning algorithm for finding the top-k influential nodes for maximizing the spread of influence to a given subset in a social network. We have evaluated this algorithm on two real-world data sets and showed that the gain in efficiency due to iterative graph pruning is significant enough for an acceptable drop in performance when compared to existing state of art. More importantly, our algorithm has a tunable parameter, γ which enables it to be tuned for this performance vs. efficiency trade-off, which can be optimized based on the domain requirements. We extend the existing analytical framework for top-k influential detection to incorporate γ. Our analysis of the network pruning approach under our analytical framework shows that the influence spread function continues to be sub-modular. We note that our work has applications beyond social networks. It may be used in epidemiology when the population to be considered for vaccination is a subset of a larger population in which the disease may spread by contagion. One obvious direction in which this may be extended is by the design of algorithms which are more efficient than the one proposed here. A second area for future investigation is to extend the proposed framework and algorithm to other diffusion models and / or consider directed underlying graphs. Finally, the diffusion model we have considered is strictly progressive, in which a node once influenced does not get uninfluenced. Extension of this model to progressive models of diffusion may also be interesting.

9.

REFERENCES

[1] C. C. Aggarwal, A. Khan, and X. Yan. On flow authority discovery in social networks. In Proceedings of the eleventh SIAM international conference on Data mining, SDM ’11, pages 522–533. SIAM / Omnipress, 2011. [2] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. Everyone’s an influencer: quantifying influence on twitter. In Proceedings of the fourth ACM international conference on Web search and data mining, WSDM ’11, pages 65–74. ACM, 2011. [3] W. Chen, Y. Wang, and S. Yang. Efficient influence maximization in social networks. In Proceedings of the fifteenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’09, pages 199–208. ACM, 2009. [4] N. A. Christakis and J. H. Fowler. The spread of obesity in a large social network over 32 years. The New England Journal of Medicine, 357(4):370–379, July 2007. [5] P. Domingos and M. Richardson. Mining the network value of customers. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’01, pages 57–66. ACM, 2001. [6] J. Goldenberg, B. Libai, and E. Muller. Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth. Marketing Letters, 3(12):211–223, Aug. 2001. [7] M. Granovetter. Threshold Models of Collective Behavior. American Journal of Sociology, 83(6):1420–1443, 1978.

[8] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’03, pages 137–146. ACM, 2003. [9] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In Proceedings of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’07, pages 420–429. ACM, 2007. [10] R. Narayanam and Y. Narahari. A shapley value-based approach to discover influential nodes in social networks. IEEE T. Automation Science and Engineering, 8(1):130–147, 2011. [11] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions. Mathematical Programming, 14(1):265–294, Dec. 1978. [12] D. M. Romero, W. Galuba, S. Asur, and B. A. Huberman. Influence and passivity in social media. In WWW (Companion Volume)’11, pages 113–114, 2011. [13] L. S. Shapley. A value for n-person games. Contribution to the Theory of Games. Annals of Mathematics Studies, 2:28, 1953. [14] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994. [15] D. J. Watts and P. S. Dodds. Influentials, networks, and public opinion formation. Journal of Consumer Research, 34(4):441–458, 2007.