Path-Constrained Influence Maximization in ...

Viewer
Transcript

Path-Constrained Influence Maximization in Heterogeneous Information Networks Fangbo Tao

Xiangnan Kong

Philip S. Yu

Tsinghua University Beijing, China

University of Illinois at Chicago Chicago, IL, USA

University of Illinois at Chicago Chicago, IL, USA

[email protected]

[email protected]

[email protected]

ABSTRACT Influence maximization in information networks has become an important research topic in recent years, where the task is to find a small set of influential nodes as seeding nodes such that the information dissemination in the network can be maximized. Influence maximization has a large number of applications, e.g., viral marketing, epidemic spreading and academic idea propagation. Current research focuses on homogeneous information networks, i.e., networks that consist of one single type of nodes. However, in many real-world applications, heterogeneous information networks which consist of multiple types of nodes, are more prevalent. In order to analyze the influence diffusion process in such networks, we address the problem of how to find the most important nodes that can maximize the influence diffusion within a heterogeneous information network. This problem is challenging and different from previous woks on influence maximization in that there are multiple types of nodes and the propagation paths are constrained. In other words, we consider a generalized version of influence diffusion process where the diffusion involves multiple types of nodes and the propagation starts from a particular type of seeding nodes, follows some constrained paths and ends with certain types of target nodes. Moreover, the diffusion process can involve multiple path constraints which are potentially infinite. We study a new of influence maximization in heterogenous information networks with path constraints. We then derive a solution called Automata-based Heterogeneous Diffusion (AHD) to solve the influence maximization problem with complex constraints in the diffusion process. Empirical studies on real-world tasks demonstrated the effectiveness and efficiency of the proposed approach.

Categories and Subject Descriptors H.2.8 [Database Management]: Database ApplicationsData Mining

General Terms

Algorithm, Performance, Experimentation

Keywords Heterogeneous Information Networks, Influence Maximization, Influence Diffusion

1. INTRODUCTION Information networks have become ubiquitous and increasing important in many real-world applications. Examples include social networks like Facebook, co-author networks like DBLP, etc. Such networks, i.e., graphs of relationships and interactions within a group of individuals, play an important role in spreading information, ideas and influence among their members. For example, in academic networks, research ideas can be diffused in co-author networks; virus of epidemics can spread within human epidemic networks; purchase behaviors can propagate in consumer networks. Such diffusion processes within information networks has a long history of study. In particular, we are interested in studying the influence maximization problem within information networks. The task is to find a small set of individuals as seeding nodes such that the information dissemination in the network can be maximized. For example, mobile phone marketers may want to find a small number of influential customers and give them mobile phones for free, such that the product can be greatly promoted through the powerful ‘word-of-mouth’ effect within a consumer network. Motivated by the tremendous number of real-world applications, influence maximization has been extensively studied in the literature [19, 5, 6, 12, 2]. Conventional research on influence maximization focuses on homogeneous networks, where the networks only consist of one single type of objects/entities. However, in many real-world networks, there usually exist multiple types of objects/entities and multiple types of links/relationships among them. For example, an academic network, such as the ACM publication network shown in Figure 1(a), usually involves multiple types of entities, e.g. papers (P), authors (A), affiliations (F), areas/index terms (I), proceedings/venues (V) and conferences (C); an epidemic network, as shown in Figure 1(b), often evolves multiple species, e.g., humans (H), cats (C), dogs (D) and swines (S). Such networks are called heterogeneous information networks, which are more general and prevalent in a wide range of application domains. In this paper, we study the problem of influence maximization in heterogeneous information networks. This problem

plied to heterogenous networks, because they disregard the subtlety of different types of nodes and links.

(a) ACM Publication Network

(b) Epidemic Network

Figure 1: Examples of Heterogeneous Information Network Schema exists in many real-world scenarios. For example, in publication networks, as shown in Figure 1(a), people may want to find the most influential conferences that have the most influence among authors. In epidemic networks, as shown in Figure 1(b), researchers may want to analyze the propagation of virus, like Swine Flu, and select a group of animals like swines, that can potential affect the largest number of human beings, to run screening tests of infection. Formally, the influence maximization problem in heterogeneous information networks can be described as follows: when analyzing the information propagation process in heterogeneous information networks, how can we pick a k-sized set of seeding nodes with a particular type that can maximize the influence to certain types of target nodes. The optimal solution to information maximization is NP-hard even in scenario of homogeneous information networks. Despite its value and significance, the influence maximization problem has not been studied in the context of heterogeneous information networks. Obviously, this problem is more challenging than conventional influence maximization problems, because homogeneous networks can be reduced as special cases of heterogeneous networks. If we consider influence maximization and heterogeneous information networks as a whole, the major challenges are summarized as follows: Multiple types of nodes and links: Conventional research on homogeneous information networks assumes that there only exists one type of nodes and links. Thus, the influence diffusion process is only considered within the same type of nodes, e.g., finding influential authors in a co-author network. However, in the context of heterogeneous information networks, the diffusion process may involve multiple types of nodes and links. The seeding nodes and target nodes can be of very different types. For example, in the ACM publication network, people may be interested in finding a small set of influential papers, which have the greatest influence/impact on researchers. In this scenario, the seeding nodes are papers, while the target nodes are authors that can potentially be influenced by the papers. Other examples include finding a set of influential authors in some research areas or conferences. In all these application scenarios, the seeding nodes and target nodes are of different types. Moreover, even considering those application scenarios with the same type of seeding nodes and target nodes, the meaning is still different from that in homogeneous networks, the reason is that such influence includes both direct links and indirect paths through other types of nodes. Conventional approaches for homogeneous networks cannot be directly ap-

Path-constrained diffusion process: In homogeneous information networks, diffusion process only involve one single type of nodes and links, thus the propagation path is usually unconstrained, i.e., the propagation process can potentially follow any links to influence any nodes linked in the network. However, in heterogeneous information networks, there are multiple types of nodes and links. Usually only a few types of links and nodes are involved in the diffusion process. Moreover, the propagation process is always constrained with a number of valid diffusion paths. For example, in epidemic networks, as shown in Figure 1(b), one type of virus can only spread through certain paths and infect certain species. The virus of swine flu can only spread through paths like swine-to-swine (SS), swine-tohuman (SH), human-to-human (HH) and human-to-swine (HS). But the infection paths like dog-to-human (DH) and human-to-dog (HD) are considered as being invalid. However for the virus of rabies, dog-to-human (DH) would be a valid and important spreading path. Conventional influence maximization approaches cannot be directly applied to these applications, because they disregard path constraints within the network diffusion process. Complex Diffusion Patterns: In homogeneous networks, the diffusion pattern is relatively simple. For example, in paper citation networks, the diffusion pattern can be written as P P ∗ , representing the set of diffusion paths, {P P, P 3 , · · · }. However, in heterogeneous information networks, the diffusion patterns can be very complex involving different types of nodes and links. For example, if we consider the influence of some authors to other authors in an academic publication network, as shown in Figure 1(a), the diffusion patterns can be written as AP ∗ A, AP V P A, AP IP A, etc. Here AP ∗ A denotes a set of diffusion paths like author-paperauthor (AP A), AP P A, AP 3 A, etc., involving multiple types of nodes and links. The influence diffusion process in heterogenous information networks often evolve a large number of complex paths (potentially infinite). In this paper, we study the problem of path-constrained influence maximization in heterogeneous information networks and propose a novel solution called Automata-based Heterogeneous Diffusion (AHD) to handle the complex constraints in the diffusion process. The proposed method is more general and flexible to handle heterogeneous information networks with various complex path constraints. A greedy algorithm with near-optimal guarantees is then proposed to solve the above problems efficiently and effectively based upon submodular functions. Empirical studies show that the proposed approach can efficiently and effectively find influential nodes with complex path constraints.

2. PROBLEM DEFINITION In this section, we briefly define the problem of path-constrained influence maximization in heterogeneous information networks.

2.1 Heterogeneous Information Networks Heterogeneous information networks are a special kind of networks which contain multiple types of nodes or edges.

Figure 2: ACM Publication Network Schema DEFINITION 1. Information Network: An information network is a directed graph G = (V, E ) with a node type mapping function φ : V → A and an edge type mapping function ψ : E → R, where each node v ∈ V belongs to one particular node type φ(v) ∈ A and each edge e ∈ E belongs to one particular edge type ψ(e) ∈ R. Different from traditional networks, we distinguish node types and edge types in information networks. In a typical heterogeneous information network, there exist multiple node types and multiple edge types, i.e., |A| > 1 or |R| > 1. Here we used the concept of network schema[16] to describe meta structure of heterogeneous information networks: DEFINITION 2. Network Schema: A network schema is a meta template for a heterogeneous network G = {V, E } with the node type mapping function φ : V → A and the edge type mapping function ψ : E → R. It is a directed graph defined upon object types A, with edges as relations from R, denoted by TG = (A, R). We have shown two Network Schemas in Figure 1, which correspond to an epidemic network and the ACM publication network.

scenario of mining several influential authors that can influence other authors and papers remarkably. In heterogeneous information networks, diffusion process with certain type constraints, i.e., seeding type set S and target type set T , should also comply with certain path constraints. For example, we can analyze the diffusion process of academic ideas in the ACM publication networks. If we consider the influence from conferences to authors, we will find a reasonable path: Conf-Venue-Paper-Author (CV P A). If we consider a diffusion process of academic ideas from authors to papers through citations, we will find a reasonable path: Author-Paper-Paper-· · · -Paper (AP ∗ ). Here ∗ is a symbol to represent one or more preceding elements. From the above examples, we can find that (1) In heterogeneous information networks, the diffusion process always comply with specific diffusion paths. (2) There are two types of diffusion paths, one is paths without iterative diffusion process, e.g., CV P A and AP A; the other is the paths with iterative diffusion process, e.g., AP ∗ and (AP )∗ A. Motivated by these customized requirements in heterogeneous information networks, we propose the concept of diffusion path to describe the path constraints. Here, the two types of paths are defined as iterable diffusion path and non-iterable diffusion path. DEFINITION 4. Iterable Diffusion Path: An iterable diffusion path P is represented by a sequence of node types, asterisk ‘*’ and parentheses ‘()’. Asterisks indicate 1 or more of the preceding elements and parentheses indicates the scope of operator ‘*’. An iterable diffusion path is a path defined on the network schema TG = (A, R) with several iterative diffusion parts. DEFINITION 5. Non-Iterable Diffusion Path: A noniterable diffusion path P is represented by a sequence of node types A1 A2 · · · Al . It is a path defined on network schema TG = (A, R) without iterative parts. The length of such path is fixed.

2.2 Path-Constrained Diffusion Framework Since there are multiple types of nodes in heterogeneous information networks, we focus on studying the influence of one particular type of nodes to other specific types of nodes. For example, in the ACM publication network, people may want to mine the crucial affiliations that have the greatest influence on conferences or mine some important papers that have the greatest influence on other papers. We define the concepts of Seeding Type Set and Target Type Set to describe the influencing type and influenced types as follows.

Here are some examples of reasonable diffusion paths on the ACM publication network: AP A, AP ∗ , V P AP V , etc. The diffusion process of a specific scenario may includes multiple diffusion paths. For example, when we analyze the diffusion process of academic ideas from authors to authors. Different paths can represent different aspects of such influences. These aspects contain influences from co-authors ((AP )∗ A), influence from communications in same venue (AP V P A), and influence from paper citations (AP ∗ A). Thus, we get some meaningful diffusion paths: (AP )∗ A, AP ∗ A, AP V P A in these scenarios. All these paths can cause diffusions of academic ideas. Formally, we propose the concept of diffusion schema as follows.

DEFINITION 3. Seeding Type Set and Target Type Set: The seeding type set S is a subset of type set A where S ⊆ A which contains the type of seeding nodes. The target type set T is a subset of type set A where T ⊆ A which contains the certain types of influenced nodes. We also denote the set of all nodes with types in S and T as VS and VT , where VS = {u|u ∈ V, φ(u) ∈ S} and VT = {u|u ∈ V, φ(u) ∈ T }.

DEFINITION 6. Diffusion Schema: A diffusion schema D consists of several diffusion paths. It can contain both non-iterable diffusion paths and iterable diffusion paths. We denote D = {P1 , P2 , · · · , Pn }, where Pi denotes the i-th diffusion path in the diffusion schema D.

One typical example is S = {A} and T = {A, P } in the ACM publication network which denotes the application

With all the related concepts introduced, the problem of influence maximization in heterogeneous information networks

P TRANSLATE PATH: P*I Diffusion Schema D

1

P

2

I

P I A F MATRIXING 1 2 0 0 0 3 2 2 3 0 0 3 0 0 0 0

DFA D

V 0 0 0

C 0 0 0

Transition Matrix M

Figure 3: Single path evaluation process can be formally defined as follows: DEFINITION 7. Influence maximization in heterogeneous information networks: Picking a k-sized set of nodes N ⊆ VS , which can maximize the number of influenced nodes in VT . The diffusion process should comply with the constraints of paths in the diffusion schema D.

2.3 Challenges The cascade model in this paper is based on the Independent Cascade Model (IC Model) [5, 6, 8]. The definition of IC Model is provided as follows: We start with an initial set of active nodes S0 , and the process unfolds in discrete steps according to the following randomized rule. When node v first becomes active in step t, it is given a single chance to activate each currently inactive neighbor w; it succeeds with a probability pv,w , a parameter of the system, independently of the history thus far. (If w has multiple newly activated neighbors, their attempts are sequenced in an arbitrary order.) If v succeeds, then w will become active in step t + 1; but whether or not v succeeds, it cannot make any further attempts to activate w in subsequent rounds. Again, the process runs until no more activations are possible. In heterogeneous context, IC model should also comply with complex type constraints and path constraints. It makes approaches [8, 12] which used to solve this problem in homogeneous context become not applicable in heterogeneous context. Influence maximization in heterogeneous information is a non-trivial task due to the following challenges: Challenge 1: Since there are tremendous combinations of k-size set, how can we choose an influential set of nodes without enumerating all these combinations? Challenge 2: There are multiple types of nodes in heterogeneous information networks and we also define S and T in our problem. In the traditional problem, we will not consider these type constraints. How can we comply with these constraints in heterogeneous context? Challenge 3: In our problem, the diffusion process should comply with diffusion paths. When node u is activated, it can only choose several neighbors to influence, instead of the whole neighbor set as IC Model do. Moreover, the available neighbor set of node u is variable during the diffusion process. For example, we are analyzing the influence from authors to authors with a diffusion schema D = {AP P A}, which means an author whose paper cites another authora´ ֒rs paper may be influenced by that author. In this diffusion process, if a paper is activated, we do not know which P in path AP P A it belongs to. If this paper belongs to the previous P , it should influence all its paper neighbors which cite

this paper. If this paper belongs to the latter P , it should influence all its author neighbors who write this paper. How can we handle these path constraints in our problem? Challenge 4: Since the diffusion schema may have multiple paths, should we evaluate this paths one by one and record the diffusion statuses for each path? It is even more challenging that iterable diffusion paths like (AP )∗ A can expand to infinite non-iterable diffusion paths like AP A, AP AP A, etc. How to handle this kind of path?

3. THE PROPOSED METHOD In this section, we propose a novel solution, called AHD (Automata-based Heterogeneous Diffusion), to effectively and efficiently mining an influential node set. We will address Challenge 1 in Section 3.1 by adapting the greedy hill-climbing strategy to heterogeneous information networks. We then address Challenge 2 and 3 in Section 3.2 by introduce finite automata into our solution. Finally we address Challenge 4 in Section 3.3 by combining multiple path constraints together using automata methods.

3.1 Greedy Hill-Climbing In order to address Challenge 1 discussed in Section 2.3, we use the greedy hill-climbing strategy [4] which always picks the node with largest marginal gain. We start with the empty placement S0 = ∅, and iteratively, in step k, adds the node sk with the maximum marginal gain. sk = arg max R(Sk−1 ∪ {s}) − R(Sk−1 ) s∈VS \Sk−1

Here R(S) denotes the influence of node set S. If the influence function R is a non-negative, monotone and submodular function. Nemhauser, Wolsey, and Fisher [3, 14] pointed out that the greedy hill-climbing strategy approximates the optimum to within a factor of (1 − 1/e). The submodularity of R(S) of IC model in homogeneous context has been proved by previous work [8]. In order to make use of this greedy strategy, we prove the submodularity of IC Model with type constraints and path constraints. The detailed proof is provided in Appendix A.

3.2 Evaluating Influence Using Automata In this section, we address the Challenge 2 and 3 discussed in Section 2.3 by introducing finite automata into our AHD solution. First, we will discuss the method of evaluating the influence of a particular node set. Even in simple IC Model, it is not easy to evaluate influence function R(S) exactly, or evaluate it in polynomial time. However, like previous works did [9, 8, 12, 2], by repeatedly simulating the cascade process and sampling R(S), we can compute arbitrarily close approximations to R(S).

P P1 : P*I

TRANSLATE

1

2

I

3

COMBINE

P

2

1

P

2

V

3

DFAs: D1 and D2

CONVERT

I 4

1

P2 : PV Diffusion Schema D

P

P

P

P

3

V

NFA N

1

P

P

3 I

V

4

2

I 5

DFA D

Figure 4: Combination of Multiple Paths In this part, we consider the situation that the diffusion schema D only contains one diffusion path. This path may be non-iterable diffusion path or iterable diffusion path. Since the type and path constraints in our problem and the iterable diffusion path may expand to infinite simple paths, the cascade process cannot be simulated effectively and efficiently in traditional methods. Finite Automata is used to realize the complex path and type constraints. First, we give the definitions of DFA (deterministic finite automata) and transition matrix. We also give a automata property proved by automata researchers as follows [7]. DEFINITION 8. DFA: A deterministic finite automata D contains a finite set of states denoted by Q = {ˆ s1 , sˆ2 , · · · , sˆl }, a finite set of input symbols denoted by Σ, a transition function δ(ˆ s, a), in which sˆ ∈ Q and a ∈ Σ, a start state q ∈ Q and a set of final states F ⊆ Q. A DFA D = {Q, Σ, δ, q, F }. The transition result of δ(ˆ s, a) only has zero or one state. DEFINITION 9. Transition Matrix: A transition matrix M is a matrix form of a DFA D = {Q, Σ, δ, q, F } with l rows and n columns, where l = |Q| and n = |Σ|. The element Mi,j denotes the next state number if state sˆi get symbol j. We have Mi,j ∈ [0, l]. If Mi,j = 0, it denote that this transition δ(ˆ si , j) is not allowed in the DFA D. Automata Property 1: Each regular expression can translate to an DFA which can provide equivalent path constraints.

set, we combine the graph G = (V, E ) and the transition matrix M together to simulate the diffusion process. Note that, Σ in DFA is equivalent to A in the graph. They are the bridges to combine DFA D and graph G. In order to make the process clear, we propose the concept of node-state tuple and tuple pool : DEFINITION 10. Node-State Tuple and Tuple Pool: An node-state tuple is a tuple with two elements (u, sˆ). Here u is a current diffusing node in the graph and sˆ is the current state in the DFA. This tuple denote the current diffusion status in a diffusion process. Tuple pool Q is a set consists of node-state tuples. It denotes all the available diffusion statuses in the diffusion process. When we evaluate the influence of a particular node set S. First, we set Q = ∅. For every node u ∈ S, we put a nodestate tuple (u, sˆM1,φ(u) ) in the pool Q. Here sˆM1,φ(u) denote a DFA state which is the result of the transition function on the start state in D with φ(u) as its input symbol. During the process, for each tuple (u, sˆk ) ∈ Q, we will attempt all u’s neighbors. For each u’s neighbor node v, if Mk,φ(v) > 0 and the activating attempt from u to v succeeds, we add a new tuple (v, sˆMk,φ(v) ) into Q. With all the neighbors attempted, we delete (u, sˆk ) from the pool Q. The process runs until no more node-state tuples exist in the tuple pool Q.

3.3 Combination of Multiple Paths

Examples of DFA and transition matrix M is shown in Figure 3. In this example, S = {P }, T = {I} and D = {P ∗ I}. We will introduce the generation process as follows.

In this section, we consider that the diffusion schema can have multiple paths. We address the Challenge 4 by combining multiple DFAs together.

TRANSLATE STEP: Because both iterable diffusion paths and non-iterable diffusion paths are kinds of regular expressions. We can easily translate them to a DFA. In our example, the generated DFA D has a start state sˆ1 and a final state set F = {ˆ s3 }. The input symbol set Σ consists of node types, here Σ = {P, I, A, F, V, C} = A. The transition function δ contains δ(ˆ s1 , P ) = {ˆ s2 }, δ(ˆ s2 , P ) = {ˆ s2 } and δ(ˆ s2 , I) = {ˆ s3 }, while other transitions are not allowed in D

In Section 3.2, we have demonstrated that if |D| = 1, A DFA can help to realize the type and path constraints well so that we can combine the transition matrix M and graph G together to evaluate the influence. In this section, we will generalize the method discussed in Section 3.2 to multiple paths situations. We provide definition of NFA and two properties in automata theory [7] as follows:

MATRIXING STEP: Then, we transform the DFA D to an equivalent transition matrix M. M has |Q| rows and |Σ| columns. The elements in M denote the results of the transition function δ.

DEFINITION 11. NFA: A nondeterministic finite automata N can be denoted by N = {Q, Σ, δ, q, F }, where Q, Σ, q, F in N have same meanings to that in DFA in which Σ = A. δ(ˆ s, a) in DFA N can has more than one states.

In the above steps, we have represent the type constraints and path constraints by a transition matrix M. In order to evaluate the path-constrained influence of a particular node

Automata Property 2: Several DFAs can be combined to a single NFA without information loss or redundancy.

P P 1

P

2 V

3 I

I

OPTIMIZE 5

4

1

P

P

3 I

2

optional

V,I

4

DFA Do

DFA D

Figure 5: Optimization on DFA structure Diffusion Schema {APJPA, APGPA} {JP ∗ J} {GP ∗ A} {F AP ∗ } {IP AF, IP ∗ AF } {AA∗ } {ACA, AT A, AA∗ }

# state 9 4 4 4 8 3 7

(a) DBLP dataset

# state∗ 6 3 3 3 3 2 5

(b) ACM dataset

(c) SGD dataset Figure 6: Dataset Network Schemas

Table 1: The # state in the original DFA model and # state* in the optimized DFA

ples is shown in Table 1. Automata Property 3: A NFA can be converted to a DFA without information loss or redundancy. With an example shown in Figre 4, we will demonstrate the combination process. In this example, S = {P }, T = {I, V } and D = {P ∗ I, P V }. TRANSLATE STEP: This step is same to the single path situation. Here we can translate multiple paths to multiple DFAs. COMBINE STEP: As Property 2 said, we combine all the DFAs to one NFA. Here we can just combine start states together and combine final states together. The combined NFA is shown in Figure 4 as NFA N. Note that, the result of transition function may have more than one states. In the example, we have δ(ˆ s1 , P ) = {ˆ s2 , sˆ3 }. CONVERT STEP: As shown in Property 3, we can convert the NFA N to an equivalent DFA D. In this example, we have start state sˆ1 , a final state set F = {ˆ s4 , sˆ5 }. Next steps is similar to the single path situation discussed in Section 3.2. By using finite automata to represent the constraints of paths in the diffusion schema D, we avoid to enumerate different paths one by one. Moreover, we will also combine iterable diffusion paths and non-iterable diffusion paths together into a single DFA so that the complex type constraints and path constraints can be greatly simplified.

3.4 Optimization of DFA A typical DFA can always be optimized to the most simplified form. The optimized has least states, which also has the same ability to express the type constraints and path constraints. We use table filling algorithm [7] to minimize the size of DFA. An example is shown in Figure 5. In this example, we have S = {P }, T = {I, V } and D = {P ∗ I, P V }. We transform the combined DFA D which have 5 states to an optimized DFA Do which only have 4 states. More optimization exam-

4. EXPERIMENT 4.1 Experimental Setup Data Sets: In order to evaluate the performances of the proposed approach on influence maximization, we tested our method on 3 real-world heterogeneous information network datasets with different network schemas as shown in Figure 6. • DBLP dataset: The first benchmark dataset [15] is a bibliographic network collected from DBLP website, which contains 20 conferences covering 4 research areas, i.e., database, data mining, machine learning, information retrieval. In this network, there are three types of nodes, i.e., author (A), conference (C) and term (T), as shown in Figure 6(a). In total, the network includes 42,000 nodes and 1 million edges. • ACM dataset: The second dataset is collected from ACM website1 . We collected all conference proceedings from 14 representative conferences (C) in computer science. The dataset involves 6 types of nodes, including conference nodes (C), proceedings (V), papers (P), authors (A), affiliations (F) and research areas/index terms (I). The whole network has 32,000 nodes and 2 million edges, including the citation links among papers as shown in Figure 6(b). • SGD dataset: The last dataset [11] contains various types of information concerning the yeast organism Saccharomyces cerevisiae, including genes (G), journals (J), authors (A) and papers (P). The whole network contains 120,000 nodes and 1.2 million edges, shown in Figure 6(c). In DBLP dataset, we utilize the weights on the edges, and normalize the weights to get the diffusion probabilities. Since there is no weight on edges in ACM and SGD datasets, we set 5% as the default diffusion probability on all edges. Comparative Method: In order to demonstrate the effectiveness of the proposed approach, we compare our method against four baseline methods, summarized as follows: 1

http://portal.acm.org/

Table 2: Case Study: Influence Node Discovery (a) Influential Affiliations in ACM publication network (D = {F AP ∗ }) Rank 1 2 3 4 5 6 7 8 9 10

Homo Cornell AT&T UCB Purdue UWater Harvard Princeton Stanford Microsoft MIT Avg.

Inf. 70.7 56.4 121.2 29.1 33.7 26.0 57.1 122.0 150.6 133.0 80.0

Semi-Hetero NUS WIS CMU Google Stanford UIUC IBM Columbia Umass UCB Avg.

Inf. 30.8 20.9 117.1 43.5 121.1 69.5 229.2 28.6 49.5 119.008 82.9

AHD IBM Microsoft MIT Stanford UCB CMU Yahoo! AT&T Cornell UIUC Avg.

Inf. 227.234 149.12 132.866 121.604 119.008 118.406 87.89 70.302 69.586 69.076 116.5

(b) Influential authors in DBLP dataset (D = {AC}) Rank 1 2 3 4 5 6 7 8 9 10

(c) Influential journals in SGD dataset (D = {JP P J}) Rank 1 2 3 4 5 6 7 8 9 10

Homo Biochem J Cell Biol FEBS Lett Nucleic Mol Gen Genetics Yeast EMBO J Mol Cell Gene Avg.

Inf. 1.14 2.73 1.30 1.72 1.67 2.63 1.89 3.26 1.77 1.62 1.97

Semi-Hetero Genetics Science J Bacteriol Biochem Biochim RNA FEBS Lett Cell J Biol Nature Avg.

Inf. 2.65 2.63 1.56 1.17 1.16 1.20 1.32 4.54 4.14 2.86 2.32

AHD Cell Mol Cell J Biol EMBO J Proc Natl Genes Dev Nature J Cell Biol Genetics Science Avg.

Homo Wei Wang Laks V. S. Lakshmanan Philip S. Yu Wei-Ying Ma Hector Garcia-Molina Raghu Ramakrishnan Qiang Yang Jiawei Han Gerhard Weikum Hans-Peter Kriegel Avg.

Inf. 0.46 0.30 1.14 0.34 0.55 0.52 0.45 0.90 0.50 0.53 0.57

Semi-Hetero Surajit Chaudhuri Philip S. Yu H. V. Jagadish Rakesh Agrawal Jiawei Han Zheng Chen Divesh Srivastava Christos Faloutsos W. Bruce Croft Nick Koudas Avg.

Inf. 0.51 1.15 0.57 0.57 0.91 0.31 0.56 0.69 0.49 0.42 0.62

AHD Philip S. Yu Jiawei Han Christos Faloutsos H. V. Jagadish Rakesh Agrawal Hans-Peter Kriegel Surajit Chaudhuri Divesh Srivastava Hector Garcia-Molina Raghu Ramakrishnan Avg.

Inf. 1.16 0.90 0.69 0.57 0.56 0.55 0.53 0.55 0.55 0.51 0.66

(d) Additional Information on the journals in SGD dataset Inf. 4.50 4.26 4.16 3.36 3.29 2.92 2.82 2.74 2.69 2.69 3.34

• Homogeneous IC Model (Homo) [4]: We first compare with conventional IC model based upon the greedy hill-climbing. In this model, information propagates without paths constraints. Nodes that maximize the number of total influenced nodes are chosen as seeding nodes. This method is originally designed for homogeneous information networks which cannot differentiate the type information on each node. Here we convert the heterogenous information networks to homogeneous information networks by ignoring type information on each node. • Semi-Heterogeneous IC Model (Semi-Hetero): We then compare with another baseline method that is modified from homogeneous IC model. In this method, the node type information is considered, i.e., nodes that maximize the number of influenced nodes with type VT are chosen as seeding nodes. But constraints on propagation paths are not considered in this baseline. • Maximum Degree (Degree): This algorithm choose the nodes with maximum degrees as influential nodes. • Random: Randomly choose a set of nodes as influential nodes. • Automata-based Heterogeneous Diffusion (AHD): The proposed method in this paper. AHD can consider the node type information as well as the path constraints in the diffusion process. All algorithms are implemented in Java and tested on machines with Intel Core i5 processors with 4GB RAM.

4.2 Case Study: Influential Node Discovery In our first experiment, we study the performance of our method in a simple case of influence maximization problem, i.e., we set k = 1 and focus on the performance of influence estimation step. Here, the problem of influence maximization degenerates to the problem of finding the most influential node. We show case study results on all dataset in Table 2 . In detail, for ACM dataset, we set the diffusion path constraint as D = {F AP ∗ }, the seeding node type as

Rank by AHD 1 2 3 4 5 6 7 8 9 10

Journal Cell Mol Cell J Biol EMBO J Proc Natl Genes Dev Nature J Cell Biol Genetics Science

Impact Factor 39.191 6.188 5.328 13.871 9.771 14.198 36.01 9.58 3.889 31.364

# Paper 1112 3394 6101 1544 2197 754 693 1113 1677 610

Product Score 43580.392 21002.072 32506.128 21416.824 21466.887 10705.292 24954.93 10662.54 6521.853 19132.04

S = {F } and the target node type as T = {P }. In other words, the task here is to list the top-10 affiliation nodes (A) that have most influence on paper nodes (P). The results are listed on Table 1(a). Here the “Inf.” denotes the influence score obtained from path-constrained diffusion process. In DBLP dataset, we set the diffusion path constraint as D = {AC}, S = {A} and T = {C}. This task is to list the top-10 authors that have the largest influence on conferences. The result is listed on Table 1(b). In SGD dataset, we set D = {JP P J}, S = {J} and T = {J}. This task is to list the top-10 journals that have most influence on other journals. The result is listed on Table 1(c). The results of our approach outperform those of Homo and Semi-Hetero. In ACM dataset, we found the affiliations “IBM” and “Microsoft” are ranked as top-2 in our AHD approach followed by four famous universities. This result is reasonable when we consider the influence of affiliations on papers, because “IBM” and “Microsoft” both have very large research groups than then computer science departments of universities. However, in the results of Homo method, where the node types are ignored, “IBM” is out of the top-10 list. In the results of Semi-Hetero method, where the path constraints are ignored, “IBM”’s rank is 7. In DBLP dataset, we can also that the result of AHD is more reasonable when we consider the influence of authors on the 14 conferences. The top ranked authors by AHD all have published papers in most of the 14 conferences. In SGD dataset, we list ranking result of each method as well as some informations about the journals, i.e., the impact factors, the paper numbers in the SGD dataset and the product of impact factor and paper number for each journal. In the result of our AHD approach, we can see that the journals either have high impact factors with few papers, e.g., “Cell” and “Nature”, or have moderate impact factors with many papers, e.g., “Mol Cell Biol” and “J Biol Chem”. This is reasonable that both journals with high impact factors and journals with many papers published can make great influ-

Influence

1000

25

Homo Semi−Hetero Degree Random AHD

20

800 600

14 Homo Semi−Hetero Degree Random AHD

12 10

Influence

1200

Influence

1400

15

10

400

8 6 Homo Semi−Hetero Degree Random AHD

4 5

200 0 1

2 2

3

4

5

6

7

0 1

8

2

3

Chosen Size

4

5

6

7

0 1

8

1

Homo Semi−Hetero Degree Random AHD

0.2

3

4

5

6

7

8

120 100 80

8 6 Homo Semi−Hetero Degree Random AHD

2 0 1

2

3

Chosen Size

(d) D = {CV P AP I} (ACM dataset)

7

12

4

8

6

14

Influence

Influence

Influence

0.4

5

(c) D = {CAT C} (DBLP dataset)

10 0.6

4

(b) D = {JP J} (SGD dataset)

0.8

2

3

Chosen Size

∗

(a) D = {P GP GP, P GP } (SGD dataset)

0 1

2

Chosen Size

4

5

6

7

Homo Semi−Hetero Degree Random AHD

60 40 20

8

0 1

Chosen Size

2

3

4

5

6

7

8

Chosen Size

∗

(e) D = {CV P I} (ACM dataset)

(f) D = {P IP } (ACM dataset)

Figure 7: Performance of influence maximization with different sizes of seeding nodes ence on other journals. Thus ranking reason fit the product of these two properties very well. Based upon the results above, we can find that the ranking results are greatly improved by considering node type information and path constraints in our approach. While compared results with Semi-Hetero method and Homo method, we show that constraints of paths and the node types both play important roles in influence estimation. Diffusion process without path constraints will lead to unreasonable results.

4.3 Performance of Influence Maximization In this subsection, we evaluate the influence maximization performances of our approach. We show the results with different diffusion schemas on each dataset as shown in Figure 7. We can see that our AHD method has better performance on all tasks with different sized seeding nodes.

4.4 Customized Influence Pattern Different from conventional influence maximization methods, our AHD model can provide customized results by using different diffusion schemas. We study the property of our method under different diffusion schema from two groups of results: a) We first analyze the influences under different type constraints on the seeding nodes and targeting nodes. b) We then show the influences under different path constraints. a) Type constraints on nodes: We choose 3 different path constraints in the ACM dataset. In this experiment, we use AHD method to find the top authors in the ACM academic network based upon the their influences on the paper nodes (i.e., T = {P }), conference nodes (i.e., T = {C}) and research area nodes (T = {I}) separately. All three different tasks have the same seeding node type S =

Table 3: AHD’s results under different target node types Rank 1 2 3 4 5

Task 1 (T = {P }) ChengXiang Zhai Jiawei Han W. Bruce Croft Philip Yu Christos Faloutsos

Task 2 (T = {C}) Jiawei Han Philip Yu ChengXiang Zhai Christos Faloutsos W. Bruce Croft

Task 3 (T = {I}) Jiawei Han Philip Yu Wei-Ying Ma ChengXiang Zhai Zheng Chen

{A}. The target type sets are different: In Task 1: T = {P } and D = {AP ∗ }; Task 2: T = {C} and D = {AP V C}; Task 3: T = {I} and D = {AP I}. The results are shown in Table 3. In Table 3, we find that AHD can get different results when focusing on different target node types. In Task 1, when we study the influences of each author on paper nodes, we find that “Chengxiang Zhai” ranks top-1. However, in Task 2, when we study the influence of authors on conferences, “Jiawei Han” is ranked as top-1 in the ACM dataset. These results are reasonable, since we found Chengxiang Zhai’s citation number is more than “Jiawei Han” in the ACM dataset, while Jiawei Han’s distribute more broadly on different conferences in the ACM dataset. In Task 3, when we consider authors’ influence on different research areas. We can find that authors with broader research areas rank higher. e.g., “Jiawei Han” has published papers in 10 research areas indicated by the ACM index term; “Philip Yu” has published paper in 8 research areas and Wei-Ying Ma with 8 areas. So if we consider influence on areas/index terms, the ranking is also reasonable. In general, our diffusion schema with node type constraints can evaluation influences of authors from different aspects. b) Path Constraint: The above part shows that results

8

8

(a) Rank by AHD method Rank 1 2 3 4 5 6 7 8 9 10

{F AP ∗ } IBM Microsoft MIT Stanford UCB CMU Yahoo! AT&T Cornell UIUC

{F AP IP } IBM CMU MIT Stanford Microsoft UCB Cornell AT&T UIUC Yahoo!

(b) Rank by#Paper (c) Rank by #Field Rank IBM Microsoft MIT Stanford UCB CMU AT&T Yahoo! Cornell UIUC

#P 2327 1569 1555 1460 1414 1329 929 858 822 801

Rank IBM UCB CMU Stanford MIT Microsoft AT&T Cornell U.Maryland AT&T

#I 56 55 53 51 50 48 45 45 43 42

7

10

10

Homo Semi−Hetero AHD

Running Time (Millisecond)

Table 4: AHD’s results under different path constraints

Running Time (Millisecond)

10

6

10

5

10

4

10

1

Path F AP ∗ means that we mainly consider citations to spread academic ideas, while path F AP IP means that we mainly consider communications in the same field to spread academic ideas. Table 3(b) shows that universities like CMU and Stanford has less papers than other affiliations like Microsoft Research, while Table 3(c) shows that these universities cover more fields than Microsoft Research. This results make sense because universities usually have many departments covering wider research areas than companies. Research centers in the companies usually concentrate on research fields related to their business. From the results in Table 3(a), we can see that influence under path F A∗ is similar to result in Table 3(b) and influence of path F AP IP is similar to result in Table 3(c). In general, different diffusion paths lead to customized influence results.

4.5 Scalable Performance We test the scalable performance of our approach in two dimensions. First, we compare our approach with Homo and Semi-Hetero in different result scales. Then we show that the execution time of our approach grows slowly when the size of the diffusion schema increases. Figure 8 shows two tasks on DBLP dataset and SGD dataset. These two examples demonstrate that our approach has a much better scalable performance than Homo and SemiHetero algorithms. The diffusion process of these two algorithms is much slower, due to the reason that most of the diffusion paths in the homogeneous context is meaningless in heterogeneous context. Our approach focuses on meaningful paths and eliminates other meaningless paths. Figure 9 shows three tasks on DBLP, SGD and ACM. These tasks show evaluations of execution time on increasing diffusion schema sizes. We add five different paths into the diffusion schema one by one for each task. The paths for DBLP dataset are T ACT , T AT , T AAT , T CAT and T AAAT . The paths for ACM dataset are F AP L, F AP , F AP P , F AP V

6

10

5

10

4

2

3

4

5

6

7

10

8

1

2

3

Chosen Size

4

5

6

7

8

Chosen Size

(a) DBLP with constraints (D = {AP A, AT A, AA∗ })

(b) SGD with constraints (D={APJPA})

Figure 8: Running time performance in log scale

Running Time (Millisecond)

3

should be different under different targeting node types. While in real-world applications, even we have the same targeting node types, the result can still be different if we consider different aspects of the influence from nodes in VS to nodes in VT . Such aspects can be described through diffusion path constraints. We choose two tasks in the ACM dataset to illustrate the different results under different path settings. Task 3: D = {F AP ∗ }. Task 4: D = {F AP IP }. The results are shown in Table 4.

Homo Semi−Hetero AHD

7

10

x 10

2.5

4

DBLP Task ACM Task SGD Task

2 1.5 1 0.5 0 1

1.5

2

2.5 3 3.5 Diffusion Schema Size

4

4.5

5

Figure 9: Speed performance with different diffusion schema sets. and F AP LP , For SGD dataset: AP J, AP G, AP P , AP P ∗ and AP A. The result demonstrates that our AHD method can evaluate influence of multiple paths situation almost as fast as that of single path situation.

5. RELATED WORKS To the best of our knowledge, this paper is the first work on influence maximization for heterogeneous information networks. Some research works have been done in related areas. Influence maximization or information diffusion deals with the problem of finding a subset of seeding nodes that can maximize the influence within the network. It has received tremendous attentions from both academic and industry communities. Traditional works on influence maximization focus on homogeneous networks. Several diffusion models have been proposed for networks with various properties. Three representative models include linear threshold model [19], independent cascade model [5, 6] and decreasing cascade model [9]. Kempe et al. [8, 9] proved the submodularity within these three models, which means that greedy hill-climbing algorithms [4] can approximate the optimum solution by a factor of (1 − 1/e) [3, 14]. Several works have also been proposed to optimize the greedy algorithm [12, 2]. Liu et al. [13] utilizes heterogeneous information to analyze topic-level influence. While, influence maximization in heterogeneous information networks has not been studied yet. Heterogeneous information network has attracted much attention recently [17, 18, 16, 10, 11, 1]. For example, Sun et al. [17, 18, 16] studied the clustering problem and top-k similarity problem in heterogeneous information networks. Chen et al. [1] studied OLAP in heterogeneous context. Several efforts have been made to utilize meta paths in heterogeneous networks to solve problems [16, 10, 11]. While previous works focus on path with fixed length, our work

extend meta path to more general paths. This extension strengthen the ability of paths to represent actions in heterogeneous networks.

6.

CONCLUSION

In this paper, we study the problem of influence maximization in heterogeneous information networks. Different from conventional approaches of influence maximization in homogeneous networks, which assume all nodes in the network have the same type. In heterogeneous information networks, there exist several types of nodes and edges, while the diffusion process is path-constrained. We study a new problem of influence maximization in heterogeneous information networks. We also derived a new cascade model which has type constraints and path constraints in the diffusion process. The definition of our new problem is more general than the traditional one in homogeneous networks. Finally, we gave the solution named Automata-based Heterogeneous Diffusion which utilize automata theory to solve this problem. Empirical studies on real-world tasks demonstrated the effectiveness and efficiency of our approach.

7.

REFERENCES

[1] C. Chen, X. Yan, F. Zhu, J. Han, and P. S. Yu. Graph OLAP: Towards online analytical processing on graphs. In ICDM, pages 103–112, 2008. [2] W. Chen, Y. Wang, and S. Yang. Efficient influence maximization in social networks. In KDD, pages 199–208, 2009. [3] G. Cornuejols, M. Fisher, and G. Nemhauser. Location of bank accounts to optimize float. Management Science, 23(8):789–810, 1977. [4] P. Domingos and M. Richardson. Mining the network value of customers. In KDD, pages 57–66, 2001. [5] J. Goldenberg, B. Libai, and E. Muller. Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing Letters, 12(3):211–223, 2001. [6] J. Goldenberg, B. Libai, and E. Muller. Using complex systems analysis to advance marketing theory development. Academy of Marketing Science Review, 1(9), 2001. [7] J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 1979. [8] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD, pages 137–146, 2003. [9] D. Kempe, J. Kleinberg, and E. Tardos. Influential nodes in a diffusion model for social networks. In ICALP, pages 1127–1138, 2005. [10] N. Lao and W. W. Cohen. Fast query execution for retrieval models based on path-constrained random walks. In KDD, pages 881–888, 2010. [11] N. Lao and W. W. Cohen. Relational retrieval using a combination of path-constrained random walks. Machine Learning, 81(2):53–67, 2010. [12] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In KDD, pages 420–429, 2007. [13] L. Liu, J. Tang, J. Han, M. Jiang, and S. Yang. Mining topic-level influence in heterogeneous networks. In CIKM, pages 199–208, 2010. [14] G. Nemhauser, L.Wolsey, and M. Fisher. An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14(1):265–294, 1978.

[15] Y. Sun, J. Han, J. Gao, and Y. Yu. iTopicModel: Information network-integrated topic modeling. In ICDM, pages 493–502, 2009. [16] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. In VLDB, 2011. [17] Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. RankClus: integrating clustering with ranking for heterogeneous information network analysis. In EDBT, pages 565–576, 2009. [18] Y. Sun, Y. Yu, and J. Han. Ranking-based clustering of heterogeneous information networks with star network schema. In KDD, pages 797–806, 2009. [19] T. Valente. Network models of the diffusion of innovations. Computational Mathematical Organization Theory, 2(2):163–164, 1995.

APPENDIX A. PROOF OF SUBMODULARITY We give an name PCIC for the IC model in heterogeneous information networks with path constraints. Each edge in the network can be attempted for not more than 1 times in PCIC Model, so we can determine every attempts’ result before the diffusion process. With all the coins flipped in advance, we want to prove that PCIC Model will have an order-independent outcome in a deterministic network. The edges in G whose coin flips result successful activation are declared to be live edges, the remaining edges are declared to be blocked. If we fix the results of the coin flips and activate the seeding nodes set, it can be proved that the activated nodes in VT is unique at the end of the cascade process. We prove it as follows: Different from IC Model, if there exist a path consisting entirely of live edges from a seeding node to another node u, u will not be activated if all the ‘live path’s(paths consisting of live edges) are illegal path instance. Shown in Section 3.3, if there exists a live path from a seeding node to v, and the path is also a legal path instance, v will be activated finally. So we proved that with all the coins flipped in advance, PCIC Model will have an orderindependent outcome. Let X denote the outcome of a fixed coin flipped, and R(u, X) denote the set of nodes that can be reached from seeding node u through a live and legal path. We also use σX (·) to denote the number of influenced nodes under the fixed coin flipped outcome X. First, we claim that for each fixed coin flipped outcome X, the function σX (·) is submodular. Let S and T be two sets of nodes where S ⊆ VS , T ⊆ VS and S ⊆ T . We consider the quantity of σX (S ∪ {v})− σX (S). Because the outcome is order-independent, we can separate the diffusion process of S ∪ {v} into 2 steps: step 1 to diffuse seeding nodes in S and step 2 to diffuse node v. The outcome of step 1 must be as large as σX (S) while outcome of step 2 must be 0 or more. Let Extra(S, v, X) denote the number of nodes in R(v, X) which are not in the union ∪u∈S R(u, x). Obviously we have Extra(S, v, X) > Extra(T, v, X). It follows from the inequality of submodularity: σX (S ∪ {v}) − σX (S) ≥ σX (T ∪ {v}) − σX (T ); S ⊆ T X σ(A) = P rob[X] · σX (A) outcomeX

The submodularity is proved for the reason that a non-negative linear combination of submodular functions is also a submodular function. So we can use the hill-climbing algorithm as the top level of our solution. The solution approximates the optimum to within a factor of (1 − 1/e).