The 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2011) 25 May 2011
LGM: Mining Frequent Subgraphs from Linear Graphs Yasuo Tabei
ERATO Minato Project Japan Science and Technology Agency joint work with Daisuke Okanohara (Preferred Infrastructure), Shuichi Hirose (AIST), Koji Tsuda (AIST) 1 1
Outline • Introduction to linear graph ★
Linear subgraph relation
★
Total order among edges
• Frequent subgraph mining from a set of linear graphs
• Experiments ★
Motif extraction from protein 3D structures 2
2
Linear graph (Davydov et al., 2004)
• Labeled graph whose vertices are totally ordered g = (V, E, L , L ) Linear graph • V
E
‣ V ⊂ N : ordered vertex set ‣ E ⊆ V × V : edge set ‣ LV → ΣV : vertex labels E E : edge labels ‣L →Σ Example: c
b
1 A
a
a
2 B
3
4 B
A
5 C
6 A
3 3
Linear subgraph relation
•
g1 is a linear subgraph of g2
i) Conventional subgraph condition ★ Vertex labels are matched ★ All edges of g1 exist in g2 with the correct labels
ii) Order of vertices are conserved Example:
b
b
1 A
a
2 B
g1
3 A
⊂
c
a
1 A
2 A
3 B
a
4
g2
B
5 C
6 A
4 4
•
•
Subgraph but not linear subgraph g1 is a subgraph of g2 ★ vertex labels are matched ★ all edges in g1also exist in g2 with correct labels g1 is not a linear subgraph of g2 ★
the order of vertices is not conserved b b
c
1 A
2
3
A
B
1 A
g1
c
a
2
3
4
A
B
A
g2
5 5
Total order among edges in a linear graph
• Compare the left vertices first. If they
are identical, look at the right vertices
•
∀e1 = (i, j) , e2 = (k, l) ∈ Eg , e1
if and only if (i) i < k or (ii) i = k, j < l Example: e1
i
2
e2
j k
l
1
1
2
3 3
4
6 6
Outline • Introduction to linear graph ★
linear subgraph relation
★
Total order among edges
• Frequent subgraph mining from a set of linear graphs
• Experiments ★
Motif extraction from protein 3D structures 7
7
Frequent subgraph mining from linear graphs
• Enumerate all frequent subgraphs from a set of
linear graphs ★ Subgraphs included in a set of linear graphs at least τ times (minimum support threshold) Enumerate connected and disconnected subgraphs with a unified framework ★ Use reverse search for an efficient enumeration (Avis and Fukuda, 1993) Polynomial delay ★ gSpan = exponential delay ★
•
8 8
Enumeration of all linear subgraph of a linear graph
• Before considering a mining
algorithm, we have to solve the problem of subgraph enumeration first
to enumerate all subgraphs of • How the following linear graph without duplication
9 9
Search lattice of all subgraphs !"#$%
*+,-+!./!0+12!3!24 &
'
( ) 10 10
Reverse search (Avis and Fukuda, 1993)
• To enumerate all subgraphs without
duplication, we need to define a search tree in the search lattice
• Reduction map f
Mapping from a child to its parent ★ Remove the largest edge ★
2
f
3
2
1 1
1 2
3
4
1
2
3
11 11
Search tree induced by the reduction map
• By applying the reduction map to each element, search tree can be induced !"#$%
12 12
Inverting the reduction map
f
−1
• When traversing the tree from the root, children nodes are created on demand
• In most cases, the inversion of reduction map takes the following two steps:
★
Consider all children candidates
★
Take the ones that qualify the reduction map
• However, in this particular case, the
reduction map can be inverted explicitly
★
Can derive the pattern extension rule (parent to children) 13 13
Pattern extension rule
14 14
Traversing search tree from root
• Depth first traversal for its memory efficiency $&!'()*+!,$'!-+! .!/')--!-'!-+!
!"#$%
15 15
Frequent subgraph mining
• Basic idea: find all possible extensions of a
current pattern in the graph database, and extend the pattern
•★ Occurrence list L
G (g)
Record every occurrence of a pattern g in the graph database G
★
Calculate the support of a pattern g by the occurrence list !"#$%&'($""
anti-monotonicity of • Use the support for pruning
)$*+,+-
16 16
Outline • Introduction to linear graph ★
linear subgraph relation
★
Total order among edges
• Frequent subgraph mining from a set of linear graphs
• Experiments ★
Motif extraction from protein 3D structures 17
17
Motif extraction from protein 3D structures •
Pairs of homologous proteins in thermophilic organism and mesophilic organism
•
Construct a linear graph from a protein ★ Use vertex order from N- to C- terminal ★ Assign vertex labels from {1,...,6} ★ Draw an edge between pairs of amino acid residues whose distance is 5Å
•
# of data:742, avg. # of vertices:371, avg. # of edges: 496
•
Rank the enumerated patterns by statistical significance (p-value) ★ Association to thermophilic/methophilic labels ★ Fisher exact test 18 18
Runtime comparison
• Compared to gSpan • Made gapped linear graphs and run gSpan • LGM is faster than gSpan
19 19
• Minimum support = 10 • 103 patterns whose p-value < 0.001 • ★Thermophilic (TATA), Mesophilic (pol II) Share the function as DNA binding protein, but the thermostatility is different
20 20
Mapping motifs in 3D structure
• Thermophilic (TATA), Mesophilic (pol II)
21 21
Summary
• Efficient subgraph mining algorithm from linear graphs
• Search tree is defined by reverse search principle
• Patterns include disconnected subgraphs • Computational time is polynomial-delay • Interesting patterns from proteins 22 22