11 Graph Pattern Mining
11 Graph Pattern Mining
Graph Mining
Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis
Diversity of graphs
Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D)
Aspirin
Internet
Co-author network
Frequent subgraphs
(A)
FREQUENT PATTERNS (MIN SUPPORT IS 2)
(B)
(C)
(1)
(2)
EXAMPLE (II)
GRAPH DATASET
Apriori-based approach
Pattern-growth approach
Level-wise search
Simulate Apriori for frequent pattern discovery
10
Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD00) FSG: Kuramochi and Karypis (ICDM01) # PATH : Vanetik and Gudes (ICDM02, ICDM04) FFSM: Huan, et al. (ICDM03) Pattern growth approach MoFa, Borgelt and Berthold (ICDM02) gSpan: Yan and Han (ICDM02) Gaston: Nijssen and Kok (KDD04)
11
Search order breadth vs. depth Generation of candidate subgraphs apriori vs. pattern growth Elimination of duplicate subgraphs passive vs. active Support calculation embedding store or not Discover order of patterns path tree graph
12
Apriori-Based Approach
k-edge
(k+1)-edge
G1 G G G
G2
Gn
JOIN
13
AGM (Inokuchi, et al. PKDD00) generates new graphs with one more node
FSG (Kuramochi and Karypis ICDM01) generates new graphs with one more edge
14
15
Represent graphs using canonical adjacency matrix (CAM) Join two CAMs or extend a CAM to generate a new graph Store the embeddings of CAMs All of the embeddings of a pattern in the database Can derive the embeddings of newly generated CAMs
16
G1
k-edge
G2
duplicate graph
Gn
17
Fast support calculation Also used in other later developed algorithms such as FFSM and GASTON
18
Theorem: Completeness
DFS Code
Flatten a graph into a sequence using depth first search 0 1 2 4 e0: (0,1) e1: (1,2)
e2: (2,0)
e3: (2,3) e4: (3,1) e5: (2,4)
20
Let Z be the set of DFS codes of all graphs. Two DFS codes a and b have the relation a<=b (DFS Lexicographic Order in Z) if and only if one of the following conditions is true. Let a = (x0, x1, , xn) and b = (y0, y1, , yn), (i) (ii) if there exists t, 0<= t <= min(m,n), xk=yk for all k, s.t. k<t, and xt < yt xk=yk for all k, s.t. 0<= k<= m and m <= n.
21
Let a be the minimum DFS code of a graph G and b be a non-minimum DFS code of G. For any DFS code d generated from b by one right-most extension,
(i)
(ii)
(iii)
d is not a minimum DFS code, min_dfs(d) cannot be extended from b, and min_dfs(d) is either less than a or can be extended from a.
THEOREM [ RIGHT-EXTENSION ] The DFS code of a graph extended from a Non-minimum DFS code is NOT MINIMUM
22
Store embeddings
Separate the discovery of different types of graphs
Simple structures are easier to mine and duplication detection is much simpler
23
subgraphs
Motivation: Handling graph pattern explosion problem Closed frequent graph A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of Gs subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining result is complete
25
CLOSEGRAPH
A Pattern-Growth Approach
(k+1)-edge
G1
k-edge
At what condition, can we stop searching their children i.e., early termination? If G and G are frequent, G is a subgraph of G. If in any part of the graph in the dataset where G occurs, G also occurs, then we need not grow G, since none of Gs children will be closed except those of G.
26
G2
Gn
a
c
b d
a
c
(pattern 1)
(graph 1)
d (graph 2)
a c d
(pattern 2)
27
Experimental Result
from NCI/NIH
compounds
Among these 43,905 compounds, 423 of them belongs to CA, 1081 are of CM, and the remaining are in class CI
28
Discovered Patterns
20%
10%
5%
29
Minimum support
Minimum support
0.06
0.07
0.08
0.1
33
Potentially exponential number of frequent patterns The worst case complexty vs. the expected probability 4 Ex.: Suppose Walmart has 10 kinds of products -4 The chance to pick up one product 10 -40 The chance to pick up a particular set of 10 products: 10 What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions? Have we solved the NP-hard problem of subgraph isomorphism testing? No. But the real graphs in bio/chemistry is not so bad A carbon has only 4 bounds and most proteins in a network have distinct labels
34
Graph Mining
Graph Search
Querying graph databases: Given a graph database and a query graph, find all the graphs containing this query graph
query graph
graph database
36
Scalability Issue
Sequential scan
Disk I/Os
Subgraph isomorphism testing DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02
37
Indexing Strategy
Query graph (Q) Graph (G) If graph G contains query graph Q, G should contain any substructure of Q Substructure Remarks Index substructures of a query graph to prune graphs that do not contain these substructures
38
Indexing Framework
Cost Analysis
QUERY RESPONSE TIME
Tindex
fetch index
Cq
40
Path-based Approach
GRAPH DATABASE
(a) PATHS
(b)
(c)
0-length: C, O, N, S 1-length: C-C, C-O, C-N, C-S, N-N, S-O 2-length: C-C-C, C-O-C, C-N-C, ... 3-length: ...
0-edge: SC={a, b, c}, SN={a, b, c} 1-edge: SC-C={a, b, c}, SC-N={a, b, c} 2-edge: SC-N-C = {a, b}, Intersect these sets, we obtain the candidate answers - graph (a) and graph (b) - which may contain this query graph.
42
(b)
(c)
Only graph (c) contains this query graph. However, if we only index paths: C, C-C, C-C-C, C-C-C-C, we cannot prune graph (a) and (b).
43
Identify frequent structures in the database, the frequent structures are subgraphs that appear quite often in the graph database
Prune redundant frequent structures to maintain a small set of discriminative structures Create an inverted index between discriminative frequent structures and graphs in the database
44
discriminative
(~103)
frequent
(~105)
structure
(>106)
45
(a)
(b)
(c)
All graphs contain structures: C, C-C, C-C-C Why bother indexing these redundant frequent structures? Only index structures that provide more information than existing structures
46
Discriminative Structures
Pinpoint the most useful frequent structures Given a set of structures f1 , f 2 , f n and a new structure x , we measure the extra indexing power provided by x ,
P x f1 , f 2 , f n , f i
x.
When P is small enough, x is a discriminative structure and should be included in the index Index discriminative frequent structures only Reduce the index size by an order of magnitude
47
We cannot index (or even search) all of substructures Large structures will likely be indexed well by their substructures Size-increasing support threshold
support minimum support threshold
size
48
Experimental Setting
dataset
GraphGrep: maximum length (edges) of paths is set at 10 gIndex: maximum size (edges) of structures is set at 10
49
# OF FEATURES
1k
2k
4k
8k
16k
DATABASE SIZE
50
# OF CANDIDATES
20
24
QUERY SIZE
51
2K
4K
From scratch
6k
8k
Incremental
10k
Frequent structures are stable to database updating Index can be built based on a small portion of a graph database, but be used for the whole database
Graph-structure-based indexing and similarity search Structure-based index methods, e.g., g-Index, S-path index Use index to search for similar graph/network structures Substructure indexing Key problem: What substructures as indexing features? gIndex [Yan, Yu & Han, SIGMOD04]: Find frequent and discriminative subgraphs (by graph-pattern mining) S-path [Zhao & Han, VLDB10]: Use decomposed shortest paths as basic indexing features
53
Neighborhood signatures of vertices are built to maintain indexing features: Effective search space pruning ability Processing (Query Decomposition): Decompose the query graph into a set of indexed shortest paths in S-Path
Network
Query
Neighborhood signature of v3
Graph Mining
(a) caffeine
(b) diurobromine
(c) viagra
QUERY GRAPH
56
Method 2: Form a set of subgraph queries from the original query graph and use the exact subgraph search
Precise Search
Approximate Search
Query relaxation measure The number of edges that can be relabeled or missed; but the position of these edges are not fixed
QUERY GRAPH
59
Easy to index
Fast
Rough measure
60
Graph (G2)
Substructure
Feature-Graph Matrix
graphs in database G1 f1 0 0 G2 1 1 G3 0 0 G4 1 0 G5 1 1
features
f2
f3
f4 f5
1
1 0
0
0 0
1
0 1
1
0 1
1
1 0
Assume a query graph has 5 features and at most 2 features to miss due to the relaxation threshold
62
If we allow k edges to be relaxed, J is the maximum number of features to be hit by k edgesit becomes the maximum coverage problem
NP-complete
1 1 k
Step 1. Index Construction Select small structures as features in a graph database, and build the featuregraph matrix between the features and the graphs in the database
64
Framework (cont.)
Step 2. Feature Miss Estimation Determine the indexed features belonging to the query graph Calculate the upper bound of the number of features that can be missed for an approximate matching, denoted by J On the query graph, not the graph database
65
Framework (cont.)
Use the feature-graph matrix to calculate the difference in the number of features between graph G and query Q, FG FQ If FG FQ > J, discard G. The remaining graphs constitute a candidate answer set
66
Performance Study
Database Chemical compounds of Anti-Aids Drug from NCI/NIH, randomly select 10,000 compounds Query Randomly select 30 graphs with 16 and 20 edges as query graphs Competitive algorithms Grafil: Graph Filterour algorithm Edge: use edges only All: use all the features
67
# of candidates
1000
100
10 1 2 3 4
edge relaxation
68
References (1)
T. Asai, et al. Efficient substructure discovery from large semi-structured data, SDM'02 C. Borgelt and M. R. Berthold, Mining molecular fragments: Finding relevant substructures of molecules, ICDM'02 M. Deshpande, M. Kuramochi, and G. Karypis, Frequent Sub-structure Based Approaches for Classifying Chemical Compounds, ICDM 2003 M. Deshpande, M. Kuramochi, and G. Karypis. Automated approaches for classifying structures, BIOKDD'02 L. Dehaspe, H. Toivonen, and R. King. Finding frequent substructures in chemical compounds, KDD'98 C. Faloutsos, K. McCurley, and A. Tomkins, Fast Discovery of 'Connection Subgraphs, KDD'04 L. Holder, D. Cook, and S. Djoko. Substructure discovery in the subdue system, KDD'94 J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. Mining spatial motifs from protein structure graphs, RECOMB04 J. Huan, W. Wang, and J. Prins. Efficient mining of frequent subgraph in the presence of isomorphism, ICDM'03 H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, Mining Coherent Dense Subgraphs across Massive Biological Networks for Functional Discovery, ISMB'05 A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data, PKDD'00 C. James, D. Weininger, and J. Delany. Daylight Theory Manual Daylight Version 4.82. Daylight Chemical Information Systems, Inc., 2003. G. Jeh, and J. Widom, Mining the Space of Graph Properties, KDD'04 M. Koyuturk, A. Grama, and W. Szpankowski. An efficient algorithm for detecting frequent subgraphs in biological networks, Bioinformatics, 20:I200--I207, 2004.
References (2)
M. Kuramochi and G. Karypis. Frequent subgraph discovery, ICDM'01 M. Kuramochi and G. Karypis, GREW: A Scalable Frequent Subgraph Discovery Algorithm, ICDM04 B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45--87, 1981. S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference. KDD'04 J. Prins, J. Yang, J. Huan, and W. Wang. Spin: Mining maximal frequent subgraphs from graph databases. KDD'04 D. Shasha, J. T.-L. Wang, and R. Giugno. Algorithmics and applications of tree and graph searching, PODS'02 J. R. Ullmann. An algorithm for subgraph isomorphism, J. ACM, 23:31--42, 1976. N. Vanetik, E. Gudes, and S. E. Shimony. Computing frequent graph patterns from semistructured data, ICDM'02 C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. Scalable mining of large disk-base graph databases, KDD'04 T. Washio and H. Motoda, State of the art of graph-based data mining, SIGKDD Explorations, 5:59-68, 2003 X. Yan and J. Han, gSpan: Graph-Based Substructure Pattern Mining, ICDM'02 X. Yan and J. Han, CloseGraph: Mining Closed Frequent Graph Patterns, KDD'03 X. Yan, P. S. Yu, and J. Han, Graph Indexing: A Frequent Structure-based Approach, SIGMOD'04 X. Yan, X. J. Zhou, and J. Han, Mining Closed Relational Graphs with Connectivity Constraints, KDD'05 X. Yan, P. S. Yu, and J. Han, Substructure Similarity Search in Graph Databases, SIGMOD'05 X. Yan, F. Zhu, J. Han, and P. S. Yu, Searching Substructures with Superimposed Distance, ICDE'06 M. J. Zaki. Efficiently mining frequent trees in a forest, KDD'02 P. Zhao and J. Han, On Graph Query Optimization in Large Networks", VLDB'10