Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
Saeed Salem
Department of Computer Science
North Dakota State University
cs.ndsu.edu/~salem
These slides are based on slides by Jiawei Han and Micheline Kamber, and Mohammed Zaki
Sequence Databases & Sequential Patterns
4
Sequential Pattern Mining Algorithms
5
The Apriori Property of Sequential Patterns
Seq. ID Sequence
Given support threshold
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
6
GSP—Generalized Sequential Pattern Mining
candidate sequence
generate candidate length-(k+1) sequences from
be found
Major strength: Candidate pruning by Apriori
7
GSP: Finding Length-1 Sequential
Patterns
8
GSP: A Complete Example
9
Candidate Generate-and-test: Drawbacks
11
The SPADE Algorithm
12
The SPADE Algorithm
tidlist
Mine frequent
subsequences
with
support = 3
13
Pros and Cons of SPADE
Pros:
Allows for different search strategies: Breadth-first and
depth-first search
Cons:
A huge set of candidates could be generated
Mining long sequential patterns
Needs an exponential number of short candidates
A length-100 sequential pattern needs 1030
100
100 100
candidate sequences!
i 1 i
2 1 10 30
14
Prefix and Suffix (Projection)
15
Mining Sequential Patterns by Prefix
Projections
16
Finding Seq. Patterns with Prefix <a>
Find all the length-2 seq. pat. Having prefix <a>: <aa>,
<ab>, <(ab)>, <ac>, <ad>, <af>
Further partition into 6 subsets SID sequence
10 <a(abc)(ac)d(cf)>
Having prefix <aa>;
20 <(ad)c(bc)(ae)>
… 30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Having prefix <af>
17
Completeness of PrefixSpan
SDB
SID sequence
10 <a(abc)(ac)d(cf)> Length-1 sequential patterns
20 <(ad)c(bc)(ae)> <a>, <b>, <c>, <d>, <e>, <f>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
<aa>-proj. db … <af>-proj. db
18
PrefixSpan: An Example
19
Efficiency of PrefixSpan
20
Speed-up by Pseudo-projection
22
Performance on Data Set C10T8S8I8
23
Constraint-Based Seq.-Pattern Mining
patterns
How to explore efficient mining with constraints? —
Optimization
Classification of constraints
Anti-monotone: E.g., value_sum(S) < 150, min(S) > 10
MS/Money}
Convertible: E.g., value_avg(S) < 25, profit_sum (S) >
24
From Sequential Patterns to Structured Patterns
{{i , i , …, i }, …}
1 2 m
Seq. DB: Sequences of sets:
{<{i , i }, …, {i , i , i }>, …}
1 2 m n k
Sets of Sequences:
{{<i , i >, …, <i , i , i >}, …}
1 2 m n k
26