Mining Sequential Patterns
Mining Sequential Patterns
Outline
What is sequence database and sequential pattern mining Methods for sequential pattern mining Constraint-based sequential pattern mining Periodicity analysis for sequence data
Sequence Databases
A sequence database consists of ordered elements or events Transaction databases vs. sequence databases
A transaction database
TID
10 20 30 40
A sequence database
SID
10 20 30 40
itemsets
a, b, d a, c, d a, d, e b, e, f
sequences
<a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc>
3
Applications
Applications of sequential pattern mining
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera, within 3 months.
Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams DNA sequences and gene structures
sequential pattern
find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans
be able to incorporate various kinds of userspecific constraints
7
[SDM03])
8
Pattern-Growth-based Approaches
FreeSpan PrefixSpan
11
Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1
12
51 length-2 Candidates
<e>
<f> <a> <a> <b> <(ab)> <c> <(ac)> <d> <(ad)>
<ea>
<fa> <e> <(ae)>
<eb>
<fb> <f> <(af)>
<ec>
<fc>
<ed>
<fd>
<ee>
<fe>
<ef>
<ff>
<b>
<c> <d> <e>
<(bc)>
<(bd)>
<(cd)>
<(be)>
<(ce)> <(de)>
<(bf)>
<(cf)> <(df)> <(ef)>
<f>
14
Cand. not in DB at all 4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> pat. 3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> pat. 20 cand. not in DB at all 2nd scan: 51 cand. 19 length-2 seq. <aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)> pat. 10 cand. not in DB at all 1st scan: 8 cand. 6 length-1 seq. <a> <b> <c> <d> <e> <f> <g> <h> pat. Seq. ID Sequence
min_sup =2
10
<(bd)cb(ac)>
20
30 40 50
<(bf)(ce)b(fg)>
<(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)>
15
Bottlenecks
Scans the database multiple times
method
A sequence database is mapped to a large set of Item: <SID, EID> Sequential pattern mining is performed by
growing the subsequences (patterns) one item at a
18
19
J.Pei, J.Han, PrefixSpan : Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE01.
21
of sequence <a(abc)(ac)d(cf)>
Given sequence <a(abc)(ac)d(cf)>
Prefix Suffix (Prefix-Based Projection)
Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:
The ones having prefix <a>; The ones having prefix <b>; The ones having prefix <f>
SID 10 sequence <a(abc)(ac)d(cf)>
20
30 40
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb> <eg(af)cbc>
23
Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
Further partition into 6 subsets
Having prefix <aa>;
Having prefix <af>
SID sequence
10
20 30 40
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc>
24
Completeness of PrefixSpan
SDB
SID sequence
10
20 30 40
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc>
<aa>-proj. db
<af>-proj. db
25
2. For each frequent item b, append it to to form a sequential pattern , and output ; 3. For each , construct -projected database S|, and call PrefixSpan(, l+1, S|).
27
Efficiency of PrefixSpan
No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing
projected databases
Can be improved by bi-level projections
28
Optimization in PrefixSpan
Single level vs. bi-level projection
Bi-level projection with 3-way checking may reduce the number and size of projected databases
29
30
Speed-up by Pseudo-projection
Major cost of PrefixSpan: projection
Postfixes of sequences often appear
repeatedly in recursive projected databases
When (projected) database can be held in main memory, use pointers to form projections
Pointer to the sequence Offset of the postfix s=<a(abc)(ac)d(cf)> <a> s|<a>: ( , 2) <(abc)(ac)d(cf)> <ab> s|<ab>: ( , 4) <(_c)(ac)d(cf)> 31
Suggested Approach:
Integration of physical and pseudo-projection Swapping to pseudo-projection when the data set fits in memory
32
33
34
Effect of Pseudo-Projection
35
Motivation: reduces the number of (redundant) patterns but attains the same expressive power
Using Backward Subpattern and Backward Superpattern pruning to prune redundant search space
36
37
Length constraint
Find patterns having at least 20 items
Aggregate constraint
Find patterns that the average price of items is over $100
38
More Constraints
Regular expression constraint
Find patterns starting from Yahoo homepage, search for hotels in Washington DC area Yahootravel(WashingtonDC|DC)(hotel|motel|lodging)
Duration constraint
Find patterns about 24 hours of a shooting
Gap constraint
Find purchasing patterns such that the gap between each consecutive purchases is less than 1 month
39
Sets of Sequences:
{{<i1, i2>, , <im, in, ik>}, }
Sets of trees: {t1, t2, , tn} Sets of graphs (mining for frequent subgraphs):
{g1, g2, , gn}
40
Periodicity Analysis
Periodicity is everywhere: tides, seasons, daily power consumption, etc. Full periodicity
Every point in time contributes (precisely or approximately) to the periodicity
Methods
Full periodicity: FFT, other statistical analysis methods Partial and cyclic periodicity: Variations of Apriori-like mining methods
42
Summary
Sequential Pattern Mining is useful in many application, e.g. weblog analysis, financial market prediction, BioInformatics, etc. It is similar to the frequent itemsets mining, but with consideration of ordering. We have looked at different approaches that are descendants from two popular algorithms in mining frequent itemsets
Candidates Generation: AprioriAll and GSP Pattern Growth: FreeSpan and PrefixSpan
43