0% found this document useful (0 votes)
224 views

Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem

The document provides an overview of sequential pattern mining. It discusses sequence databases and sequential patterns, as well as applications like customer shopping sequences. Several sequential pattern mining algorithms are described, including GSP, SPADE, and PrefixSpan. GSP uses an Apriori-based candidate generation approach, while SPADE adopts a vertical format and PrefixSpan employs prefix projections to efficiently mine patterns. The document highlights challenges in mining long sequential patterns from a large search space and number of candidate sequences.

Uploaded by

Ibrahim AlJarah
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
224 views

Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem

The document provides an overview of sequential pattern mining. It discusses sequence databases and sequential patterns, as well as applications like customer shopping sequences. Several sequential pattern mining algorithms are described, including GSP, SPADE, and PrefixSpan. GSP uses an Apriori-based candidate generation approach, while SPADE adopts a vertical format and PrefixSpan employs prefix projections to efficiently mine patterns. The document highlights challenges in mining long sequential patterns from a large search space and number of candidate sequences.

Uploaded by

Ibrahim AlJarah
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction to Data Mining

Saeed Salem
Department of Computer Science
North Dakota State University
cs.ndsu.edu/~salem

These slides are based on slides by Jiawei Han and Micheline Kamber, and Mohammed Zaki
Sequence Databases & Sequential Patterns

 Transaction databases, time-series databases vs. sequence


databases
 Frequent patterns vs. (frequent) sequential patterns
 Applications of sequential pattern mining
 Customer shopping sequences:
 First buy computer, then CD-ROM, and then digital
camera, within 3 months.
 Medical treatments, natural disasters (e.g., earthquakes),
science & eng. processes, stocks and markets, etc.
 Telephone calling patterns, Weblog click streams
 DNA sequences and gene structures
2
What Is Sequential Pattern Mining?

 Given a set of sequences, find the complete set of


frequent subsequences

A sequence : < (ef) (ab) (df) c b >


A sequence database
SID sequence An element may contain a set of items.
10 <a(abc)(ac)d(cf)> Items within an element are unordered
20 <(ad)c(bc)(ae)> and we list them alphabetically.
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc> <a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a


sequential pattern
3
Challenges on Sequential Pattern Mining

 A huge number of possible sequential patterns are


hidden in databases
 A mining algorithm should
 find the complete set of patterns, when possible,
satisfying the minimum support (frequency) threshold
 be highly efficient, scalable, involving only a small
number of database scans
 be able to incorporate various kinds of user-specific
constraints

4
Sequential Pattern Mining Algorithms

 Concept introduction and an initial Apriori-like algorithm


 Agrawal & Srikant. Mining sequential patterns, ICDE’95
 Apriori-based method: GSP (Generalized Sequential Patterns: Srikant
& Agrawal @ EDBT’96)
 Pattern-growth methods: FreeSpan & PrefixSpan (Han et
al.@KDD’00; Pei, et al.@ICDE’01)
 Vertical format-based mining: SPADE (Zaki@Machine Leanining’00)
 Constraint-based sequential pattern mining (SPIRIT: Garofalakis,
Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02)
 Mining closed sequential patterns: CloSpan (Yan, Han & Afshar
@SDM’03)

5
The Apriori Property of Sequential Patterns

 A basic property: Apriori (Agrawal & Sirkant’94)


 If a sequence S is not frequent
 Then none of the super-sequences of S is frequent
 E.g, <hb> is infrequent  so do <hab> and <(ah)b>

Seq. ID Sequence
Given support threshold
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>

6
GSP—Generalized Sequential Pattern Mining

 GSP (Generalized Sequential Pattern) mining algorithm


 proposed by Agrawal and Srikant, EDBT’96

 Outline of the method


 Initially, every item in DB is a candidate of length-1

 for each level (i.e., sequences of length-k) do

 scan database to collect support count for each

candidate sequence
 generate candidate length-(k+1) sequences from

length-k frequent sequences using Apriori


 repeat until no frequent sequence or no candidate can

be found
 Major strength: Candidate pruning by Apriori
7
GSP: Finding Length-1 Sequential
Patterns

 Examine GSP using an example


 Initial candidates: all singleton sequences Cand Sup
 <a>, <b>, <c>, <d>, <e>, <f>,
<a> 3
<g>, <h> <b> 5
 Scan database once, count support for <c> 4
candidates <d> 3
min_sup =2 <e> 3
Seq. ID Sequence <f> 2
10 <(bd)cb(ac)> <g> 1
<h> 1
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>

8
GSP: A Complete Example

9
Candidate Generate-and-test: Drawbacks

 A huge set of candidate sequences generated.


 Especially 2-item candidate sequence.
 Multiple Scans of database needed.
 The length of each candidate grows by one at each
database scan.
 Inefficient for mining long sequential patterns.
 A long pattern grow up from short patterns
 The number of short patterns is exponential to the
length of mined patterns.
10
The SPADE Algorithm

 SPADE (Sequential PAttern Discovery using Equivalent


Class) developed by Zaki 2001
 A vertical format sequential pattern mining method
 A sequence database is mapped to a large set of
 Item: <SID, EID>
 Sequential pattern mining is performed by
 growing the subsequences (patterns) one item at a
time by Apriori candidate generation

11
The SPADE Algorithm

12
The SPADE Algorithm

tidlist

Mine frequent
subsequences
with
support = 3

13
Pros and Cons of SPADE

Pros:
 Allows for different search strategies: Breadth-first and
depth-first search

Cons:
 A huge set of candidates could be generated
 Mining long sequential patterns
 Needs an exponential number of short candidates
 A length-100 sequential pattern needs 1030
100
100  100

candidate sequences!

 
i 1  i 
  2  1  10 30

14
Prefix and Suffix (Projection)

 <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of


sequence <a(abc)(ac)d(cf)>
 Given sequence <a(abc)(ac)d(cf)>

Prefix Suffix (Prefix-Based Projection)


<a> <(abc)(ac)d(cf)>
<aa> <(_bc)(ac)d(cf)>
<ab> <(_c)(ac)d(cf)>

15
Mining Sequential Patterns by Prefix
Projections

 Step 1: find length-1 sequential patterns


 <a>, <b>, <c>, <d>, <e>, <f>

 Step 2: divide search space. The complete set of seq. pat.


can be partitioned into 6 subsets:
 The ones having prefix <a>;

 The ones having prefix <b>;


SID sequence
 … 10 <a(abc)(ac)d(cf)>
 The ones having prefix <f>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>

16
Finding Seq. Patterns with Prefix <a>

 Only need to consider projections w.r.t. <a>


 <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)
(ae)>, <(_b)(df)cb>, <(_f)cbc>

 Find all the length-2 seq. pat. Having prefix <a>: <aa>,
<ab>, <(ab)>, <ac>, <ad>, <af>
 Further partition into 6 subsets SID sequence
10 <a(abc)(ac)d(cf)>
 Having prefix <aa>;
20 <(ad)c(bc)(ae)>
 … 30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
 Having prefix <af>

17
Completeness of PrefixSpan
SDB
SID sequence
10 <a(abc)(ac)d(cf)> Length-1 sequential patterns
20 <(ad)c(bc)(ae)> <a>, <b>, <c>, <d>, <e>, <f>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>

Having prefix <a> Having prefix <c>, …, <f>


Having prefix <b>
<a>-projected database
<(abc)(ac)d(cf)> Length-2 sequential
<b>-projected database …
<(_d)c(bc)(ae)> patterns
<(_b)(df)cb> <aa>, <ab>, <(ab)>,
<(_f)cbc> <ac>, <ad>, <af>
……
Having prefix <aa> Having prefix <af>

<aa>-proj. db … <af>-proj. db

18
PrefixSpan: An Example

19
Efficiency of PrefixSpan

 No candidate sequence needs to be generated


 Projected databases keep shrinking
 Major cost of PrefixSpan: constructing projected
databases
 Can be improved by pseudo-projections

20
Speed-up by Pseudo-projection

 Major cost of PrefixSpan: projection


 Postfixes of sequences often appear
repeatedly in recursive projected databases
 When (projected) database can be held in main
memory, use pointers to form projections
 Pointer to the sequence s=<a(abc)(ac)d(cf)>
 Offset of the postfix <a>
s|<a>: ( , 2) <(abc)(ac)d(cf)>
<ab>
s|<ab>: ( , 4) <(_c)(ac)d(cf)>
21
Pseudo-Projection vs. Physical Projection

 Pseudo-projection avoids physically copying postfixes


 Efficient in running time and space when database
can be held in main memory
 However, it is not efficient when database cannot fit
in main memory
 Disk-based random accessing is very costly
 Suggested Approach:
 Integration of physical and pseudo-projection
 Swapping to pseudo-projection when the data set
fits in memory

22
Performance on Data Set C10T8S8I8

23
Constraint-Based Seq.-Pattern Mining

 Constraint-based sequential pattern mining


 Constraints: User-specified, for focused mining of desired

patterns
 How to explore efficient mining with constraints? —

Optimization
 Classification of constraints
 Anti-monotone: E.g., value_sum(S) < 150, min(S) > 10

 Monotone: E.g., count (S) > 5, S  {PC, digital_camera}

 Succinct: E.g., length(S)  10, S  {Pentium, MS/Office,

MS/Money}
 Convertible: E.g., value_avg(S) < 25, profit_sum (S) >

160, max(S)/avg(S) < 2, median(S) – min(S) > 5


 Inconvertible: E.g., avg(S) – median(S) = 0

24
From Sequential Patterns to Structured Patterns

 Sets, sequences, trees, graphs, and other structures


 Transaction DB: Sets of items

 {{i , i , …, i }, …}
1 2 m
 Seq. DB: Sequences of sets:
 {<{i , i }, …, {i , i , i }>, …}
1 2 m n k
 Sets of Sequences:
 {{<i , i >, …, <i , i , i >}, …}
1 2 m n k

 Sets of trees: {t1, t2, …, tn}


 Sets of graphs (mining for frequent subgraphs):
 {g , g , …, g }
1 2 n

 Mining structured patterns in XML documents, bio-


chemical structures, etc.
25
Ref: Mining Sequential Patterns
 R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance
improvements. EDBT’96.
 H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. DAMI:97.
 M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning,
2001.
 J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential
Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01 (TKDE’04).
 J. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large Databases,
CIKM'02.
 X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large
Datasets. SDM'03.
 J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04.
 H. Cheng, X. Yan, and J. Han, IncSpan: Incremental Mining of Sequential Patterns in Large
Database, KDD'04.
 J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series
Database, ICDE'99.
 J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series data,
KDD'00.

26

You might also like