0% found this document useful (0 votes)

98 views43 pages

Mining Sequential Patterns

Sequential pattern mining finds frequent subsequences in sequence databases. GSP is an early Apriori-based algorithm that scans the database multiple times to generate candidate sequences. PrefixSpan is a pattern-growth method that divides the search space based on frequent prefixes and projects databases to mine subsequences efficiently with fewer scans. It provides a complete set of sequential patterns by recursively projecting databases based on frequent prefixes.

Uploaded by

rprema79

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views43 pages

Mining Sequential Patterns

Uploaded by

rprema79

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 43

Sequential Pattern Mining

Outline
What is sequence database and sequential pattern mining Methods for sequential pattern mining Constraint-based sequential pattern mining Periodicity analysis for sequence data

Sequence Databases
A sequence database consists of ordered elements or events Transaction databases vs. sequence databases
A transaction database
TID
10 20 30 40

A sequence database
SID
10 20 30 40

itemsets
a, b, d a, c, d a, d, e b, e, f

sequences
<a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc>
3

Applications
Applications of sequential pattern mining
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera, within 3 months.

Medical treatments, natural disasters (e.g., earthquakes), science & eng. processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams DNA sequences and gene structures

Subsequence vs. super sequence

A sequence is an ordered list of events, denoted < e1 e2 el >
Given two sequences =< a1 a2 an > and = is called a subsequence of , denoted as , if there exist integers 1 j1 < j2 << jn m such that a1 bj1, a2 bj2,, an bjn is a super sequence of
E.g.=< (ab), d> and =< (abc), (de)>
5

What Is Sequential Pattern Mining?

Given a set of sequences and support threshold, find the complete set of frequent subsequences A sequence : < (ef) (ab) (df) c b >
A sequence database
SID 10 20 30 40 sequence <a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc> An element may contain a set of items. Items within an element are unordered and we list them alphabetically.

<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a

sequential pattern

Challenges on Sequential Pattern Mining

A huge number of possible sequential patterns are hidden in databases
A mining algorithm should

find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans
be able to incorporate various kinds of userspecific constraints
7

Studies on Sequential Pattern Mining

Concept introduction and an initial Apriori-like algorithm
Agrawal & Srikant. Mining sequential patterns, [ICDE95]

Apriori-based method: GSP (Generalized Sequential Patterns: Srikant

& Agrawal [EDBT96])

Pattern-growth methods: FreeSpan & PrefixSpan (Han et al.KDD00; Pei, et al. [ICDE01]) Vertical format-based mining: SPADE (Zaki [Machine Leanining00]) Constraint-based sequential pattern mining (SPIRIT: Garofalakis, Rastogi, Shim [VLDB99]; Pei, Han, Wang [CIKM02]) Mining closed sequential patterns: CloSpan (Yan, Han & Afshar

[SDM03])
8

Methods for sequential pattern mining

Apriori-based Approaches
GSP SPADE

Pattern-Growth-based Approaches
FreeSpan PrefixSpan

The Apriori Property of Sequential Patterns

A basic property: Apriori (Agrawal & Sirkant94)

If a sequence S is not frequent, then none of the super-sequences of S is frequent

E.g, <hb> is infrequent so do <hab> and <(ah)b>
Seq. ID 10 20 30 40 50 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)>

Given support threshold min_sup =2

GSPGeneralized Sequential Pattern Mining

GSP (Generalized Sequential Pattern) mining algorithm Outline of the method
Initially, every item in DB is a candidate of length-1 for each level (i.e., sequences of length-k) do
scan database to collect support count for each candidate sequence generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori

repeat until no frequent sequence or no candidate can be found

Major strength: Candidate pruning by Apriori

Finding Length-1 Sequential Patterns

Initial candidates:
<a>, , <c>, <d>, <e>, <f>, <g>, <h>

Scan database once, count support for candidates

min_sup =2
Seq. ID 10 20 30 40 50 Sequence <(bd)cb(ac)> <(bf)(ce)b(fg)> <(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)>

Cand Sup <a> 3 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1
12

Generating Length-2 Candidates

51 length-2 Candidates

<a> <c> <d>

<aa> <ba> <ca> <da>

<(bc)>

<(bd)>
<(cd)>

<(be)>
<(ce)> <(de)>

<(bf)>
<(cf)> <(df)> <(ef)>

Without Apriori property, 88+87/2=92 candidates

<f>

Apriori prunes 13 44.57% candidates

Finding Lenth-2 Sequential Patterns

Scan database one more time, collect support count for each length-2 candidate There are 19 length-2 candidates which pass the minimum support threshold
They are length-2 sequential patterns

The GSP Mining Process

5th scan: 1 cand. 1 length-5 seq. pat. <(bd)cba>

Cand. cannot pass sup. threshold

Cand. not in DB at all 4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> pat. 3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> pat. 20 cand. not in DB at all 2nd scan: 51 cand. 19 length-2 seq. <aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)> pat. 10 cand. not in DB at all 1st scan: 8 cand. 6 length-1 seq. <a> <c> <d> <e> <f> <g> <h> pat. Seq. ID Sequence

min_sup =2

<(bd)cb(ac)>

20
30 40 50

<(bf)(ce)b(fg)>
<(ah)(bf)abf> <(be)(ce)d> <a(bd)bcb(ade)>
15

The GSP Algorithm

Take sequences in form of <x> as length-1 candidates Scan database once, find F1, the set of length-1 sequential patterns Let k=1; while Fk is not empty do
Form Ck+1, the set of length-(k+1) candidates from Fk; If Ck+1 is not empty, scan database once, find Fk+1, the set of length-(k+1) sequential patterns Let k=k+1;
16

The GSP Algorithm

Benefits from the Apriori pruning
Reduces search space

Bottlenecks
Scans the database multiple times

Generates a huge set of candidate sequences

There is a need for more efficient mining methods

The SPADE Algorithm

SPADE (Sequential PAttern Discovery using Equivalent Class) developed by Zaki 2001 A vertical format sequential pattern mining

method
A sequence database is mapped to a large set of Item: <SID, EID> Sequential pattern mining is performed by
growing the subsequences (patterns) one item at a

time by Apriori candidate generation

The SPADE Algorithm

Bottlenecks of Candidate Generate-and-test

A huge set of candidates generated.
Especially 2-item candidate sequence.

Multiple Scans of database in mining.

The length of each candidate grows by one at each
database scan.

Inefficient for mining long sequential patterns.

A long pattern grow up from short patterns An exponential number of short candidates
20

PrefixSpan (Prefix-Projected Sequential Pattern Growth)

PrefixSpan
Projection-based But only prefix-based projection: less projections and quickly shrinking sequences

J.Pei, J.Han, PrefixSpan : Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE01.

Prefix and Suffix (Projection)

<a>, <aa>, <a(ab)> and <a(abc)> are prefixes

of sequence <a(abc)(ac)d(cf)>
Given sequence <a(abc)(ac)d(cf)>
Prefix Suffix (Prefix-Based Projection)

<a> <aa> <ab>

<(abc)(ac)d(cf)> <(_bc)(ac)d(cf)> <(_c)(ac)d(cf)>

Mining Sequential Patterns by Prefix Projections

Step 1: find length-1 sequential patterns
<a>, , <c>, <d>, <e>, <f>

Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:
The ones having prefix <a>; The ones having prefix ; The ones having prefix <f>
SID 10 sequence <a(abc)(ac)d(cf)>

20
30 40

<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb> <eg(af)cbc>
23

Finding Seq. Patterns with Prefix <a>

Only need to consider projections w.r.t. <a>
<a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>

Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
Further partition into 6 subsets
Having prefix <aa>;
Having prefix <af>
SID sequence

10
20 30 40

<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc>
24

Completeness of PrefixSpan
SDB
SID sequence

10
20 30 40

<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)> <(ef)(ab)(df)cb> <eg(af)cbc>

Length-1 sequential patterns <a>, , <c>, <d>, <e>, <f>

Having prefix <a> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>

Having prefix <c>, , <f> -projected database

Having prefix

Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

Having prefix <aa> Having prefix <af>

<aa>-proj. db

<af>-proj. db
25

The Algorithm of PrefixSpan

Input: A sequence database S, and the minimum support threshold min_sup Output: The complete set of sequential patterns Method: Call PrefixSpan(<>,0,S) Subroutine PrefixSpan(, l, S|) Parameters:
: sequential pattern, l: the length of ; S|: the -projected database, if <>; otherwise; the sequence database S
26

The Algorithm of PrefixSpan(2)

Method
1. Scan S| once, find the set of frequent items b such that:
a) b can be assembled to the last element of to form a sequential pattern; or b) can be appended to to form a sequential pattern.

2. For each frequent item b, append it to to form a sequential pattern , and output ; 3. For each , construct -projected database S|, and call PrefixSpan(, l+1, S|).
27

Efficiency of PrefixSpan
No candidate sequence needs to be generated Projected databases keep shrinking Major cost of PrefixSpan: constructing

projected databases
Can be improved by bi-level projections
28

Optimization in PrefixSpan
Single level vs. bi-level projection
Bi-level projection with 3-way checking may reduce the number and size of projected databases

Physical projection vs. pseudo-projection

Pseudo-projection may reduce the effort of projection when the projected database fits in main memory

Parallel projection vs. partition projection

Partition projection may avoid the blowup of disk space

Scaling Up by Bi-Level Projection

Partition search space based on length-2 sequential patterns Only form projected databases and pursue recursive mining over bi-level projected databases

Speed-up by Pseudo-projection
Major cost of PrefixSpan: projection
Postfixes of sequences often appear
repeatedly in recursive projected databases

When (projected) database can be held in main memory, use pointers to form projections
Pointer to the sequence Offset of the postfix s=<a(abc)(ac)d(cf)> <a> s|<a>: ( , 2) <(abc)(ac)d(cf)> <ab> s|<ab>: ( , 4) <(_c)(ac)d(cf)> 31

Pseudo-Projection vs. Physical Projection

Pseudo-projection avoids physically copying postfixes
Efficient in running time and space when database can be held in main memory

However, it is not efficient when database cannot fit in main memory

Disk-based random accessing is very costly

Suggested Approach:
Integration of physical and pseudo-projection Swapping to pseudo-projection when the data set fits in memory
32

Performance on Data Set C10T8S8I8

Performance on Data Set Gazelle

Effect of Pseudo-Projection

CloSpan: Mining Closed Sequential Patterns

A closed sequential pattern s: there exists no superpattern s such that s s, and s and s have the same support

Motivation: reduces the number of (redundant) patterns but attains the same expressive power
Using Backward Subpattern and Backward Superpattern pruning to prune redundant search space

CloSpan: Performance Comparison with PrefixSpan

Constraints for Seq.-Pattern Mining

Item constraint
Find web log patterns only about online-bookstores

Length constraint
Find patterns having at least 20 items

Super pattern constraint

Find super patterns of PC digital camera

Aggregate constraint
Find patterns that the average price of items is over $100

More Constraints
Regular expression constraint
Find patterns starting from Yahoo homepage, search for hotels in Washington DC area Yahootravel(WashingtonDC|DC)(hotel|motel|lodging)

Duration constraint
Find patterns about 24 hours of a shooting

Gap constraint
Find purchasing patterns such that the gap between each consecutive purchases is less than 1 month
39

From Sequential Patterns to Structured Patterns

Sets, sequences, trees, graphs, and other structures
Transaction DB: Sets of items
{{i1, i2, , im}, }

Seq. DB: Sequences of sets:

{<{i1, i2}, , {im, in, ik}>, }

Sets of Sequences:
{{<i1, i2>, , <im, in, ik>}, }

Sets of trees: {t1, t2, , tn} Sets of graphs (mining for frequent subgraphs):
{g1, g2, , gn}

Mining structured patterns in XML documents,

Episodes and Episode Pattern Mining

Other methods for specifying the kinds of patterns
Serial episodes: A B Parallel episodes: A & B

Regular expressions: (A | B)C*(D E)

Methods for episode pattern mining

Variations of Apriori-like algorithms, e.g., GSP

Database projection-based pattern growth

Similar to the frequent pattern growth without candidate generation
41

Periodicity Analysis
Periodicity is everywhere: tides, seasons, daily power consumption, etc. Full periodicity
Every point in time contributes (precisely or approximately) to the periodicity

Partial periodicit: A more general notion

Only some segments contribute to the periodicity
Jim reads NY Times 7:00-7:30 am every week day

Cyclic association rules

Associations which form cycles

Methods
Full periodicity: FFT, other statistical analysis methods Partial and cyclic periodicity: Variations of Apriori-like mining methods

Summary
Sequential Pattern Mining is useful in many application, e.g. weblog analysis, financial market prediction, BioInformatics, etc. It is similar to the frequent itemsets mining, but with consideration of ordering. We have looked at different approaches that are descendants from two popular algorithms in mining frequent itemsets
Candidates Generation: AprioriAll and GSP Pattern Growth: FreeSpan and PrefixSpan
43

Lecture 13
No ratings yet
Lecture 13
43 pages
An Updown Directed Acyclic Graph Approach For Sequential Pattern Mining
No ratings yet
An Updown Directed Acyclic Graph Approach For Sequential Pattern Mining
67 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
26 pages
Concepts and Techniques: Mining Sequence Patterns in Transactional Databases
No ratings yet
Concepts and Techniques: Mining Sequence Patterns in Transactional Databases
26 pages
Data Mining - Mining Sequential Patterns
No ratings yet
Data Mining - Mining Sequential Patterns
10 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
34 pages
Sequential Pattern Mining by Pattern-Growth: Principles and Extensions
No ratings yet
Sequential Pattern Mining by Pattern-Growth: Principles and Extensions
38 pages
1. LAN_PAKDD2014_sequential_pattern_mining_CM-SPADE_CM-SPAM
No ratings yet
1. LAN_PAKDD2014_sequential_pattern_mining_CM-SPADE_CM-SPAM
13 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
Sequential Pattern Mining
No ratings yet
Sequential Pattern Mining
24 pages
Pattern Sequence Mining: Presented By: Devika Mittal
No ratings yet
Pattern Sequence Mining: Presented By: Devika Mittal
15 pages
Good One
No ratings yet
Good One
12 pages
Mining High Utility Patterns in One Phase Without Generating Candidates
No ratings yet
Mining High Utility Patterns in One Phase Without Generating Candidates
17 pages
ADMA2013 MaxSP Maximal Sequential Patterns
No ratings yet
ADMA2013 MaxSP Maximal Sequential Patterns
12 pages
PrefixSpan the Presentation (1)
No ratings yet
PrefixSpan the Presentation (1)
93 pages
Icremental Mining of Sequential Pattern
No ratings yet
Icremental Mining of Sequential Pattern
25 pages
Compusoft, 3 (9), 1079-1082 PDF
No ratings yet
Compusoft, 3 (9), 1079-1082 PDF
4 pages
Chapter 10: Sequence Mining
No ratings yet
Chapter 10: Sequence Mining
37 pages
Sequence Analysis: Athira P-AM - BU.P2MBA20029
No ratings yet
Sequence Analysis: Athira P-AM - BU.P2MBA20029
14 pages
Sequential Pattern Mining: A Comparison Between GSP, SPADE and Prefix SPAN
No ratings yet
Sequential Pattern Mining: A Comparison Between GSP, SPADE and Prefix SPAN
21 pages
Improved Sequential Pattern Mining Using An Extended Bitmap Representation
No ratings yet
Improved Sequential Pattern Mining Using An Extended Bitmap Representation
11 pages
L13-16 Sequential Patterns
No ratings yet
L13-16 Sequential Patterns
36 pages
Data Mining Patrones Secuenciales
No ratings yet
Data Mining Patrones Secuenciales
59 pages
Efficient Mining of Correlated Sequential Patterns Based On Null Hypothesis
No ratings yet
Efficient Mining of Correlated Sequential Patterns Based On Null Hypothesis
8 pages
DM Lect 5_Sequence & Stream Mining
No ratings yet
DM Lect 5_Sequence & Stream Mining
32 pages
49 Sweight
No ratings yet
49 Sweight
13 pages
Sequential Patterns The GSP Algorithm
No ratings yet
Sequential Patterns The GSP Algorithm
10 pages
Efficient Mining of Top-K Sequential Rules: Philippe Fournier-Viger
No ratings yet
Efficient Mining of Top-K Sequential Rules: Philippe Fournier-Viger
21 pages
Mining Temporal Patterns For Interval-Based and Point-Based Events
No ratings yet
Mining Temporal Patterns For Interval-Based and Point-Based Events
6 pages
Performance Analysis of Sequential Pattern Mining Algorithms On Large Dense Datasets
No ratings yet
Performance Analysis of Sequential Pattern Mining Algorithms On Large Dense Datasets
7 pages
Efficient Mining of Top-K Sequential Rules: Abstract
No ratings yet
Efficient Mining of Top-K Sequential Rules: Abstract
14 pages
Review Paper: Procuring Frequent and Sequential Items To Improve Product Sales in E-Commerce Sites
No ratings yet
Review Paper: Procuring Frequent and Sequential Items To Improve Product Sales in E-Commerce Sites
5 pages
Unit-3
No ratings yet
Unit-3
69 pages
Comparative Study of Different Improvements of Apriori Algorithm
No ratings yet
Comparative Study of Different Improvements of Apriori Algorithm
4 pages
Mining Sequential Patterns: E-Mail: Arif@its-Sby - Edu URL: WWW - Its-Sby - Edu/ Arif
No ratings yet
Mining Sequential Patterns: E-Mail: Arif@its-Sby - Edu URL: WWW - Its-Sby - Edu/ Arif
25 pages
21 Maxweight
No ratings yet
21 Maxweight
15 pages
Lesson 5 Quiz
No ratings yet
Lesson 5 Quiz
11 pages
BIDE: Efficient Mining of Frequent Closed Sequences: Jianyong Wang and Jiawei Han
No ratings yet
BIDE: Efficient Mining of Frequent Closed Sequences: Jianyong Wang and Jiawei Han
36 pages
A Survey of Sequential Pattern Mining
No ratings yet
A Survey of Sequential Pattern Mining
24 pages
Prediction of Customer Behavior Using Cma
No ratings yet
Prediction of Customer Behavior Using Cma
9 pages
Guide: Mr. Gautam Borkar: Group Members: Rahul Kelaskar A - 636 Anish Khale A - 638 Dhaval Doshi A - 682
No ratings yet
Guide: Mr. Gautam Borkar: Group Members: Rahul Kelaskar A - 636 Anish Khale A - 638 Dhaval Doshi A - 682
22 pages
Scalable Sequential Pattern Mining Based On PrefixSpan For High Dimensional Data
No ratings yet
Scalable Sequential Pattern Mining Based On PrefixSpan For High Dimensional Data
6 pages
PrefixSpan Final
No ratings yet
PrefixSpan Final
22 pages
Sequential Pattern Mining
No ratings yet
Sequential Pattern Mining
3 pages
A Rough Set Model For Sequential Pattern Mining With Constraints
No ratings yet
A Rough Set Model For Sequential Pattern Mining With Constraints
7 pages
Mining Web Access Patterns With Super-Pattern Constraint
No ratings yet
Mining Web Access Patterns With Super-Pattern Constraint
13 pages
Chap 5.1: Mining Sequential Patterns
No ratings yet
Chap 5.1: Mining Sequential Patterns
20 pages
Summarizing Sequential Data With Closed Partial Orders
No ratings yet
Summarizing Sequential Data With Closed Partial Orders
12 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
30 pages
Association Analysis: Patterns Association Analysis: Patterns
No ratings yet
Association Analysis: Patterns Association Analysis: Patterns
30 pages
Data Mining Unit-III
No ratings yet
Data Mining Unit-III
24 pages
PrefixSpan the Presentation
No ratings yet
PrefixSpan the Presentation
76 pages
Various Sequence Classification Mechanisms For Knowledge Discovery
No ratings yet
Various Sequence Classification Mechanisms For Knowledge Discovery
4 pages
Business Intelligence Software and Techniques: BUAN6324/MIS6324
No ratings yet
Business Intelligence Software and Techniques: BUAN6324/MIS6324
24 pages
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
No ratings yet
Apriori Based Novel Frequent Itemset Mining Mechanism: Issn No
8 pages
Bab 06 - Seq Mining - Part 2
No ratings yet
Bab 06 - Seq Mining - Part 2
26 pages
PrefixSpan the Presentation (1) Removed
No ratings yet
PrefixSpan the Presentation (1) Removed
51 pages
M9 Asosiasi
No ratings yet
M9 Asosiasi
58 pages
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet

Mining Sequential Patterns

Uploaded by

Mining Sequential Patterns

Uploaded by

Sequential Pattern Mining

Subsequence vs. super sequence

What Is Sequential Pattern Mining?

<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a

Challenges on Sequential Pattern Mining

Studies on Sequential Pattern Mining

Apriori-based method: GSP (Generalized Sequential Patterns: Srikant

& Agrawal [EDBT96])

Methods for sequential pattern mining

The Apriori Property of Sequential Patterns

If a sequence S is not frequent, then none of the super-sequences of S is frequent

Given support threshold min_sup =2

GSPGeneralized Sequential Pattern Mining

repeat until no frequent sequence or no candidate can be found

Major strength: Candidate pruning by Apriori

Finding Length-1 Sequential Patterns

Scan database once, count support for candidates

Generating Length-2 Candidates

<a> <b> <c> <d>

<aa> <ba> <ca> <da>

Without Apriori property, 8*8+8*7/2=92 candidates

Apriori prunes 13 44.57% candidates

Finding Lenth-2 Sequential Patterns

The GSP Mining Process

Cand. cannot pass sup. threshold

The GSP Algorithm

The GSP Algorithm

Generates a huge set of candidate sequences

There is a need for more efficient mining methods

The SPADE Algorithm

time by Apriori candidate generation

The SPADE Algorithm

Bottlenecks of Candidate Generate-and-test

Multiple Scans of database in mining.

Inefficient for mining long sequential patterns.

PrefixSpan (Prefix-Projected Sequential Pattern Growth)

Prefix and Suffix (Projection)

<a> <aa> <ab>

<(abc)(ac)d(cf)> <(_bc)(ac)d(cf)> <(_c)(ac)d(cf)>

Mining Sequential Patterns by Prefix Projections

Finding Seq. Patterns with Prefix <a>

Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f>

Having prefix <a> <a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>

Having prefix <c>, , <f> <b>-projected database

Having prefix <b>

Having prefix <aa> Having prefix <af>

The Algorithm of PrefixSpan

The Algorithm of PrefixSpan(2)

Physical projection vs. pseudo-projection

Parallel projection vs. partition projection

Scaling Up by Bi-Level Projection

Pseudo-Projection vs. Physical Projection

However, it is not efficient when database cannot fit in main memory

Performance on Data Set C10T8S8I8

Performance on Data Set Gazelle

CloSpan: Mining Closed Sequential Patterns

CloSpan: Performance Comparison with PrefixSpan

Constraints for Seq.-Pattern Mining

Super pattern constraint

From Sequential Patterns to Structured Patterns

Seq. DB: Sequences of sets:

Mining structured patterns in XML documents,

Episodes and Episode Pattern Mining

Regular expressions: (A | B)C*(D E)

Methods for episode pattern mining

Database projection-based pattern growth

Partial periodicit: A more general notion

Cyclic association rules

You might also like

Without Apriori property, 88+87/2=92 candidates