Unit-2
Unit-2
1
Unit - II
2
Association Analysis
◼ Basic Concepts
◼ Apriori Algorithm
◼ FP-growth
◼ Multidimensional Associations
◼ Summary
3
What Is Frequent Pattern Analysis?
◼ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
◼ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
◼ Motivation: Finding inherent regularities in data
◼ What products were often purchased together?— Beer and diapers?!
◼ What are the subsequent purchases after buying a PC?
◼ What kinds of DNA are sensitive to this new drug?
◼ Can we automatically classify web documents?
◼ Applications
◼ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
4
Why Is Freq. Pattern Mining Important?
◼ Broad applications
5
Basic Concepts: Frequent Patterns
6
Basic Concepts: Association Rules
Tid Items bought ◼ Find all the rules X → Y with
10 Beer, Nuts, Diaper
minimum support and confidence
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs ◼ support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X Y
50 Nuts, Coffee, Diaper, Eggs, Milk
◼ confidence, c, conditional
probability that a transaction
Customer
Customer
buys both
having X also contains Y
buys
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
Customer {Beer, Diaper}:3
buys beer ◼ Association rules: (many more!)
◼ Beer → Diaper (60%, 100%)
◼ Diaper → Beer (60%, 75%)
7
Closed Patterns and Max-Patterns
◼ A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
◼ Solution: Mine closed patterns and max-patterns instead
◼ An itemset X is closed if X is frequent and there exists no
super-pattern Y כX, with the same support as X
(proposed by Pasquier, et al. @ ICDT’99)
◼ An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y כX (proposed by
Bayardo @ SIGMOD’98)
◼ Closed pattern is a lossless compression of freq. patterns
◼ Reducing the # of patterns and rules
8
Computational Complexity of Frequent Itemset
Mining
◼ How many itemsets are potentially to be generated in the worst case?
◼ The number of frequent itemsets to be generated is senstive to the
minsup threshold
◼ When minsup is low, there exist potentially an exponential number of
frequent itemsets
◼ The worst case: MN where M: # distinct items, and N: max length of
transactions
◼ The worst case complexty vs. the expected probability
◼ Ex. Suppose Walmart has 104 kinds of products
◼ The chance to pick up one product 10-4
◼ The chance to pick up a particular set of 10 products: ~10-40
◼ What is the chance this particular set of 10 products to be frequent
103 times in 109 transactions?
9
Association Analysis
◼ Basic Concepts
◼ Apriori Algorithm
◼ FP-growth
◼ Multidimensional Associations
◼ Summary
10
The Downward Closure Property and Scalable
Mining Methods
◼ The downward closure property of frequent patterns
◼ Any subset of a frequent itemset must be frequent
diaper}
◼ i.e., every transaction having {beer, diaper, nuts} also
@SIGMOD’00)
◼ Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
11
Apriori: A Candidate Generation & Test Approach
12
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
13
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
14
Implementation of Apriori
◼ Basic Concepts
◼ Apriori Algorithm
◼ FP-growth
◼ Multidimensional Associations
◼ Summary
18
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
◼ Bottlenecks of the Apriori approach
◼ Breadth-first (i.e., level-wise) search
◼ Candidate generation and test
◼ Often generates a huge number of candidates
◼ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
◼ Depth-first search
◼ Avoid explicit candidate generation
◼ Major philosophy: Grow long patterns from short ones using local
frequent items only
◼ “abc” is a frequent pattern
◼ Get all transactions having “abc”, i.e., project DB on abc: DB|abc
◼ “d” is a local frequent item in DB|abc → abcd is a frequent pattern
19
Construct FP-tree from a Transaction Database
◼ Patterns containing p
◼ …
◼ Pattern f
21
Find Patterns Having P From P-conditional Database
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
pattern base
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
24
A Special Case: Single Prefix Path in FP-tree
◼ Completeness
◼ Preserve complete information for frequent pattern
mining
◼ Never break a long pattern of any transaction
◼ Compactness
◼ Reduce irrelevant info—infrequent items are gone
◼ Items in frequency descending order: the more
frequently occurring, the more likely to be shared
◼ Never be larger than the original database (not count
node-links and the count field)
26
The Frequent Pattern Growth Mining Method
database partition
◼ Method
◼ For each frequent item, construct its conditional
FP-tree
◼ Until the resulting FP-tree is empty, or it contains only
27
Performance of FPGrowth in Large Datasets
100
90 D1 FP-grow th runtime
D1 Apriori runtime
80
70
Run time(sec.)
60
50 Data set T25I20D10K
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
30
Advantages of the Pattern Growth Approach
◼ Divide-and-conquer:
◼ Decompose both the mining task and DB according to the
frequent patterns obtained so far
◼ Lead to focused search of smaller databases
◼ Other factors
◼ No candidate generation, no candidate test
◼ Compressed database: FP-tree structure
◼ No repeated scan of entire database
◼ Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
◼ A good open-source implementation and refinement of FPGrowth
◼ FPGrowth+ (Grahne and J. Zhu, FIMI'03)
31
Association Analysis
◼ Basic Concepts
◼ Apriori Algorithm
◼ FP-growth
◼ Multidimensional Associations
◼ Summary
32
Interestingness Measure: Correlations (Lift)
◼ play basketball eat cereal [40%, 66.7%] is misleading
◼ The overall % of students eating cereal is 75% > 66.7%.
◼ play basketball not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
◼ Measure of dependent/correlated events: lift
33
Are lift and 2 Good Measures of Correlation?
34
Null-Invariant Measures
35
Comparison of Interestingness Measures
◼ Null-(transaction) invariance is crucial for correlation analysis
◼ Lift and 2 are not null-invariant
◼ 5 null-invariant measures
Null-transactions Kulczynski
w.r.t. m and c measure (1927) Null-invariant
September 12, 2024 Data Mining: Concepts and Techniques Subtle: They disagree36
Analysis of DBLP Coauthor Relationships
Recent DB conferences, removing balanced associations, low sup, etc.
◼ Basic Concepts
◼ Apriori Algorithm
◼ FP-growth
◼ Multidimensional Associations
◼ Summary
39
Research on Pattern Mining: A Road Map
40
Mining Multiple-Level Association Rules
41
Multi-level Association: Flexible Support and
Redundancy filtering
◼ Flexible min-support thresholds: Some items are more valuable but
less frequent
◼ Use non-uniform, group-based min-support
◼ E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …
◼ Redundancy Filtering: Some rules may be redundant due to
“ancestor” relationships between items
◼ milk wheat bread [support = 8%, confidence = 70%]
◼ 2% milk wheat bread [support = 2%, confidence = 72%]
The first rule is an ancestor of the second rule
◼ A rule is redundant if its support is close to the “expected” value,
based on the rule’s ancestor
42
Mining Multi-Dimensional Association
◼ Single-dimensional rules:
buys(X, “milk”) buys(X, “bread”)
◼ Multi-dimensional rules: 2 dimensions or predicates
◼ Inter-dimension assoc. rules (no repeated predicates )
age(X,”19-25”) occupation(X,“student”) buys(X, “coke”)
◼ hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) buys(X, “popcorn”) buys(X, “coke”)
◼ Categorical Attributes: finite number of possible values, no
ordering among values—data cube approach
◼ Quantitative Attributes: Numeric, implicit ordering among
values—discretization, clustering, and gradient approaches
43
Advanced Frequent Pattern Mining
◼ Pattern Mining in Multi-Level, Multi-Dimensional Space
◼ Mining Multi-Level Association
◼ Mining Multi-Dimensional Association
◼ Mining Quantitative Association Rules
◼ Mining Rare Patterns and Negative Patterns
44
Mining Quantitative Associations
45
Static Discretization of Quantitative Attributes
48
Negative and Rare Patterns
◼ Rare patterns: Very low support but interesting
◼ E.g., buying Rolex watches
◼ Mining: Setting individual-based or special group-based
support threshold for valuable items
◼ Negative patterns
◼ Since it is unlikely that one buys Ford Expedition (an
SUV car) and Toyota Prius (a hybrid car) together, Ford
Expedition and Toyota Prius are likely negatively
correlated patterns
◼ Negatively correlated patterns that are infrequent tend to
be more interesting than those that are frequent
49
Defining Negative Correlated Patterns (I)
◼ Definition 1 (support-based)
◼ If itemsets X and Y are both frequent but rarely occur together, i.e.,
sup(X U Y) < sup (X) * sup(Y)
◼ Then X and Y are negatively correlated
◼ Problem: A store sold two needle 100 packages A and B, only one
transaction containing both A and B.
◼ When there are in total 200 transactions, we have
s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)
◼ When there are 105 transactions, we have
s(A U B) = 1/105, s(A) * s(B) = 1/103 * 1/103, s(A U B) > s(A) * s(B)
◼ Where is the problem? —Null transactions, i.e., the support-based
definition is not null-invariant!
50
Defining Negative Correlated Patterns (II)
◼ Definition 2 (negative itemset-based)
◼ X is a negative itemset if (1) X = Ā U B, where B is a set of positive
items, and Ā is a set of negative items, |Ā|≥ 1, and (2) s(X) ≥ μ
◼ Itemsets X is negatively correlated, if
◼ Basic Concepts
◼ Apriori Algorithm
◼ FP-growth
◼ Multidimensional Associations
◼ Summary
52
Mining Colossal Frequent Patterns
◼ F. Zhu, X. Yan, J. Han, P. S. Yu, and H. Cheng, “Mining Colossal
Frequent Patterns by Core Pattern Fusion”, ICDE'07.
◼ We have many algorithms, but can we mine large (i.e., colossal)
patterns? ― such as just size around 50 to 100? Unfortunately, not!
◼ Why not? ― the curse of “downward closure” of frequent patterns
◼ The “downward closure” property
◼ Any sub-pattern of a frequent pattern is frequent.
◼ Example. If (a1, a2, …, a100) is frequent, then a1, a2, …, a100, (a1,
a2), (a1, a3), …, (a1, a100), (a1, a2, a3), … are all frequent! There
are about 2100 such frequent itemsets!
◼ No matter using breadth-first search (e.g., Apriori) or depth-first
search (FPgrowth), we have to examine so many patterns
◼ Thus the downward closure property leads to explosion!
53
Colossal Patterns: A Motivating Example
Let’s make a set of 40 transactions Closed/maximal patterns may
T1 = 1 2 3 4 ….. 39 40 partially alleviate the problem but not
T2 = 1 2 3 4 ….. 39 40 really solve it: We often need to mine
: . scattered large patterns!
: . Let the minimum support threshold
: . σ= 20
: .
T40=1 2 3 4 ….. 39 40
Then delete the items on the diagonal
T1 = 2 3 4 ….. 39 40
T2 = 1 3 4 ….. 39 40
: .
: .
: .
: .
T40=1 2 3 4 …… 39
54
Colossal Pattern Set: Small but Interesting
55
Mining Colossal Patterns: Motivation and
Philosophy
◼ Motivation: Many real-world tasks need mining colossal patterns
◼ Micro-array analysis in bioinformatics (when support is low)
Transaction Database D
A colossal pattern α
α D
α1 Dαk
α2
D
Dα1
α
Dα2
αk
59
Robustness of Colossal Patterns
◼ Core Patterns
Intuitively, for a frequent pattern α, a subpattern β is a τ-core
pattern of α if β shares a similar support set with α, i.e.,
60
Colossal Patterns Correspond to Dense Balls
61
Idea of Pattern-Fusion Algorithm
62
Pattern-Fusion: The Algorithm
◼ A bounded-breadth pattern
tree traversal
◼ It avoids explosion in
64
Pattern-Fusion Leads to Good Approximation
65
Association Analysis
◼ Basic Concepts
◼ Apriori Algorithm
◼ FP-growth
◼ Multidimensional Associations
◼ Summary
66
Association Analysis
◼ Basic Concepts
◼ Apriori, FP Growth
◼ Correlation Analysis
67
End Unit - 2
68