Ch06 Frequent Itemsets
Ch06 Frequent Itemsets
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Frequent Itemset
Mining & Association
Rules
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff
Ullman Stanford University
http://www.mmds.org
Association Rule Discovery
For example:
Finding communities in graphs (e.g., Twitter)
in s buckets Bi
…
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4= {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
support( I j )
conf( I j )
support( I )
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
Mining Association Rules
Step 1: Find all frequent itemsets I
(we will explain this next)
Step 2: Rule generation
For every subset A of I, generate a rule A → I \ A
Since I is frequent, A is also frequent
Variant 1: Single pass to compute the rule confidence
confidence(A,B→C,D) = support(A,B,C,D) / support(A,B)
Variant 2:
Observation: If A,B,C→D is below confidence, so is A,B→C,D
Can generate “bigger” rules from smaller ones!
Output the rules above the confidence threshold
A 4 No No Frequent, and
B 5 No Yes its only superset,
ABC, not freq.
C 3 No No Superset BC
AB 4 Yes Yes has same count.
AC 2 No No Its only super-
set, ABC, has
BC 3 Yes Yes smaller count.
ABC 2 No Yes
12 per
4 bytes per pair
occurring pair
Frequent items
Item counts
Main memory
Counts of
pairs of frequent
items
(candidate
pairs)
Pass 1 Pass 2
Main memory
Counts
of of
frequent
with storing triples pairs of
items
Trick: re-number frequent items
Example
generating Ck from Lk-1 and L1.
But that one can be more careful with candidate
generation. For example, in C3 we know {b,m,j}
cannot be frequent since {m,j} is not frequent
Pass 2:
Only count pairs that hash to frequent buckets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
PCY Algorithm – Between Passes
Bitmap
Main memory
Hash
Hash table
table Counts of
for pairs
candidate
pairs
Pass 1 Pass 2
Bitmap 1 Bitmap 1
First Bitmap 2
hash table
First
Second Counts
hash table Counts of
of
hash table candidate
candidate
pairs
pairs
Bitmap 1
Main memory
First
First hash
hash table
table Bitmap 2
Counts
Countsofof
Second
Second candidate
candidate
hash table
hash table pairs
pairs
Pass 1 Pass 2
Main memory
baskets