Association Rules and Frequent Item Sets
Association Rules and Frequent Item Sets
Support of
then sets of items that appear {Beer, Bread} = 2
in at least s baskets are called
frequent itemsets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
Items = {milk, coke, pepsi, beer, juice}
Support threshold = 3 baskets
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
Frequent itemsets: {m}, {c}, {b}, {j},
{m,b} , {b,c} , {c,j}.
support(I j )
conf( I → j ) =
support(I )
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
Step 1: Find all frequent itemsets I
▪ (we will explain this next)
A 4 No No Frequent, and
B 5 No Yes its only superset,
ABC, not freq.
C 3 No No Superset BC
AB 4 Yes Yes has same count.
ABC 2 No Yes
The approach:
▪ We always need to generate all the itemsets
▪ But we would only like to count (keep track) of those
itemsets that in the end turn out to be frequent
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33
Naïve approach to finding frequent pairs
Read file once, counting in main memory
the occurrences of each pair:
▪ From each basket of n items, generate its
n(n-1)/2 pairs by two nested loops
Fails if (#items)2 exceeds main memory
▪ Remember: #items can be
100K (Wal-Mart) or 10B (Web pages)
▪ Suppose 105 items, counts are 4-byte integers
▪ Number of pairs of items: 105(105-1)/2 = 5*109
▪ Therefore, 2*1010 (20 gigabytes) of memory needed
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34
Two approaches:
Approach 1: Count all pairs using a matrix
Note:
Approach 1 only requires 4 bytes per pair
Counts of
pairs of
frequent items
(candidate
pairs)
Pass 1 Pass 2
pairs of
buckets as fit in memory frequent items
▪ Keep a count for each bucket into which
(candidate
pairs)
pairs of items are hashed
▪ For each bucket just keep the count, not the actual
pairs that hash to the bucket!
Pass 1 Pass 2
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 78
FOR (each basket/transaction) :
FOR (each item in the basket) :
add 1 to item’s count;
New FOR (each pair of items) :
in hash the pair to a bucket;
PCY add 1 to the count for that bucket;
Bitmap
Main memory
Hash
Hash table
table Counts of
for pairs candidate
pairs
Pass 1 Pass 2
Support 1
Example Credits to: PCY (Park-Chen-Yu) Algorithm with Solved Example | Big Data Analytics | #pcy #bigdata (youtube.com) 84
Item Freq
1 3
2 3
3 2
4 5
5 4
Bitmap 1 Bitmap 1
First Bitmap 2
hash table
First
Second Counts
hash table Counts of
of
hash table candidate
candidate
pairs
pairs
Bitmap 1
Main memory
First
First hash
hash table
table Bitmap 2
Counts
Countsofof
Second
Second candidate
candidate
hash table
hash table pairs
pairs
Pass 1 Pass 2
Main memory
time we increase the size of itemsets baskets