0% found this document useful (0 votes)
22 views

Class Notes 10jan2023

The document summarizes the Apriori algorithm for finding frequent itemsets in transactional data. It discusses how to identify association rules between items that commonly occur together. The Apriori algorithm works in multiple passes over the data: it first finds all frequent items, then uses those to generate candidate item pairs, then finds frequent pairs to generate candidate triples, and so on. It exploits the fact that any subset of a frequent itemset must also be frequent, allowing it to prune the search space significantly.

Uploaded by

arindamsinharay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Class Notes 10jan2023

The document summarizes the Apriori algorithm for finding frequent itemsets in transactional data. It discusses how to identify association rules between items that commonly occur together. The Apriori algorithm works in multiple passes over the data: it first finds all frequent items, then uses those to generate candidate item pairs, then finds frequent pairs to generate candidate triples, and so on. It exploits the fact that any subset of a frequent itemset must also be frequent, allowing it to prune the search space significantly.

Uploaded by

arindamsinharay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Lecture 2: 10 January, 2023

Madhavan Mukund
https://www.cmi.ac.in/~madhavan

Data Mining and Machine Learning


January–April 2023
Market-Basket Analysis

Set of items I = {i1 , i2 , . . . , iN }


Set of transactions T = {t1 , t2 , . . . , tM }
A transaction is a set t ✓ I of items

Identify association rules X ! Y


X,Y ✓ I, X \ Y = ;
If X ✓ tj then it is likely that Y ✓ tj

Two thresholds
How frequently does X ✓ tj imply Y ✓ tj ?
How significant is this pattern overall?

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 2 / 16


Setting thresholds

For Z ✓ I , Z .count = |{tj | Z ✓ tj }|


How frequently does X ✓ tj imply Y ✓ tj ?
Fix a confidence level
(X [ Y ).count
Want
X .count
How significant is this pattern overall?
Fix a support level
(X [ Y ).count
Want
M
Given sets of items I and transactions T , with
confidence and support , find all valid
association rules X ! Y
Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 3 / 16
Frequent itemsets

X ! Y is interesting only if (X [ Y ).count ·M


First identify all frequent itemsets
Z ✓ I such that Z .count ·M

Naı̈ve strategy: maintain a counter for each Z


For each tj 2 T
For each Z ✓ tj
Increment the counter for Z
After scanning all transactions, keep Z with
Z .count ·M

Need to maintain 2|I | counters


Infeasible amount of memory
Can we do better?
Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 4 / 16
Sample calculation

Let’s assume a bound on each ti 2 T


No transaction has more than 10 items

Say N = |I | = 106 , M = |T | = 109 , = 0.01


10 ✓ 6 ◆
X 10
Number of possible subsets to count is
i
i=1

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 5 / 16


Sample calculation

Let’s assume a bound on each ti 2 T


No transaction has more than 10 items

Say N = |I | = 106 , M = |T | = 109 , = 0.01


10 ✓ 6 ◆
X 10
Number of possible subsets to count is
i
i=1

A singleton subset {x} that is frequent is an item x that


appears in at least 107 transactions

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 5 / 16


Sample calculation

Let’s assume a bound on each ti 2 T


No transaction has more than 10 items

Say N = |I | = 106 , M = |T | = 109 , = 0.01


10 ✓ 6 ◆
X 10
Number of possible subsets to count is
i
i=1

A singleton subset {x} that is frequent is an item x that


appears in at least 107 transactions
Totally, T contains at most 1010 items
At most 1010 /107 = 1000 items are frequent!
How can we exploit this?
Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 5 / 16
Apriori

Clearly, if Z is frequent, so is every subset Y ✓ Z

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 6 / 16


Apriori

Clearly, if Z is frequent, so is every subset Y ✓ Z


We exploit the contrapositive

Apriori observation
If Z is not a frequent itemset, no superset Y ◆ Z can be
frequent

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 6 / 16


Apriori

Clearly, if Z is frequent, so is every subset Y ✓ Z


We exploit the contrapositive

Apriori observation
If Z is not a frequent itemset, no superset Y ◆ Z can be
frequent

For instance, in our earlier example, every frequent


itemset must be built from the 1000 frequent items
In particular, for any frequent pair {x, y }, both {x} and
{y } must be frequent
Build frequent itemsets bottom up, size 1,2,. . .

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 6 / 16


Apriori algorithm
Fi : frequent itemsets of size i — Level i

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 7 / 16


Apriori algorithm
Fi : frequent itemsets of size i — Level i
F1 : Scan T , maintain a counter for each x 2 I CountIT. M

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 7 / 16


Apriori algorithm
Fi : frequent itemsets of size i — Level i
F1 : Scan T , maintain a counter for each x 2 I
C2 = {{x, y } | x, y 2 F1 }: Candidates in level 2

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 7 / 16


Apriori algorithm
Fi : frequent itemsets of size i — Level i

&
F1 : Scan T , maintain a counter for each x 2 I
C2 = {{x, y } | x, y 2 F1 }: Candidates in level 2
F2 : Scan T , maintain a counter for each X 2 C2

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 7 / 16


Apriori algorithm
Fi : frequent itemsets of size i — Level i
I G->F1
=

F1 : Scan T , maintain a counter for each x 2 I


C2 = {{x, y } | x, y 2 F1 }: Candidates in level 2 /
F2 : Scan T , maintain a counter for each X 2 C2 ( kXF,
=
-

Fz
C3 = {{x, y , z} | {x, y }, {x, z}, {y , z} 2 F2 } /
F3 : Scan T , maintain a counter for each X 2 C3 2 ->
F

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 7 / 16


Apriori algorithm
Fi : frequent itemsets of size i — Level i
F1 : Scan T , maintain a counter for each x 2 I
C2 = {{x, y } | x, y 2 F1 }: Candidates in level 2
F2 : Scan T , maintain a counter for each X 2 C2
C3 = {{x, y , z} | {x, y }, {x, z}, {y , z} 2 F2 }
F3 : Scan T , maintain a counter for each X 2 C3
...
Ck = subsets of size k, every (k 1)-subset is in Fk 1

Fk : Scan T , maintain a counter for each X 2 Ck


...
Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 7 / 16
Apriori algorithm
Ck = subsets of size k, every (k 1)-subset is in Fk 1

How do we generate Ck ?

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 8 / 16


Apriori algorithm
Ck = subsets of size k, every (k 1)-subset is in Fk 1

How do we generate Ck ?
Naı̈ve: enumerate subsets of size k and check each one
Expensive!

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 8 / 16


Apriori algorithm
Ck = subsets of size k, every (k 1)-subset is in Fk 1

innmi
How do we generate Ck ?

8
Naı̈ve: enumerate subsets of size k and check each one
Expensive!

Observation: Any Ck0 ◆ Ck will do as a candidate set

calculate calculate
Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 8 / 16
Apriori algorithm
Ck = subsets of size k, every (k 1)-subset is in Fk 1

How do we generate Ck ?
Naı̈ve: enumerate subsets of size k and check each one
Expensive!

Observation: Any Ck0 ◆ Ck will do as a candidate set


Items are ordered: i1 < i2 < · · · < iN
List each itemset in ascending order — canonical representation

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 8 / 16


Apriori algorithm
Ck = subsets of size k, every (k 1)-subset is in Fk 1

(x)

How do we generate Ck ?
Naı̈ve: enumerate subsets of size k and check each one

1
Expensive!

Observation: Any Ck0 ◆ Ck will do as a candidate set


n
- -
1 >
Items are ordered: i1 < i2 < · · · < iN
List each itemset in ascending order — canonical representation
Merge two (k 1)-subsets if they di↵er in last element

07
X = {i1 , i2 , . . . , ik 2 , ik 1 }
X 0 = {i1 , i2 , . . . , ik 2 , ik0 1 }
Merge(X , X 0 ) = {i1 , i2 , . . . , ik 0
2 , ik 1 , ik 1 }

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 8 / 16


Apriori algorithm
Merge(X , X 0 ) = {i1 , i2 , . . . , ik 0
2 , ik 1 , ik 1 }
X = {i1 , i2 , . . . , ik 2 , ik 1 }
X 0 = {i1 , i2 , . . . , ik 2 , ik0 1 }

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 9 / 16


Apriori algorithm
Merge(X , X 0 ) = {i1 , i2 , . . . , ik 0
2 , ik 1 , ik 1 }
X = {i1 , i2 , . . . , ik 2 , ik 1 }
X 0 = {i1 , i2 , . . . , ik 2 , ik0 1 }
Ck0 = {Merge(X , X 0 ) | X , X 0 2 Fk 1}

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 9 / 16


Apriori algorithm
Merge(X , X 0 ) = {i1 , i2 , . . . , ik 0
2 , ik 1 , ik 1 }

*muxfix/
X = {i1 , i2 , . . . , ik 2 , ik 1 }
x.,x2. -

X 0 = {i1 , i2 , . . . , ik 2 , ik0 1 }
Ck0 = {Merge(X , X 0 ) | X,X0 2 Fk 1}

Claim Ck ✓ Ck0
Suppose Y = {i1 , i2 , . . . , ik 1 , ik } 2 Ck
X = {i1 , i2 , . . . , ik 2 , ik 1 } 2 Fk 1 and
X 0 = {i1 , i2 , . . . , ik 2 , ik } 2 Fk 1
Y = Merge(X , X 0 ) 2 Ck0

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 9 / 16


Apriori algorithm

menentela
Merge(X , X 0 ) = {i1 , i2 , . . . , ik 0
2 , ik 1 , ik 1 }
X = {i1 , i2 , . . . , ik 2 , ik 1 }
X 0 = {i1 , i2 , . . . , ik 2 , ik0 1 }
Ck0 = {Merge(X , X 0 ) | X , X 0 2 Fk 1}

Claim Ck ✓ Ck0
Suppose Y = {i1 , i2 , . . . , ik 1 , ik } 2 Ck
X = {i1 , i2 , . . . , ik 2 , ik 1 } 2 Fk 1 and
X 0 = {i1 , i2 , . . . , ik 2 , ik } 2 Fk 1
Y = Merge(X , X 0 ) 2 Ck0
Can generate Ck0 efficiently
Arrange Fk 1 in dictionary order
Split into blocks that di↵er on last element
Merge all pairs within each block
abae
Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 9 / 16
Apriori algorithm
C1 = {{x} | x 2 I }
F1 = {Z | Z 2 C1 , Z .count · M}
For k 2 {2, 3, . . .}
Ck0 = {Merge(X , X 0 ) | X , X 0 2 Fk 1}

Fk = {Z | Z 2 Ck0 , Z .count · M}

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 10 / 16


Apriori algorithm
C1 = {{x} | x 2 I }
F1 = {Z | Z 2 C1 , Z .count · M}
For k 2 {2, 3, . . .}
Ck0 = {Merge(X , X 0 ) | X , X 0 2 Fk 1}

Fk = {Z | Z 2 Ck0 , Z .count · M}

When do we stop?

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 10 / 16


Apriori algorithm
C1 = {{x} | x 2 I } F, 2X,
F1 = {Z | Z 2 C1 , Z .count · M} Cz
-x,
For k 2 {2, 3, . . .}
Ck0 = {Merge(X , X 0 ) | X , X 0 2 Fk 1}
a (z F,XF,XI
=

Fk = {Z | Z 2 Ck0 , Z .count · M}

When do we stop?
k exceeds the size of the largest transaction
Cs(Miz
Fk is empty More
generally, in is
amply
#so: Applicability
of insight
Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 10 / 16
Apriori algorithm
C1 = {{x} | x 2 I }
F1 = {Z | Z 2 C1 , Z .count · M}
For k 2 {2, 3, . . .}
Ck0 = {Merge(X , X 0 ) | X , X 0 2 Fk 1}

Fk = {Z | Z 2 Ck0 , Z .count · M}

When do we stop?
k exceeds the size of the largest transaction
Fk is empty

Next step: From frequent itemsets to association rules

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 10 / 16


Association rules
Given sets of items I and transactions T , with
confidence and support , find all valid association
rules X ! Y
X,Y ✓ I, X \ Y = ;
(X [ Y ).count
X .count
(X [ Y ).count
M

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 11 / 16


Association rules
Given sets of items I and transactions T , with
confidence and support , find all valid association
rules X ! Y
X,Y ✓ I, X \ Y = ;
(X [ Y ).count
X .count
(X [ Y ).count
M
For a rule X ! Y to be valid, X [ Y should be a
frequent itemset
Apriori algorithm finds all Z ✓ I such that
Z .count ·M

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 11 / 16


Association rules
Naı̈ve strategy
For every frequent itemset Z
Enumerate all pairs X , Y ✓ Z , X \ Y = ;
(X [ Y ).count
Check
X .count

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 12 / 16


Association rules
Naı̈ve strategy
For every frequent itemset Z
Enumerate all pairs X , Y ✓ Z , X \ Y = ;
(X [ Y ).count
Check
X .count

Can we do better?

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 12 / 16


Association rules
Naı̈ve strategy

&
For every frequent itemset Z

77
Enumerate all pairs X , Y ✓ Z , X \ Y = ;
(X [ Y ).count
Check
X .count

Can we do better?
Sufficient to check all partitions of Z
Suppose X , Y ✓ Z , X ! Y is a valid association rule,
but X [ Y is a proper subset of Z

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 12 / 16


Association rules
Naı̈ve strategy
For every frequent itemset Z
Enumerate all pairs X , Y ✓ Z , X \ Y = ;
(X [ Y ).count
Check
X .count

Can we do better?
Sufficient to check all partitions of Z
Suppose X , Y ✓ Z , X ! Y is a valid association rule,
but X [ Y is a proper subset of Z
X [ Y = Z0 ( Z
Z 0 is also a frequent itemset (a priori)
X , Y partitions Z 0
Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 12 / 16
Association rules
Sufficient to check all partitions of Z

Suppose Z = X ] Y , X ! Y is a valid rule and y 2 Y


What about (X [ {y }) ! Y \ {y }? I

x -> y

is valid

(v)wnbX
Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 13 / 16
Association rules
Sufficient to check all partitions of Z

Suppose Z = X ] Y , X ! Y is a valid rule and y 2 Y


What about (X [ {y }) ! Y \ {y }?
(X [ Y ).count
Know
X .count
(X [ Y ).count
Check
(X [ {y }).count

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 13 / 16


Association rules
Sufficient to check all partitions of Z

Suppose Z = X ] Y , X ! Y is a valid rule and y 2 Y


What about (X [ {y }) ! Y \ {y }?

Know
(X [ Y ).count
X .count
(X [ Y ).count
x 0
+

Check
(X [ {y }).count
X .count (X [ {y }).count, always
Second fraction has smaller denominator, so
(X [ {y }) ! Y \ {y } is also a valid rule

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 13 / 16


Association rules
Sufficient to check all partitions of Z

Suppose Z = X ] Y , X ! Y is a valid rule and y 2 Y


What about (X [ {y }) ! Y \ {y }?
(X [ Y ).count
Know
X .count
(X [ Y ).count
Check
(X [ {y }).count
X .count (X [ {y }).count, always
Second fraction has smaller denominator, so
(X [ {y }) ! Y \ {y } is also a valid rule

Observation: Can use apriori principle again!

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 13 / 16


Apriori for association rules
If X ! Y is a valid rule, and y 2 Y ,
(X [ {y }) ! Y \ {y } must also be a valid rule
If X ! Y is not a valid rule, and x 2 X ,
(X \ {x}) ! Y [ {x} cannot be a valid rule

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 14 / 16


Apriori for association rules
If X ! Y is a valid rule, and y 2 Y ,
(X [ {y }) ! Y \ {y } must also be a valid rule
If X ! Y is not a valid rule, and x 2 X ,
(X \ {x}) ! Y [ {x} cannot be a valid rule
Start by checking rules with single element on the right
Z \ {z} ! {z}

For X ! {x, y } to be a valid rule, both


(X [ {x}) ! {y } and (X [ {y }) ! {x} must be valid
Explore partitions of each frequent itemset “level by
level”

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 14 / 16


Association rules for classification
Classify documents by topic
Words in document Topic
Consider the table on the right student, teach, school Education
student, school Education
teach, school, city, game Education
cricket, football Sports
football, player, spectator Sports
cricket, coach, game, team Sports
football, team, city, game Sports

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 15 / 16


Association rules for classification
Classify documents by topic
Words in document Topic
Consider the table on the right student, teach, school Education
student, school Education
Items are regular words and topics
teach, school, city, game Education
Documents are transactions — set of cricket, football Sports
words and one topic football, player, spectator Sports
cricket, coach, game, team Sports
football, team, city, game Sports

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 15 / 16


Association rules for classification
Classify documents by topic
Words in document Topic
Consider the table on the right student, teach, school Education
student, school Education
Items are regular words and topics
teach, school, city, game Education
Documents are transactions — set of cricket, football Sports
words and one topic football, player, spectator Sports
cricket, coach, game, team Sports
Look for association rules of a special football, team, city, game Sports
form
{student, school} ! {Education}
{game, team} ! {Sports}

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 15 / 16


Association rules for classification
Classify documents by topic
Words in document Topic
Consider the table on the right student, teach, school Education
student, school Education
Items are regular words and topics
teach, school, city, game Education
Documents are transactions — set of cricket, football Sports
words and one topic football, player, spectator Sports
cricket, coach, game, team Sports
Look for association rules of a special football, team, city, game Sports
form
{student, school} ! {Education}
{game, team} ! {Sports}

Right hand side always a single topic


Class Association Rules

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 15 / 16


Summary

Market-basket analysis searches for correlated items across transactions


Formalized as association rules
Apriori principle helps us to efficiently
identify frequent itemsets, and
split these itemsets into valid rules

Class association rules — simple supervised learning model

Madhavan Mukund Lecture 2: 10 January, 2023 DMML Jan–Apr 2023 16 / 16

You might also like