Association Rule Mining (Under Construction!)
Association Rule Mining (Under Construction!)
Part 2
(under construction!)
Bigger Example
December 2008
GKGupta
Frequency of Items
December 2008
GKGupta
Frequent Items
Assume 25% support. In 25 transactions, a frequent item
must occur in at least 7 transactions. The frequent 1itemset or L1 is now given below. How many candidates in
C2? List them.
December 2008
GKGupta
L2
The following pairs are frequent. Now find C 3 and then
L3 and the rules.
December 2008
GKGupta
Rules
The full set of rules are given below. Could some rules be
removed?
December 2008
GKGupta
December 2008
GKGupta
Pruning
Pruning can reduce the size of the candidate
set C . We want to transform C into a set of
frequent items L . To reduce the work of
checking, we may use the rule that all subsets
of C must also be frequent.
k
December 2008
GKGupta
Example
Suppose the items are A, B, C, D, E, F, .., X, Y, Z
Suppose L is A, C, E, P, Q, S, T, V, W, X
Suppose L is {A, C}, {A, F}, {A, P}, {C, P}, {E, P},
{E, G}, {E, V}, {H, J}, {K, M}, {Q, S}, {Q, X}
Are you able to identify errors in the L list?
What is C ?
How to prune C ?
C is {A, C, P}, {E, P, V}, {Q, S, X}
1
December 2008
GKGupta
Hashing
The direct hashing and pruning (DHP) algorithm
attempts to generate large itemsets efficiently
and reduces the transaction database size.
When generating L1, the algorithm also
generates all the 2-itemsets for each
transaction, hashes them to a hash table and
keeps a count.
December 2008
GKGupta
10
Example
Consider the transaction database in the first table below used
in an earlier example. The second table below shows all possible
2-itemsets for each transaction.
December 2008
GKGupta
11
Hashing Example
The possible 2-itemsets in the last table are now hashed to a hash
table below. The last column shown in the table below is not
required in the hash table but we have included it for explaining the
technique.
December 2008
GKGupta
12
GKGupta
13
Find C2
The major aim of the algorithm is to reduce the size of C2. It is
therefore essential that the hash table is large enough so that
collisions are low. Collisions result in loss of effectiveness of
the hash table. This is what happened in the example in which
we had collisions in three of the eight rows of the hash table
which required us finding which pair was frequent.
December 2008
GKGupta
14
Transaction Reduction
As discussed earlier, any transaction that does
not contain any frequent k-itemsets cannot
contain any frequent (k+1)-itemsets and such a
transaction may be marked or removed.
December 2008
GKGupta
15
Example
Frequent items (L1) are
A, B, D, M, T. We are
not able to use these to
eliminate any
transactions since all
transactions have at
least one of the items in
L1. The frequent pairs
(C2) are {A,B} and
{B,M}. How can we
reduce transactions
using these?
December 2008
TID
Items bought
001
B, M, T, Y
002
B, M
003
T, S, P
004
A, B, C, D
005
A, B
006
T, Y, E
007
A, B, M
008
B, C, D, T, P
009
D, T, S
010
A, B, M
GKGupta
16
Partitioning
The set of transactions may be divided into a
number of disjoint subsets. Then each partition is
searched for frequent itemsets. These frequent
itemsets are called local frequent itemsets.
How can information about local itemsets be
used in finding frequent itemsets of the global
set of transactions?
In the example on the next slide, we have
divided a set of transactions into two partitions.
Find the frequent items sets for each partition.
Are these local frequent itemsets useful?
December 2008
GKGupta
17
Example
2
2
2
1
4
1
2
5
4
13
5
2
5
2
4
6
11
13
11
13
3
6
5
6
11
7
11
13
December 2008
13
1
2
2
3
5
6
2
5
6
11
14 19
1
2
12 14
2
4
2
4
GKGupta
5
4
11
7
13
7
5
4
2
13
4
5
19
5
6
14
6
11
7
1318
Partitioning
Phase 1
Divide n transactions into m partitions
Find the frequent itemsets in each partition
Combine all local frequent itemsets to form
candidate itemsets
Phase 2
Find global frequent itemsets
December 2008
GKGupta
19
Sampling
A random sample (usually large enough to fit in
the main memory) may be obtained from the
overall set of transactions and the sample is
searched for frequent itemsets. These frequent
itemsets are called sample frequent itemsets.
How can information about sample itemsets be
used in finding frequent itemsets of the global set
of transactions?
December 2008
GKGupta
20
Sampling
Not guaranteed to be accurate but we sacrifice
accuracy for efficiency. A lower support threshold
may be used for the sample to ensure not missing
any frequent datasets.
The actual frequencies of the sample frequent
itemsets are then obtained.
More than one sample could be used to improve
accuracy.
December 2008
GKGupta
21
GKGupta
22
December 2008
GKGupta
23
December 2008
GKGupta
24
Subsets of ABCD
ABCD
AB
December 2008
ACD
BCD
AD
BC
BD
ABC
ABD
AC
GKGupta
CD
25
December 2008
GKGupta
26
December 2008
GKGupta
27
December 2008
GKGupta
28
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
December 2008
GKGupta
29
Performance Evaluation of
Algorithms
The FP-growth method was usually better than the best
implementation of the Apriori algorithm.
CHARM was also usually better than Apriori. In some
cases, Charm was better than the FP-growth method.
Apriori was generally better than other algorithms if the
support required was high since high support leads to a
smaller number of frequent items which suits the Apriori
algorithm.
At very low support, the number of frequent items
became large and none of the algorithms were able to
handle large frequent sets gracefully.
December 2008
GKGupta
30
Bibliography
December 2008
GKGupta
31
Bibliography
December 2008
GKGupta
32