0% found this document useful (0 votes)
21 views

Association Rule Mining (Under Construction!)

Arm discription
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Association Rule Mining (Under Construction!)

Arm discription
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 32

Association Rule Mining

Part 2
(under construction!)

Introduction to Data Mining with Case Studies


Author: G. K. Gupta
Prentice Hall India, 2006.

Bigger Example

December 2008

GKGupta

Frequency of Items

December 2008

GKGupta

Frequent Items
Assume 25% support. In 25 transactions, a frequent item
must occur in at least 7 transactions. The frequent 1itemset or L1 is now given below. How many candidates in
C2? List them.

December 2008

GKGupta

L2
The following pairs are frequent. Now find C 3 and then
L3 and the rules.

December 2008

GKGupta

Rules
The full set of rules are given below. Could some rules be
removed?

Comment: Study the above rules carefully.

December 2008

GKGupta

Improving the Apriori Algorithm


Many techniques for improving the efficiency
have been proposed:
Pruning (already mentioned)
Hashing based technique
Transaction reduction
Partitioning
Sampling
Dynamic itemset counting

December 2008

GKGupta

Pruning
Pruning can reduce the size of the candidate
set C . We want to transform C into a set of
frequent items L . To reduce the work of
checking, we may use the rule that all subsets
of C must also be frequent.
k

December 2008

GKGupta

Example
Suppose the items are A, B, C, D, E, F, .., X, Y, Z
Suppose L is A, C, E, P, Q, S, T, V, W, X
Suppose L is {A, C}, {A, F}, {A, P}, {C, P}, {E, P},
{E, G}, {E, V}, {H, J}, {K, M}, {Q, S}, {Q, X}
Are you able to identify errors in the L list?
What is C ?
How to prune C ?
C is {A, C, P}, {E, P, V}, {Q, S, X}
1

December 2008

GKGupta

Hashing
The direct hashing and pruning (DHP) algorithm
attempts to generate large itemsets efficiently
and reduces the transaction database size.
When generating L1, the algorithm also
generates all the 2-itemsets for each
transaction, hashes them to a hash table and
keeps a count.

December 2008

GKGupta

10

Example
Consider the transaction database in the first table below used
in an earlier example. The second table below shows all possible
2-itemsets for each transaction.

December 2008

GKGupta

11

Hashing Example
The possible 2-itemsets in the last table are now hashed to a hash
table below. The last column shown in the table below is not
required in the hash table but we have included it for explaining the
technique.

December 2008

GKGupta

12

Hash Function Used


For each pair, a numeric value is obtained by first representing
B by 1, C by 2, E 3, J 4, M 5 and Y 6. Now each pair can be
represented by a two digit number, for example (B, E) by 13
and (C, M) by 26.
The two digits are then coded as modulo 8 number (dividing
by 8 and using the remainder). This is the bucket address.
A count of the number of pairs hashed is kept. Those addresses
that have a count above the support value have the bit vector
set to 1 otherwise 0.
All pairs in rows that have zero bit are removed.
December 2008

GKGupta

13

Find C2
The major aim of the algorithm is to reduce the size of C2. It is
therefore essential that the hash table is large enough so that
collisions are low. Collisions result in loss of effectiveness of
the hash table. This is what happened in the example in which
we had collisions in three of the eight rows of the hash table
which required us finding which pair was frequent.

December 2008

GKGupta

14

Transaction Reduction
As discussed earlier, any transaction that does
not contain any frequent k-itemsets cannot
contain any frequent (k+1)-itemsets and such a
transaction may be marked or removed.

December 2008

GKGupta

15

Example
Frequent items (L1) are
A, B, D, M, T. We are
not able to use these to
eliminate any
transactions since all
transactions have at
least one of the items in
L1. The frequent pairs
(C2) are {A,B} and
{B,M}. How can we
reduce transactions
using these?
December 2008

TID

Items bought

001

B, M, T, Y

002

B, M

003

T, S, P

004

A, B, C, D

005

A, B

006

T, Y, E

007

A, B, M

008

B, C, D, T, P

009

D, T, S

010

A, B, M

GKGupta

16

Partitioning
The set of transactions may be divided into a
number of disjoint subsets. Then each partition is
searched for frequent itemsets. These frequent
itemsets are called local frequent itemsets.
How can information about local itemsets be
used in finding frequent itemsets of the global
set of transactions?
In the example on the next slide, we have
divided a set of transactions into two partitions.
Find the frequent items sets for each partition.
Are these local frequent itemsets useful?

December 2008

GKGupta

17

Example
2
2
2
1
4
1
2
5

4
13
5
2
5
2
4
6

11

13

11

13

3
6
5
6
11

7
11
13

December 2008

13

1
2
2
3
5
6
2
5
6
11
14 19
1
2
12 14
2
4
2
4
GKGupta

5
4
11
7
13

7
5
4
2

13
4

5
19
5
6

14

6
11

7
1318

Partitioning
Phase 1
Divide n transactions into m partitions
Find the frequent itemsets in each partition
Combine all local frequent itemsets to form
candidate itemsets
Phase 2
Find global frequent itemsets

December 2008

GKGupta

19

Sampling
A random sample (usually large enough to fit in
the main memory) may be obtained from the
overall set of transactions and the sample is
searched for frequent itemsets. These frequent
itemsets are called sample frequent itemsets.
How can information about sample itemsets be
used in finding frequent itemsets of the global set
of transactions?

December 2008

GKGupta

20

Sampling
Not guaranteed to be accurate but we sacrifice
accuracy for efficiency. A lower support threshold
may be used for the sample to ensure not missing
any frequent datasets.
The actual frequencies of the sample frequent
itemsets are then obtained.
More than one sample could be used to improve
accuracy.

December 2008

GKGupta

21

Problems with Association Rules


Algorithms
Users are overwhelmed by the number of rules
identified how can the number of rules be
reduced to those that are relevant to the user
needs?
Apriori algorithm assumes sparsity since
number of items on each record is quite small.
Some applications produce dense data which
may also have
many frequently occurring items
strong correlations
many items on each record
December 2008

GKGupta

22

Problems with Association Rules


Also consider:
AB C (90% confidence)
and A C
(92% confidence)
Clearly the first rule is of no use. We should look
for more complex rules only if they are better
than simple rules.

December 2008

GKGupta

23

Top Down Approach


Algorithms considered so far were bottom up i.e. they
started from looking at each frequent item, then each pair
and so on.
Is it possible to design top down algorithms that consider
the largest group of items first and then finds the smaller
groups. Let us first look at the itemset ABCD which can be
frequent only if all subsets are frequent.

December 2008

GKGupta

24

Subsets of ABCD
ABCD

AB

December 2008

ACD

BCD

AD

BC

BD

ABC

ABD

AC

GKGupta

CD

25

Closed and Maximal Itemsets


A frequent closed itemset is a frequent itemset X such
that there exists no superset of X with the same support
count as X.
A frequent itemset Y is maximal if it is not a proper
subset of any other frequent itemset.
Therefore a maximal itemset is a closed itemset but a
closed itemset is not necessarily a maximal itemset.

December 2008

GKGupta

26

Closed and Maximal Itemsets


Frequent maximal itemsets the frequent maximal
itemsets uniquely determine all frequent itemsets.
Therefore the aim of any association rule algorithm is to
find all maximal frequent itemsets.

December 2008

GKGupta

27

Closed and Maximal Itemsets


In Example, we found {B, D} and {B, C, D} had the same
support of 8 while {C, D} had a support of 9. {C, D} is
therefore a closed itemset but not maximal. On the other
hand, {B, C} was frequent but no superset of the two
items is frequent. This pair therefore is maximal as well
as closed.

December 2008

GKGupta

28

Closed and maximal itemsets


Frequent
Itemsets

Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets

December 2008

GKGupta

29

Performance Evaluation of
Algorithms
The FP-growth method was usually better than the best
implementation of the Apriori algorithm.
CHARM was also usually better than Apriori. In some
cases, Charm was better than the FP-growth method.
Apriori was generally better than other algorithms if the
support required was high since high support leads to a
smaller number of frequent items which suits the Apriori
algorithm.
At very low support, the number of frequent items
became large and none of the algorithms were able to
handle large frequent sets gracefully.
December 2008

GKGupta

30

Bibliography

R. Agarwal, T. Imielinski, and A. Swami, Mining Association Rules


Between sets of Items in Large Databases, In Proc of the ACM
SIGMOD, 1993, pp. 207-216.

R. Ramakrishnan and J. Gehrke, Database management systems,,


2nd ed. McGraw-Hill, 2000.

M. J. A. Berry and G. Linoff, Mastering data mining : the art and


science of customer relationship management, Wiley, 2000.

I. H. Witten and E. Frank, Data mining: practical machine learning


tools and techniques with Java implementations,. Morgan Kaufmann,
2000.

December 2008

GKGupta

31

Bibliography

M. J. A. Berry and G. Linoff, Data mining techniques: for marketing,


sales, and customer support, New York : Wiley, 1997.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy
(eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT
Press, 1996.
R. Agarwal, M. Mehta, J. Shafer, A. Arning, and T. Bollinger, The Quest
Data Mining System, Proc 1996 Int. Conf on Data Mining and
Knowledge Discovery (KDD96), Portland, Oregon, pp. 244-249, Aug
1996.

December 2008

GKGupta

32

You might also like