0% found this document useful (0 votes)
4 views

DataMining_Chapter2

This chapter discusses association rules, a key model in data mining used to identify correlations in large datasets, first introduced by Skirant and Agrawal in 1995. It explains basic concepts such as transaction volume, support, and confidence, and outlines the process for finding association rules, including the Apriori algorithm which optimizes the search for frequent item sets. The chapter concludes with examples illustrating the generation of association rules based on specified thresholds for support and confidence.

Uploaded by

hamidbnb865
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

DataMining_Chapter2

This chapter discusses association rules, a key model in data mining used to identify correlations in large datasets, first introduced by Skirant and Agrawal in 1995. It explains basic concepts such as transaction volume, support, and confidence, and outlines the process for finding association rules, including the Apriori algorithm which optimizes the search for frequent item sets. The chapter concludes with examples illustrating the generation of association rules based on specified thresholds for support and confidence.

Uploaded by

hamidbnb865
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CHAPTER II : ASSOCIATION RULES

In this chapter we present one of the most commonly used models in data mining: association
rules.

2.1 INTRODUCTION:
Association rules model is one of the most important and well researched models of data
mining. It was first introduced by Skirant and Agrawal in 1995 [8]. It aims to extract
interesting correlations, associations or casual structures among huge sets of data.
Association rules are widely used in various areas such as market, medical diagnostic,
telecommunication networks, risk management, inventory control etc [9].
Many companies accumulate voluminous amounts of data on their IT systems, resulting from
day-to-day operations. For example, hypermarkets collect many data on consumer purchases.
Table 2.1 gives an illustration of this type of data (note that we are interested in the list of
products and not in quantities or prices).
Table 2.1 Example of shopping basket
TIQ Items
1 { Bread, Milk}
2 { Bread, Chocolate, Juice, Eggs}
3 { Milk, Chocolate, Juice, Lemonade }
4 { Bread, Milk, Chocolate, Juice }
5 { Bread, Milk, Chocolate, Lemonade }

Each line corresponds to a transaction and includes the ticket number and a list of products
purchased. Commercial companies are interested in analysing this type of data to gain a better
understanding of their customers' purchasing behaviour. For example, in a hypermarket,
decision-makers may be interested to know which products are often bought together by
customers ?. "Association rules" model can answer this question. It is recommended for
problems involving the search for hidden relationships between data in large databases. For
example, the following rule can be extracted from the table above:
{Bread} → {Milk}

This rule suggests that there is a strong relationship between the sale of bread and milk,
because many customers (3/5) who buy bread also buy milk.

2.2 BASIC CONCEPTS


The data in the shopping basket (previous example) can be presented in binary form:

24
Table 2.2 Shopping basket data in binary representation
TIQ Bread Milk Chocolate Juice Eggs Lemonade
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1

For the rest of the chapter, we will use the following notations : let I = { i1, i2, ... id} be the set
of all items in the baskets. and Let T = { t1, t2, ... tn} be the set of transactions.

2.2.1 Transaction Volume


Definition 1 : Volume
Volume is the number of items contained in the transaction.

For example, transaction no. 1 has a volume of 2, while all the others have a volume of 4.

2.2.2 Support of an Items Set


Definition 2 : Support of an items set
The support of a set of items X is given by Supp(X), the number of transactions containing it.

For example, the support of the set {Jus, Chocolate, Milk} is equal to 2.

2.2.3 Association Rule


Definition 3 : Association rule
An association rule is an application of the form X → Y where X and Y are disjoint sets of
items. Note that an association rule denotes co-occurrence, not causality.

2.2.4 Support and Confidence of an association rule


The strength of an association rule is measured by its support and confidence. These two
values are calculated using the following expressions.
Definition 4 : Support of a rule
The support of a rule: Supp(X → Y) = Supp(X U Y) / N ,where N is the number of
transactions.

Definition 5 : Confidence of a rule


The confidence of a rule Conf ( X → Y) = Supp(X U Y) / Supp(X)

Example: Consider the rule {Milk, Chocolate} → {Juice}.


The support of the set {Milk, Chocolate, Juice} is equal to 2.
The support of the rule is therefore 2/5 , i.e. 0.4, or 40.00%.
The support of {Milk, Chocolate} is equal to 3.
The confidence of this rule is therefore: 2/3, or 66.67%.

25
Intuitively, we can say that a “good association rule” is one that is based on good support and
has good confidence. Indeed, low support means that the rule is observed rarely in the data.
This is why support is often used to eliminate uninteresting rules.

2.3 SEARCHING FOR ASSOCIATION RULES


The problem of finding association rules in a database can be formulated as follows:
Given a set of transactions T, we want to find all the rules with support >= minsup and
confidence >= minconf where minsup and minconf are thresholds for support and confidence.

2.3.1 Basic method:


A basic solution to the problem presented above is to :
1/ Search for all candidate association rules
2/ Select those whose support and confidence are greater than or equal to the chosen
thresholds.
The calculation costs of this first approach are high. For a set of d items, the total number of
possible rules is :
nb = 3d - 2d+1 + 1 (1)

If d=6 (the example of the previous shopping basket), we have R=602 ! But if we apply a
selection of minsup=20% and minconf=50%, we can eliminate more than 80% of the rules (so
we keep 20% of the rules, or nearly 120 association rules).
A first step towards improving the performance of a rule search algorithm is to separate the
requirements on support and confidence. The definition of support shows that the support of a
rule X -> Y depends only on the support of its set of items (X U Y) . For example, the
following rules have the same support because they come from the same set :
{Jus, Chocolate, Milk}.
{Juice , Chocolate} -> {Milk}
{Egg, Milk} -> {Chocolate}
{Juice} -> {Milk, Chocolate}
{Milk} -> {Mus, Chocolate}
If all the items are rare, then all the associated rules are rare. They can therefore be eliminated
even before their confidence is calculated.
So a better strategy for finding association rules would be to split the process into two stages:
calculating sets of frequent items and generating rules from the sets calculated.
1.Generation of frequent item sets: the aim is to find all the item sets that satisfy the minsup
threshold.
2.Rule generation: the objective is to extract all high-confidence rules from the frequent item
sets found in the previous step. These rules are called strong rules.
This is the strategy applied by the Apriori algorithm (see next section).

26
2.3.2 Apriori algorithm
The Apriori algorithm (Agrawal & Skirant, 1995) [8] reduces the number of candidates sets
considered during the generation of frequent item sets. It is based on the following principle:
If a set of items is frequent, then all its subsets are also frequent. Conversely, if a set {a, b} is
infrequent, then its super-sets are also infrequent (i.e. the sets of items containing {a, b}). So
if we know that {a, b} is infrequent, we can eliminate a priori all the sets containing it. This
strategy is called support-based pruning.
This strategy is made possible by a key property of the support measure: the support of a set
of items is never greater than the support of its subsets.

2.3.3 Frequent items Sets generation


To introduce the Apriori algorithm, let's use the following notation:
Let Ck be the set of candidate items of size k, and Fk the set of frequent items of size k.
The algorithm begins by determining the support for each item: initialisation of F1.
The algorithm iteratively generates the sets of frequent items of size k from the sets of items
of size k-1 obtained in the previous step.
Algorithm 2.1 Apriori, Part 1 : Frequet item sets calculation
Algorithm Frequent item sets generation
Input : Data Set T
Output : Frequent item sets
Begin
//Initialisation
k=1 ;
Fk={ i / i  T et Supp(i) >= N x minsup }
//Iteration
K=2;
While Fk-1 <>Ǿ
Do begin
For each transaction t € T
Do begin
Ct = Extract(Ck, t) ; //extract all the item sets containg t
For each element c € Ct
Do supp(c) := supp(c) +1;
End;
Fk = { c / c  Ck et supp(c) >= N x minsup}
k=k+1
End;
Result = UFk //Union of Fk
End.

27
Let's execute the algorithm on the following example with minsup=40%.

N° items
01 F, M, B, E
02 O, F, E
03 O, A, H, S
04 D, B, F, E
05 D, A, H, S
06 O, M, E
07 O, A, D, H, S

Iteration 1 : Construction of sets of frequent items of size 1: F1


At this iteration, let's take each item separately, and calculate the support for each of them.
We can see that there are 7 sets of frequent terms (F1) which meet the criterion supp>=40%.
C1 : Sets of items of size 1 Sets of frequent items of size 1:
Ensemble support
{F} 3/7 =42,8 % Sets
{M} 2/7=28,6% {F}
{B} 2/7=28,6% {E}
{E} 4/7=57,1% {O}
{O} 4/7=57,1% {A}
{A} 3/7=42,8% {H}
{H} 3/7=42,8% {S}
{S} 3/7=42,8% {D}
{D} 3/7=42,8%

Iteration 2 : Construction of sets of frequent items of size 2: F2


At this stage, we take the result of the previous iteration (F1), and construct all the possible
item sets of size 2. By calculating the support for each of them, we find that there are 4 sets of
frequent items (which verify the criterion sup >= 40%).

C2 : Sets of items of size 2 Sets of frequent items of size 2:


Ensemble support Ensemble
{F, E} 3/7=42,8% {F, E}
{F, O} 1/7=14,3% {A, H}
{F, A} 0% {A, S}
{F, H} 0% {H, S}
{F, S} 0%
{F, D} 1/7=14,3%
{E, O} 2/7=28,6%
{E, A} 0%
{E, H} 0%
{E, S} 0%
{E, D} 1/7=14,3%

28
{O, A} 2/7=28,6%
{O, H} 2/7=28,6%
{O, S} 2/7=28,6%
{O, D} 1/7=14,3%
{A, H} 3/7=42,8%
{A, S} 3/7=42,8%
{A, D} 2/7=28,6%
{H, S} 3/7=42,8%
{H, D} 1/7=14,3%
{S, D} 2/7=14,3%

Iteration 3 : Construction of sets of frequent items of size 3: F3


All possible sets of size 3 are constructed from the sets F2 found in the previous step. We
calculate the support for each set. The result is a single set of frequent items.

Sets of items of size 3 Sets of frequent items of size 3:


Ensemble support Ensemble
{F, E, A} 0% {A, H, S}
{F, E, H} 0%
{F, E, S} 0%
{A, H, F} 0%
{A, H, E} 0%
{A, H, S} 3/7=42,8%
{A, S, F} 0%
{A, S, E} 0%
{H, S, F} 0%
{H, S, E} 0%

Iteration: Construction of sets of frequent items of size 4: F4


Sets of size 4 cannot be constructed. This part of the algorithm stops here.

2.3.4 Association Rules Generation


From the sets of frequent items obtained in the previous step, we can construct association
rules with a confidence greater than or equal to minconf. Here is the algorithm for this step:

29
Algorithm 2.2 Apriori, Part 2 : Association Rules Generation
Algorithm Frequent item sets generation
Input : Frequent item Sets , the confidence threshold minconf.
Output : The set of association rules whose confidence is greater than or equal to minconf
Begin
For each set L of Fi
Do Begin
Generate all sets of non-empty items in L
For each subset s
Do Generate the rule s → (L – s)

Select association rules checking conf >= minconf


End
End.

Let's apply this algorithm to our example with minconf=80%.


Association rules extracted from F2s (sets of frequent items of dimension 2) :
Rule Confidence
{E} → {F} 75%
{F} → {E} 100%
{A} → {H} 100%
{H} → {A} 100%
{A} → {S} 100%
{S} → {A} 100%
{H} → {S} 100%
{S} → {H} 100%

Association rules extracted from F3 (sets of frequent items of dimension 3) :

Rule Confidence
{A} → {H, S} 100%
{H} → {A, S} 100%
{S} → {A, H} 100%
{A, H} → { S } 100%
{A, S} → { H } 100%
{H, S} → { A } 100%

We can therefore see that there are 13 association rules extracted which meet the confidence
criterion >=80%.

CONCLUSION OF THE CHAPTER


In this chapter, we have introduced one of the most commonly used models in data mining:
association rules. We have introduced its basic concepts (association rule, support,

30
confidence). We discussed the problem of finding association rules. We presented the Apriori
algorithm with an example.

EXERCISES
…….

31

You might also like