DataMining_Chapter2
DataMining_Chapter2
In this chapter we present one of the most commonly used models in data mining: association
rules.
2.1 INTRODUCTION:
Association rules model is one of the most important and well researched models of data
mining. It was first introduced by Skirant and Agrawal in 1995 [8]. It aims to extract
interesting correlations, associations or casual structures among huge sets of data.
Association rules are widely used in various areas such as market, medical diagnostic,
telecommunication networks, risk management, inventory control etc [9].
Many companies accumulate voluminous amounts of data on their IT systems, resulting from
day-to-day operations. For example, hypermarkets collect many data on consumer purchases.
Table 2.1 gives an illustration of this type of data (note that we are interested in the list of
products and not in quantities or prices).
Table 2.1 Example of shopping basket
TIQ Items
1 { Bread, Milk}
2 { Bread, Chocolate, Juice, Eggs}
3 { Milk, Chocolate, Juice, Lemonade }
4 { Bread, Milk, Chocolate, Juice }
5 { Bread, Milk, Chocolate, Lemonade }
Each line corresponds to a transaction and includes the ticket number and a list of products
purchased. Commercial companies are interested in analysing this type of data to gain a better
understanding of their customers' purchasing behaviour. For example, in a hypermarket,
decision-makers may be interested to know which products are often bought together by
customers ?. "Association rules" model can answer this question. It is recommended for
problems involving the search for hidden relationships between data in large databases. For
example, the following rule can be extracted from the table above:
{Bread} → {Milk}
This rule suggests that there is a strong relationship between the sale of bread and milk,
because many customers (3/5) who buy bread also buy milk.
24
Table 2.2 Shopping basket data in binary representation
TIQ Bread Milk Chocolate Juice Eggs Lemonade
1 1 1 0 0 0 0
2 1 0 1 1 1 0
3 0 1 1 1 0 1
4 1 1 1 1 0 0
5 1 1 1 0 0 1
For the rest of the chapter, we will use the following notations : let I = { i1, i2, ... id} be the set
of all items in the baskets. and Let T = { t1, t2, ... tn} be the set of transactions.
For example, transaction no. 1 has a volume of 2, while all the others have a volume of 4.
For example, the support of the set {Jus, Chocolate, Milk} is equal to 2.
25
Intuitively, we can say that a “good association rule” is one that is based on good support and
has good confidence. Indeed, low support means that the rule is observed rarely in the data.
This is why support is often used to eliminate uninteresting rules.
If d=6 (the example of the previous shopping basket), we have R=602 ! But if we apply a
selection of minsup=20% and minconf=50%, we can eliminate more than 80% of the rules (so
we keep 20% of the rules, or nearly 120 association rules).
A first step towards improving the performance of a rule search algorithm is to separate the
requirements on support and confidence. The definition of support shows that the support of a
rule X -> Y depends only on the support of its set of items (X U Y) . For example, the
following rules have the same support because they come from the same set :
{Jus, Chocolate, Milk}.
{Juice , Chocolate} -> {Milk}
{Egg, Milk} -> {Chocolate}
{Juice} -> {Milk, Chocolate}
{Milk} -> {Mus, Chocolate}
If all the items are rare, then all the associated rules are rare. They can therefore be eliminated
even before their confidence is calculated.
So a better strategy for finding association rules would be to split the process into two stages:
calculating sets of frequent items and generating rules from the sets calculated.
1.Generation of frequent item sets: the aim is to find all the item sets that satisfy the minsup
threshold.
2.Rule generation: the objective is to extract all high-confidence rules from the frequent item
sets found in the previous step. These rules are called strong rules.
This is the strategy applied by the Apriori algorithm (see next section).
26
2.3.2 Apriori algorithm
The Apriori algorithm (Agrawal & Skirant, 1995) [8] reduces the number of candidates sets
considered during the generation of frequent item sets. It is based on the following principle:
If a set of items is frequent, then all its subsets are also frequent. Conversely, if a set {a, b} is
infrequent, then its super-sets are also infrequent (i.e. the sets of items containing {a, b}). So
if we know that {a, b} is infrequent, we can eliminate a priori all the sets containing it. This
strategy is called support-based pruning.
This strategy is made possible by a key property of the support measure: the support of a set
of items is never greater than the support of its subsets.
27
Let's execute the algorithm on the following example with minsup=40%.
N° items
01 F, M, B, E
02 O, F, E
03 O, A, H, S
04 D, B, F, E
05 D, A, H, S
06 O, M, E
07 O, A, D, H, S
28
{O, A} 2/7=28,6%
{O, H} 2/7=28,6%
{O, S} 2/7=28,6%
{O, D} 1/7=14,3%
{A, H} 3/7=42,8%
{A, S} 3/7=42,8%
{A, D} 2/7=28,6%
{H, S} 3/7=42,8%
{H, D} 1/7=14,3%
{S, D} 2/7=14,3%
29
Algorithm 2.2 Apriori, Part 2 : Association Rules Generation
Algorithm Frequent item sets generation
Input : Frequent item Sets , the confidence threshold minconf.
Output : The set of association rules whose confidence is greater than or equal to minconf
Begin
For each set L of Fi
Do Begin
Generate all sets of non-empty items in L
For each subset s
Do Generate the rule s → (L – s)
Rule Confidence
{A} → {H, S} 100%
{H} → {A, S} 100%
{S} → {A, H} 100%
{A, H} → { S } 100%
{A, S} → { H } 100%
{H, S} → { A } 100%
We can therefore see that there are 13 association rules extracted which meet the confidence
criterion >=80%.
30
confidence). We discussed the problem of finding association rules. We presented the Apriori
algorithm with an example.
EXERCISES
…….
31