0% found this document useful (0 votes)
37 views

Chapter - 4 - Association Rule Mining

The document discusses mining association rules in large databases. It defines association rules as rules that describe relationships between items in a transactional database based on support and confidence metrics. Mining association rules involves first finding frequent itemsets that occur together above a minimum support threshold, and then generating association rules from those itemsets where the confidence is above a minimum confidence threshold. The Apriori algorithm is introduced as a method for efficiently finding all frequent itemsets in a database.

Uploaded by

Brook Barock
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Chapter - 4 - Association Rule Mining

The document discusses mining association rules in large databases. It defines association rules as rules that describe relationships between items in a transactional database based on support and confidence metrics. Mining association rules involves first finding frequent itemsets that occur together above a minimum support threshold, and then generating association rules from those itemsets where the confidence is above a minimum confidence threshold. The Apriori algorithm is introduced as a method for efficiently finding all frequent itemsets in a database.

Uploaded by

Brook Barock
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

Chapter Four

Mining Association Rules in Large Databases

March 10, 2024 Data Mining: Concepts and Techniques 1


Outline
 Association Rule
 Frequent Pattern
 Association mining from frequent Pattern
 Issues to be considered?
 Classification of Frequent Pattern Mining
Mining Frequent Itemsets: the Key Step
 Algorithm to find Frequent Itemsets
 The Apriori Algorithm
 Generating Association Rules from Frequent Itemsets

Data Mining: Concepts and Techniques 2


Association Rule
 Association rule is a rule which is described in the form of XY with
interestingness measure of support and confidence where
 X and Y are Simple or complex Statements
 A simple Statement is to mean a statement formed from a single attribute
say age, buy or sex and a value which is related by relational operator
 Example:

 Buy(X, “Computer”) Buy(X, “Printer”)[Supp = 25%, conf=95%]

 Which is to mean a person X who buy a computer also buy a printer .


 25% of the entire data shows a person buy a computer and printer
(support). Out of the tuples that buy a computer, 95% of them also buy
printer (confidence)

March 10, 2024 Data Mining: Concepts and Techniques 3


Association Rule
 Example 2:
 Buy(X, “Computer”) Buy(X, “Antivirus”)[Supp = 2%, conf=60%]
 Rule support and confidence are two measures of rule
interestingness. They respectively reflect the usefulness and
certainty of discovered rules.
 A support of 2% means that 2% of all the transactions under
analysis show that computer and antivirus software are purchased
together.
 A confidence of 60% means that 60% of the customers who
purchased a computer also bought the Antivirus.
 Association rules are considered interesting if they satisfy both a
minimum support threshold and a minimum confidence
threshold. These thresholds can be a set by users or domain
experts.

March 10, 2024 Data Mining: Concepts and Techniques 4


Association Rule
 A complex statement is usually represented as conjunction of simple
statements
 Example:
 Buy(X, “Computer”) ʌ Buy(X, “printer”)  Buy(X, “Scanner”)[Supp =
50%, Conf=90%]
 Which is to mean a person X who buy a computer and a printer also buy a
scanner.
 50% of the entire data shows a person buy a computer, a printer and scanner
among the entire data set(support).
 Out of all transactions with a person that buy computer and printer, 90% of
them also buy scanner(confidence)
 In order to mine such association rule, we need to discuss deeply about
frequent pattern and its extraction algorithm

March 10, 2024 Data Mining: Concepts and Techniques 5


Frequent Pattern
 Frequent pattern are patterns (such as item set, sub sequences, or sub
structures) that appear in a data set frequently.
 An item set are two or more items that appear together in a transaction data
set.
 An item set is said to be frequent item set if the item set appear frequently
together in a transaction data set.
 For example a milk and bread may occur together frequently in a single
transaction and hence are frequent item set.

 Subsequence refers to items that happen in transaction in a sequential order.


 For example, buying a computer at time t may be followed by buying a
0
digital camera at time t1, and buying memory card at time t2.
 A sub sequence that appear most frequently is said to be frequent
subsequence.

March 10, 2024 Data Mining: Concepts and Techniques 6


Frequent Pattern
 A sub structure refers to different structural forms of the data set, such as
sub-graphs, sub-trees, or sub-lattices, which may be combined with item
sets or subsequences.
 If a substructure occurs frequently, it is called a (frequent) structured
pattern.
 Finding such frequent patterns plays an essential role in mining
associations, correlations, classification, clustering, and other data mining
tasks as well.
 Thus, frequent pattern mining has become an important data mining task and
a focused theme in data mining research.

 This chapter is dedicated to methods of frequent itemset mining.

March 10, 2024 Data Mining: Concepts and Techniques 7


Frequent Pattern
 We look into the following questions:
 How can we find frequent itemsets from large amounts of data, where

the data are either transactional or relational?


 How can we mine association rules in multilevel and multidimensional

space?
 Which association rules are the most interesting?

 How can we help or guide the mining procedure to discover interesting

associations or correlations?
 How can we take advantage of user preferences or constraints to speed

up the mining process?

March 10, 2024 Data Mining: Concepts and Techniques 8


Frequent Pattern

 Frequent itemset mining leads to the discovery of associations and correlations


among items in large transactional or relational data sets.

 With massive amounts of data continuously being collected and stored, many
industries are becoming interested in mining frequent itemset patterns from
their databases.

 The discovery of interesting correlation relationships among huge amounts of


business transaction records can help in many business decision-making
processes such as:
 market basket analysis, catalog design, cross-marketing, loss-leader
analysis and customer shopping behavior analysis.

March 10, 2024 Data Mining: Concepts and Techniques 9


Association mining from frequent Pattern

 Rule form: “Body (X) -> Head (Y) [support, confidence]”.

 Which is read as if body (X) then head (Y) will occur together in the
transaction with the stated support and confidence

 Rule support and confidence are two measures of rule interestingness. They
respectively reflect the usefulness and certainty of discovered rules.

 Typically, association rules are considered interesting if they satisfy both a


minimum support threshold and a minimum confidence threshold.

 Such thresholds can be set by users or domain experts.

March 10, 2024 Data Mining: Concepts and Techniques 10


Association mining from frequent Pattern

 Additional analysis can be performed to uncover interesting statistical


correlations between associated items.
 Let I ={I , I , …, I } be a set of items.
1 2 m

 Let D, the task-relevant data set, be a set of database transactions where each
transaction T is a set of items such that T I .
 Each transaction is associated with an identifier, called TID (Transaction ID).
 Let A be a set of items.
 transaction T is said to contain A if and only if A  T.
 An association rule is an implication of the form A  B, where A I , B  I ,
and AB=.

March 10, 2024 Data Mining: Concepts and Techniques 11


Association mining from frequent Pattern

 The rule A  B holds in the transaction set D with support s, where s is the
percentage of transactions in D that contain A B (i.e., the union of itemsets A
and B, or say, both A and B).

 This is taken to be the probability, P(A  B) =

 Support shows the probability that all the predicates in A and B fulfill together.
 Count of tuples that has both A and B divided by total number of tuples in
the working data set

March 10, 2024 Data Mining: Concepts and Techniques 12


Association mining from frequent Pattern

 The rule A  B has confidence c in the transaction set D, where c is the


percentage of transactions in D containing A that also contain B.
 This is taken to be the conditional probability, P(B|A)=

 Confidence measure how often predicates B fulfilled if predicate A get fulfilled.


 Count of tuples that has both A and B together divided by total number of tuples
that has A
 That is
support(A B) = P(A  B) =

confidence(A B) = P(B|A) =

March 10, 2024 Data Mining: Concepts and Techniques 13


Association mining from frequent Pattern

• Rules that satisfy both a minimum support threshold (min sup) and a minimum
confidence threshold (min conf) are called strong.

• By convention, we write support and confidence values so as to occur between


0% and 100%, rather than 0 to 1.0 which require to multiply by 100%.
A and B occur together (C1)
Others occur A occur without B (C2)
without A and B B
(C4)

A
B occur without A (C3)

March 10, 2024 Data Mining: Concepts and Techniques 14


Association mining from frequent Pattern

 A set of items is referred to as an itemset.


 An itemset that contains k items is a k-itemset.
 The set {computer, antivirus software} is a 2-itemset.
 The occurrence frequency of an itemset is the number of transactions that
contain the itemset.
 This is also known as the frequency, support count, or count of the itemset.

 Note that the itemset support defined before is sometimes referred to as


relative support, whereas the occurrence frequency is called the absolute
support.
 If the relative support of an itemset I satisfies a prespecified minimum support
threshold (i.e., the absolute support of I satisfies the corresponding minimum
support count threshold), then I is a frequent itemset.

March 10, 2024 Data Mining: Concepts and Techniques 15


Association mining from frequent Pattern

 The set of frequent k-itemsets is commonly denoted by Lk.


 From the previous equation, we have

 confidence(AB) = P(B | A)
= support(A  B)/ support(A) (relative support)
= support_count(A  B)/support_count(A) (absolute support)

 The above equation shows that the confidence of rule A  B can be easily
derived from the support counts of A and A  B.
 That is, once the support counts of A, B, and A  B are found, it is straightforward
to derive the corresponding association rules AB and B A and check whether
they are strong.
 Thus, the problem of mining association rules can be reduced to that of mining
frequent itemsets.
March 10, 2024 Data Mining: Concepts and Techniques 16
Association mining from frequent Pattern:
Support and Confidence example

 Consider the following 4 transactions.


Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
 The support for the various item set can be computed and the result shows:

support(A) 3 support(B,E,F) 1 support(B,F) 1


support(A,C) 2 support(A,D) 1 support(E,F) 1
support(B) 2 support(A,B) 1 support(D) 1
support( C ) 2 support(B,C) 1 support(E) 1
support(A,B,C) 1 support(B,E) 1 support(F) 1

March 10, 2024 Data Mining: Concepts and Techniques 17


Association mining from frequent
Pattern: Support and Confidence example
Transaction ID Items Bought
 The following are some of the association rules 2000 A,B,C
1000 A,C
with support and confidence. 4000 A,D
 A  B (25%, 33.3%)  A B  C (25%, 100%) 5000 B,E,F
 A  C (50%, 66.6%)  A  C B (25%, 50%) support(A) 3
support(A,C) 2
 A  D (25%, 33.3%)  B  C A (25%, 100%) support(B) 2
 B  A (25%, 50%)  A  C B (25%, 33.3%) support( C ) 2
support(A,B,C) 1
 C  A (50%, 100%)  B  A C (25%, 50%)
support(B,E,F) 1
 D  A (25%, 100%)  C  A B (25%, 50%) support(A,D) 1
 B  C (25%, 50%)  B E  F (25%, 100%) support(A,B) 1

 B  E (25%, 50%)  B  F E (25%, 100%) support(B,C) 1


support(B,E) 1
 B  F (25%, 50%)  E  F B (25%, 100%) support(B,F) 1
 C B (25%, 50%)  B  E F (25%, 50%) support(E,F) 1
 E B (25%, 100%)  E  B F (25%, 100%) support(D) 1
support(E) 1
 F B (25%, 100%)  F  B E (25%, 100%) support(F) 1
March 10, 2024 Data Mining: Concepts and Techniques 18
s and Techniques

Association mining from frequent Pattern

 In general, association rule mining can be viewed as a two-step process:


1. Find all frequent itemsets
2. Generate strong association rules from the frequent itemsets

Find all frequent itemsets


 By definition, each of these itemsets will occur at least as frequently as a
predetermined minimum support count, min sup.
 Let the minimum support count is 50% for the previous transaction which
consists of 4 transactions.
 This enable generation of the following item set
support(A) 3
support(A,C) 2
support(B) 2
support( C ) 2

March 10, 2024 19


Association mining from frequent Pattern

Generate strong association rules from the frequent


itemsets: support(A) 3
 At this step, we need to select association rules support(A,C) 2
that must satisfy minimum support and minimum support(B) 2
confidence. support( C ) 2

• In the example considered above, only two associations are possible: A C


and C A.
• Let the minimum confidence is 80%, i.e. (s, c) = ( 50%, 80%)
• Hence the rule which full fill the condition is C  A (50%, 100%)
• Where as A  C (50%, 66.6%) doesn’t fulfill the requirement of confidence
and filtered out
• As the second step is much less costly than the first, the overall performance of
mining association rules is determined by the first step

March 10, 2024 Data Mining: Concepts and Techniques 20


Classification of Frequent Pattern
Mining

 Frequent pattern mining can be classified in various ways, based on


different criteria, two of which are
1. Based on the levels of abstraction involved in the rule set:

2. Based on the number of data dimensions involved in the rule:

 Some other criterion may be


A. Based on the completeness of patterns to be mined:

B. Based on the types of values handled in the rule:

C. Based on the kinds of rules to be mined

Data Mining: Concepts and Techniques 21


Classification of Frequent Pattern
Mining

 Based on the levels of abstraction involved in the rule set:


 Based on the level of abstraction, we can classify frequent pattern mining as
single level and multiple level mining
 Multiple level frequent pattern mining for association rule can find rules at
differing levels of abstraction.
 For example, suppose that a set of association rules mined includes the following
rules where X is a variable representing a customer:
buys(X, “computer”)buys(X, “HP printer”)
buys(X, “desktop computer”)buys(X, “HP printer”)
 In the above Rules, the items bought are referenced at different levels of
abstraction (e.g., “computer” is a higher-level abstraction of “desktop computer”).
 If, instead, the rules within a given set do not reference items or attributes at
different levels of abstraction, then the set contains single-level association
rules.

March 10, 2024 Data Mining: Concepts and Techniques 22


Classification of Frequent Pattern
Mining
 Based on the number of data dimensions involved in the rule:
 Based on the number of data dimensions involved in the rule we can classify

frequent pattern mining as single dimensional or multidimensional


 If the items or attributes in an association rule reference only one dimension,
then it is a single-dimensional association rule.
buys(X, “computer”) buys(X, “antivirus software”)
buys(X, “computer”)buys(X, “HP printer”)
buys(X, “laptop computer”)buys(X, “HP printer”)
 The above rules are single-dimensional association rules because they each
refer to only one attribute/dimension, buys.
 If a rule references two or more dimensions, such as the dimensions or
attributes age, income, and buys, then it is a multidimensional association
rule.
 The following rule is an example of a multidimensional rule:
age(X, “30. . . 39”) ^ income(X, “42K. . .48K”)buys(X, “high resolution
TV”)
March 10, 2024 Data Mining: Concepts and Techniques 23
Mining Frequent Itemsets: the Key Step

 In order to mine association rule using frequent itemset from a database,


we should perform the following basic steps
1. Find the frequent itemsets:
 the sets of items that have minimum support
 A subset of a frequent itemset is also frequent i.e., if {AB} is a
frequent itemset, both {A} and {B} are frequent
 A number of algorithms are suggested to find the set of closed or
maximal frequent items
2. Use the frequent itemsets to generate association rules that fulfill the
confidence criteria.

March 10, 2024 Data Mining: Concepts and Techniques 24


Algorithm to find Frequent Itemsets

 There are a number of algorithms to find frequent itemset in


mining association pattern from the data set
 Some of them are:
1. The apriori algorithm
2. Frequent pattern growth method
3. Vertical data format method

March 10, 2024 Data Mining: Concepts and Techniques 25


Algorithm to find Frequent Itemsets

1. The apriori algorithm:


 It iteratively find frequent itemsets with cardinality from 1 to k (k-
itemset)

2. Frequent pattern growth method


 Find frequent item set using divide and conquer method of frequent
pattern tree

March 10, 2024 Data Mining: Concepts and Techniques 26


Algorithm to find Frequent Itemsets

3. Vertical data format method


 Usually working data set is represented as a set of record where each
record is identified by transaction id (TID) and associated itemsets.
 This format is called horizontal data format
 Vertical data format represent a record which is uniquely identified
by an item name and having associated transaction ids for that item.
 This approach uses this format of input data to discover all frequent
pattern

 We will discuss in this chapter only the first approach (Apriori algorithm)

March 10, 2024 Data Mining: Concepts and Techniques 27


The Apriori Algorithm

 Assume:
 Lk be the set of all frequent k-itemsets which are ordered
lexicographically (i.e. the ith itemset in Lk is smaller than the jth
itemset iff i< j)
 Ck be the set of k-itemset which is a super set of Lk .
 li and lj be the ith and jth k-itemset from a given Lk and each of their
elements are also sorted lexicographically.

March 10, 2024 Data Mining: Concepts and Techniques 28


The Apriori Algorithm

 The Apriori algorithm will have the following steps


 Initialization
 Join Step
 Prune Step
 Generation

March 10, 2024 Data Mining: Concepts and Techniques 29


The Apriori Algorithm

 Initialization
 Generate all the frequent itemset with cardinality of 1
(i.e. L1) in which each elements are sorted
lexicographically.
 Let L be {{i }, {i }, {i }, {i }, {i }} (Note the
1 1 4 7 9 11
ordering)

March 10, 2024 Data Mining: Concepts and Techniques 30


The Apriori Algorithm

 Join Step:
 Generate the candidate k-itemsets by joining L with itself
k-1

(i.e. Ck = Lk-1 ∞ Lk-1) using the following procedure:


 Take any two element from Lk-1 where each of them are
similar in all their elements except the last
 Form k-itemset set by union operation of the two (k-1)-
itemset
 Repeat the procedure for all possible such elements

March 10, 2024 Data Mining: Concepts and Techniques 31


The Apriori Algorithm

 Join Step:
o Let’s assume L2 = {{i1,i4}, {i1,i9}, {i1,i11} , {i4,i9} , {i4,i11} , {i7,i9} ,
{i7,i11}}
o The candidate 3-itemsets are {{i1,i4,i9}, {i1,i4,i11}, {i1,i9,i11},
{i4,i9,i11}, {i7,i9,i11} } (Note each elements are sorted and the
elements of the elements are also sorted)
o Note {i9,i11} is subset of the generated 3-itemset but not in L2.
o As a result, some of the 2-itemset are not frequent and hence those
3-item set having {i9,i11} as its subset could not fulfill the
requirement to be frequent itemset.
o which leads into immediate removal of the 3 candidate 3-itemsets
in the next step
March 10, 2024 Data Mining: Concepts and Techniques 32
The Apriori Algorithm
 Prune Step:
 generate Ck from the candidate k-itemset by pruning apriori those
elements which has subsets that are not frequent
 This can be best done by checking if an element in the k-itemset has
Any (k-1)-itemset that is not frequent.
 If such an element exist, it should be prunned as it is not frequent

March 10, 2024 Data Mining: Concepts and Techniques 33


The Apriori Algorithm

 Generation:
 Generate L from C by eliminating elements which are
k k
not frequent
 This can be best done by assigning count to each k-

itemset in Ck by exploring the entire database transaction

March 10, 2024 Data Mining: Concepts and Techniques 34


The Apriori Algorithm

 Input:
 D, a database of transactions;

 Min_sup, the minimum support count threshold.

 Output:
 L, frequent itemsets in D.

March 10, 2024 Data Mining: Concepts and Techniques 35


The Apriori Algorithm

 Method:
1. L1 = find frequent 1-itemsets(D); //initialize
2. for (k = 2;Lk-1;k++) {
3. Ck = apriori_gen(Lk-1); //join and prune
4. for each transaction t  D {// scan D for counts
5. Ct = subset(Ck, t);
// get the subsets of t that are
candidates
6. for each candidate c  C t

7. c.count++;
8. }
9. Lk = { c  Ck | c:count  min_sup}
//generate
10. }
March 10, 2024 11. return L = k LData
k; Mining: Concepts and Techniques 36
The Apriori Algorithm
procedure apriori_gen(Lk-1:frequent (k-1)-itemsets)
1. for each itemset l1  Lk-1{
2. for each itemset l2  Lk-1 {
3. if (l1[1] = l2[1])^(l1[2] = l2[2])^ . . . ^(l1[k-2] = l2[k-2])^
(l1[k-1] < l2[k-1]) then {
4. c = l1 ∞ l2; // join step: generate candidates
5. if (not (has_infrequent_subset(c, Lk-1))) then
6. add c to Ck;

7. else delete c; // prune step: remove unfruitful candidate


8. }
9. }
10. }
11. return Ck;
March 10, 2024 Data Mining: Concepts and Techniques 37
The Apriori Algorithm

procedure has_infrequent_subset (c: candidate k-itemset; Lk-1: frequent (k-


1)-itemsets); // use prior knowledge
1. for each (k-1)-subset s of c
2. if s  Lk-1 then
3. return TRUE;
4. return FALSE;

March 10, 2024 Data Mining: Concepts and Techniques 38


The Apriori Algorithm — Example
support = 2
C1
L1 C2
Database D itemset sup. itemset
TID Items {1} 2 itemset sup.
{1 2}
100 134 Scan D {2} 3 {1} 2 {1 3}
200 235 {3} 3 {2} 3 {1 5}
300 1235 {4} 1 {3} 3 {2 3}
400 25 {5} 3 {5} 3
{2 5}
{3 5}
itemset sup
{1, 2, 3} 1 C2 Scan D
{1, 2, 5} 1
L2
itemset sup
itemset sup
{1, 3, 5} 1 {1 2} 1
C3 {1 3} 2
L3 {2, 3, 5} 2
{2 3} 2
{1 3} 2
Scan D itemset {1 5} 1
itemset sup {2 5} 3 {2 3} 2
{2 3 5} 2 {2 3 5}
{3 5} 2 {2 5} 3
{3 5} 2
March 10, 2024 Data Mining: Concepts and Techniques 39
 Exercise: Given support and confidence vales as 50%
and 80%. Find the frequent 3-itemset with strong
association rule.
 Exercise 2:
 Let the database of transactions consist of the sets
{1,2,3,4}, {1,2}, {2,3,4}, {2,3}, {1,2,4}, {3,4}, and
{2,4}. Find a frequent 3 –itemset if min_sup = 3.

March 10, 2024 Data Mining: Concepts and Techniques 40


Problem
 Find all the
frequent three
itemsets using
apriori algorithm.
 Given:
Min_sup=20%

41
42
Generating Association Rules

43
 Example of Apriori: Support threshold=50%,
Confidence= 60%
Table 1
Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
 Solution:
 Support threshold=50% => 0.5*6= 3 => min_sup=3
March 10, 2024 Data Mining: Concepts and Techniques 44
 1. Count Of Each Item
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
 2. Prune Step: TABLE -2 shows that I5 item does not
meet min_sup=3, thus it is deleted, only I1, I2, I3, I4
meet min_sup count.

March 10, 2024 Data Mining: Concepts and Techniques 45


tem Count
I1 4
I2 5
I3 4
I4 4

 3. Join Step: Form 2-itemset. From TABLE-


1 find out the occurrences of 2-itemset.

March 10, 2024 Data Mining: Concepts and Techniques 46


 4. Prune Step: TABLE shows that item set
{I1, I4} and {I3, I4} does not meet min_sup,
thus it is deleted.
Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2

March 10, 2024 Data Mining: Concepts and Techniques 47


 5. Join and Prune Step: Form 3-itemset.
From the TABLE- 1 find out occurrences of 3-
itemset. From below TABLE, find out the 2-
itemset subsets which support min_sup.
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3

March 10, 2024 Data Mining: Concepts and Techniques 48


 We can see for itemset {I1, I2, I3} subsets,
{I1, I2}, {I1, I3}, {I2, I3} are occurring in
above TABLE thus {I1, I2, I3} is frequent.
 We can see for itemset {I1, I2, I4} subsets,
{I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not
frequent, as it is not occurring in
above TABLE thus {I1, I2, I4} is not frequent,
hence it is deleted.

March 10, 2024 Data Mining: Concepts and Techniques 49


 Only {I1, I2, I3} is frequent.
 6. Generate Association Rules: From the frequent
itemset discovered above the association could be:

Item
No I1,I2,I3
t fre

{
qu ent I1,I2,I4
I1,I3,I4
I2,I3,I4
 {I1, I2} => {I3}
Confidence = support {I1, I2, I3} / support {I1, I2} = (3/
4)* 100 = 75%
March 10, 2024 Data Mining: Concepts and Techniques 50
 {I1, I3} => {I2}
 Confidence = support {I1, I2, I3} / support {I1, I3} =
(3/ 3)* 100 = 100%
 {I2, I3} => {I1}
 Confidence = support {I1, I2, I3} / support {I2, I3} =
(3/ 4)* 100 = 75%
 {I1} => {I2, I3}
 Confidence = support {I1, I2, I3} / support {I1} = (3/
4)* 100 = 75%

March 10, 2024 Data Mining: Concepts and Techniques 51


 {I2} => {I1, I3}
 Confidence = support {I1, I2, I3} / support {I2 = (3/
5)* 100 = 60%
 {I3} => {I1, I2}
 Confidence = support {I1, I2, I3} / support {I3} = (3/
4)* 100 = 75%
 This shows that all the above association rules are
strong if minimum confidence threshold is 60%.

March 10, 2024 Data Mining: Concepts and Techniques 52


Generating Association Rules from
Frequent Itemsets

 So far we have seen how frequent itemset from transactions in a


database D have been generated
 These set of frequent itemset are important to generate association
rule pattern that are strong
 A strong association rule that satisfy both minimum support and
minimum confidence
 An association rule formed from frequent itemset automatically
fulfill the support criterion
 So the question is how we know a given association pattern fulfill
the minimum confidence criterion?

March 10, 2024 Data Mining: Concepts and Techniques 53


Generating Association Rules from
Frequent Itemsets

 How?
 For each frequent itemset S, generate all its nonempty subsets.

 For every nonempty subset α and β where α U β = S, output the

rule “α  β ” if
support_count(α U β =S) * 100 ≥ support_count(α) * min_conf

where min_conf is the minimum confidence threshold given where


min_conf is in percentage

March 10, 2024 Data Mining: Concepts and Techniques 54


Mining Multi-Level Associations Rules
 Items often form hierarchy in a dimension
 Multilevel association rule mining refers to mining association
pattern from items at different concept level of a dimension
 Items at the lower level are expected to have lower support.
 Rules regarding itemsets at appropriate levels could be quite useful
Food

milk bread

skim 2% wheat white

Fraser Sunset

March 10, 2024 Data Mining: Concepts and Techniques 55


Mining Multi-Level Associations Rules

 One of the most common algorithm to extract multilevel association rule


is progressive deepening
 In Multilevel association pattern extraction, one may follow uniform
support or reduce support approach as we move down the concept
hierarchy for strong association extraction
 In Multilevel association pattern extraction, two association may be
extracted such that one pattern is more general and the other is specific
cases of it. In this case we may filter the specific one as the general
dominates the specific
 We will see progressive deepening multiple level association rule
mining, the idea of uniform and reduced support concept and rule
filtering

March 10, 2024 Data Mining: Concepts and Techniques 56


Mining Multi-Level Associations Rules

Progressive Deepening
 A top_down, progressive deepening approach:
 First find high-level strong rules:

milk  bread [20%, 60%].


 Then find their lower-level “weaker” rules:
2%milk  wheat bread [6%, 50%].

Food Level 0
milk bread Level 1
skim 2% wheat white Level 2
Fraser Sunset
Level 3

March 10, 2024 Data Mining: Concepts and Techniques 57


Mining Multi-Level Associations Rules

Progressive Deepening
 Variations at mining multiple-level association rules.
 Level-crossed association rules:

2%milk  wheat bread (level 2)


 Association rules with multiple, alternative hierarchies:
2%milk  bread (level 2 and 1)
Food Level 0
milk bread Level 1
skim 2% wheat white Level 2
Fraser Sunset
Level 3

March 10, 2024 Data Mining: Concepts and Techniques 58


Mining Multi-Level Associations Rules

Uniform Support
 In this approach, the same (only one) minimum support needs
to be assessed to measure how frequent a pattern is at all
levels

 This approach, do not need to examine itemsets containing


any item whose ancestors do not have minimum support as
lower level items do not occur as frequently
 Hence, computationally fast.

March 10, 2024 Data Mining: Concepts and Techniques 59


Mining Multi-Level Associations Rules

Uniform Support
 Uniform Support (limitations)
 If support threshold

 too high  miss important low level associations

 too low  generate too many high level associations; most

may not be interesting

March 10, 2024 Data Mining: Concepts and Techniques 60


Mining Multi-Level Associations Rules

Uniform Support

Level 1 Milk
min_sup = 5% [support = 10%]

Level 2 2% Milk Skim Milk


min_sup = 5% [support = 6%] [support = 4%]

If adopting the same min_support across multi-levels then throw away


association T if any of its ancestors itemset is infrequent.
March 10, 2024 Data Mining: Concepts and Techniques 61
Mining Multi-Level Associations Rules

Reduced Support
 In this approach, the algorithm will reduce the
required minimum support as we go down to the
lower concept levels (i.e. the minimum support
reduce as the level increases)
 There are different search strategies to implement
reduced support multilevel association rule (see the
text book):

March 10, 2024 Data Mining: Concepts and Techniques 62


Mining Multi-Level Associations Rules

Reduced Support

Level 1 Milk
min_sup = 5% [support = 10%]

Level 2 2% Milk Skim Milk


min_sup = 3%
[support = 6%] [support = 4%]

If adopting reduced min_support at lower levels then examine only


those descendents whose ancestor’s support is frequent/non-negligible
using the reduced min_support.
March 10, 2024 Data Mining: Concepts and Techniques 63
Mining Multi-Level Associations Rules

Redundancy Filtering
 Some rules may be redundant due to “ancestor” relationships
between items.
 Example
 milk  wheat bread [support = 8%, confidence = 70%]
 2% milk  wheat bread [support = 2%, confidence = 72%]

 Note: 2% milk is a milk and a milk is either 2%milk or skim


 Hence, we may say the first rule is an ancestor of the second
rule.
 A rule is redundant if its support is close to the “expected” value,
based on the rule’s ancestor.

March 10, 2024 Data Mining: Concepts and Techniques 64


Mining Multi-Level Associations Rules

Redundancy Filtering
 What is expected?
 An expected support is that in the concept hierarchy, each ancestor

expect its descendant to have some proportion of the entire data


 For example: 50% of milk is skim and 50% of the milk is 2% milk.

 Example
 milk  wheat bread [support = 8%, confidence = 70%]
 2% milk  wheat bread [support = 2%, confidence = 72%]

 In this case the 2% support of the 2% milk is not expected


 Hence, non of the pattern is redundant

March 10, 2024 Data Mining: Concepts and Techniques 65


Mining Multi-Level Associations Rules

Redundancy Filtering
 Consider the two rules below:
 milk  wheat bread [support = 8%, confidence = 70%]
 2% milk  wheat bread [support = 8%, confidence = 72%]

 This shows all milk are 2% milk as both rules have the same
support
 Hence the 2nd rule is the best rule to take

March 10, 2024 Data Mining: Concepts and Techniques 66


Mining Multi-Level Associations Rules

Redundancy Filtering
 Consider the two rule with expectation of 50% each type of milk:
1. milk  wheat bread [support = 8%, confidence = 70%]
2. 2% milk  wheat bread [support = 2%, confidence = 72%]
3. Skim milk wheat bread [support = 6%, confidence = 69%]
 This shows significant difference between rule 1 and 2 but match
between rule 1 and 3
 Hence the 3rd rule may be removed as it doesn’t add extra
information given rule 1
 Rule 2 must be retained as it doesn’t have support as expected

March 10, 2024 Data Mining: Concepts and Techniques 67


Mining Multi-Dimensional Association Rule

 So far we have seen association rules that is derived from single


attribute such as buys (item sold)
 Such association rule is called single dimensional or intra-
dimensional association rule because it contains a single distinct
predicate (e.g., buys)with multiple occurrences.
 Example:
buys(X, “milk”)  buys(X, “bread”)
 However, we may need to have association rules that involve
predicates from two or more attributes
 Association rules that involve two or more dimensions or
predicates can be referred to as multidimensional association rules

March 10, 2024 Data Mining: Concepts and Techniques 68


Mining Multi-Dimensional Association Rule

 Multidimensional association rules with no repeated predicates are


called interdimensional association rules.
age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”)

 Multidimensional association rules with one or more repeated


predicates are called hybrid dimensional association rules.
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)

March 10, 2024 Data Mining: Concepts and Techniques 69


Mining Multi-Dimensional Association Rule
 During single dimensional association rule mining, we search for
frequent item set
 However, in multidimensional association rule mining we search for
frequent k-predicate sets.
 A k-predicate set is a set containing k conjunctive predicates
 For instance, the set of predicates {age, occupation, buys} is a 3-
predicate set
 For example, we may have a frequent 3-predicate set as
{age=“30..39”, income=“1K..2K”,buys=“laptop” } with support 30%

 Which means among all the individuals at any age range and income
level, 30% of them bought a laptop

March 10, 2024 Data Mining: Concepts and Techniques 70


Mining Multi-Dimensional Association Rule

 In relational database, finding all frequent k-predicate sets will


require k table scans to join each of the k tables

 One more processing of the merged table may be required to check


confidence for a given association pattern

March 10, 2024 Data Mining: Concepts and Techniques 71


Mining Multi-Dimensional Association Rule

 Data cube is well suited for mining as the cell of an n-dimensional


cuboid correspond to the predicate sets.
 Hence, mining from data cubes can be much faster.
()

(age) (income) (buys)

(age, income) (age,buys) (income,buys)

(age,income,buys)

March 10, 2024 Data Mining: Concepts and Techniques 72


Mining Multi-Dimensional Association Rule
 we use the notation Lk to refer to the set of frequent k-predicate sets.

 If the resulting task-relevant data are stored in a relational table or a


data warehouse, then the frequent itemset mining algorithms we have
discussed (such as Apriori algorithm) can be modified easily so as to
find all frequent predicate sets

March 10, 2024 Data Mining: Concepts and Techniques 73


Mining Multi-Dimensional Association Rule

 If the type of the attributes used in MD association rule is


quantitative (numeric nominal) it should be discretized before
starting frequent itemsets extraction

 Techniques of quantitative attribute discretization for mining


multidimensional association rules can be categorized into two
basic approaches:
1. Static discretization
2. Dynamic discretization

March 10, 2024 Data Mining: Concepts and Techniques 74


Mining Multi-Dimensional Association Rule

 Static discretization refers to discretizing quantitative attribute


using predefined concept hierarchy
 Example: income may be replaced by the discrete interval such

as “0..200”, “201.. 1000”, “1001.. 5000”, and “above 5000”


 Such concept hierarchies for the quantitative attribute can be

generated by the preprocessing phase.

March 10, 2024 Data Mining: Concepts and Techniques 75


Mining Multi-Dimensional Association Rule

 Dynamic discretization refers to discretizing quantitative


attribute dynamically based on its attribute distribution
 Usually binning is used for discretization
 These bins may be further combined during the mining process.

 The discretization process is dynamic and established so as to


satisfy some mining criteria, such as maximizing the confidence
of the rules mined.
 Sample example is given below:

March 10, 2024 Data Mining: Concepts and Techniques 76


Mining Multi-Dimensional Association Rule

 While numeric attributes are dynamically discretized, the


following should be taken into consideration: the confidence or
compactness of the rules mined should be maximized.
 2-D quantitative association rules: Aquan1  Aquan2  Acat
 (2D Quantitative Association Rules)

age(X,”20…25”) Λ income(X,”30K…40K”) -> buys (X,


”Laptop Computer”)
 Cluster “adjacent” association rules to form general rules using a
2-D grid.

March 10, 2024 Data Mining: Concepts and Techniques 77


Mining Multi-Dimensional Association Rule

Example:
age(X,”30-34”)  income(X,”24K - 48K”)  buys(X,”high resolution TV”)

• In order to find such a 2D quantitative association rule, there are


a number of algorithms and one of the most common is
Association Rule Clustering Systems (ARCS)

March 10, 2024 Data Mining: Concepts and Techniques 78


ARCS (Association Rule Clustering System)
 How does ARCS work?
 This approach maps pairs of quantitative attributes onto a 2-D

grid for tuples satisfying a given categorical attribute condition.


 The grid is then searched for clusters of points from which the

association rules are generated.


 The following grid shows income versus age distribution of

some measurements and its values

March 10, 2024 Data Mining: Concepts and Techniques 79


ARCS (Association Rule Clustering System)
 Three steps are involved in ARCS: this are
1. binning,
2. Find frequent predicateset,
3. clustering

March 10, 2024 Data Mining: Concepts and Techniques 80


ARCS (Association Rule Clustering System)
1. Binning
• Create a binning so that the set of quantitative values will map into a
single bin and find count information for each pair of bins in each
dimension for each categorical values

• At each grid point we have


support count (i.e. how many
measure at a given age and
income)

March 10, 2024 Data Mining: Concepts and Techniques 81


ARCS (Association Rule Clustering System)
2. Find frequent predicateset
• Scan to find the frequent predicate sets (those satisfying minimum
support) that also satisfy minimum confidence.
• Strong association rules can then be generated from these predicate
sets, using a rule generation algorithm

March 10, 2024 Data Mining: Concepts and Techniques 82


ARCS (Association Rule Clustering System)
3. Clustering
• In this step, neighbouring strong association rule will be merged to
make more general rules by looking at the distribution of the strong
association rules in the 2D space
• Consider the strong association shown bellow
age(X, 34) Λ income(X, “31K:::40K”)  buys(X, “HDTV”)
age(X, 35) Λ income(X, “31K:::40K”)  buys(X, “HDTV”)
age(X, 34) Λ income(X, “41K:::50K”)  buys(X, “HDTV”)
age(X, 35) Λ income(X, “41K:::50K”)  buys(X, “HDTV”).

March 10, 2024 Data Mining: Concepts and Techniques 83


ARCS (Association Rule Clustering System)

age(X, 34) Λ income(X, “31K:::40K”)  buys(X, “HDTV”)


age(X, 35) Λ income(X, “31K:::40K”)  buys(X, “HDTV”)
age(X, 34) Λ income(X, “41K:::50K”)  buys(X, “HDTV”)
age(X, 35) Λ income(X, “41K:::50K”)  buys(X, “HDTV”).

• This can be clustered into the rule


age(X, “34..35”) Λ income(X, “31K..50K”) buys(X, “HDTV”)
March 10, 2024 Data Mining: Concepts and Techniques 84
Limitations of ARCS

 Only quantitative attributes on Left Hand Side (LHS) of rules.


 Only 2 attributes on LHS. (2D limitation)
 An alternative to ARCS
 Non-grid-based
 equi-depth binning
 clustering based on a measure of partial completeness.
 “Mining Quantitative Association Rules in Large
Relational Tables” by R. Srikant and R. Agrawal.

March 10, 2024 Data Mining: Concepts and Techniques 85


Interestingness Measurements
 Objective measures
Two popular measurements:
 support; and

 confidence

 Subjective measures (Silberschatz & Tuzhilin,


KDD95)
A rule (pattern) is interesting if
 it is unexpected (surprising to the user); and/or

 actionable (the user can do something with it)

March 10, 2024 Data Mining: Concepts and Techniques 86

You might also like