0% found this document useful (0 votes)
244 views

Association Rule Mining:: "If A Customer Buys Bread, He's 70% Likely of Buying Milk."

Association rule mining aims to discover relationships between items in transactional datasets by finding itemsets that frequently occur together above minimum support and confidence thresholds. An association rule consists of an antecedent (if) and consequent (then) - for example, the rule "if a customer buys bread, they are 70% likely to also buy milk". Frequent itemset mining is used to identify itemsets that satisfy the minimum support, such as bread and butter, which are commonly purchased together. Visualizing the associations between items in a grocery transaction dataset reveals several purchase patterns, such as people frequently buying onions and other vegetables together.

Uploaded by

aman
Copyright
© © All Rights Reserved
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
244 views

Association Rule Mining:: "If A Customer Buys Bread, He's 70% Likely of Buying Milk."

Association rule mining aims to discover relationships between items in transactional datasets by finding itemsets that frequently occur together above minimum support and confidence thresholds. An association rule consists of an antecedent (if) and consequent (then) - for example, the rule "if a customer buys bread, they are 70% likely to also buy milk". Frequent itemset mining is used to identify itemsets that satisfy the minimum support, such as bread and butter, which are commonly purchased together. Visualizing the associations between items in a grocery transaction dataset reveals several purchase patterns, such as people frequently buying onions and other vegetables together.

Uploaded by

aman
Copyright
© © All Rights Reserved
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 12

Association Rule Mining: Association rule mining is a procedure which aims to observe

frequently occurring patterns, correlations, or associations from datasets found in various kinds
of databases such as relational databases, transactional databases, and other forms of repositories.
An association rule has 2 parts:

 an antecedent (if) and


 a consequent (then)

An antecedent is something that’s found in data, and a consequent is an item that is found in
combination with the antecedent. Have a look at this rule for instance:

“If a customer buys bread, he’s 70% likely of buying milk.”

In the above association rule, bread is the antecedent and milk is the consequent. Simply put, it
can be understood as a retail store’s association rule to target their customers better. If the above
rule is a result of a thorough analysis of some data sets, it can be used to not only improve
customer service but also improve the company’s revenue.

Association rules are created by thoroughly analysing data and looking for frequent if/then
patterns. Then, depending on the following two parameters, the important relationships are
observed:
1. Support: Support indicates how frequently the if/then relationship appears in the
database.
2. Confidence: Confidence tells about the number of times these relationships have been
found to be true.

So, in a given transaction with multiple items, Association Rule Mining primarily tries to find the
rules that govern how or why such products/items are often bought together. For example, peanut
butter and jelly are frequently purchased together because a lot of people like to make PB&J
sandwiches.

What Is An Itemset?
A set of items together is called an itemset. If any itemset has k-items it is called a k-itemset. An
itemset consists of two or more items. An itemset that occurs frequently is called a frequent
itemset. Thus frequent itemset mining is a data mining technique to identify the items that
often occur together.

For Example, Bread and butter, Laptop and Antivirus software, etc.

What Is A Frequent Itemset?


A set of items is called frequent if it satisfies a minimum threshold value for support and
confidence. Support shows transactions with items purchased together in a single transaction.
Confidence shows transactions where the items are purchased one after the other.
For frequent itemset mining method, we consider only those transactions which meet minimum
threshold support and confidence requirements. Insights from these mining algorithms offer a lot
of benefits, cost-cutting and improved competitive advantage.

There is a tradeoff time taken to mine data and the volume of data for frequent mining. The
frequent mining algorithm is an efficient algorithm to mine the hidden patterns of itemsets within
a short time and less memory consumption.

Frequent Pattern Mining (FPM)


The frequent pattern mining algorithm is one of the most important techniques of data mining to
discover relationships between different items in a dataset. These relationships are represented in
the form of association rules. It helps to find the irregularities in data.

FPM has many applications in the field of data analysis, software bugs, cross-marketing, sale
campaign analysis, market basket analysis, etc.

Frequent itemsets discovered through Apriori have many applications in data mining tasks. Tasks
such as finding interesting patterns in the database, finding out sequence and Mining of
association rules is the most important of them.

Association rules apply to supermarket transaction data, that is, to examine the customer
behavior in terms of the purchased products. Association rules describe how often the items are
purchased together.

Association Rules
Association Rule Mining is defined as:
“Let I= { …} be a set of ‘n’ binary attributes called items. Let D= { ….} be set of transaction
called database. Each transaction in D has a unique transaction ID and contains a subset of
the items in I. A rule is defined as an implication of form X->Y where X, Y? I and X?Y=?. The
set of items X and Y are called antecedent and consequent of the rule respectively.”
Learning of Association rules is used to find relationships between attributes in large databases.
An association rule, A=> B, will be of the form” for a set of transactions, some value of itemset
A determines the values of itemset B under the condition in which minimum support and
confidence are met”.

Support and Confidence can be represented by the following example:


Bread=> butter [support=2%, confidence-60%]

The above statement is an example of an association rule. This means that there is a 2%
transaction that bought bread and butter together and there are 60% of customers who bought
bread as well as butter.
Support and Confidence for Itemset A and B are represented by formulas:

Association rule mining consists of 2 steps:


1. Find all the frequent itemsets.
2. Generate association rules from the above frequent itemsets.
Why Frequent Itemset Mining?
Frequent itemset or pattern mining is broadly used because of its wide applications in mining
association rules, correlations and graph patterns constraint that is based on frequent patterns,
sequential patterns, and many other data mining tasks.

The Problem
When we go grocery shopping, we often have a standard list of things to buy. Each shopper has a
distinctive list, depending on one’s needs and preferences. A housewife might buy healthy
ingredients for a family dinner, while a bachelor might buy beer and chips. Understanding these
buying patterns can help to increase sales in several ways. If there is a pair of items, X and Y,
that are frequently bought together:

Product placement in Tesco, UK.


 Both X and Y can be placed on the same shelf, so that buyers of one item would
be prompted to buy the other.
 Promotional discounts could be applied to just one out of the two items.
 Advertisements on X could be targeted at buyers who purchase Y.
 X and Y could be combined into a new product, such as having Y in flavors of X.
While we may know that certain items are frequently bought together, the question is, how do we
uncover these associations?
Besides increasing sales profits, association rules can also be used in other fields. In medical
diagnosis for instance, understanding which symptoms tend to co-morbid can help to improve
patient care and medicine prescription.
Definition
Association rules analysis is a technique to uncover how items are associated to each other.
There are three common ways to measure association.
Measure 1: Support. This says how popular an itemset is, as measured by the proportion of
transactions in which an itemset appears. In Table 1 below, the support of {apple} is 4 out of 8,
or 50%. Itemsets can also contain multiple items. For instance, the support of {apple, beer, rice}
is 2 out of 8, or 25%.

Table 1. Example Transactions


If you discover that sales of items beyond a certain proportion tend to have a significant impact
on your profits, you might consider using that proportion as your support threshold. You may
then identify itemsets with support values above this threshold as significant itemsets.
Measure 2: Confidence. This says how likely item Y is purchased when item X is purchased,
expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which
item Y also appears. In Table 1, the confidence of {apple -> beer} is 3 out of 4, or 75%.
One drawback of the confidence measure is that it might misrepresent the importance of an
association. This is because it only accounts for how popular apples are, but not beers. If beers
are also very popular in general, there will be a higher chance that a transaction containing
apples will also contain beers, thus inflating the confidence measure. To account for the base
popularity of both constituent items, we use a third measure called lift.
Measure 3: Lift. This says how likely item Y is purchased when item X is purchased, while
controlling for how popular item Y is. In Table 1, the lift of {apple -> beer} is 1,which implies
no association between items. A lift value greater than 1 means that item Y is likely to be bought
if item X is bought, while a value less than 1 means that item Y is unlikely to be bought if item X
is bought.

An Illustration
We use a dataset on grocery transactions from the arules R library. It contains actual transactions
at a grocery outlet over 30 days. The network graph below shows associations between selected
items. Larger circles imply higher support, while red circles imply higher lift:

Associations between selected items. Visualized using the arulesViz R library.


Several purchase patterns can be observed. For example:
 The most popular transaction was of pip and tropical fruits
 Another popular transaction was of onions and other vegetables
 If someone buys meat spreads, he is likely to have bought yogurt as well
 Relatively many people buy sausage along with sliced cheese
 If someone buys tea, he is likely to have bought fruit as well, possibly inspiring the
production of fruit-flavored tea
Recall that one drawback of the confidence measure is that it tends to misrepresent the
importance of an association. To demonstrate this, we go back to the main dataset to pick 3
association rules containing beer:
Table 2. Association measures for beer-related rules
The {beer -> soda} rule has the highest confidence at 20%. However, both beer and soda appear
frequently across all transactions (see Table 3), so their association could simply be a fluke. This
is confirmed by the lift value of {beer -> soda}, which is 1, implying no association between
beer and soda.

Table 3. Support of individual items


On the other hand, the {beer -> male cosmetics} rule has a low confidence, due to few purchases
of male cosmetics in general. However, whenever someone does buy male cosmetics, he is very
likely to buy beer as well, as inferred from a high lift value of 2.6. The converse is true for {beer
-> berries}. With a lift value below 1, we may conclude that if someone buys berries, he would
likely be averse to beer.
It is easy to calculate the popularity of a single itemset, like {beer, soda}. However, a business
owner would not typically ask about individual itemsets. Rather, the owner would be more
interested in having a complete list of popular itemsets. To get this list, one needs to calculate the
support values for every possible configuration of items, and then shortlist the itemsets that meet
the minimum support threshold.
In a store with just 10 items, the total number of possible configurations to examine would be a
whopping 1023. This number increases exponentially in a store with hundreds of items.
Apriori Algorithm – Frequent Pattern Algorithms
Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. It was
later improved by R Agarwal and R Srikant and came to be known as Apriori. This algorithm
uses two steps “join” and “prune” to reduce the search space. It is an iterative approach to
discover the most frequent itemsets.

Apriori says:
The probability that item I is not frequent is if:

 P(I) < minimum support threshold, then I is not frequent.


 P (I+A) < minimum support threshold, then I+A is not frequent, where A also belongs to
itemset.
 If an itemset set has value less than minimum support then all of its supersets will also
fall below min support, and thus can be ignored. This property is called the Antimonotone
property.

The steps followed in the Apriori Algorithm of data mining are:
1. Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with
itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate item
does not meet minimum support, then it is regarded as infrequent and thus it is removed.
This step is performed to reduce the size of the candidate itemsets.
3.
Steps In Apriori
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the
given database. This data mining technique follows the join and the prune steps iteratively until
the most frequent itemset is achieved. A minimum support threshold is given in the problem or it
is assumed by the user.

#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The
algorithm will count the occurrences of each item.
#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose
occurrence is satisfying the min sup are determined. Only those candidates which count more
than or equal to min_sup, are taken ahead for the next iteration and the others are pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-
itemset is generated by forming a group of 2 by combining items with itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have
2 –itemsets with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each
group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent
otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its
subset does not meet the min_sup criteria. The algorithm is stopped when the most frequent
itemset is achieved.

[image  source]
Example of Apriori: Support threshold=50%, Confidence= 60%
TABLE-1
Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3

1. Count Of Each Item C1=


TABLE-2
Item Count

I1 4

I2 5

I3 4

I4 4

I5 2
2. Prune Step: TABLE -2 shows that I5 item does not meet min_sup=3, thus it is deleted, only
I1, I2, I3, I4 meet min_sup count.
TABLE-3
Item Count

I1 4

I2 5

I3 4

I4 4
3. Join Step: Form 2-itemset. From TABLE-1 find out the occurrences of 2-itemset.
TABLE-4
Item Count

I1,I2 4

I1,I3 3

I1,I4 2

I2,I3 4

I2,I4 3

I3,I4 2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet min_sup, thus
it is deleted.
TABLE-5
Item Count

I1,I2 4
I1,I3 3

I2,I3 4

I2,I4 3
5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out occurrences of 3-itemset.
From TABLE-5, find out the 2-itemset subsets which support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring in  TABLE-
5 thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not frequent, as
it is not occurring in TABLE-5 thus {I1, I2, I4} is not frequent, hence it is deleted.
TABLE-6
Item

I1,I2,I3

I1,I2,I4

I1,I3,I4

I2,I3,I4
Only {I1, I2, I3} is frequent.
6. Generate Association Rules: From the frequent itemset discovered above the association
could be:
{I1, I2} => {I3}

Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%

{I1, I3} => {I2}

Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%

{I2, I3} => {I1}

Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%

{I1} => {I2, I3}

Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%

{I2} => {I1, I3}


Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%

{I3} => {I1, I2}

Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%

This shows that all the above association rules are strong if minimum confidence threshold is
60%.

The Apriori Algorithm: Pseudo Code


C: Candidate item set of size k

L: Frequent itemset of size k

[image  source]
Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in large databases
Disadvantages
1. It requires high computation if the itemsets are very large and the minimum support is
kept very low.
2. The entire database needs to be scanned.

Methods to Improve Apriori Efficiency


Many methods are available for improving the efficiency of the algorithm.
1. Hash-Based Technique: This method uses a hash-based structure called a hash table for
generating the k-itemsets and its corresponding count. It uses a hash function for
generating the table.
2. Transaction Reduction: This method reduces the number of transactions scanning in
iterations. The transactions which do not contain frequent items are marked or removed.
3. Partitioning: This method requires only two database scans to mine the frequent
itemsets. It says that for any itemset to be potentially frequent in the database, it should
be frequent in at least one of the partitions of the database.
4. Sampling: This method picks a random sample S from Database D and then searches for
frequent itemset in S. It may be possible to lose a global frequent itemset. This can be
reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate itemsets at any
marked start point of the database during the scanning of the database.

Applications of Apriori Algorithm

Some fields where Apriori is used:


1. In Education Field: Extracting association rules in data mining of admitted students
through characteristics and specialties.
2. In the Medical field: For example Analysis of the patient's database.
3. In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
4. Apriori is used by many companies like Amazon in the Recommender System and by
Google for the auto-complete feature.

Conclusion
Apriori algorithm is an efficient algorithm that scans the database only once.

It reduces the size of the itemsets in the database considerably providing a good performance.
Thus, data mining helps consumers and industries better in the decision-making process.

You might also like