Association Rule Mining:: "If A Customer Buys Bread, He's 70% Likely of Buying Milk."
Association Rule Mining:: "If A Customer Buys Bread, He's 70% Likely of Buying Milk."
frequently occurring patterns, correlations, or associations from datasets found in various kinds
of databases such as relational databases, transactional databases, and other forms of repositories.
An association rule has 2 parts:
An antecedent is something that’s found in data, and a consequent is an item that is found in
combination with the antecedent. Have a look at this rule for instance:
In the above association rule, bread is the antecedent and milk is the consequent. Simply put, it
can be understood as a retail store’s association rule to target their customers better. If the above
rule is a result of a thorough analysis of some data sets, it can be used to not only improve
customer service but also improve the company’s revenue.
Association rules are created by thoroughly analysing data and looking for frequent if/then
patterns. Then, depending on the following two parameters, the important relationships are
observed:
1. Support: Support indicates how frequently the if/then relationship appears in the
database.
2. Confidence: Confidence tells about the number of times these relationships have been
found to be true.
So, in a given transaction with multiple items, Association Rule Mining primarily tries to find the
rules that govern how or why such products/items are often bought together. For example, peanut
butter and jelly are frequently purchased together because a lot of people like to make PB&J
sandwiches.
What Is An Itemset?
A set of items together is called an itemset. If any itemset has k-items it is called a k-itemset. An
itemset consists of two or more items. An itemset that occurs frequently is called a frequent
itemset. Thus frequent itemset mining is a data mining technique to identify the items that
often occur together.
For Example, Bread and butter, Laptop and Antivirus software, etc.
There is a tradeoff time taken to mine data and the volume of data for frequent mining. The
frequent mining algorithm is an efficient algorithm to mine the hidden patterns of itemsets within
a short time and less memory consumption.
FPM has many applications in the field of data analysis, software bugs, cross-marketing, sale
campaign analysis, market basket analysis, etc.
Frequent itemsets discovered through Apriori have many applications in data mining tasks. Tasks
such as finding interesting patterns in the database, finding out sequence and Mining of
association rules is the most important of them.
Association rules apply to supermarket transaction data, that is, to examine the customer
behavior in terms of the purchased products. Association rules describe how often the items are
purchased together.
Association Rules
Association Rule Mining is defined as:
“Let I= { …} be a set of ‘n’ binary attributes called items. Let D= { ….} be set of transaction
called database. Each transaction in D has a unique transaction ID and contains a subset of
the items in I. A rule is defined as an implication of form X->Y where X, Y? I and X?Y=?. The
set of items X and Y are called antecedent and consequent of the rule respectively.”
Learning of Association rules is used to find relationships between attributes in large databases.
An association rule, A=> B, will be of the form” for a set of transactions, some value of itemset
A determines the values of itemset B under the condition in which minimum support and
confidence are met”.
The above statement is an example of an association rule. This means that there is a 2%
transaction that bought bread and butter together and there are 60% of customers who bought
bread as well as butter.
Support and Confidence for Itemset A and B are represented by formulas:
The Problem
When we go grocery shopping, we often have a standard list of things to buy. Each shopper has a
distinctive list, depending on one’s needs and preferences. A housewife might buy healthy
ingredients for a family dinner, while a bachelor might buy beer and chips. Understanding these
buying patterns can help to increase sales in several ways. If there is a pair of items, X and Y,
that are frequently bought together:
An Illustration
We use a dataset on grocery transactions from the arules R library. It contains actual transactions
at a grocery outlet over 30 days. The network graph below shows associations between selected
items. Larger circles imply higher support, while red circles imply higher lift:
Apriori says:
The probability that item I is not frequent is if:
#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The
algorithm will count the occurrences of each item.
#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose
occurrence is satisfying the min sup are determined. Only those candidates which count more
than or equal to min_sup, are taken ahead for the next iteration and the others are pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-
itemset is generated by forming a group of 2 by combining items with itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have
2 –itemsets with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each
group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent
otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its
subset does not meet the min_sup criteria. The algorithm is stopped when the most frequent
itemset is achieved.
[image source]
Example of Apriori: Support threshold=50%, Confidence= 60%
TABLE-1
Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
I1 4
I2 5
I3 4
I4 4
I5 2
2. Prune Step: TABLE -2 shows that I5 item does not meet min_sup=3, thus it is deleted, only
I1, I2, I3, I4 meet min_sup count.
TABLE-3
Item Count
I1 4
I2 5
I3 4
I4 4
3. Join Step: Form 2-itemset. From TABLE-1 find out the occurrences of 2-itemset.
TABLE-4
Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet min_sup, thus
it is deleted.
TABLE-5
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out occurrences of 3-itemset.
From TABLE-5, find out the 2-itemset subsets which support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring in TABLE-
5 thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not frequent, as
it is not occurring in TABLE-5 thus {I1, I2, I4} is not frequent, hence it is deleted.
TABLE-6
Item
I1,I2,I3
I1,I2,I4
I1,I3,I4
I2,I3,I4
Only {I1, I2, I3} is frequent.
6. Generate Association Rules: From the frequent itemset discovered above the association
could be:
{I1, I2} => {I3}
Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%
Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%
Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%
Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%
Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%
This shows that all the above association rules are strong if minimum confidence threshold is
60%.
[image source]
Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in large databases
Disadvantages
1. It requires high computation if the itemsets are very large and the minimum support is
kept very low.
2. The entire database needs to be scanned.
Conclusion
Apriori algorithm is an efficient algorithm that scans the database only once.
It reduces the size of the itemsets in the database considerably providing a good performance.
Thus, data mining helps consumers and industries better in the decision-making process.