Unit 4 - Data Mining - WWW - Rgpvnotes.in
Unit 4 - Data Mining - WWW - Rgpvnotes.in
E
Subject Name: Data Mining
Subject Code: CS-8003
Semester: 8th
Downloaded from www.rgpvnotes.in
Association rule mining is the data mining process of finding the rules that may govern
associations and causal objects between sets of items.
So in a given transaction with multiple items, it tries to find the rules that govern how or why
such items are often bought together. For example, peanut butter and jelly are often bought
together because a lot of people like to make PB&J sandwiches.
Also surprisingly, diapers and beer are bought together because, as it turns out, that dads are
often tasked to do the shopping while the moms are left with the baby.
Basket data analysis - is to analyze the association of purchased items in a single basket
or single purchase as per the examples given above.
Cross marketing - is to work with other businesses that complement your own, not
competitors. For example, vehicle dealerships and manufacturers have cross marketing
campaigns with oil and gas companies for obvious reasons.
Catalog design - the selection of items in a business’ catalog are often designed to
complement each other so that buying one item will lead to buying of another. So these
items are often complements or very related.
Apriori algorithm
Apriori is the best-known algorithm to mine association rules. It uses a breadth-first search
strategy to count the support of itemsets and uses a candidate generation function which
exploits the downward closure property of support.
Apriori is a classic algorithm for learning association rules. Apriori is designed to operate
on databases containing transactions (for example, collections of items bought by customers, or
details of a website frequentation). Other algorithms are designed for finding association rules
in data having no transactions (Winepi and Minepi), or having no timestamps (DNA
sequencing).
As is common in association rule mining, given a set of itemsets (for instance, sets of retail
transactions, each listing individual items purchased), the algorithm attempts to find subsets
which are common to at least a minimum number C of the itemsets. Apriori uses a “bottom up”
approach, where frequent subsets are extended one item at a time (a step known as candidate
generation), and groups of candidates are tested against the data. The algorithm terminates
when no further successful extensions are found.
The purpose of the Apriori Algorithm is to find associations between different sets of data. It is
sometimes referred to as “Market Basket Analysis”. Each set of data has a number of items and
is called a transaction. The output of Apriori is sets of rules that tell us how often items are
contained in sets of data. Here is an example:
Algorithm Pseudocode
The pseudocode for the algorithm is given below for a transaction database , and a support
threshold of . Usual set theoretic notation is employed, though note that is a multiset. is
the candidate set for level . Generate() algorithm is assumed to generate the candidate sets
from the large itemsets of the preceding level, heeding the downward closure
lemma. accesses a field of the data structure that represents candidate set , which
is initially assumed to be zero. Many details are omitted below, usually the most important part
of the implementation is the data structure used for storing the candidate sets, and counting
their frequencies.
Apriori
large 1-itemsets
while
for transactions
for candidates
return
Example
A large supermarket tracks sales data by stock-keeping unit (SKU) for each item, and thus is able
to know what items are typically purchased together. Apriori is a moderately efficient way to
build a list of frequent purchased item pairs from this data. Let the database of transactions
consist of the sets {1,2,3,4}, {1,2}, {2,3,4}, {2,3}, {1,2,4}, {3,4}, and {2,4}. Each number
corresponds to a product such as “butter” or “bread”. The first step of Apriori is to count up the
frequencies, called the support, of each member item separately:
This table explains the working of apriori algorithm.
Item Support
1 3/7
2 6/7
3 4/7
4 5/7
We can define a minimum support level to qualify as “frequent,” which depends on the
context. For this case, let min support = 3/7. Therefore, all are frequent. The next step is to
generate a list of all pairs of the frequent items. Had any of the above items not been frequent,
they wouldn’t have been included as a possible member of possible pairs. In this way,
Apriori prunes the tree of all possible sets. In next step we again select only these items (now
pairs are items) which are frequent:
Item Support
{1,2} 3/7
{1,3} 1/7
{1,4} 2/7
{2,3} 3/7
{2,4} 4/7
{3,4} 3/7
The pairs {1,2}, {2,3}, {2,4}, and {3,4} all meet or exceed the minimum support of 3/7. The pairs
{1,3} and {1,4} do not. When we move onto generating the list of all triplets, we will not
consider any triplets that contain {1,3} or {1,4}:
Item Support
{2,3,4} 2/7
In the example, there are no frequent triplets — {2,3,4} has support of 2/7, which is below our
minimum, and we do not consider any other triplet because they all contain either {1,3} or
{1,4}, which were discarded after we calculated frequent pairs in the second table.
Clusters in the rule antecedent are strongly associated with clusters of rules in the
consequent
Clusters in the antecedent occur together
Clusters in the consequent occur together
Fp Growth Algorithm
Association rule mining (ARM) techniques are effective in extracting frequent patterns and
hidden associations among data items in various databases. These techniques are widely used
for learning behavior, predicting events and making decisions at various levels. The conventional
ARM techniques are however limited to databases comprising categorical data only whereas the
real-world databases mostly in business and scientific domains have attributes containing
quantitative data. Therefore, an improvised methodology called Quantitative Association Rule
Mining (QARM) is used that helps discovering hidden associations from the real-world
quantitative databases. In this paper, we present an exhaustive discussion on the trends in QARM
research and further make a systematic classification of the available techniques into different
categories based on the type of computational methods they adopted. We perform a critical
analysis of various methods proposed so far and present a theoretical comparative study among
them. We also enumerate some of the issues that needs to be addressed in future research.