0% found this document useful (0 votes)
69 views

Unit 4 - Data Mining - WWW - Rgpvnotes.in

The document discusses association rule mining to find interesting relationships between items in large datasets. It describes: 1) Association rule mining searches for relationships between items that are commonly purchased together. For example, rules may show that customers who buy milk also frequently buy bread. 2) The Apriori algorithm is described as the best-known algorithm for mining association rules. It uses an iterative approach, where frequent itemsets are extended one item at a time to generate candidate itemsets to test against the database. 3) Pseudocode provides an overview of the Apriori algorithm's steps of generating frequent itemsets, creating candidate itemsets, and scanning the database to determine truly frequent itemsets based on a minimum

Uploaded by

Rozy Vadgama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Unit 4 - Data Mining - WWW - Rgpvnotes.in

The document discusses association rule mining to find interesting relationships between items in large datasets. It describes: 1) Association rule mining searches for relationships between items that are commonly purchased together. For example, rules may show that customers who buy milk also frequently buy bread. 2) The Apriori algorithm is described as the best-known algorithm for mining association rules. It uses an iterative approach, where frequent itemsets are extended one item at a time to generate candidate itemsets to test against the database. 3) Pseudocode provides an overview of the Apriori algorithm's steps of generating frequent itemsets, creating candidate itemsets, and scanning the database to determine truly frequent itemsets based on a minimum

Uploaded by

Rozy Vadgama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Program : B.

E
Subject Name: Data Mining
Subject Code: CS-8003
Semester: 8th
Downloaded from www.rgpvnotes.in

DATA MINING AND WAREHOUSING IT- 8004(1)


(LECTURERS NOTES)
Unit IV:Mining Association Rules in Large Databases: Association Rule Mining, Single Dimensional
Boolean Association Rules, Multi-Level Association Rule, Apriori Algorithm, Fp Growth Algorithm,
Time series mining association rules, latest trends in association rules mining.

UNIT:-4 Mining Association Rules in Large Databases


INTRODUCTION
Association rule mining finds interesting association or correlation relationships
among a large set of data items. With massive amounts of data continuosly being
collected and stored , many industries are becoming interested in mining association
huge amounts of business transaction records can help in many business decision
making processes, such as catalog design, cross-marketing, and loss-leader analysis.
A typical example of association rule mining is market basket analysis. This
process analyzes customer buying habits by finding associations between the different items that
customers place in their “shopping baskets”. The discovery of such associations can help retailers
develop marketing strategies by gaining insight into which items are frequently purchased
together by customers. For instance, if customers are buying milk, how likely are they to also buy
bread(and what kind of bread)on the same trip to the supermarket? Such information can lead
to increased sales by helping retailers do selective marketing and plan their shelf space. For
example, placing milk And bread within close proximity may further encourage the sale of these
items together within single visits to the store.

Association Rule Mining

Association rule mining searches for interesting relationships among items in a


given data set. This section provides an introduction to association rule mining. We
begin by presenting an example of market basket analysis, the earliest form of
association rule mining. The basic concepts of mining associations are given and we
present a road map to the different kinds of association rules that can mined.

Association rule mining is the data mining process of finding the rules that may govern
associations and causal objects between sets of items.

So in a given transaction with multiple items, it tries to find the rules that govern how or why
such items are often bought together. For example, peanut butter and jelly are often bought
together because a lot of people like to make PB&J sandwiches.

Also surprisingly, diapers and beer are bought together because, as it turns out, that dads are

Page no: 1 Get real-time updates from RGPV


Downloaded from www.rgpvnotes.in

often tasked to do the shopping while the moms are left with the baby.

The main applications of association rule mining:

 Basket data analysis - is to analyze the association of purchased items in a single basket
or single purchase as per the examples given above.
 Cross marketing - is to work with other businesses that complement your own, not
competitors. For example, vehicle dealerships and manufacturers have cross marketing
campaigns with oil and gas companies for obvious reasons.
 Catalog design - the selection of items in a business’ catalog are often designed to
complement each other so that buying one item will lead to buying of another. So these
items are often complements or very related.

Single Dimensional Boolean Association Rules

Apriori algorithm
Apriori is the best-known algorithm to mine association rules. It uses a breadth-first search
strategy to count the support of itemsets and uses a candidate generation function which
exploits the downward closure property of support.

Apriori is a classic algorithm for learning association rules. Apriori is designed to operate
on databases containing transactions (for example, collections of items bought by customers, or
details of a website frequentation). Other algorithms are designed for finding association rules
in data having no transactions (Winepi and Minepi), or having no timestamps (DNA
sequencing).
As is common in association rule mining, given a set of itemsets (for instance, sets of retail
transactions, each listing individual items purchased), the algorithm attempts to find subsets
which are common to at least a minimum number C of the itemsets. Apriori uses a “bottom up”
approach, where frequent subsets are extended one item at a time (a step known as candidate
generation), and groups of candidates are tested against the data. The algorithm terminates
when no further successful extensions are found.
The purpose of the Apriori Algorithm is to find associations between different sets of data. It is
sometimes referred to as “Market Basket Analysis”. Each set of data has a number of items and
is called a transaction. The output of Apriori is sets of rules that tell us how often items are
contained in sets of data. Here is an example:

each line is a set of items

alpha beta gamma

Page no: 2 Get real-time updates from RGPV


Downloaded from www.rgpvnotes.in

alpha beta theta

alpha beta epsilon

alpha beta theta

100% of sets with alpha also contain beta


25% of sets with alpha, beta also have gamma

50% of sets with alpha, beta also have theta


Apriori uses breadth-first search and a Hash tree structure to count candidate item sets
efficiently. It generates candidate item sets of length from item sets of length . Then it
prunes the candidates which have an infrequent sub pattern. According to the downward
closure lemma, the candidate set contains all frequent -length item sets. After that, it scans
the transaction database to determine frequent item sets among the candidates.
Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which
have spawned other algorithms. Candidate generation generates large numbers of subsets (the
algorithm attempts to load up the candidate set with as many as possible before each scan).
Bottom-up subset exploration (essentially a breadth-first traversal of the subset lattice) finds
any maximal subset S only after all of its proper subsets.

Algorithm Pseudocode
The pseudocode for the algorithm is given below for a transaction database , and a support
threshold of . Usual set theoretic notation is employed, though note that is a multiset. is
the candidate set for level . Generate() algorithm is assumed to generate the candidate sets
from the large itemsets of the preceding level, heeding the downward closure
lemma. accesses a field of the data structure that represents candidate set , which
is initially assumed to be zero. Many details are omitted below, usually the most important part
of the implementation is the data structure used for storing the candidate sets, and counting
their frequencies.

Apriori

large 1-itemsets

Page no: 3 Get real-time updates from RGPV


Downloaded from www.rgpvnotes.in

while

for transactions

for candidates

return

Example

A large supermarket tracks sales data by stock-keeping unit (SKU) for each item, and thus is able
to know what items are typically purchased together. Apriori is a moderately efficient way to
build a list of frequent purchased item pairs from this data. Let the database of transactions
consist of the sets {1,2,3,4}, {1,2}, {2,3,4}, {2,3}, {1,2,4}, {3,4}, and {2,4}. Each number
corresponds to a product such as “butter” or “bread”. The first step of Apriori is to count up the
frequencies, called the support, of each member item separately:
This table explains the working of apriori algorithm.

Item Support

1 3/7

2 6/7

3 4/7

4 5/7

We can define a minimum support level to qualify as “frequent,” which depends on the
context. For this case, let min support = 3/7. Therefore, all are frequent. The next step is to

Page no: 4 Get real-time updates from RGPV


Downloaded from www.rgpvnotes.in

generate a list of all pairs of the frequent items. Had any of the above items not been frequent,
they wouldn’t have been included as a possible member of possible pairs. In this way,
Apriori prunes the tree of all possible sets. In next step we again select only these items (now
pairs are items) which are frequent:

Item Support

{1,2} 3/7

{1,3} 1/7

{1,4} 2/7

{2,3} 3/7

{2,4} 4/7

{3,4} 3/7

The pairs {1,2}, {2,3}, {2,4}, and {3,4} all meet or exceed the minimum support of 3/7. The pairs
{1,3} and {1,4} do not. When we move onto generating the list of all triplets, we will not
consider any triplets that contain {1,3} or {1,4}:

Item Support

{2,3,4} 2/7

In the example, there are no frequent triplets — {2,3,4} has support of 2/7, which is below our
minimum, and we do not consider any other triplet because they all contain either {1,3} or
{1,4}, which were discarded after we calculated frequent pairs in the second table.

Multi-Level Association Rule

Rules involving more than one dimensions or predicates


buys (X, “IBM Laptop Computer”) ->
buys (X, “HP Inkjet Printer”)
(Single dimensional)
age (X, “20 ..25” ) and occupation (X, “student”) ->
buys (X, “HP Inkjet Printer”)
(Multi Dimensional- Inter dimension Association Rule)
age (X, “20 ..25” ) and buys (X, “IBM Laptop Computer”) ->

Page no: 5 Get real-time updates from RGPV


Downloaded from www.rgpvnotes.in

buys (X, “HP Inkjet Printer”)


(Multi Dimensional- Hybrid dimension Association Rule)

 Attributes can be categorical or quantitative


 Quantitative attributes are numeric and incorporates hierarchy (age, income..)
 Numeric attributes must be discretized
 3 different approaches in mining multi dimensional association rules
o Using static discretization of quantitative attributes
o Using dynamic discretization of quantitative attributes
o Using Distance based discretization with clustering

Mining using Static Discretization

 Discretization is static and occurs prior to mining


 Discretized attributes are treated as categorical
 Use apriori algorithm to find all k-frequent predicate sets
 Every subset of frequent predicate set must be frequent
 If in a data cube the 3D cuboid (age, income, buys) is frequent implies (age, income),
(age,buys), (income, buys)

Mining using Dynamic Discretization

 Known as Mining Quantitative Association Rules


 Numeric attributes are dynamically discretized
 Consider rules of type

Aquan1 Λ Aquan2 -> Acat


(2D Quantitative Association Rules)
age(X,”20…25”) Λ income(X,”30K…40K”) -> buys (X, ”Laptop Computer”)

 ARCS (Association Rule Clustering System) – An Approach for mining quantitative


association rules

Distance-based Association Rule


2 step mining process

 Perform clustering to find the interval of attributes involved


 Obtain association rules by searching for groups of clusters that occur together

The resultant rules must satisfy

Page no: 6 Get real-time updates from RGPV


Downloaded from www.rgpvnotes.in

 Clusters in the rule antecedent are strongly associated with clusters of rules in the
consequent
 Clusters in the antecedent occur together
 Clusters in the consequent occur together

Fp Growth Algorithm

Fp Growth Algorithm (Frequent pattern growth).

FP growth algorithm is an improvement of apriori algorithm. FP growth algorithm used for


finding frequent itemset in a transaction database without candidate generation.

FP growth represents frequent items in frequent pattern trees or FP-tree.


Advantages of FP growth algorithm:-
1. Faster than apriori algorithm
2. No candidate generation
3. Only two passes over dataset

Disadvantages of FP growth algorithm:-


1. FP tree may not fit in memory
2. FP tree is expensive to build
Fp growth algorithm example
Consider the following database(D)

Let minimum support = 3%

Page no: 7 Get real-time updates from RGPV


Downloaded from www.rgpvnotes.in

Page no: 8 Get real-time updates from RGPV


Downloaded from www.rgpvnotes.in

Time series mining association rules

With increasing concerns about the environmental problem as well as tremendous


environmental issues impacting on our daily life, a new requirement for analysis of
environmental changes and effect has been proposed. In this paper we use Western Pacific
events and basic background database as its data source to find the association between
different marine parameters. The improved Apriori algorithm is utilized to discover knowledge in
magnanimous spatio-temporal data. There are two main steps. First is according to the different
variation degree of each point, the study area can be divided into lots of spatial-temporal
transaction zones. Second is use the improved Apriori algorithm for spatial-temporal data mining.
For the need of mining algorithm, the quantitative attributes need to be transformed into
qualitative attributes. The concept generalization method is utilized to divide the original
attribute data into several levels. Then the Apriori algorithm can be used to discover the potential
association between marine parameters within the given time frame.

latest trends in association rules mining

Association rule mining (ARM) techniques are effective in extracting frequent patterns and
hidden associations among data items in various databases. These techniques are widely used
for learning behavior, predicting events and making decisions at various levels. The conventional
ARM techniques are however limited to databases comprising categorical data only whereas the
real-world databases mostly in business and scientific domains have attributes containing
quantitative data. Therefore, an improvised methodology called Quantitative Association Rule
Mining (QARM) is used that helps discovering hidden associations from the real-world
quantitative databases. In this paper, we present an exhaustive discussion on the trends in QARM
research and further make a systematic classification of the available techniques into different
categories based on the type of computational methods they adopted. We perform a critical
analysis of various methods proposed so far and present a theoretical comparative study among
them. We also enumerate some of the issues that needs to be addressed in future research.

Page no: 9 Get real-time updates from RGPV

You might also like