DATA MINING UNIT-II NOTES
DATA MINING UNIT-II NOTES
Unit-II
Association Rule Mining is to find out association rules or Frequent patterns or
subsequences or correlation relationships among large set of data items that satisfy
the predefined minimum support and confidence from a given database.
Atypical example of frequent itemset mining is market basket analysis. This process
analyzes customer buying habits by finding associations between the different items
that customers place in their “shopping baskets”. The discovery of these associations
can help retailers develop marketing strategies by gaining insight into which items are
frequently purchased together by customers.
Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for
frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy
against efficiency. The S sample size is such that the search for frequent itemsets in S
can be done in main memory, and so only one scan of the transactions in S is
required overall. Because we are searching for frequent itemsets in S rather than in D,
it is possible that we will miss some of the global frequent itemsets.
Dynamic itemset counting (adding candidate itemsets at different points during
a scan): A dynamic itemset counting technique was proposed in which the database is
partitioned into blocks marked by start points. In this variation, new candidate
itemsets can be added at any start point, unlike in Apriori, which determines new
candidate itemsets only immediately before each complete database scan. The
technique uses the count-so-far as the lower bound of the actual count. If the count-
so-far passes the minimum support, the itemset is added into the frequent itemset
collection and can be used to generate longer candidates. This leads to fewer database
scans than with Apriori for finding all the frequent itemsets.
We first consider I5, which is the last item in L, rather than the first. The reason for
starting at the end of the list will become apparent as we explain the FP-tree mining
process. I5 occurs in two FP-tree branches of Figure.
Given Min_sup=50%
Support(A) = 3/4 =75%
Support(B) = 1/4 =25%
Support(C) = 2/4 =50%
Support(D) = 2/4 =50%
Support(E) = 2/4 =50%
So items A,C,D,E are satisfying 50% min_sup so these items are frequent.
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
There are two such representations
1. Maximal Frequent Itemsets:
An itemset is maximal frequent if none of its immediate supersets is
frequent.
2. Closed Frequent Itemsets:
An itemset is closed if none of its immediate supersets has exactly the same
support count.
Closed itemsets provide a minimal representation of itemsets without
losing their support information.
Maximal frequent itemset:
Maximal frequent itemset is a frequent itemset for which none of its immediate
supersets are frequent.
All maximal frequent itemsets are closed frequent itemsets but the converse is not
true.
Frequent itemset
Maximal frequent
itemsets
This rule contains three predicates (age, occupation, and buys), each of which
occurs only once in the rule, such rules are called inter-dimensional association
rules.
The rules with repeated predicates or containing multiple occurrences of some
predicates are called hybrid-dimension association rules.
For example: Age (X, "20...29") ^buys (X,"Laptop") =>buys (X,"Printer")
This rule contains two predicates (age and buys)
The cells of an n-dimensional cuboid correspond to the predicate sets.
The second step is to find the support of each and every candidate. We must
optimize and enhance the first step because the second step is an NP-
completed set where the computational complexity is accurate and high.
Sequential Pattern Mining:
A sequence database consists of ordered elements or events. For example, a customer
first buys bread, then eggs and cheese, and then milk. This forms a sequence
consisting of three ordered events. We consider an event or a subsequent event is
frequent if its support, which is the number of sequences that contain this event or
subsequence, is greater than a certain value. This algorithm finds patterns in input
sequences satisfying user defined minimum support.
The task of sequential pattern mining is a data mining task specialized for
analyzing sequential data, to discover sequential patterns. More precisely, it
Faculty: Mr. D. Krishna, Associate Professor CSE Dept
consists of discovering interesting subsequences in a set of sequences, where the
interestingness of a subsequence can be measured in terms of various criteria such as
its occurrence frequency, length, and profit. Sequential pattern mining has
numerous real-life applications due to the fact that data is naturally encoded
as sequences of symbols in many fields such as bioinformatics, e-learning, market
basket analysis, texts, and webpage click-stream analysis. Consider the task
of sequential pattern mining with an example. Consider the following sequence
database, representing the purchases made by customers in a retail store.
This database contains four sequences. Each sequence represents the items
purchased by a customer at different times. A sequence is an ordered list of itemsets
(sets of items bought together). For example, in this database, the first sequence (SID
1) indicates that a customer bought some items a and b together, then purchased an
item c, then purchased items f and g together, then purchased an item g, and then
finally purchased an item e.
Traditionally, sequential pattern mining is being used to find subsequences that
appear often in a sequence database, i.e. that are common to several sequences.
Those subsequences are called the frequent sequential patterns.
For example, in the context of our example, sequential pattern mining can be used
to find the sequences of items frequently bought by customers. This can be useful to
understand the behavior of customers to take marketing decisions.
Essay Questions
(b) List all of the strong association rules (with support s and confidence c) matching the following
meta rule, where X is a variable representing customers, and itemi denotes variables
representing items (e.g., “A”, “B”, etc.): Vx belongs transaction, buys(X, item1)^buys(X,
item2))buys(X, item3) [s, c].
22.