DM Unit 3
DM Unit 3
Unit 3
Mining Frequent Patterns: Associations and Correlations, Basic Concepts, Efficient and
Scalable Frequent Itemset Mining Methods, Mining various kinds of Association Rules, From
Association Mining to Correlation Analysis, Constraint-Based Association Mining
Correlation is a bivariate analysis that measures the strength of association between two variables and
the direction of the relationship. In terms of the strength of the relationship, the correlation coefficient's
value varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the
two variables.
As the correlation coefficient value goes towards 0, the relationship between the two variables will be
weaker. The coefficient sign indicates the direction of the relationship; a + sign indicates a positive
relationship, and a - sign indicates a negative relationship.
Example 1:
Here is a dataset consisting of six transactions in an hour. Each transaction is a combination of 0s
and 1s, where 0 represents the absence of an item and 1 represents the presence of it.
We can find multiple rules from this scenario. For example, in a transaction of wine, chips, and
bread, if wine and chips are bought, then customers also buy bread.
{wine, chips} =>; {bread}
In order to select the interesting rules out of multiple possible rules from this small business
scenario, we will be using the following measures:
Support
Confidence
List
Conviction
Support
Support of item x is nothing but the ratio of the number of transactions in which item x appears to
the total number of transactions.
Prepared by Dr. Pallavi Tawde
Unit 3 Data Mining and Warehousing Semester V
i.e.,
Confidence
Confidence (x => y) signifies the likelihood of the item y being purchased when item x is
purchased. This method takes into account the popularity of item x.
i.e.,
Lift
Lift (x => y) is nothing but the ‘interestingness’ or the likelihood of the item y being
purchased when item x is sold. Unlike confidence (x => y), this method takes into account
the popularity of the item y.
Conviction
Conviction of a rule can be defined as follows:
Example 2:
To illustrate the concepts, we use a small example from the supermarket domain. The set of
items is and a small database containing the items (1 codes
presence and 0 absence of an item in a transaction) is shown in the table.
in the database, which means that for 100% of the transactions containing
butter and bread the rule is correct (100% of the times a customer buys butter and bread,
milk is bought as well). Confidence can be interpreted as an estimate of the
probability , the probability of finding the RHS of the rule in transactions under
the condition that these transactions also contain the LHS.
or the ratio of the observed support to that expected if X and Y were independent. The
and can be interpreted as the ratio of the expected frequency that X occurs without Y
(that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.
This process analyzes customer buying habits by finding associations between the different items
that customers place in their shopping baskets. The discovery of such associations can help
retailers develop marketing strategies by gaining insight into which items are frequently purchased
together by customers. For instance, if customers are buying milk, how likely are they to also buy
bread (and what kind of bread) on the same trip to the supermarket. Such information can lead to
increased sales by helping retailers do selective marketing and plan their shelf space.
Example:
If customers who purchase computers also tend to buy antivirus software at the same time, then
placing the hardware display close to the software display may help increase the sales of both
items. In an alternative strategy, placing hardware and software at opposite ends of the store may
entice customers who purchase such items to pick up other items along the way. For instance, after
deciding on an expensive computer, a customer may observe security systems for sale while
heading toward the software display to purchase antivirus software and may decide to purchase a
home security system as well. Market basket analysis can also help retailers plan which items to
put on sale at reduced prices. If customers tend to purchase computers and printers together, then
having a sale on printers may encourage the sale of printers as well as computers.
Frequent pattern mining can be classified in various ways, based on the following criteria:
1. Based on the completeness of patterns to be mined:
We can mine the complete set of frequent itemsets, the closed frequent itemsets, and
themaximal frequent itemsets, given a minimum support threshold.
We can also mine constrained frequent itemsets, approximate frequent
itemsets,near-match frequent itemsets, top-k frequent itemsets and so on.
Some methods for association rule mining can find rules at differing levels of abstraction.
For example, suppose that a set of association rules mined includes the following
ruleswhere X is a variable representing a customer:
In rule (1) and (2), the items bought are referenced at different levels of abstraction (e.g.,
―computer‖ is a higher-level abstraction of ―laptop computer‖).
3. Based on the number of data dimensions involved in the rule:
If the items or attributes in an association rule reference only one dimension, then it is
a single-dimensional association rule.
buys(X, ―computer‖))=>buys(X, ―antivirus software‖)
If a rule references two or more dimensions, such as the dimensions age, income, and
buys,then it is a multidimensional association rule. The following rule is an example of a
multidimensional rule:
age(X, ―30,31…39‖) ^ income(X, ―42K,…48K‖))=>buys(X, ―high resolution TV‖)
Therefore, structured pattern mining can be considered as the most general form of frequent
pattern mining.
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The resulting
set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used
to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.
A two-step process is followed in Apriori consisting of join and prune action.
Example:
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1- itemsets,
C1. The algorithm simply scans all of the transactions in order to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of frequent
1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying
minimum support. In our example, all of the candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1to
generate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the prune
step because each subset of the candidates is also frequent.
4. Next, the transactions in Dare scanned and the support count of each candidate itemsetInC2
is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, From the join step, we first getC3 =L2x
L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on the
Apriori property that all subsets of a frequent itemset must also be frequent, we can determine
that the four latter candidates cannot possibly be frequent.
7.The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support.
8.The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.
Example 2
Step 1: Create a frequency table of all the items that occur in all transactions
Item Frequency
Wine 4
Chips 4
Bread 4
Milk 5
Item Frequency
Wine 4
Chips 4
Bread 4
Milk 5
Step 3: From the significant items, make possible pairs irrespective of the order
Item Frequency
Wine, Chips 3
Wine, Bread 3
Wine, Milk 4
Chips, Bread 2
Chips, Milk 3
Bread, Milk 4
Step 4: Again, find the significant items based on the support threshold
Item Frequency
Wine, Milk 4
Bread, Milk 4
Step 5: Now, make a set of three items that are bought together based on the significant items from
Step 4
Prepared by Dr. Pallavi Tawde
Unit 3 Data Mining and Warehousing Semester V
Item Frequency
Wine, Bread, Milk 3
{Wine, Bread, Milk} is the only significant item set we have got from the given data. But in real-
world scenarios, we would have dozens of items to build rules from. Then, we might have to make
four/five-pair itemsets.
Example 3:
Example:
TID List of item IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
For many applications, it is difficult to find strong associations among data items at
lowor primitive levels of abstraction due to the sparsity of data at those levels.
Strong associations discovered at high levels of abstraction may represent
commonsenseknowledge.
Therefore, data mining systems should provide capabilities for mining association
rulesat multiple levels of abstraction, with sufficient flexibility for easy traversal
among different abstraction spaces.
Association rules generated from mining data at multiple levels of abstraction are called
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher
level, more general concepts. Data can be generalized by replacing low-level concepts within the
data by their higher-level concepts, or ancestors, from a concept hierarchy.
The concept hierarchy has five levels, respectively referred to as levels 0 to 4, starting with
level 0 at the root node for all.
Here, Level 1 includes computer, software, printer & camera, and computer accessory.
Level 2 includes laptop computer, desktop computer, office software, antivirus software
Level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on.
Level 4 is the most specific abstraction level of this hierarchy.
The deeper the level of abstraction, the smaller the corresponding threshold is.
For example,the minimum support thresholds for levels 1 and 2 are 5% and 3%,respectively.
In this way, ―computer,‖ ―laptop computer,‖ and ―desktop computer‖ areall considered
frequent.
Association rules that involve two or more dimensions or predicates can be referred
That is, a correlation rule is measured not only by its support and confidence but
also by the correlation between itemsets A and B. There are many different
correlation measures from which to choose. In this section, we study various
correlation measures to determine which would be good for mining large data sets.
The general constraint is the support minimum threshold. If a constraint is uncontrolled, its
inclusion in the mining phase can support significant reduction of the exploration space
because of the definition of a boundary inside the search space lattice, following which
exploration is not needed.
The important of constraints is well-defined − they create only association rules that are
appealing to users. The method is quite trivial and the rules space is decreased whereby
remaining methods satisfy the constraints.
Constraint-based clustering discover clusters that satisfy user-defined preferences or
constraints. It depends on the characteristics of the constraints, constraint-based clustering can
adopt rather than different approaches.
The constraints can include the following which are as follows −
Knowledge type constraints − These define the type of knowledge to be mined, including
association or correlation.
Data constraints − These define the set of task-relevant information such as Dimension/level
constraints − These defines the desired dimensions (or attributes) of the information, or
methods of the concept hierarchies, to be utilized in mining.
Interestingness constraints − These defines thresholds on numerical measures of rule
interestingness, including support, confidence, and correlation.
Rule constraints − These defines the form of rules to be mined. Such constraints can be
defined as metarules (rule templates), as the maximum or minimum number of predicates that
can appear in the rule antecedent or consequent, or as relationships between attributes, attribute
values, and/or aggregates.
The following constraints can be described using a high-level declarative data mining query
language and user interface. This form of constraint-based mining enables users to define the
rules that they can like to uncover, thus by creating the data mining process more efficient.
Furthermore, a sophisticated mining query optimizer can be used to deed the constraints
defined by the user, thereby creating the mining process more effective. Constraint-based
mining boost interactive exploratory mining and analysis.
Used in forest departments to understand the intensity and probability of forest fires.
Used by Google and other search engines for their auto-complete features.
The Healthcare department used such algorithms to analyze the patients’ database and
predict which patients might develop blood pressure, diabetes, another common
disease.
Used to categorize students based on their specialties and performance to improve their
academic performance.
E-commerce websites use it in their recommendation systems to provide a better user
experience.