0% found this document useful (0 votes)
46 views

DM Unit 3

The document discusses data mining techniques for association rule mining and correlation analysis. It defines key concepts for association rule mining including support, confidence, lift, and conviction. Support measures the proportion of transactions containing an itemset. Confidence measures the likelihood of the consequent being true given the antecedent is true. Lift and conviction measure the strength of correlation between itemsets. Market basket analysis is also discussed as a way to analyze customer purchasing habits to discover associations between items bought together.

Uploaded by

ajayagupta1101
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

DM Unit 3

The document discusses data mining techniques for association rule mining and correlation analysis. It defines key concepts for association rule mining including support, confidence, lift, and conviction. Support measures the proportion of transactions containing an itemset. Confidence measures the likelihood of the consequent being true given the antecedent is true. Lift and conviction measure the strength of correlation between itemsets. Market basket analysis is also discussed as a way to analyze customer purchasing habits to discover associations between items bought together.

Uploaded by

ajayagupta1101
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit 3 Data Mining and Warehousing Semester V

Unit 3
Mining Frequent Patterns: Associations and Correlations, Basic Concepts, Efficient and
Scalable Frequent Itemset Mining Methods, Mining various kinds of Association Rules, From
Association Mining to Correlation Analysis, Constraint-Based Association Mining

Correlation Analysis in Data Mining


Correlation analysis is a statistical method used to measure the strength of the linear relationship
between two variables and compute their association. Correlation analysis calculates the level of
change in one variable due to the change in the other. A high correlation points to a strong relationship
between the two variables, while a low correlation means that the variables are weakly related.
Researchers use correlation analysis to analyze quantitative data collected through research methods
like surveys and live polls for market research. They try to identify relationships, patterns, significant
connections, and trends between two variables or datasets. There is a positive correlation between two
variables when an increase in one variable leads to an increase in the other. On the other hand, a
negative correlation means that when one variable increases, the other decreases and vice-versa.

Correlation is a bivariate analysis that measures the strength of association between two variables and
the direction of the relationship. In terms of the strength of the relationship, the correlation coefficient's
value varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the
two variables.

As the correlation coefficient value goes towards 0, the relationship between the two variables will be
weaker. The coefficient sign indicates the direction of the relationship; a + sign indicates a positive
relationship, and a - sign indicates a negative relationship.

3.1 Association Rule Mining:


Association rule mining is a popular and well researched method for discovering interesting
relations between variables in large databases.
It is intended to identify strong rules discovered in databases using different measures of
interestingness.
Based on the concept of strong rules, Rakesh Agrawal et al. introduced association rules.
Problem Definition:
The problem of association rule mining is defined as:

Let be a set of binary attributes called items.


Prepared by Dr. Pallavi Tawde
Unit 3 Data Mining and Warehousing Semester V

Let be a set of transactions called the database.


Each transaction in has a unique transaction ID and contains a subset of the items in .
A rule is defined as an implication of the form
where and .
The sets of items (for short itemsets) and are called antecedent (left-hand-side or LHS) and
consequent (right-hand-side or RHS) of the rule respectively.

Example 1:
Here is a dataset consisting of six transactions in an hour. Each transaction is a combination of 0s
and 1s, where 0 represents the absence of an item and 1 represents the presence of it.

We can find multiple rules from this scenario. For example, in a transaction of wine, chips, and
bread, if wine and chips are bought, then customers also buy bread.
{wine, chips} =>; {bread}
In order to select the interesting rules out of multiple possible rules from this small business
scenario, we will be using the following measures:
 Support
 Confidence
 List
 Conviction
Support
Support of item x is nothing but the ratio of the number of transactions in which item x appears to
the total number of transactions.
Prepared by Dr. Pallavi Tawde
Unit 3 Data Mining and Warehousing Semester V

i.e.,

Confidence
Confidence (x => y) signifies the likelihood of the item y being purchased when item x is
purchased. This method takes into account the popularity of item x.
i.e.,

Lift
Lift (x => y) is nothing but the ‘interestingness’ or the likelihood of the item y being
purchased when item x is sold. Unlike confidence (x => y), this method takes into account
the popularity of the item y.

Lift (x => y) = 1 means that there is no correlation within the itemset.


Lift (x => y) > 1 means that there is a positive correlation within the itemset, i.e., products in
the itemset, x and y, are more likely to be bought together.
Lift (x => y) < 1 means that there is a negative correlation within the itemset, i.e., products
in itemset, x and y, are unlikely to be bought together.
Prepared by Dr. Pallavi Tawde
Unit 3 Data Mining and Warehousing Semester V

Conviction
Conviction of a rule can be defined as follows:

Its value range is [0, +∞].


Conv(x => y) = 1 means that x has no relation with y.
Greater the conviction higher the interest in the rule.

Example 2:
To illustrate the concepts, we use a small example from the supermarket domain. The set of
items is and a small database containing the items (1 codes
presence and 0 absence of an item in a transaction) is shown in the table.

An example rule for the supermarket could be meaning that if


butter and bread are bought, customers also buy milk.

Example database with 4 items and 5 transactions

Transaction ID milk bread butter beer


1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0

3.1.1 Important concepts of Association Rule Mining:


Prepared by Dr. Pallavi Tawde
Unit 3 Data Mining and Warehousing Semester V

The support of an itemset is defined as the proportion of transactions in the


data set which contain the itemset. In the example database, the itemset

has a support of since it occurs in 20% of all


transactions (1 out of 5 transactions).

The confidence of a rule is defined

For example, the rule has a confidence of

in the database, which means that for 100% of the transactions containing
butter and bread the rule is correct (100% of the times a customer buys butter and bread,
milk is bought as well). Confidence can be interpreted as an estimate of the

probability , the probability of finding the RHS of the rule in transactions under
the condition that these transactions also contain the LHS.

The lift of a rule is defined as

or the ratio of the observed support to that expected if X and Y were independent. The

rule has a lift of .

The conviction of a rule is defined as

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

The rule has a conviction of ,


1-0.4 / 1-1 = 0.6

and can be interpreted as the ratio of the expected frequency that X occurs without Y
(that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.

3.2 Market basket analysis:

This process analyzes customer buying habits by finding associations between the different items
that customers place in their shopping baskets. The discovery of such associations can help
retailers develop marketing strategies by gaining insight into which items are frequently purchased
together by customers. For instance, if customers are buying milk, how likely are they to also buy
bread (and what kind of bread) on the same trip to the supermarket. Such information can lead to
increased sales by helping retailers do selective marketing and plan their shelf space.

Example:

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

If customers who purchase computers also tend to buy antivirus software at the same time, then
placing the hardware display close to the software display may help increase the sales of both
items. In an alternative strategy, placing hardware and software at opposite ends of the store may
entice customers who purchase such items to pick up other items along the way. For instance, after
deciding on an expensive computer, a customer may observe security systems for sale while
heading toward the software display to purchase antivirus software and may decide to purchase a
home security system as well. Market basket analysis can also help retailers plan which items to
put on sale at reduced prices. If customers tend to purchase computers and printers together, then
having a sale on printers may encourage the sale of printers as well as computers.

3.3 Frequent Pattern Mining:

Frequent pattern mining can be classified in various ways, based on the following criteria:
1. Based on the completeness of patterns to be mined:

We can mine the complete set of frequent itemsets, the closed frequent itemsets, and
themaximal frequent itemsets, given a minimum support threshold.
We can also mine constrained frequent itemsets, approximate frequent
itemsets,near-match frequent itemsets, top-k frequent itemsets and so on.

2. Based on the levels of abstraction involved in the rule set:

Some methods for association rule mining can find rules at differing levels of abstraction.

For example, suppose that a set of association rules mined includes the following
ruleswhere X is a variable representing a customer:

buys(X, ―computer‖))=>buys(X, ―HP printer‖) (1)

buys(X, ―laptop computer‖)) =>buys(X, ―HP printer‖) (2)

In rule (1) and (2), the items bought are referenced at different levels of abstraction (e.g.,
―computer‖ is a higher-level abstraction of ―laptop computer‖).
3. Based on the number of data dimensions involved in the rule:

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

If the items or attributes in an association rule reference only one dimension, then it is
a single-dimensional association rule.
buys(X, ―computer‖))=>buys(X, ―antivirus software‖)

If a rule references two or more dimensions, such as the dimensions age, income, and
buys,then it is a multidimensional association rule. The following rule is an example of a
multidimensional rule:
age(X, ―30,31…39‖) ^ income(X, ―42K,…48K‖))=>buys(X, ―high resolution TV‖)

4. Based on the types of values handled in the rule:


If a rule involves associations between the presence or absence of items, it is a
Boolean association rule.
If a rule describes associations between quantitative items or attributes, then it is
a quantitative association rule.

5. Based on the kinds of rules to be mined:


Frequent pattern analysis can generate various kinds of rules and other interesting
relationships.
Association rule mining can generate a large number of rules, many of which are
redundant or do not indicate a correlation relationship among itemsets.
The discovered associations can be further analyzed to uncover statistical
correlations,leading to correlation rules.

6. Based on the kinds of patterns to be mined:


Many kinds of frequent patterns can be mined from different kinds of data sets.
Sequential pattern mining searches for frequent subsequences in a sequence data set, where
a sequence records an ordering of events.
For example, with sequential pattern mining, we can study the order in which items are
frequently purchased. For instance, customers may tend to first buy a PC, followed by a
digital camera ,and then a memory card.
Structured pattern mining searches for frequent substructures in a structured data
set. Single items are the simplest form of structure.
Each element of an itemset may contain a subsequence, a subtree, and so on.
Prepared by Dr. Pallavi Tawde
Unit 3 Data Mining and Warehousing Semester V

Therefore, structured pattern mining can be considered as the most general form of frequent
pattern mining.

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

3.4 Efficient Frequent Itemset Mining Methods:

3.4.1 Finding Frequent Itemsets Using Candidate Generation:TheApriori Algorithm

Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The resulting
set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used
to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.
A two-step process is followed in Apriori consisting of join and prune action.

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

Example:

TID List of item IDs


T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

There are nine transactions in this database, that is, |D| = 9.

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1- itemsets,
C1. The algorithm simply scans all of the transactions in order to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of frequent
1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying
minimum support. In our example, all of the candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1to
generate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the prune
step because each subset of the candidates is also frequent.
4. Next, the transactions in Dare scanned and the support count of each candidate itemsetInC2
is accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, From the join step, we first getC3 =L2x
L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on the
Apriori property that all subsets of a frequent itemset must also be frequent, we can determine
that the four latter candidates cannot possibly be frequent.

7.The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support.
8.The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

Example 2

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

Step 1: Create a frequency table of all the items that occur in all transactions
Item Frequency
Wine 4
Chips 4
Bread 4
Milk 5

Step 2: Find the significant items based on the support threshold


Support threshold = 3 = Minimum support count

Item Frequency
Wine 4
Chips 4
Bread 4
Milk 5

Step 3: From the significant items, make possible pairs irrespective of the order
Item Frequency
Wine, Chips 3
Wine, Bread 3
Wine, Milk 4
Chips, Bread 2
Chips, Milk 3
Bread, Milk 4
Step 4: Again, find the significant items based on the support threshold
Item Frequency
Wine, Milk 4
Bread, Milk 4
Step 5: Now, make a set of three items that are bought together based on the significant items from
Step 4
Prepared by Dr. Pallavi Tawde
Unit 3 Data Mining and Warehousing Semester V

Item Frequency
Wine, Bread, Milk 3

{Wine, Bread, Milk} is the only significant item set we have got from the given data. But in real-
world scenarios, we would have dozens of items to build rules from. Then, we might have to make
four/five-pair itemsets.

Example 3:

3.4.2 Generating Association Rules from Frequent Itemsets:


Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them.

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

Example:
TID List of item IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

3.5 Mining Multilevel Association Rules:

For many applications, it is difficult to find strong associations among data items at
lowor primitive levels of abstraction due to the sparsity of data at those levels.
Strong associations discovered at high levels of abstraction may represent
commonsenseknowledge.
Therefore, data mining systems should provide capabilities for mining association
rulesat multiple levels of abstraction, with sufficient flexibility for easy traversal
among different abstraction spaces.

Association rules generated from mining data at multiple levels of abstraction are called

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

multiple-level or multilevel association rules.


Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.
In general, a top-down strategy is employed, where counts are accumulated for the
calculation of frequent itemsets at each concept level, starting at the concept level 1 and
working downward in the hierarchy toward the more specific concept levels, until no more
frequent itemsets can be found.

A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher
level, more general concepts. Data can be generalized by replacing low-level concepts within the
data by their higher-level concepts, or ancestors, from a concept hierarchy.

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

The concept hierarchy has five levels, respectively referred to as levels 0 to 4, starting with
level 0 at the root node for all.

Here, Level 1 includes computer, software, printer & camera, and computer accessory.
Level 2 includes laptop computer, desktop computer, office software, antivirus software
Level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on.
Level 4 is the most specific abstraction level of this hierarchy.

3.5.1 Approaches For Mining Multilevel Association Rules:

1. Uniform Minimum Support:


The same minimum support threshold is used when mining at each level of abstraction.
When a uniform minimum support threshold is used, the search procedure is simplified.
The method is also simple in that users are required to specify only one minimum support
threshold.
The uniform support approach, however, has some difficulties. It is unlikely that items at
lower levels of abstraction will occur as frequently as those at higher levels of
abstraction.
If the minimum support threshold is set too high, it could miss some meaningful
associationsoccurring at low abstraction levels. If the threshold is set too low, it may
generate many uninteresting associations occurring at high abstraction levels.

2. Reduced Minimum Support:

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

Each level of abstraction has its own minimum support threshold.

The deeper the level of abstraction, the smaller the corresponding threshold is.

For example,the minimum support thresholds for levels 1 and 2 are 5% and 3%,respectively.
In this way, ―computer,‖ ―laptop computer,‖ and ―desktop computer‖ areall considered
frequent.

3. Group-Based Minimum Support:


Because users or experts often have insight as to which groups are more important than others,
it is sometimes more desirable to set up user-specific, item, or group based minimal support
thresholds when mining multilevel rules.
For example, a user could set up the minimum support thresholds based on product price, or
on items of interest, such as by setting particularly low support thresholds for laptop computers
and flash drives in order to pay particular attention to the association patterns containing items
in these categories.

3.6 Mining Multidimensional Association Rules from RelationalDatabases and Data


Warehouses:

Single dimensional or intradimensional association rule contains a single distinct predicate


(e.g., buys)with multiple occurrences i.e., the predicate occurs more than once within the
rule.

buys(X, ―digital camera‖)=>buys(X, ―HP printer‖)

Association rules that involve two or more dimensions or predicates can be referred

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

to as multidimensional association rules.


age(X, “20…29”)^occupation(X, “student”)=>buys(X, “laptop”)
Above Rule contains three predicates (age, occupation,and buys), each of which occurs
only once in the rule. Hence, we say that it has norepeated predicates.
Multidimensional association rules with no repeated predicates arecalled interdimensional
association rules.
We can also mine multidimensional associationrules with repeated predicates, which
contain multiple occurrences of some predicates.These rules are called hybrid-dimensional
association rules. An example of sucha rule is the following, where the predicate buys is
repeated:
age(X, ―20…29‖)^buys(X, ―laptop‖)=>buys(X, ―HP printer‖)

3.7 Mining Quantitative Association Rules:


Quantitative association rules are multidimensional association rules in which the numeric
attributes are dynamically discretized during the mining process so as to satisfy some mining
criteria, such as maximizing the confidence or compactness of the rules mined.
In this section, we focus specifically on how to mine quantitative association rules having
two quantitative attributes on the left-hand side of the rule and one categorical attribute on
the right-hand side of the rule. That is
Aquan1 ^Aquan2 =>Acat
Where Aquan1 and Aquan2 are tests on quantitative attribute interval
A cat tests a categorical attribute from the task-relevant data.
Such rules have been referred to as two-dimensional quantitative association rules,
because they contain two quantitative dimensions.
For instance, suppose you are curious about the association relationship between pairs of
quantitative attributes, like customer age and income, and the type of television (such as
high-definition TV, i.e., HDTV) that customers like to buy.
An example of such a 2-D quantitative association rule is
age(X, ―30…39‖)^income(X, ―42K…48K‖)=>buys(X, ―HDTV‖)
Prepared by Dr. Pallavi Tawde
Unit 3 Data Mining and Warehousing Semester V

3.8 From Association Mining to Correlation Analysis:


A correlation measure can be used to augment the support-confidence
framework forassociation rules. This leads to correlation rules of the form
A=>B [support, confidence, correlation]

That is, a correlation rule is measured not only by its support and confidence but
also by the correlation between itemsets A and B. There are many different
correlation measures from which to choose. In this section, we study various
correlation measures to determine which would be good for mining large data sets.

Lift is a simple correlation measure that is given as follows. The occurrence of


itemsetA is independent of the occurrence of itemsetB if =
P(A)P(B); otherwise, itemsetsA and B are dependent and correlated as events.
This definition can easily be extended to more than two itemsets.

The lift between the occurrence of A and B can bemeasured by computing

If the lift(A,B) is less than 1, then the occurrence of A is negativelycorrelated


with theoccurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated,
meaning thatthe occurrence of one implies the occurrence of the other.
If the resulting value is equal to 1, then A and B are independent and there is no
correlationbetween them.

3.9 Constraint-Based Association Mining


A data mining procedure can uncover thousands of rules from a given set of information, most
of which end up being independent or tedious to the users. Users have a best sense of which
“direction” of mining can lead to interesting patterns and the “form” of the patterns or rules
they can like to discover.
Therefore, a good heuristic is to have the users defines such intuition or expectations as
constraints to constraint the search space. This strategy is called constraint-based mining.
Constraint-based algorithms need constraints to decrease the search area in the frequent itemset
generation step (the association rule generating step is exact to that of exhaustive algorithms).

Prepared by Dr. Pallavi Tawde


Unit 3 Data Mining and Warehousing Semester V

The general constraint is the support minimum threshold. If a constraint is uncontrolled, its
inclusion in the mining phase can support significant reduction of the exploration space
because of the definition of a boundary inside the search space lattice, following which
exploration is not needed.
The important of constraints is well-defined − they create only association rules that are
appealing to users. The method is quite trivial and the rules space is decreased whereby
remaining methods satisfy the constraints.
Constraint-based clustering discover clusters that satisfy user-defined preferences or
constraints. It depends on the characteristics of the constraints, constraint-based clustering can
adopt rather than different approaches.
The constraints can include the following which are as follows −
Knowledge type constraints − These define the type of knowledge to be mined, including
association or correlation.
Data constraints − These define the set of task-relevant information such as Dimension/level
constraints − These defines the desired dimensions (or attributes) of the information, or
methods of the concept hierarchies, to be utilized in mining.
Interestingness constraints − These defines thresholds on numerical measures of rule
interestingness, including support, confidence, and correlation.
Rule constraints − These defines the form of rules to be mined. Such constraints can be
defined as metarules (rule templates), as the maximum or minimum number of predicates that
can appear in the rule antecedent or consequent, or as relationships between attributes, attribute
values, and/or aggregates.
The following constraints can be described using a high-level declarative data mining query
language and user interface. This form of constraint-based mining enables users to define the
rules that they can like to uncover, thus by creating the data mining process more efficient.
Furthermore, a sophisticated mining query optimizer can be used to deed the constraints
defined by the user, thereby creating the mining process more effective. Constraint-based
mining boost interactive exploratory mining and analysis.

Applications of Apriori Algorithm

 Used in forest departments to understand the intensity and probability of forest fires.
 Used by Google and other search engines for their auto-complete features.
 The Healthcare department used such algorithms to analyze the patients’ database and
predict which patients might develop blood pressure, diabetes, another common
disease.
 Used to categorize students based on their specialties and performance to improve their
academic performance.
 E-commerce websites use it in their recommendation systems to provide a better user
experience.

Prepared by Dr. Pallavi Tawde


inprotected.com

You might also like