0% found this document useful (0 votes)
5 views

Unit_3 Mining Frequent Patterns

The document discusses the fundamentals of frequent pattern mining in data science, highlighting its importance in identifying patterns within datasets. It covers various algorithms such as Apriori, ECLAT, and FP-growth, detailing their applications and advantages in areas like market basket analysis and recommendation systems. Additionally, it explains the concepts of support, confidence, and lift in association rule learning, along with its diverse applications in fields like retail and healthcare.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit_3 Mining Frequent Patterns

The document discusses the fundamentals of frequent pattern mining in data science, highlighting its importance in identifying patterns within datasets. It covers various algorithms such as Apriori, ECLAT, and FP-growth, detailing their applications and advantages in areas like market basket analysis and recommendation systems. Additionally, it explains the concepts of support, confidence, and lift in association rule learning, along with its diverse applications in fields like retail and healthcare.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Fundamentals of Data Science

SMD COLLEGE, HOSAPETE

FUNDAMENTALS OF DATA SCIENCE


UNIT 3: Mining Frequent Patterns

PREPARED BY:
KUMAR MUTTURAJ
BCA DEPT
Fundamentals of Data Science

UNIT 3: Mining Frequent Patterns


Basic Concepts:
Frequent pattern mining in data mining is the process of identifying patterns or associations
within a dataset that occur frequently. This is typically done by analyzing large datasets to
find items or sets of items that appear together frequently.
Frequent pattern extraction is an essential mission in data mining that intends to uncover
repetitive patterns or item sets in a granted dataset. It encompasses recognizing collections
of components that occur together frequently in a transactional or relational database. This
procedure can offer valuable perceptions into the connections and affiliations among
diverse components or features within the data.

There are several different algorithms used for frequent pattern mining, including:
1. Apriori algorithm: This is one of the most commonly used algorithms for frequent
pattern mining. It uses a “bottom-up” approach to identify frequent itemsets and
then generates association rules from those itemsets.
2. ECLAT algorithm: This algorithm uses a “depth-first search” approach to identify
frequent itemsets. It is particularly efficient for datasets with a large number of
items.
3. FP-growth algorithm: This algorithm uses a “compression” technique to find
frequent patterns efficiently. It is particularly efficient for datasets with a large
number of transactions.
4. Frequent pattern mining has many applications, such as Market Basket Analysis,
Recommender Systems, Fraud Detection, and many more.
Fundamentals of Data Science

Frequent-item-set –mining methods:


Association Mining searches for frequent items in the data set. In frequent mining usually,
interesting associations and correlations between item sets in transactional and relational
databases are found. In short, Frequent Mining shows which items appear together in a
transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules from a
Transactional Dataset. If there are 2 items X and Y purchased frequently then it’s good to
put them together in stores or provide some discount offer on one item on purchase of
another item. This can really increase sales. For example, it is likely to find that if a customer
buys Milk and bread he/she also buys Butter. So the association rule
is [‘milk]^[‘bread’]=>[‘butter’]. So the seller can suggest the customer buy butter if he/she
buys Milk and Bread.

Advantages of using frequent item sets and association rule mining include:
1. Efficient discovery of patterns: Association rule mining algorithms are efficient at
discovering patterns in large datasets, making them useful for tasks such as market
basket analysis and recommendation systems.
2. Easy to interpret: The results of association rule mining are easy to understand and
interpret, making it possible to explain the patterns found in the data.
3. Can be used in a wide range of applications: Association rule mining can be used in a
wide range of applications such as retail, finance, and healthcare, which can help to
improve decision-making and increase revenue.
4. Handling large datasets: These algorithms can handle large datasets with many items
and transactions, which makes them suitable for big-data scenarios.
Disadvantages of using frequent item sets and association rule mining include:
1. Large number of generated rules: Association rule mining can generate a large
number of rules, many of which may be irrelevant or uninteresting, which can make
it difficult to identify the most important patterns.
2. Limited in detecting complex relationships: Association rule mining is limited in its
ability to detect complex relationships between items, and it only considers the co-
occurrence of items in the same transaction.
3. Can be computationally expensive: As the number of items and transactions
increases, the number of candidate item sets also increases, which can make the
algorithm computationally expensive.
4. Need to define the minimum support and confidence threshold: The minimum
support and confidence threshold must be set before the association rule mining
process, which can be difficult and requires a good understanding of the data.

Apriori algorithm:
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets
in a dataset for boolean association rule. Name of the algorithm is Apriori because it uses
prior knowledge of frequent itemset properties. We apply an iterative approach or level-
wise search where k-frequent itemsets are used to find k+1 itemsets.
Fundamentals of Data Science

To improve the efficiency of level-wise generation of frequent itemsets, an important


property is used called Apriori property which helps by reducing the search space.

Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori
algorithm is its anti-monotonicity of support measure. Apriori assumes that
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Before we start understanding the algorithm, go through some definitions which are
explained in my previous post.
Consider the following dataset and we will find frequent itemsets and generate association
rules for them.

minimum support count is 2


minimum confidence is 60%

Step-1: K=1
(I) Create a table containing support count of each item present in dataset –
Called C1(candidate set)

(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then
remove those items). This gives us itemset L1.
Fundamentals of Data Science

Step-2: K=2
 Generate candidate set C2 using L1 (this is called join step). Condition of joining L k-
1 and Lk-1 is that it should have (K-2) elements in common.
 Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each
itemset)
 Now find support count of these item sets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L2.

Step-3:
o Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in common. So here, for L2, first
element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2,
I4, I5}{I2, I3, I5}
o Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset.(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which
are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it.
Similarly check for every itemset)
o find support count of these remaining itemset by searching in dataset.
Fundamentals of Data Science

(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L3.

Step-4:
o Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and
Lk-1 (K=4) is that, they should have (K-2) elements in common. So here, for L3,
first 2 elements (items) should match.
o Check all subsets of these item sets are frequent or not (Here itemset formed
by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not
frequent). So no itemset in C4
o We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong association
rule comes into picture. For that we need to calculate confidence of each rule.

Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also
bought butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3

SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong association
rules.

Frequent Pattern Growth:


Frequent Pattern Growth (FP-Growth) is a popular algorithm in data mining for finding
frequent patterns (sets of items that frequently occur together) in transactional databases.
It aims to improve upon the Apriori algorithm by using an FP-tree (Frequent Pattern tree)
Fundamentals of Data Science

structure to store compressed and structured frequent item sets.

Suppose we have a transactional database where each transaction consists of items


purchased by customers:

Transaction 1: {bread, milk, butter}


Transaction 2: {bread, coffee}
Transaction 3: {bread, butter}
Transaction 4: {bread, milk, coffee}
Transaction 5: {bread, milk}
And let's say our minimum support threshold is 2 (meaning we are interested in itemsets
that appear in at least 2 transactions).

Step 1: Counting item frequencies


First, we count the frequency of each item:
 bread: 5
 milk: 3
 butter: 2
 coffee: 2
Since the minimum support threshold is 2, all items meet this criterion.

Step 2: Building the FP-Tree

Next, we construct the FP-tree:

1. Transaction 1: {bread, milk, butter}


o Insert items into the tree: bread -> milk -> butter
2. Transaction 2: {bread, coffee}
o Insert items into the tree: bread -> coffee
3. Transaction 3: {bread, butter}
o Insert items into the tree: bread -> butter
4. Transaction 4: {bread, milk, coffee}
o Insert items into the tree: bread -> milk -> coffee
5. Transaction 5: {bread, milk}
o Insert items into the tree: bread -> milk

The FP-tree structure will look like this:


FP-tree:

- bread (5)
- milk (3)
- butter (1)
- coffee (1)
- butter (1)
- coffee (1)
Here, the numbers in parentheses indicate the count of transactions that contain that
itemset.
Fundamentals of Data Science

Step 3: Mining frequent itemsets


From the FP-tree, we can generate frequent itemsets. Starting from the bottom of the tree
(the least frequent items), we can generate frequent itemsets:
 {milk}: 3 transactions
 {butter}: 2 transactions
 {coffee}: 2 transactions
 {bread, milk}: 3 transactions
 {bread, butter}: 2 transactions
 {bread, coffee}: 2 transactions

Thus, the frequent itemsets above the minimum support threshold (2 transactions) are
{milk}, {butter}, {coffee}, {bread, milk}, {bread, butter}, and {bread, coffee}.

This example demonstrates the basic process of building an FP-tree from transactional
data and using it to efficiently mine frequent itemsets without generating candidate
itemsets explicitly, which is the key advantage of the FP-Growth algorithm.

Mining Association Rules:


Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the
variables of dataset. It is based on different rules to discover the interesting relations
between variables in the database.

The association rule learning is one of the very important concepts of machine learning,
and it is employed in Market Basket analysis, Web usage mining, continuous production,
etc. Here market basket analysis is a technique used by the various big retailer to discover
the associations between items. We can understand it by taking an example of a
supermarket, as in a supermarket, all products that are purchased together are put
together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so
these products are stored within a shelf or mostly nearby. Consider the below diagram:
Fundamentals of Data Science

Association rule learning can be divided into three types of algorithms:


1. Apriori
2. Eclat
3. F-P Growth Algorithm

Association rule learning works on the concept of If and Else Statement, such as if A then
B.

Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation between
two items is known as single cardinality. It is all about creating rules, and if the number of
items increases, then cardinality also increases accordingly. So, to measure the associations
between thousands of data items, there are several metrics. These metrics are given below:

o Support
o Confidence
o Lift

Let's understand each of them:

Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items
X and Y occur together in the dataset when the occurrence of X is already given. It is the
ratio of the transaction that contains X and Y to the number of records that contain X.

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:
o If Lift= 1: The probability of occurrence of antecedent and consequent is
independent of each other.
Fundamentals of Data Science

o Lift>1: It determines the degree to which the two itemsets are dependent to each
other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item
has a negative effect on another.

Applications of Association Rule Learning:

It has various applications in machine learning and data mining. Below are some popular
applications of association rule learning:
o Market Basket Analysis: It is one of the popular examples and applications of
association rule mining. This technique is commonly used by big retailers to
determine the association between items.

o Medical Diagnosis: With the help of association rules, patients can be cured easily,
as it helps in identifying the probability of illness for a particular disease.

o Protein Sequence: The association rules help in determining the synthesis of


artificial Proteins.

o It is also used for the Catalog Design and Loss-leader Analysis and many more
other applications.

You might also like