0% found this document useful (0 votes)
0 views

pattern mining[1]

Pattern mining focuses on identifying rules that describe data patterns, with market-basket analysis being a key application that reveals item associations in transactions. The Apriori algorithm is a method used to calculate association rules between items, utilizing support, confidence, and lift to analyze relationships and improve sales strategies. While the algorithm is straightforward and effective for large datasets, it can be computationally expensive and requires multiple database scans.

Uploaded by

suranavaibhav23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

pattern mining[1]

Pattern mining focuses on identifying rules that describe data patterns, with market-basket analysis being a key application that reveals item associations in transactions. The Apriori algorithm is a method used to calculate association rules between items, utilizing support, confidence, and lift to analyze relationships and improve sales strategies. While the algorithm is straightforward and effective for large datasets, it can be computationally expensive and requires multiple database scans.

Uploaded by

suranavaibhav23
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Pattern mining concentrates on identifying rules that describe specific

patterns within the data. Market-basket analysis, which identifies items


that typically occur together in purchase transactions, was one of the first
applications of data mining. For example, supermarkets used
market-basket analysis to identify items that were often purchased
together—for instance, a store featuring a fish sale would also stock up on
tartar sauce. Although testing for such associations has long been feasible
and is often simple to see in small data sets, data mining has enabled the
discovery of less apparent associations in immense data sets. Of most
interest is the discovery of unexpected associations, which may open new
avenues for marketing or research. Another important use of pattern
mining is the discovery of sequential patterns; for example, sequences of
errors or warnings that precede an equipment failure may be used to
schedule preventative maintenance or may provide insight into a design
flaw.

Apriori algorithm refers to the algorithm which is used to calculate the association
rules between objects. It means how two or more objects are related to one another.
In other words, we can say that the apriori algorithm is an association rule leaning
that analyzes that people who bought product A also bought product B.

The primary objective of the apriori algorithm is to create the association rule
between different objects. The association rule describes how two or more objects
are related to one another. Apriori algorithm is also called frequent pattern mining.
Generally, you operate the Apriori algorithm on a database that consists of a huge
number of transactions. Let's understand the apriori algorithm with the help of an
example; suppose you go to Big Bazar and buy different products. It helps the
customers buy their products with ease and increases the sales performance of the
Big Bazar. In this tutorial, we will discuss the apriori algorithm with examples.

Components of Apriori algorithm

The given three components comprise the apriori algorithm.

1. Support
2. Confidence
3. Lift
Let's take an example to understand this concept.

We have already discussed above; you need a huge database containing a large no
of transactions. Suppose you have 4000 customers transactions in a Big Bazar. You
have to calculate the Support, Confidence, and Lift for two products, and you may
say Biscuits and Chocolate. This is because customers frequently buy these two
items together.

Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and
these 600 transactions include a 200 that includes Biscuits and chocolates. Using
this data, we will find out the support, confidence, and lift.

Support

Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by the
total number of transactions. Hence, we get

Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

= 400/4000 = 10 percent.

Confidence

Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that comprise
both biscuits and chocolates by the total number of transactions to get the
confidence.

Hence,

Confidence = (Transactions relating both biscuits and Chocolate) / (Total


transactions involving Biscuits)

AD

= 200/400

= 50 percent.

It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift

Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given
below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

= 50/10 = 5

It means that the probability of people buying both biscuits and chocolates together
is five times more than that of purchasing the biscuits alone. If the lift value is below
one, it requires that the people are unlikely to buy both the items together. Larger the
value, the better is the combination.

How does the Apriori Algorithm work in Data Mining?

We will understand this algorithm with the help of an example

Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk,
Apple}. The database comprises six transactions where 1 represents the presence of
the product and 0 represents the absence of the product.

Transaction ID Rice Pulse Oil Milk Apple

t1 1 1 1 0 0

t2 0 1 1 1 0
t3 0 0 0 1 1

t4 1 1 0 1 0

t5 1 1 1 0 1

t6 1 1 1 1 1

The Apriori Algorithm makes the given assumptions

○ All subsets of a frequent itemset must be frequent.


○ The subsets of an infrequent item set must be infrequent.
○ Fix a threshold support level. In our case, we have fixed it at 50 percent.

Step 1

Make a frequency table of all the products that appear in all the transactions. Now,
short the frequency table to add only those products with a threshold support level
of over 50 percent. We find the given frequency table.

Product Frequency (Number of transactions)

Rice (R) 4

Pulse(P) 5
Oil(O) 4

Milk(M) 4

The above table indicated the products frequently bought by the customers.

Step 2

Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given
frequency table.

Itemset Frequency (Number of transactions)

RP 4

RO 3

RM 2

PO 4

PM 3

OM 2

Step 3
Implementing the same threshold support of 50 percent and consider the products
that are more than 50 percent. In our case, it is more than 3

Thus, we get RP, RO, PO, and PM

Step 4

Now, look for a set of three products that the customers buy together. We get the
given combination.

1. RP and RO give RPO


2. PO and PM give POM

Step 5

Calculate the frequency of the two itemsets, and you will get the given frequency
table.

Itemset Frequency (Number of transactions)

RPO 4

POM 3

If you implement the threshold assumption, you can figure out that the customers'
set of three products is RPO.

We have considered an easy example to discuss the apriori algorithm in data mining.
In reality, you find thousands of such combinations.

How to improve the efficiency of the Apriori


Algorithm?
There are various methods used for the efficiency of the Apriori algorithm
Hash-based itemset counting

In hash-based itemset counting, you need to exclude the k-itemset whose equivalent
hashing bucket count is least than the threshold is an infrequent itemset.

Transaction Reduction

In transaction reduction, a transaction not involving any frequent X itemset becomes


not valuable in subsequent scans.

Apriori Algorithm in data mining


We have already discussed an example of the apriori algorithm related to the
frequent itemset generation. Apriori algorithm has many applications in data mining.

The primary requirements to find the association rules in data mining are given
below.

Use Brute Force

Analyze all the rules and find the support and confidence levels for the individual
rule. Afterward, eliminate the values which are less than the threshold support and
confidence levels.

The two-step approaches

The two-step approach is a better option to find the associations rules than the Brute
Force method.

Step 1

In this article, we have already discussed how to create the frequency table and
calculate itemsets having a greater support value than that of the threshold support.

Step 2

To create association rules, you need to use a binary partition of the frequent
itemsets. You need to choose the ones having the highest confidence levels.

In the above example, you can see that the RPO combination was the frequent
itemset. Now, we find out all the rules using RPO.
RP-O, RO-P, PO-R, O-RP, P-RO, R-PO

You can see that there are six different combinations. Therefore, if you have n
elements, there will be 2n - 2 candidate association rules.

Advantages of Apriori Algorithm

○ It is used to calculate large itemsets.


○ Simple to understand and apply.

AD

Disadvantages of Apriori Algorithms

○ Apriori algorithm is an expensive method to find support since the calculation


has to pass through the whole database.
○ Sometimes, you need a huge number of candidate rules, so it becomes
computationally more expensive.

Apriori Algorithm – Frequent Pattern Algorithms

Apriori algorithm was the first algorithm that was proposed for frequent itemset
mining. It was later improved by R Agarwal and R Srikant and came to be known as
Apriori. This algorithm uses two steps “join” and “prune” to reduce the search space.
It is an iterative approach to discover the most frequent itemsets.

Apriori says:

The probability that item I is not frequent is if:

● P(I) < minimum support threshold, then I is not frequent.


● P (I+A) < minimum support threshold, then I+A is not frequent, where A
also belongs to itemset.
● If an itemset set has value less than minimum support then all of its
supersets will also fall below min support, and thus can be ignored. This
property is called the Antimonotone property.

The steps followed in the Apriori Algorithm of data mining are:


1. Join Step: This step generates (K+1) itemset from K-itemsets by
joining each item with itself.
2. Prune Step: This step scans the count of each item in the database. If
the candidate item does not meet minimum support, then it is regarded
as infrequent and thus it is removed. This step is performed to reduce
the size of the candidate itemsets.

Steps In Apriori

Apriori algorithm is a sequence of steps to be followed to find the most frequent


itemset in the given database. This data mining technique follows the join and the
prune steps iteratively until the most frequent itemset is achieved. A minimum
support threshold is given in the problem or it is assumed by the user.

#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets
candidate. The algorithm will count the occurrences of each item.

#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets
whose occurrence is satisfying the min sup are determined. Only those candidates
which count more than or equal to min_sup, are taken ahead for the next iteration
and the others are pruned.

AD

#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join
step, the 2-itemset is generated by forming a group of 2 by combining items with
itself.

#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the
table will have 2 –itemsets with min-sup only.

#5) The next iteration will form 3 –itemsets using join and prune step. This iteration
will follow antimonotone property where the subsets of 3-itemsets, that is the 2
–itemset subsets of each group fall in min_sup. If all 2-itemset subsets are frequent
then the superset will be frequent otherwise it is pruned.

#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning
if its subset does not meet the min_sup criteria. The algorithm is stopped when the
most frequent itemset is achieved.
Example of Apriori: Support threshold=50%, Confidence= 60%

TABLE-1

Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

Solution:

Support threshold=50% => 0.5*6= 3 => min_sup=3


1. Count Of Each Item

TABLE-2

Item Count

I1 4

I2 5

I3 4

I4 4

I5 2

2. Prune Step: TABLE -2 shows that I5 item does not meet min_sup=3, thus it is
deleted, only I1, I2, I3, I4 meet min_sup count.

TABLE-3

Item Count

I1 4

I2 5

I3 4

I4 4

3. Join Step: Form 2-itemset. From TABLE-1 find out the occurrences of 2-itemset.

TABLE-4

Item Count

I1,I2 4
I1,I3 3

I1,I4 2

I2,I3 4

I2,I4 3

I3,I4 2

4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does not meet
min_sup, thus it is deleted.

TABLE-5

Item Count

I1,I2 4

I1,I3 3

I2,I3 4

I2,I4 3

5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out occurrences
of 3-itemset. From TABLE-5, find out the 2-itemset subsets which support min_sup.

AD

We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are occurring in
TABLE-5 thus {I1, I2, I3} is frequent.

We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1, I4} is not
frequent, as it is not occurring in TABLE-5 thus {I1, I2, I4} is not frequent, hence it is
deleted.

TABLE-6
Item

I1,I2,I3

I1,I2,I4

I1,I3,I4

I2,I3,I4

Only {I1, I2, I3} is frequent.

6. Generate Association Rules: From the frequent itemset discovered above the
association could be:

{I1, I2} => {I3}

Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%

{I1, I3} => {I2}

Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%

{I2, I3} => {I1}

Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%

{I1} => {I2, I3}

Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%

{I2} => {I1, I3}

Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%

{I3} => {I1, I2}

Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%

This shows that all the above association rules are strong if minimum confidence
threshold is 60%.
The Apriori Algorithm: Pseudo Code

C: Candidate item set of size k

L: Frequent itemset of size k

Advantages

1. Easy to understand algorithm


2. Join and Prune steps are easy to implement on large itemsets in
large databases

Disadvantages

1. It requires high computation if the itemsets are very large and the
minimum support is kept very low.
2. The entire database needs to be scanned.

Methods To Improve Apriori Efficiency


Many methods are available for improving the efficiency of the algorithm.

1. Hash-Based Technique: This method uses a hash-based structure


called a hash table for generating the k-itemsets and its corresponding
count. It uses a hash function for generating the table.
2. Transaction Reduction: This method reduces the number of
transactions scanning in iterations. The transactions which do not
contain frequent items are marked or removed.
3. Partitioning: This method requires only two database scans to mine
the frequent itemsets. It says that for any itemset to be potentially
frequent in the database, it should be frequent in at least one of the
partitions of the database.
4. Sampling: This method picks a random sample S from Database D
and then searches for frequent itemset in S. It may be possible to lose a
global frequent itemset. This can be reduced by lowering the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate
itemsets at any marked start point of the database during the scanning
of the database.

Frequent Pattern Growth Algorithm

This algorithm is an improvement to the Apriori method. A frequent pattern is


generated without the need for candidate generation. FP growth algorithm
represents the database in the form of a tree called a frequent pattern tree or FP
tree.

This tree structure will maintain the association between the itemsets. The database
is fragmented using one frequent item. This fragmented part is called “pattern
fragment”. The itemsets of these fragmented patterns are analyzed. Thus with this
method, the search for frequent itemsets is reduced comparatively.

FP Tree

Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of
the database. The purpose of the FP tree is to mine the most frequent pattern. Each
node of the FP tree represents an item of the itemset.

The root node represents null while the lower nodes represent the itemsets. The
association of the nodes with the lower nodes that is the itemsets with the other
itemsets are maintained while forming the tree.
Frequent Pattern Algorithm Steps

The frequent pattern growth method lets us find the frequent pattern without
candidate generation.

Let us see the steps followed to mine the frequent pattern using frequent
pattern growth algorithm:

#1) The first step is to scan the database to find the occurrences of the itemsets in
the database. This step is the same as the first step of Apriori. The count of
1-itemsets in the database is called support count or frequency of 1-itemset.

#2) The second step is to construct the FP tree. For this, create the root of the tree.
The root is represented by null.

#3) The next step is to scan the database again and examine the transactions.
Examine the first transaction and find out the itemset in it. The itemset with the max
count is taken at the top, the next itemset with lower count and so on. It means that
the branch of the tree is constructed with transaction itemsets in descending order of
count.

#4) The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch (for example in the 1st transaction), then this transaction branch
would share a common prefix to the root.

This means that the common itemset is linked to the new node of another itemset in
this transaction.

#5) Also, the count of the itemset is incremented as it occurs in the transactions.
Both the common node and new node count is increased by 1 as they are created
and linked according to transactions.

#6) The next step is to mine the created FP Tree. For this, the lowest node is
examined first along with the links of the lowest nodes. The lowest node represents
the frequency pattern length 1. From this, traverse the path in the FP Tree. This path
or paths are called a conditional pattern base.

Conditional pattern base is a sub-database consisting of prefix paths in the FP tree


occurring with the lowest node (suffix).
#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the
path. The itemsets meeting the threshold support are considered in the Conditional
FP Tree.

#8) Frequent Patterns are generated from the Conditional FP Tree.

Example Of FP-Growth Algorithm

Support threshold=50%, Confidence= 60%

Table 1

Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

Solution:

Support threshold=50% => 0.5*6= 3 => min_sup=3

1. Count of each item

Table 2

Item Count

I1 4
I2 5

I3 4

I4 4

I5 2

2. Sort the itemset in descending order.

Table 3

Item Count

I2 5

I1 4

I3 4

I4 4

3. Build FP Tree

AD

1. Considering the root node null.


2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1},
{I2:1}, {I3:1}, where I2 is linked as a child to root, I1 is linked to I2 and I3
is linked to I1.
3. T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked
to I2 and I4 is linked to I3. But this branch would share I2 node as
common as it is already used in T1.
4. Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is
linked as a child to I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is
created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to
the root node, hence it will be incremented by 1. Similarly I1 will be
incremented by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2},
{I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4},
{I1:3}, {I3:2}, {I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5},
{I1:4}, {I3:3}, {I4 1}.

4. Mining of FP-tree is summarized below:

1. The lowest node item I5 is not considered as it does not have a min
support count, hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches ,
{I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore considering I4 as suffix the prefix
paths will be {I2, I1, I3:1}, {I2, I3: 1}. This forms the conditional pattern
base.
3. The conditional pattern base is considered a transaction database, an
FP-tree is constructed. This will contain {I2:2, I3:2}, I1 is not considered
as it does not meet the min support count.
4. This path will generate all combinations of frequent patterns :
{I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2
node FP-tree : {I2:4, I1:3} and frequent patterns are generated: {I2,I3:4},
{I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node
FP-tree: {I2:4} and frequent patterns are generated: {I2, I1:4}.

Conditional Pattern Conditional Frequent Patterns


Item
Base FP-tree Generated

I4 {I2,I1,I3:1},{I2,I3:1} {I2:2, I3:2} {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}


I3 {I2,I1:3},{I2:1} {I2:4, I1:3} {I2,I3:4}, {I1:I3:3},
{I2,I1,I3:3}

I1 {I2:4} {I2:4} {I2,I1:4}

The diagram given below depicts the conditional FP tree associated with the
conditional node I3.

Advantages Of FP Growth Algorithm

1. This algorithm needs to scan the database only twice when compared
to Apriori which scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it
faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent
patterns.

Disadvantages Of FP-Growth Algorithm

1. FP Tree is more cumbersome and difficult to build than Apriori.


2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared
memory.

FP Growth vs Apriori

FP Growth Apriori

Pattern Generation
FP growth generates pattern by Apriori generates pattern by pairing the
constructing a FP tree items into singletons, pairs and triplets.

Candidate Generation

There is no candidate generation Apriori uses candidate generation

Process

The process is faster as compared to The process is comparatively slower


Apriori. The runtime of process than FP Growth, the runtime increases
increases linearly with increase in exponentially with increase in number of
number of itemsets. itemsets

Memory Usage

A compact version of database is The candidates combinations are saved


saved in memory

ECLAT

The above method, Apriori and FP growth, mine frequent itemsets using horizontal
data format. ECLAT is a method of mining frequent itemsets using the vertical data
format. It will transform the data in the horizontal data format into the vertical format.

For Example, Apriori and FP growth use:

Transaction List of items

T1 I1,I2,I3

T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5
T6 I1,I2,I3,I4

The ECLAT will have the format of the table as:

Item Transaction Set

I1 {T1,T4,T5,T6}

I2 {T1,T2,T4,T5,T6}

I3 {T1,T2,T5,T6}

I4 {T2,T3,T4,T5}

I5 {T3,T5}

This method will form 2-itemsets, 3 itemsets, k itemsets in the vertical data format.
This process with k is increased by 1 until no candidate itemsets are found. Some
optimizations techniques such as diffset are used along with Apriori.

This method has an advantage over Apriori as it does not require scanning the
database to find the support of k+1 itemsets. This is because the Transaction set will
carry the count of occurrence of each item in the transaction (support). The
bottleneck comes when there are many transactions taking huge memory and
computational time for intersecting the sets.

Conclusion

The Apriori algorithm is used for mining association rules. It works on the principle,
“the non-empty subsets of frequent itemsets must also be frequent”. It forms
k-itemset candidates from (k-1) itemsets and scans the database to find the frequent
itemsets.

Frequent Pattern Growth Algorithm is the method of finding frequent patterns without
candidate generation. It constructs an FP Tree rather than using the generate and
test strategy of Apriori. The focus of the FP Growth algorithm is on fragmenting the
paths of the items and mining frequent patterns.
https://www.softwaretestinghelp.com/weka-explorer-tutorial/

INTRODUCTION:

1. Frequent item sets, also known as association rules, are a fundamental


concept in association rule mining, which is a technique used in data mining
to discover relationships between items in a dataset. The goal of association
rule mining is to identify relationships between items in a dataset that occur
frequently together.
2. A frequent item set is a set of items that occur together frequently in a
dataset. The frequency of an item set is measured by the support count,
which is the number of transactions or records in the dataset that contain the
item set. For example, if a dataset contains 100 transactions and the item set
{milk, bread} appears in 20 of those transactions, the support count for {milk,
bread} is 20.
3. Association rule mining algorithms, such as Apriori or FP-Growth, are used to
find frequent item sets and generate association rules. These algorithms work
by iteratively generating candidate item sets and pruning those that do not
meet the minimum support threshold. Once the frequent item sets are found,
association rules can be generated by using the concept of confidence, which
is the ratio of the number of transactions that contain the item set and the
number of transactions that contain the antecedent (left-hand side) of the rule.
4. Frequent item sets and association rules can be used for a variety of tasks
such as market basket analysis, cross-selling and recommendation systems.
However, it should be noted that association rule mining can generate a large
number of rules, many of which may be irrelevant or uninteresting. Therefore,
it is important to use appropriate measures such as lift and conviction to
evaluate the interestingness of the generated rules.

Association Mining searches for frequent items in the data set. In frequent mining
usually, interesting associations and correlations between item sets in transactional
and relational databases are found. In short, Frequent Mining shows which items
appear together in a transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules
from a Transactional Dataset. If there are 2 items X and Y purchased frequently then
it’s good to put them together in stores or provide some discount offer on one item on
purchase of another item. This can really increase sales. For example, it is likely to
find that if a customer buys Milk and bread he/she also buys Butter. So the
association rule is [‘milk]^[‘bread’]=>[‘butter’]. So the seller can suggest the customer
buy butter if he/she buys Milk and Bread.
Important Definitions :

● Support : It is one of the measures of interestingness. This tells about the


usefulness and certainty of rules. 5% Support means total 5% of transactions
in the database follow the rule.

Support(A -> B) = Support_count(A ∪ B)

● Confidence: A confidence of 60% means that 60% of the customers who


purchased a milk and bread also bought butter.

Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)


If a rule satisfies both minimum support and minimum confidence, it is a strong rule.

● Support_count(X): Number of transactions in which X appears. If X is A union


B then it is the number of transactions in which A and B both are present.
● Maximal Itemset: An itemset is maximal frequent if none of its supersets are
frequent.
● Closed Itemset: An itemset is closed if none of its immediate supersets have
same support count same as Itemset.
● K- Itemset: Itemset which contains K items is a K-itemset. So it can be said
that an itemset is frequent if the corresponding support count is greater than
the minimum support count.

Example On finding Frequent Itemsets – Consider the given dataset with given
transactions.

● Lets say minimum support count is 3


● Relation hold is maximal frequent => closed => frequent

1-frequent: {A} = 3; // not closed due to {A, C} and not maximal {B} = 4; // not closed
due to {B, D} and no maximal {C} = 4; // not closed due to {C, D} not maximal {D} =
5; // closed item-set since not immediate super-set has same count. Not maximal
2-frequent: {A, B} = 2 // not frequent because support count < minimum support
count so ignore {A, C} = 3 // not closed due to {A, C, D} {A, D} = 3 // not closed due
to {A, C, D} {B, C} = 3 // not closed due to {B, C, D} {B, D} = 4 // closed but not
maximal due to {B, C, D} {C, D} = 4 // closed but not maximal due to {B, C, D}
3-frequent: {A, B, C} = 2 // ignore not frequent because support count < minimum
support count {A, B, D} = 2 // ignore not frequent because support count < minimum
support count {A, C, D} = 3 // maximal frequent {B, C, D} = 3 // maximal frequent
4-frequent: {A, B, C, D} = 2 //ignore not frequent </
ADVANTAGES OR DISADVANTAGES:
Advantages of using frequent item sets and association rule mining include:

1. Efficient discovery of patterns: Association rule mining algorithms are efficient


at discovering patterns in large datasets, making them useful for tasks such
as market basket analysis and recommendation systems.
2. Easy to interpret: The results of association rule mining are easy to
understand and interpret, making it possible to explain the patterns found in
the data.
3. Can be used in a wide range of applications: Association rule mining can be
used in a wide range of applications such as retail, finance, and healthcare,
which can help to improve decision-making and increase revenue.
4. Handling large datasets: These algorithms can handle large datasets with
many items and transactions, which makes them suitable for big-data
scenarios.

Disadvantages of using frequent item sets and association rule mining include:

1. Large number of generated rules: Association rule mining can generate a


large number of rules, many of which may be irrelevant or uninteresting,
which can make it difficult to identify the most important patterns.
2. Limited in detecting complex relationships: Association rule mining is limited in
its ability to detect complex relationships between items, and it only considers
the co-occurrence of items in the same transaction.
3. Can be computationally expensive: As the number of items and transactions
increases, the number of candidate item sets also increases, which can make
the algorithm computationally expensive.
4. Need to define the minimum support and confidence threshold: The minimum
support and confidence threshold must be set before the association rule
mining process, which can be difficult and requires a good understanding of
the data.
In data mining, the process of rating the usefulness and importance of patterns

found is known as pattern evaluation. It is essential for drawing insightful

conclusions from enormous volumes of data. Data mining professionals can assess

patterns to establish the applicability and validity of newly acquired knowledge,

facilitating informed decision−making and generating practical results.

Several metrics and criteria, including support, confidence, and lift, are used in this

evaluation method to statistically evaluate the patterns' sturdiness and

dependability. In this post, we will be looking at pattern evaluation methods in

data mining. Let’s begin.

Understanding Pattern Evaluation


In the field of data mining, the objective is to draw useful information and insights

from vast amounts of data. Finding patterns, trends, and correlations in data allows

for the discovery of hidden information that can help with decision−making and

problem−solving. An essential step in this process is pattern evaluation, which

involves systematically evaluating the identified patterns to ascertain their utility,

importance, and quality.

It acts as a filter to distinguish useful patterns from noise or unimportant

connections, and it is a crucial phase in the data mining workflow. Pattern

evaluation and pattern discovery go hand in hand since the assessment standards

and metrics adopted are frequently impacted by the aims and purposes of the

mining operation.

Types of Patterns in Data Mining


Association rules
Data mining's core patterns known as association rules are used to find

connections or correlations between objects in a collection. These guidelines show

co−occurrence patterns, which aid in revealing concealed dependencies or linkages.

An association rule, for instance, may show that consumers who buy diapers also

frequently buy infant formula in a market basket study. Businesses might conduct

customized marketing campaigns or optimize product placement with the help of

these analytics.

In assessing association rules, support, and confidence metrics are essential.

Support describes how frequently an item set appears in a dataset, indicating how

frequently a rule is true. Contrarily, confidence is a term used to describe the

conditional probability of an object given its antecedent. While confidence gauges

the rule's dependability or correctness, higher support levels signify stronger

relationships.

Sequential Patterns
Data mining also uses sequential patterns, which concentrate on the time

sequencing of transactions or occurrences. These patterns help analysts

comprehend trends in behavior across time by pointing out repeated sequences or

trends in sequential data. Sequential patterns, for instance, might identify the most

typical user pathways on a website when examining online clickstreams.

Specific sequence assessment measures are applied to examine sequential

patterns. These metrics express how important or fascinating a sequence pattern

is. Sequence length, frequency, and predictive metrics including predictive accuracy

and predictive power are typical assessment criteria. These assessment metrics

assist analysts in locating significant and useful patterns within sequential data,

producing insightful information.


Evaluation Methods for Association Rules
Support−Confidence Framework
In data mining, one of the most used methods for evaluating association rules is

the support−confidence framework. Support measures how frequently a rule is

true by describing the frequency or recurrence of an item set in a dataset.

It is determined by dividing the total number of transactions by the proportion of

transactions that contain the itemset. The conditional likelihood of the subsequent

item given the antecedent item is represented by confidence. It is calculated as the

proportion of transactions with both an antecedent and a consequent to

transactions with only the antecedent.

Lift and Conviction Measures


Additional assessment metrics that are used to rate the strength and interest of

association rules include lift and conviction metrics. Lift quantifies how dependent

the antecedent and consequent elements are in a rule. It is calculated as the

difference between the observed and predicted levels of support for the rule under

independence. When the lift value exceeds 1, there is a positive correlation

between the components; when it is below 1, there is a negative correlation or

independence.

Contrarily, conviction gives an indication of the strength of connection in terms of

how likely it is that the subsequent item will emerge without the antecedent. It is

calculated as the reciprocal of the complement of confidence to the complement of

the consequent's support. Strong links between the items are implied by conviction

values larger than 1, whilst weaker relationships are suggested by conviction

values closer to 1.
Evaluation Methods for Sequential Patterns
Sequential Pattern Evaluation
Evaluation of sequential patterns entails determining the importance and

applicability of patterns found in sequential data. The Sequential Pattern Growth

algorithm is one often employed technique for assessing sequential patterns.

It finds sequential patterns by gradually expanding them from shorter to longer

sequences, making sure that each extension is still common in the dataset. This

technique allows analysts to quickly find and assess sequential patterns of various

durations and complexity.

Episode Evaluation
Another assessment technique utilized in the study of sequential patterns is

episode evaluation. The term "episode" refers to a group of related events that take

place in a predetermined time frame or sequence. In medical research, for instance,

episodes could stand in for groups of symptoms that frequently coexist in a given

condition.

Measurement of the importance and recurrence of certain event combinations is

the main goal of episode assessment. By examining episodes, analysts can obtain

insight into the patterns of how events occur together and can find significant

temporal or associational correlations in the sequential data.

Conclusion
The lift and conviction measures for association rules, the sequential pattern

growth algorithm, and episode assessment for sequential patterns are only a few

of the strategies used in data mining's pattern evaluation methodologies. These


techniques enable analysts to evaluate the importance, dependability, and interest

of patterns found in datasets.

The correct assessment technique must be used to assure the extraction of

valuable insights, enable informed decision−making, and assist organizations in

optimizing their operations using the data's trustworthy patterns and relationships.

Pattern Evaluation Methods in Data


Mining
In a world where data is becoming increasingly abundant, the ability to extract insights from large
datasets is becoming a critical skill. This is where data mining comes in a field that combines
statistics, machine learning, and computer science to discover patterns and insights in large datasets.

However, not all patterns are created equally. Some patterns may be spurious or meaningless, while
others may be highly predictive and useful. This is where pattern evaluation methods come in - a set
of techniques used to assess the quality and usefulness of patterns discovered through data mining.
Let's dive into pattern evaluation methods in data mining and learn more about their importance in
data science or data mining and key takeaways. You should check out the data science tutorial guide
to clarify your basic concepts.

Pattern Evaluation in Data Mining


Whenever a pattern is discovered in data mining, it must be evaluated to determine its reliability.
Designs may be considered using a variety of metrics, depending on the context. Algorithms for
pattern evaluation methods in data mining may be tested in several ways:

Accuracy
The accuracy of a data mining model may be defined as the extent to which it accurately predicts the
values of the input variables. After the model has been trained, it is evaluated with a separate test
dataset to determine how well it performed. One of the most common approaches to measuring one's
accuracy level is to keep track of the number of times their predictions are accurate. The so-called
"accuracy rate" refers to that particular proportion to determine its reliability to be high quality if it
achieves 100% accuracy on the data used for training but only 50% accuracy on the data used for
testing. Because of this model's tendency toward overfitting, it cannot be utilized for analyzing new
data. A respectable model must achieve 80% or above accuracy on both training data. And the test
data to be considered credible. The universal use of the model is possible to predict newly collected
data.

The data mining models are essential, but there are other things to consider. The MAE is calculated
by taking the sum of fundamental mistakes, and the RMSE is calculated by taking the RME and the
mean absolute error (MAE).

Clustering Accuracy

This statistic determines how well the patterns discovered by the algorithm can be used to cluster
newly collected data accurately. Applying the detected practices to a set of data that has already been
tagged with known cluster labels is the method used most of the time to achieve this goal. After then,
accuracy may be determined by examining the degree to which the predicted labels are congruent
with the existing brands.

The effectiveness of a clustering algorithm can be evaluated using various criteria, including the
following:-

■ The clustering quality is determined by employing internal indices, which do not depend on
any data from the outside world. The Dunn index is the index that is utilized the majority of the
time within organizations.
■ Stability is utilized to quantify how effectively the clustering maintains its integrity in the face of
observed changes. We observed strategy as being stable when it consistently produces the
same clustering results over a wide variety of data samples on its own.
■ Evaluating how well the algorithm's clusters align with a standard external illustrates an
external index. You can use tools such as the Rand Index and the Jaccard coefficient if you
know the truth.
■ One of the most critical indicators of an algorithm's effectiveness is how quickly it can cluster
the data appropriately.

Classification Accuracy

This metric evaluates how well the algorithm's found patterns can be leveraged to label new data.
Typically, this is accomplished by applying the identified patterns to a dataset already classified with
known class labels. Accuracy may be calculated by checking how well the predicted labels match the
true ones.For classification models, a standard performance measure is classification accuracy, which
is just the proportion of times the model gets its predictions right. Even while classification accuracy is
a simple and straightforward statistic, it can be deceiving in some circumstances.
Measures of classification model performance like accuracy and recall are more illuminating in
unbalanced data sets. The model's accuracy in predicting a class is measured by its precision, while
its recall indicates how many instances of that class it adequately identified.Finding a model's
weaknesses may be more accessible by seeing how well or poorly it is doing. An additional
instrument for evaluating classification models is called a confusion matrix. Confusion matrices are
tables that describe the proportion of correct and incorrect predictions made by the model for each
class. These percentages are broken down according to the confusion matrix.

Visual Examination

By visually examining the data, the data miner may use this method, arguably the most common one,
to decide whether or not the patterns make sense. Plotting the data visually and then analyzing the
emerging pattern is involved in visual analysis. This method is utilized when there is a need for more
complexity in the data, which can be shown straightforwardly. This method is also frequently used for
categorically presenting the information. The process of determining patterns in data by visually
examining the data is referred to as "visual inspection" in data mining. One can look at the raw data
and a graph or plot to accomplish this goal. Identifying irregularities and patterns that do not conform
to the norm commonly uses this approach.

Running Time

The time it takes to train a model and provide predictions is a frequent metric for evaluating the
effectiveness of a machine learning algorithm, though it is by no means the only one. The time it takes
for the algorithm to analyze the data and identify patterns is quantified here. Standard units of time
measurement for this are seconds and minutes. The term for this type of assessment is "running time
pattern."

Measuring the execution time of an algorithm requires attention to several factors. The first
consideration is how long it will take for the data to be loaded into memory. Second, you must t think
about the time it takes to pre-process the data. Last, you must factor in the time necessary to train the
model and generate forecasts.

Algorithm execution time tends to grow proportionally with data size. This is because a more
extensive data set requires more processing power from the learning algorithm. While most
algorithms can handle enormous datasets, some perform better than others. It is essential to consider
the dataset being utilized while comparing algorithms. Different kinds of data may require different
algorithms. Hardware can also play a role in how long something takes to operate.

Support

A pattern's strength is measured by how many records out of a whole set have the pattern as a
percentage. Pattern evaluation methods in data mining and machine learning programs frequently
include support pattern evaluation methods. The support pattern evaluation methods aim to find
intriguing and valuable ways to data. To aid decision-making, evaluation of the association method
becomes necessary to support patterns to see whether any are of interest.

Several approaches exist for gauging the efficacy of a specific support pattern. Using a support
metric, which counts the times a particular pattern appears in a dataset, is a typical method.
Employing a lift metric, which compares the actual frequency of a pattern to its predicted frequency, is
another popular strategy.

Confidence

Confidence in Pattern evaluation methods in data mining, evaluating the quality of identified patterns
is accomplished through a process known as pattern assessment. Standard methods for making this
assessment include counting the number of occurrences of a pattern in a given data set and
comparing that number to the number of occurrences predicted by the data's normal distribution. A
pattern is considered to inspire high confidence levels if the frequency with which it is observed is far
higher than expected by chance alone. One may measure the reliability of a pattern by the proportion
of times it is validated as suitable. You can learn more about the six stages of data science processing
to grasp the above topic better.

Lift

A pattern's lift is the proportion of successes relative to failures when comparing the actual number of
successes to the projected number.

True positive rate (TPR) against false positive rate (FPR) is shown as a lifted pattern (FPR). TPR
measures how well a model accurately classifies positive examples, whereas FPR measures how
often negative examples are wrongly labeled as positive. While a TPR of 100% and an FPR of 0%
would be ideal, this is rarely the case in the real world. A model's perfect lifting pattern should be near
diagonal.

The model's performance drops significantly when the lifted pattern deviates too much from the
diagonal. Numerous issues, such as skewed data, inadequate feature selection, and model
overfitting, might contribute to this. As a result, we may infer that the model accurately identifies a
comparable proportion of cases as positive and negative and that the TPR and FPR are identical.

Prediction

A pattern's accuracy rate may be predicted as the number of times it is validated. Pattern evaluation
methods in data mining are done through pattern evaluation. A measure of a model's predictive ability
employs it to see how well it can extrapolate from historical data. Evaluating a model's performance or
comparing many models is possible and valuable with Prediction Pattern evaluation methods.
To assess a Prediction Pattern, it is common practice to divide the data set into training and test sets.
We use one set, called the training set, to teach the model how to behave and another set, called the
test set, to evaluate how well it did. The prediction error is computed to assess the performance of the
model. It is possible to enhance the precision of prediction models by evaluating Prediction Patterns.
Predictive models may be modified to better suit the data by utilizing a test set. Adding additional
characteristics to the data set or adjusting the model parameters are two possible approaches.

Precision

Data from a wide range of sources can be analyzed with the help of Precision Pattern Evaluation
methods. This technique may be utilized to assess the reliability of data and spot trends and patterns
within the information at hand. Errors in data may be detected, and their root causes investigated with
the help of Precision Pattern Evaluation methods. The effect of the inaccuracies needed for more
reliability of the data as a whole may also be calculated using this technique.

Pattern evaluation methods in data mining can significantly benefit from Precision Pattern Evaluation.
This strategy may be used to refine data quality and spot trends and patterns.

Bootstrapping

Training the model on the sampled data, then testing it. The actual data is the steps that make up this
methodology. This may be used to obtain a performance distribution for the model, which can provide
light on the model's stability. To gauge a model's precision, statisticians employ a resampling method
called "bootstrapping." The process entails training the model using a randomly selected subset of the
original dataset. After training, the model is put through its paces using a separate data set. Several
iterations are performed, and the model's average accuracy is determined.
C >>n

https://www.geeksforgeeks.org/multilevel-association-rule-in-data-mining
/

https://codinginfinite.com/fp-growth-algorithm-explained-with-numerical
-example/

https://www.geeksforgeeks.org/ml-frequent-pattern-growth-algorithm/

6 given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36,
8):
(a) compute the euclidean distance between the two objects.
(b) compute the manhattan distance between the two objects.
(c) compute the minkowski distance between the two objects, using q = 3.

Formula for Euclidean Distance :


distance = ((p1-q1)^2 + (p2-q2)^2 + ... + (pn-qn)^2)^(1/2)
Now,
distance = ( (22-20)^2 + (1-0)^2 + (42 - 36)^2 + (10-8)^2) ) ^(1/2)
=( (2)^2 + (1)^2 + (6)^2 + (2)^2 ) ) ^(1/2)
=(4+1+36+4)^(1/2)
=45^(1/2)
b.Manhattan distance :
d = |x1 - x2| + |y1 - y2|
d = |22- 20| + |1 - 0|
d = |2| + |1|

Formula for Minkowski Distance

((p1-q1)^q+ (p2-q2)^q + ... + (pn-qn)q2)^(1/q)


= ( (22-20)^3 + (1-0)^3 + (42 - 36)^3 + (10-8)^3) ) ^(1/3)
= ( (2)^3 + (1)^3 + (6)^3 + (2)^3 ) ) ^(1/3)
= (6+1+216+8)^(1/3)
= (231)

You might also like