0% found this document useful (0 votes)
75 views

Equent Itemsets & Clustering

The document discusses mining frequent itemsets, which involves finding sets of items that frequently occur together in transaction data. It explains the concepts of support count, support threshold, and frequent itemsets. It also covers association rule mining, including calculating confidence and generating and interpreting rules based on frequent itemsets.

Uploaded by

shivamsinghhelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Equent Itemsets & Clustering

The document discusses mining frequent itemsets, which involves finding sets of items that frequently occur together in transaction data. It explains the concepts of support count, support threshold, and frequent itemsets. It also covers association rule mining, including calculating confidence and generating and interpreting rules based on frequent itemsets.

Uploaded by

shivamsinghhelo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

UNIT 4: FREQUENT ITEMSETS AND CLUSTERING

Mining frequent item sets


Mining frequent item sets is a fundamental task in data mining, particularly in the context of

association rule mining. The goal is to discover sets of items that frequently appear together in a

dataset. This task is widely used in various applications, such as market basket analysis, here retailers

aim to identify associations between products that customers frequently purchase together.

Here's a detailed explanation of mining frequent item sets:

1. Transaction Data:

The process begins with a dataset containing transactions. Each transaction consists of a set of items.

For example, consider a dataset of customer transactions in a grocery store where each transaction

lists the items purchased by a customer.

2. Support Count:

The support count of an itemset is the number of transactions in which the itemset appears. The

support count is a crucial parameter, and it helps filter out infrequent itemsets. A threshold supports

value is defined, and itemsets that meet or exceed this threshold are considered frequent.

3. Support Threshold:

Users typically set a minimum support threshold, denoted as min_support, to filter out itemsets that

are not frequent. This threshold is a user-defined parameter and depends on the specific application

and dataset.

4. Frequent Itemsets:

An itemset is considered frequent if its support count is greater than or equal to the specified
minimum support threshold. Frequent itemsets represent sets of items that co-occur frequently in
the dataset.

5. Example:

Let's consider a small example with transactions:

T1: {bread, milk, eggs}

T2: {bread, butter, cheese}

T3: {milk, butter}

T4: {bread, milk, butter, cheese}

Suppose we set min_support to 3 (meaning an itemset must appear in at least 3 transactions to be


considered frequent).

6. Support Count Calculation:

Calculate the support count for each itemset:

{bread}: 3 (T1, T2, T4)

{milk}: 3 (T1, T3, T4)

{butter}: 3 (T2, T3, T4)

{cheese}: 2 (T2, T4)

7. Frequent Item sets:

Filter out itemsets that do not meet the minimum support threshold:

Frequent Itemsets: {cheese} is not considered frequent because its support count (2) is below the
min_support threshold(3).

8. Association Rules (Optional):

Once frequent itemsets are identified, association rules can be generated. These rules express

relationships between items in the frequent itemset, providing insights into associators between

different products.

Mining frequent itemsets is an essential step in discovering patterns and associations within large

datasets, and it forms the basis for more advanced data mining tasks like association rule mining.

Association rules are a crucial concept in data mining and are often used to discover
interesting relationships or patterns within datasets. These rules identify associations between
items in a transaction dataset, revealing co-occurrence patterns that can provide valuable insights.
The most common measure used to evaluate the strength of an association rule is confidence.

Let's break down the concept of association rules with an example:

1. Transaction Data:

We start with a dataset of transactions, where each transaction contains a set of items. Continuing

with the grocery store example:

T1: {bread, milk, eggs}

T2: {bread, butter, cheese}

T3: {milk, butter}

T4: {bread, milk, butter, cheese}

2. Support Count:

We’ve previously calculated support counts for each itemset. For the sake of this example, let's

consider that the frequent itemsets are {bread}, {milk}, and {butter} with a min_support of 2.
3. Association Rule Format:

Association rules are typically written in the form "A => B," where A and B are itemsets. For example,

{bread} => {butter} represents an association rule.

4. Confidence:

Confidence is a measure of the strength of an association rule. It is defined as the probability of item

B occurring in a transaction given that item A is present. Mathematically, confidence is calculated as

follows:

5. Example Association Rules:

6. Interpretation:

Interpretation of the rules:

Rule 1: If a customer buys bread, there is a 2/3 chance they will also buy butter.

Rule 2: If a customer buys milk, there is a 1/2 chance they will also buy butter.

Rule 3: If a customer buys both bread and milk, there is a 100% chance they will also buy butter.

7. Setting Confidence Threshold:

Analysts often set a minimum confidence threshold to filter out weak rules. For example, a
confidence threshold of 0.6 might be used to only consider rules with a confidence of 60% or higher.
Association rules are powerful tools for discovering hidden patterns in data, and they find
applications in various fields, such as market basket analysis, recommendation systems, and
more. The interpretation of these rules is crucial for understanding customer behaviour and
making informed decisions.

Apriori Algorithm
 Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding
frequent itemsets in a dataset for Boolean association rule.
 Name of the algorithm is Apriori because it uses prior knowledge of frequent
itemset properties.
 We apply an iterative approach or level-wise search where k-frequent itemsets are
used to find k+1 itemsets.
 To improve the efficiency of level-wise generation of frequent itemsets, an
important property is used called Apriori property which helps by reducing the
search space.
 Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of
Apriori algorithm is its anti-monotonicity of support measure.
 Apriori assumes that ,
o All subsets of a frequent itemset must be frequent (Apriori property).
o If an itemset is infrequent, all its supersets will be infrequent.

Consider the following dataset and we will find frequent itemsets and generate association
rules for them.

minimum support count is 2


minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in dataset –
Called C1(candidate set)
Step-2: K=2
(I) Generate candidate set C2 using L1 (this is called join step). Condition of joining
Lk-1 and Lk-1 is that it should have (K-2) elements in common.
 Check all subsets of an itemset are frequent or not and if not frequent remove
that itemset. (Example subset of {I1, I2} are {I1}, {I2} they are frequent. Check
for each itemset)
 Now find support count of these itemsets by searching in dataset.

(II) compare candidate (C2) support count with minimum support count (here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L2.

Step-3:
(I) Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-
1 is that it should have (K-2) elements in common. So here, for L2, first element
should match.
So itemset generated by joining L2 is {I1, I2, I3} {I1, I2, I5} {I1, I3, i5} {I2, I3, I4}
{I2, I4, I5} {I2, I3, I5}
 Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset. (Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent.
For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for
every itemset)
 find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count (here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L3.

Step-4:
 Generate candidate set C4 using L3 (join step). Condition of joining Lk-
1 and Lk-1 (K=4) is that, they should have (K-2) elements in common. So
here, for L3, first 2 elements (items) should match.
 Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5},
which is not frequent). So no itemset in C4
 We stop here because no frequent itemsets are found further

Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of each
rule.

Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and
bread also bought butter.

Confidence(A->B) = Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule
generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong
association rules.

Consider another Example of Apriori algorithm as follows:


Frequent Item Sets: These are the item sets which satisfies the minimum
support level. Minimum support level has been set to 33%. So, if you want to
check the 33% out of 12 transactions if you have for eg. 4 transaction
(frequency 4) for a particular item then 4/12= 0.3333333, i.e. 33%.
So, we cannot take into consideration a items less than 4 frequency.
Now we are starting to build an association rule:
Handling Larger Datasets in Main Memory:
The A-Priori Algorithm is fine as long as the step with the greatest requirement for main
memory – typically the counting of the candidate pairs C2 – has enough memory that it can
be accomplished without thrashing (repeated moving of data between disk and main
memory).
Several algorithms have been proposed to cut down on the size of candidate set C2. Here,
we consider the PCY Algorithm, which takes advantage of the fact that in the first pass of
A-Priori there is typically lots of main memory not needed for the counting of single
items.
Then we look at the Multistage Algorithm, which uses the PCY trick and also inserts extra
passes to further reduce the size of C2.

The Algorithm of Park, Chen, and Yu


The Multistage Algorithm:
The Multistage Algorithm improves upon PCY by using several successive hash tables to reduce further the
number of candidate pairs. The tradeoff is that Multistage takes more than two passes to find the frequent pairs.
An outline of the Multistage Algorithm is shown in following Fig.

Figure : The Multistage Algorithm uses additional hash tables to reduce the
number of candidate pairs
Pass 1:
HOME WORK:

(Plagarism, Biomarkers, Related Concepts like web pages, blogs, tweets etc)

The Multihash Algorithm:


Sometimes, we can get most of the benefit of the extra passes of the Multistage Algorithm in a single pass. This
variation of PCY is called the Multihash Algorithm. Instead of using two different hash tables on two successive
passes, use two hash functions and two separate hash tables that share main memory on the first pass, as
suggested by Fig. 6.7.
The danger of using two hash tables on one pass is that each hash table has half as many buckets as the one
large hash table of PCY. As long as the average count of a bucket for PCY is much lower than the support
threshold, we can operate two half-sized hash tables and still expect most of the buckets of both hash tables to
be infrequent. Thus, in this situation we might well choose the multihash approach.

Figure : The Multihash Algorithm uses several hash tables in one pass
Limited-Pass Algorithms:
The algorithms for frequent itemsets discussed so far use one pass for each size of itemset we investigate. If
main memory is too small to hold the data and the space needed to count frequent itemsets of one size, there
does not seem to be any way to avoid k passes to compute the exact collection of frequent itemsets.
In this section we explore some algorithms that have been proposed to find all or most frequent itemsets using at
most two passes.
An algorithm called SON uses two passes, gets the exact answer, and lends itself to implementation by
MapReduce or another parallel computing regime.
Finally, Toivonen’s Algorithm uses two passes on average, gets an exact answer, but may, rarely, not terminate
in any given amount of time.

The Algorithm of Savasere, Omiecinski, and Navathe (The SON


Algorithm and MapReduce)
Toivonen's algorithm:
Toivonen's algorithm is a two-pass algorithm used for finding frequent itemsets in a
data stream. It offers a good balance between accuracy and memory usage
compared to simpler single-pass approaches. Here's a breakdown of the algorithm
with an example:

Concept
Toivonen's algorithm leverages the concept of a "negative border." A negative border
itemset (T) is an itemset that is not frequent in the entire dataset but all of its
immediate subsets (itemsets with one element less) are frequent.

Steps:
1. Sample and Frequent Itemset Discovery (Pass 1):
o Take a small sample of the data stream (S).
o Use an existing frequent itemset mining algorithm (like Apriori) to find
all frequent itemsets (F) within the sample (S).
o Identify all immediate subsets of the frequent itemsets in the sample
(F').
o Find all itemsets in the negative border of the sample (N): these are
itemsets that are not frequent in the sample (S) but all their immediate
subsets (F') are frequent.
o Lower the minimum support threshold slightly for the sample (typically
s / number_of_baskets_in_sample) to account for potential
underestimation of frequent itemsets in the sample due to its smaller
size.
2. Full Data Scan and Counting (Pass 2):
o Scan the entire data stream.
o For each itemset encountered:
 If the itemset is in the set of frequent itemsets from the sample
(F), increment its count.
 If the itemset is in the negative border from the sample (N), also
increment its count.
3. Frequent Itemset Determination:
o After scanning the entire stream, analyze the counts:
 Any itemset from the sample (F) with a count exceeding the
original minimum support threshold (s) is considered frequent in
the entire data stream.
 There are two possibilities for itemsets in the negative border
(N):
 If no members of the negative border are frequent in the
entire data stream, then all itemsets from the sample (F)
that were marked frequent are truly frequent in the data
stream.
 If some members of the negative border are frequent,
there might be a few false negatives (frequent itemsets
missed in the sample). However, the total number of false
negatives is guaranteed to be small.
Example:

Imagine a data stream containing transactions from a grocery store (basket IDs and
purchased items). Let's say the minimum support threshold (s) is 3 (an itemset
needs to appear in at least 3 baskets to be considered frequent).

Pass 1:
 Sample: Basket 1 (Bread, Milk), Basket 2 (Milk, Eggs), Basket 3 (Bread,
Eggs, Cereal)
 Frequent itemsets in the sample (F): {Bread, Milk}, {Milk, Eggs}
 Immediate subsets (F'): {Bread}, {Milk}, {Eggs}
 Negative border (N): {Cereal} (not frequent in the sample, but all subsets are
frequent)
 Lowered support threshold for the sample: s / 3 (assuming 3 baskets in the
sample)
Pass 2:
 We scan the entire data stream, keeping track of counts for itemsets in F and
N.
Result Analysis:
 Analyze the final counts after scanning the entire stream.
 If none of the itemsets in N are frequent in the entire data stream, then
{Bread, Milk} and {Milk, Eggs} (from F) are guaranteed to be frequent in the
whole data stream.
 Even if some members of N are frequent, there's a high probability that
{Bread, Milk} and {Milk, Eggs} are still frequent, with a minimal chance of false
negatives.
Benefits of Toivonen's Algorithm:
 Reduced Memory Usage: Compared to single-pass approaches that might
need to store all candidate itemsets, Toivonen's algorithm uses just two sets
(F and N) from the sample, reducing memory requirements.
 Guaranteed No False Positives: The algorithm ensures that no itemsets are
incorrectly identified as frequent.
 Low False Negatives: There's a high probability of finding most frequent
itemsets, with a minimal chance of missing some due to the sample.
Drawbacks:
 Two-Pass Processing: Requires scanning the data stream twice, potentially
increasing processing time compared to single-pass algorithms.
 Sample Size Selection: Choosing the right sample size is crucial for
balancing accuracy and memory usage.

Toivonen's algorithm provides a valuable approach for efficient frequent itemset


mining in data streams, offering a good balance between accuracy, memory usage,
and processing time.

You might also like