0% found this document useful (0 votes)
17 views

DM UNIT II (1)

Uploaded by

aravindhan062003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

DM UNIT II (1)

Uploaded by

aravindhan062003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Unit II

What Is An Itemset?

A set of items together is called an itemset. If any itemset has k-items it is called a k- itemset. An itemset
consists of two or more items. An itemset that occurs frequently is called a frequent itemset. Thus, frequent
itemset mining is a data mining techniqueto identify the items that often occur together.

For Example, Bread and butter, Laptop and Antivirus software, etc.

What Is A Frequent Itemset?

A set of items is called frequent if it satisfies a minimum threshold value for support and confidence.
Support shows transactions with items purchased together in a single transaction. Confidence shows
transactions where the items are purchased one after the other.

For frequent itemset mining method, we consider only those transactions which meet minimum threshold
support and confidence requirements. Insights from these mining algorithms offer a lot of benefits, cost-cutting
and improved competitive advantage.

There is a tradeoff time taken to mine data and the volume of data for frequent mining. The frequent
mining algorithm is an efficient algorithm to mine the hidden patterns of itemsets within a short time and less
memory consumption.

PROBLEM

When we go grocery shopping, we often have a standard list of things to buy. Each shopper has a distinctive
list, depending on one’s needs and preferences. A housewife might buy healthy ingredients for a family dinner,
while a bachelor might buy beer and chips. Understanding these buying patterns can help to increase sales in
several ways. If there is a pair of items, X and Y, that are frequently bought together:

Both X and Y can be placed on the same shelf, so that buyers of one item would be prompted to buy the other.

 Promotional discounts could be applied to just one out of the two items.
 Advertisements on X could be targeted at buyers who purchase Y.
 X and Y could be combined into a new product, such as having Y in flavors of X.
 While we may know that certain items are frequently bought together, the question is, how do we
uncover these associations?

Besides increasing sales profits, association rules can also be used in other fields. In medical diagnosis for
instance, understanding which symptoms tend to co-morbid can help to improve patient care and medicine
prescription.

Definition
Association rules analysis is a technique to uncover how items are associated to each other. Association rule
mining, at a basic level, involves the use of machine learning models to analyze data for patterns, or co-
occurrences, in a database. It identifies frequent if-then associations, which themselves are the association
rules.

If we apply to supermarket transaction data, that is, to examine the customer behavior in terms the purchased
products. Association rules describe how often the items are purchased together.

An association rule has two parts:


 antecedent (if) (the condition being tested)/An antecedent is an item found within the data.
 consequent (then) (the outcome that occurs if the condition is met)\ A consequent is an item found
in combination with the antecedent.

We can understand it by taking an example of a supermarket, as in a supermarket, all products that are
purchased together are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these products are
stored within a shelf or mostly nearby. Consider the below diagram:

Methods for searching data for frequent if-then patterns/Measures of the effectiveness of association
rules

Association rules are created by searching data for frequent if-then patterns or to measure strength of a given
association rule by using the criteria support and confidence to identify the most important relationships.
Support It is an indication of how frequently the items appear in the data.

Table 1. Example Transactions


If you discover that sales of items beyond a certain proportion tend to have a significant impact on your profits,
you might consider using that proportion as your support threshold. You may then identify itemsets with
support values above this threshold as significant itemsets.

Measure 1: Support.

It is defined as the fraction of the transaction T that contains the itemset X. If there are X datasets, then for
transactions T, it can be written as:

This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.
In Table, the support of {apple} is 4 out of 8, or 50%. Itemsets can also contain multiple items. For instance,
the support of {apple, beer, rice} is 2 out of 8, or 25%.

Measure 2: Confidence.

It refers to the amount of times a given rule turns out to be true in practice. Or how often the items X and Y
occur together in the dataset when the occurrence of X is already given. It is the ratio of the transaction that
contains X and Y to the number of records that contain X.

This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured
by the proportion of transactions with item X, in which item Y also appears. In Table 1, the confidence of
{apple -> beer} is 3 out of 4, or 75%.

One drawback of the confidence measure is that it might misrepresent the importance of an association. This
is because it only accounts for how popular apples are, but not beers. If beers are also very popular in general,
there will be a higher chance that a transaction containing apples will also contain beers, thus inflating the
confidence measure. To account for the base popularity of both constituent items, we use a third measure
called lift.
Measure 3: Lift.

A third metric, called lift, can be used to compare confidence with expected confidence, or how many times
an if-then statement is expected to be found true. It is the ratio of confidence to support. is the ratio of the
observed support measure and expected support if X and Y are independent of each other.

This says how likely item Y is purchased when item X is purchased, while controlling for how popular item
Y is. In Table 1, the lift of {apple -> beer} is 1,which implies no association between items. A lift value greater
than 1 means that item Y is likely to be bought if item X is bought, while a value less than 1 means that item
Y is unlikely to be bought if item X is bought.

It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a negative
effect on another.

Support focuses on identifying frequent itemsets, while confidence measures the strength of association
between antecedent and consequent itemsets in association rules.

Association rules are calculated from itemsets, which are made up of two or more items. If rules are built
from analyzing all the possible itemsets, there could be so many rules that the rules hold little meaning. With
that, association rules are typically created from rules well-represented in data.

A rule may show a strong correlation in a data set because it appears very often but may occur far less
when applied. This would be a case of high support, but low confidence.

Conversely, a rule might not particularly stand out in a data set, but continued analysis shows that it occurs
very frequently. This would be a case of high confidence and low support. Using these measures helps
analysts separate causation from correlation and allows them to properly value a given rule.

Association rule mining consists of 2 steps:

Find all the frequent itemsets.


Generate association rules from the above frequent itemsets.

Frequent Pattern Mining (FPM)

The frequent pattern mining algorithm is one of the most important techniques of data mining to discover
relationships between different items in a dataset. These relationships are represented in the form of association
rules. It helps to find the irregularities in data.

FPM has many applications in the field of data analysis, software bugs, cross- marketing, sale campaign
analysis, market basket analysis, etc.

Why Frequent Itemset Mining?

Frequent itemset or pattern mining is broadly used because of its wide applications in mining Association
rules, correlations and graph patterns constraint that is based on frequent patterns, sequential patterns, and
many other data mining tasks.

Frequent itemsets discovered through Apriori have many applications in data mining tasks. Tasks such as
finding interesting patterns in the database, finding out sequence and Mining of association rules is the most
important of them.

In frequent itemset mining, the process of discovering sets of items that frequently occur together in a dataset,
there are two main approaches: one with candidate generation and one without candidate generation. These
approaches are typically associated with two popular algorithms:

Apriori (with candidate generation) and FP-growth (without candidate generation).

With Candidate Generation (e.g., Apriori):

Apriori Algorithm: This algorithm is one of the earliest and most well-known frequent itemset mining
algorithms. It uses candidate generation as a key step.

Candidate Itemsets: In Apriori, candidate itemsets are generated at each iteration. The algorithm starts with
single items and gradually generates larger itemsets by joining frequent itemsets from the previous level.
These candidate itemsets are then pruned based on the Apriori property, which states that if an itemset is
infrequent, none of its supersets can be frequent.

Support Counting: The support (frequency) of these candidate itemsets is then counted in the dataset to
identify frequent itemsets (those that meet a predefined support threshold).

Pros: Apriori is easy to understand and implement. It works well for a wide range of datasets and support
thresholds.

Cons: The generation of candidate itemsets and multiple passes over the dataset can make it inefficient for
large datasets.
Without Candidate Generation (e.g., FP-growth):

FP-growth Algorithm: FP-growth (Frequent Pattern growth) is an alternative approach to frequent itemset
mining that does not rely on candidate generation.

FP-tree Structure: FP-growth builds an FP-tree (Frequent Pattern tree) from the dataset, which encodes
itemset frequencies compactly.

Conditional FP-trees: After building the FP-tree, it extracts frequent itemsets directly from the tree structure
using a recursive process, without generating candidate itemsets explicitly.
Apriori Algorithm – Frequent Pattern Algorithms

Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. It was later
improved by R Agarwal and R Srikant and came to be known as Apriori. This algorithm uses two steps
“join” and “prune” to reduce the search space. It is an iterative approach to discover the most frequent
itemsets.

Apriori says:

The probability that item I is not frequent is if:

 P(I) < minimum support threshold, then I is not frequent.


 P (I+A) < minimum support threshold, then I+A is not frequent,
where A also belongs to itemset.

The statement you provided is accurate for the Apriori algorithm. If the support of an itemset I (P(I)) is less
than the minimum support threshold, then I is not considered frequent. Similarly, if the support of an itemset
I+A (P(I+A)), where A is another item, is less than the minimum support threshold, then I+A is not
considered frequent.

Let's illustrate this statement with a simple example:

Suppose we have a transaction dataset containing shopping baskets, and we want to find frequent itemsets
with a minimum support threshold of 3.

Transaction Dataset:

Transaction 1: {Milk, Bread, Eggs}


Transaction 2: {Milk, Bread}
Transaction 3: {Milk, Eggs}
Transaction 4: {Bread, Eggs}
Transaction 5: {Milk, Bread, Eggs}

Now, let's consider two itemsets, I and A:

Itemset I = {Milk, Bread}


Item A = {Eggs}
First, we need to calculate the support for these itemsets:

P(I) = Support({Milk, Bread}) = 3 (Transaction 1, 2, and 5 contain {Milk, Bread})


P(A) = Support({Eggs}) = 4 (Transaction 1, 3, 4, and 5 contain {Eggs})
P(I + A) = Support({Milk, Bread, Eggs}) = 3 (Transaction 1, 4, and 5 contain {Milk, Bread, Eggs})
Now, let's apply the statement:

P(I) < minimum support threshold: In our example, P(I) = 3, which is greater than the minimum support
threshold of 3. Therefore, itemset I = {Milk, Bread} is frequent.

P(I + A) < minimum support threshold: P(I + A) = 3, which is equal to the minimum support threshold of 3.
Therefore, itemset I + A = {Milk, Bread, Eggs} is also frequent.

In this example, both itemset I and itemset I + A are considered frequent because their support is equal to or
greater than the minimum support threshold. If any of them had a support less than the minimum threshold,
they would be considered infrequent.

If an itemset set has value less than minimum support then all of its supersets will also fall below min
support, and thus can be ignored. This property iscalled the Antimonotone property.

The steps followed in the Apriori Algorithm of data mining are:

1. Join Step: This step generates (K+1) itemset from K-itemsets by joining each item with
itself.
2. Prune Step: This step scans the count of each item in the database. If the candidate item
does not meet minimum support, then it is regarded as infrequent and thus it is removed.
This step is performed to reduce the size of the candidate itemsets.

Steps In Apriori

Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the given database.
This data mining technique follows the join and the prune steps iteratively until the most frequent itemset is
achieved. A minimum support threshold is given in the problem or it is assumed by the user.

#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm will
count the occurrences of each item.

#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose occurrence is
satisfying the min sup are determined. Only those candidates which count more than or equal to min_sup,
are taken ahead for the next iteration and the others are pruned.

#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-itemset is
generated by forming a group of 2 by combining items with itself.

#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the tablewill have 2 –itemsets
with min-sup only.

#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2
–itemset subsets of each group fall in min_sup. If all 2-itemset subsets are frequent then the superset will be
frequent otherwise it is pruned.

#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset does
not meet the min_sup criteria. The algorithm is stoppedwhen the most frequent itemset is achieved.

MINIMUM SUPPORT THRESHOLD

The minimum support threshold represents the minimum frequency or support count that an itemset must
meet to be considered "frequent" in the dataset.

Minimum Support Count = (Minimum Support Percentage / 100) * Total Number of Transactions
Apriori Helps in mining the frequent itemset.

Example of Apriori Algorithm


Step 1: Data in the database
Step 2: Calculate the support/frequency of all items
Step 3: Discard the items with minimum support less than 2
Step 4: Combine two items
Step 5: Calculate the support/frequency of all items
Step 6: Discard the items with minimum support less than 2
Step 6.5: Combine three items and calculate their support.
Step 7: Discard the items with minimum support less than 2.

Result:
Only one itemset is frequent (Eggs, Tea, Cold Drink) because this itemset has
minimum support 2.
The Apriori Algorithm: Pseudo Code
function apriori(dataset, min_sup):
# Initialize frequent itemsets with 1-itemsets
Ck = generate_candidate_1_itemsets(dataset)
Lk = prune_infrequent_itemsets(Ck, min_sup)
frequent_itemsets = Lk

k = 2 # Candidate itemset size

while Lk is not empty:


# Generate candidate itemsets of size k
Ck = generate_candidate_k_itemsets(Lk, k)

# Count the support of each candidate itemset


for transaction in dataset:
for candidate in Ck:
if candidate is_subset_of(transaction):
candidate.support += 1

# Prune infrequent itemsets based on minimum support


Lk = prune_infrequent_itemsets(Ck, min_sup)

# Add frequent itemsets of size k to the result


frequent_itemsets += Lk

k += 1 # Increment itemset size

return frequent_itemsets

In this pseudo-code:

1. dataset represents the transaction database.


2. min_sup is the user-defined minimum support threshold.
3. generate_candidate_1_itemsets(dataset) generates the initial set of candidate 1-itemsets from the dataset.
4. prune_infrequent_itemsets(Ck, min_sup) filters out infrequent itemsets based on the minimum support.
5. generate_candidate_k_itemsets(Lk, k) generates candidate itemsets of size k from the frequent itemsets of size
k-1.
6. is_subset_of(candidate, transaction) checks if the candidate itemset is a subset of the transaction.
7. The algorithm iteratively increases the itemset size (k) until no more frequent itemsets can be found.
Advantages

Easy to understand algorithm


Join and Prune steps are easy to implement on large itemsets in large databases.
Apriori Algorithm is the simplest and easy to understand the algorithm for mining the frequent itemset.
Apriori Algorithm is fully supervised.
Apriori Algorithm is fully supervised so it does not require labeled data.
Apriori Algorithm is an exhaustive algorithm, so it gives satisfactory results to mine all the rules within specified
confidence and spport.

Disadvantages
It requires high computation if the itemsets are very large and the minimumsupport is kept
very low.
The entire database needs to be scanned.

Methods To Improve Apriori Efficiency

Many methods are available for improving the efficiency of the algorithm.

Hash-Based Technique: This method uses a hash-based structure called a hash table for generating the k-
itemsets and its corresponding count. It uses a hash function for generating the table.

Transaction Reduction: This method reduces the number of transactions scanning in iterations. The
transactions which do not contain frequent items are marked or removed.

Partitioning: This method requires only two database scans to mine the frequent itemsets. It says that for
any itemset to be potentially frequent in the database, it should be frequent in at least one of the partitions of
the database.

Sampling: This method picks a random sample S from Database D and then searches for frequent itemset in
S. It may be possible to lose a global frequent itemset. This can be reduced by lowering the min_sup.

Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked start point of the
database during the scanning of the database.
HASH BASED TECHNIQUES
Partitioning algorithm is based to find the frequent elements on the basis partitioning of database in n parts.
It overcomes the memory problem for large database which do not fit into main memory because small parts
of database easily fit into main memory. This algorithm divides into two passes,

A partition concept has been proposed to increase the execution speed with minimum cost. For each itemsets,
that is one itemset, 2-itemsets, 3-itemsets etc., a separate partition will be created during data insertion in to
the table. Initially a set of frequent 1-itemsets is found by scanning the databases and get the numbers of
occurrences of each item from the partition of those particular items using the pointer, the items satisfying the
minimum support count will be included in the frequent 1- itemsets.

It should be noted that if the minimum support for transactions in whole database is min_sup then the
minimum support for partitioned transactions is min-sup number of transaction in that partition.
transaction dataset with the following transactions:

Now, let's say we want to mine association rules and set a minimum support threshold (min_sup) of 30% for the entire
dataset. This means that an itemset must appear in at least 30% of all transactions (8 transactions) to be considered
frequent.

Calculate the minimum support count for the entire dataset:

Minimum Support Count = min_sup * Total number of transactions

Minimum Support Count = 0.30 * 8 transactions = 2.4 (rounded to 3 for practical purposes)
So, for the entire dataset, an itemset must appear in at least 3 transactions to be considered frequent.

Now, let's consider that we want to partition this dataset into two subsets based on some criteria (e.g., geographical
location or time period). We have two partitions, Partition A and Partition B, and each has its own set of transactions:

Partition A:

Now, the statement you provided suggests that the minimum support for each partition should be determined based on
the number of transactions in that partition. Let's calculate the minimum support for each partition:

Minimum Support Count for Partition A:

Number of transactions in Partition A = 4 transactions


Minimum Support Count = min_sup * Number of transactions
Minimum Support Count = 0.30 * 4 transactions = 1.2 (rounded to 2 for practical purposes)
So, for Partition A, an itemset must appear in at least 2 transactions to be considered frequent within that partition.

Identify the frequent items in Partition A:

Frequent Itemsets in Partition A:


{A, B}: Appears in transactions 1, 2, and 3 (support count = 3) [Frequent]
{A, C}: Appears in transactions 1 and 3 (support count = 2) [Frequent]
{B, C}: Appears in transactions 1 and 4 (support count = 2) [Frequent]
So, in Partition A, the frequent itemsets are {A, B}, {A, C}, and {B, C}.

Minimum Support Count for Partition B:

Number of transactions in Partition B = 4 transactions


Minimum Support Count = min_sup * Number of transactions
Minimum Support Count = 0.30 * 4 transactions = 1.2 (rounded to 2 for practical purposes)

Minimum Support Count for Partition B: 2

Now, let's identify the frequent items in Partition B:

Frequent Itemsets in Partition B:

{A, B}: Appears in transactions 5 and 7 (support count = 2) [Frequent]


{C, D}: Appears in transactions 6, 7, and 8 (support count = 3) [Frequent]
{B, C, D}: Appears in transactions 7 and 8 (support count = 2) [Frequent]
So, in Partition B, the frequent itemsets are {A, B}, {C, D}, and {B, C, D}.

To obtain the final set of frequent itemsets for the entire dataset, we need to combine the frequent itemsets from both
Partition A and Partition B. Since each partition has its own frequent itemsets based on its minimum support threshold,
we can merge these results to find the frequent itemsets for the entire dataset. Here are the frequent itemsets for the
entire dataset:

Frequent Itemsets for the Entire Dataset:

{A, B}: Appears in transactions 1, 2, 3, 5, and 7 (support count = 5) [Frequent]


{A, C}: Appears in transactions 1, 3, and 7 (support count = 3) [Frequent]
{B, C}: Appears in transactions 1, 4, 7, and 8 (support count = 4) [Frequent]
{C, D}: Appears in transactions 6, 7, and 8 (support count = 3) [Frequent]
{B, C, D}: Appears in transactions 7 and 8 (support count = 2) [Frequent]
These are the final frequent itemsets for the entire dataset, considering the minimum support thresholds for each partition
(Partition A and Partition B). These itemsets meet the overall support threshold of 30% for the entire dataset.

This algorithm also used to reduce the number of database scan. It is based upon the downward disclosure
property in which adds the candidate itemsets at different point of time during the scan. In this dynamic blocks
are formed from the database marked by start points and unlike the previous techniques of Apriori it
dynamically changes the sets of candidates during the database scan. Unlike the Apriori it cannot start the
next level scan at the end of first level scan, it start the scan by starting label attached to each dynamic partition
of candidate sets.

In this way it reduce the database scan for finding the frequent itemsets by just adding the new candidate at
any point of time during the run time. But it generates the large number of candidates and computing their
frequencies are the bottleneck of performance while the database scans only take a small part of runtime.

Example 2 for partitioning algorithm:


Itemsets are marked in four different ways as they are counted:

 Solid box: confirmed frequent itemset - an itemset we have finished counting and
exceeds the support threshold minsupp
 Solid circle: confirmed infrequent itemset - we have finished counting and it is
below minsupp
 Dashed box: suspected frequent itemset - an itemset we are still counting that
exceeds minsupp
 Dashed circle: suspected infrequent itemset - an itemset we are still counting that is
below minsupp

DIC Algorithm

itemset counting typically refers to the process of counting the occurrences of specific combinations of items or
attributes in a dataset.

Algorithm:

1. Mark the empty itemset with a solid square. Mark all the 1-itemsets with dashed circles.
Leave all other itemsets unmarked.
2. While any dashed itemsets remain:
1. Read M transactions (if we reach the end of the transaction file, continue from the
beginning). For each transaction, increment the respective counters for the itemsets
that appear in the transaction and are marked with dashes.
2. If a dashed circle's count exceeds minsupp, turn it into a dashed square. If any
immediate superset of it has all of its subsets as solid or dashed squares, add a new
counter for it and make it a dashed circle.
3. Once a dashed itemset has been counted through all the transactions, make it solid
and stop counting it.

Itemset lattices: An itemset lattice contains all of the possible itemsets for a transaction database.
Each itemset in the lattice points to all of its supersets. When represented graphically, a itemset
lattice can help us to understand the concepts behind the DIC algorithm.
 Example: minsupp = 25% and M = 2.

TID A B C
T1 1 1 0
T2 1 0 0
T3 0 1 1
T4 0 0 0
Transaction Database
Applications Of Apriori AlgorithmSome fields where Apriori is used:

In Education Field: Extracting association rules in data mining of admitted students through characteristics and
specialties.
In the Medical field: For example Analysis of the patient’s database.
In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.
Apriori is used by many companies like Amazon in the Recommender System and by Google for the auto-complete
feature.

Conclusion

Apriori algorithm is an efficient algorithm that scans the database only once.
It reduces the size of the itemsets in the database considerably providing a good performance.
Thus, data mining helps consumers and industries better in the decision- making process.

Shortcomings Of Apriori Algorithm

Using Apriori needs a generation of candidate itemsets. These itemsets may be large in number if the itemset
in the database is huge.
Apriori needs multiple scans of the database to check the support of each itemset generated and this
leads to high costs.
These shortcomings can be overcome using the FP growth algorithm.

Frequent Pattern Growth Algorithm

This algorithm is an improvement to the Apriori method. A frequent pattern is generated without
the need for candidate generation. FP growth algorithm represents the database in the form of a
tree called a frequent pattern tree or FP tree.

This tree structure will maintain the association between the itemsets. The database is
fragmented using one frequent item. This fragmented part is called “pattern fragment”. The
itemsets of these fragmented patterns are analyzed. Thus with this method, the search for
frequent itemsets is reduced comparatively.
Frequent pattern-growth (FP-Growth) is the mining of pattern itemsets, subsequences, and
substructures that appear frequently in a dataset.

A Frequent itemset refers to the most common items bought together. A Subsequence where
items are bought by a customer is called a frequent sequential pattern.

A Substructure refers to different structural forms, such as subgraphs and subtrees, that are
combined with frequent structural patterns to analyze and find relations with different items of
frequent itemsets.

FP Tree

Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the database.
The purpose of the FP tree is to mine the most frequent pattern. Each node of the FP tree
represents an item of the itemset.

The root node represents null while the lower nodes represent the itemsets. The association of
the nodes with the lower nodes that is the itemsets with the other itemsets are maintained while
forming the tree.

Frequent Pattern Algorithm Steps


The frequent pattern growth method lets us find the frequent pattern without candidate
generation.

Let us see the steps followed to mine the frequent pattern using frequent pattern growth
algorithm:

#1) The first step is to scan the database to find the occurrences of the itemsets in the database.
This step is the same as the first step of Apriori. The count of 1-itemsets in the database is called
support count or frequency of 1-itemset.

#2) The second step is to construct the FP tree. For this, create the root of the tree.The root
is represented by null.

#3) The next step is to scan the database again and examine the transactions. Examine the first
transaction and find out the itemset in it. The itemset with the max count is taken at the top, the
next itemset with lower count and so on. It means that the branch of the tree is constructed with
transaction itemsets in descending order of count.

#4) The next transaction in the database is examined. The itemsets are ordered in descending
order of count. If any itemset of this transaction is already present in another branch (for example
in the 1st transaction), then this transaction branch would share a common prefix to the root.

This means that the common itemset is linked to the new node of another itemset in this
transaction.

#5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked according to
transactions.

#6) The next step is to mine the created FP Tree. For this, the lowest node is examined first
along with the links of the lowest nodes. The lowest node representsthe frequency pattern
length 1. From this, traverse the path in the FP Tree. This pathor paths are called a conditional
pattern base.

Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring
with the lowest node (suffix).

#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.

#8) Frequent Patterns are generated from the Conditional FP Tree.

Summary:

Generating Frequent Pattern-Growth


To generate frequent pattern-growth, we do the following:

1. In the first scan of the database, we try to find the minimum support of itemsets.
2. If the frequency of items is less than the specified minimum support, we simply delete them from the
database.
3. The set of frequent items is sorted in the descending order of support.
4. Then, an FP-tree is constructed.
5. The initial node of the FP-tree is called the root node. It is labeled null.
6. The items in the transactions are processed when the node is created.
7. Whenever we try to pass the itemset, we can create a child node and expand it if the itemset follows
the root node.
8. If a branch encounters a common prefix for the existing path, we increment the value of the root node.
9. We repeat the process for each itemset until we reach the end of the itemsets in the transaction.
10. To facilitate tree traversal, we build an item header table so that each item points to occurrences in
thetree via a chain of node-links.

Example Of FP-Growth Algorithm Support


Frequent itemset is k,e,o

Advantages Of FP Growth Algorithm

This algorithm needs to scan the database only twice when compared to Apriori which scans the
transactions for each iteration.
The pairing of items is not done in this algorithm and this makes it faster.
The database is stored in a compact version in memory.
It is efficient and scalable for mining both long and short frequent patterns.

Disadvantages Of FP-Growth Algorithm

FP Tree is more cumbersome and difficult to build than Apriori.


It may be expensive.
When the database is large, the algorithm may not fit in the shared memory.

FP Growth vs Apriori

FP Growth Apriori

Pattern Generation Candidate Generation

FP growth generates pattern by constructing a Apriori generates pattern by pairing the items
FP tree into singletons, pairs and triplets.

There is no candidate generation Apriori uses candidate generation

Process
The process is faster as compared to Apriori. The process is comparatively slower than FP The runtime
of process increases linearly with Growth, the runtime increases exponentially increase in number of
itemsets. with increase in number of itemsets

Memory Usage

A compact version of database is saved The candidates combinations are saved in memory

Conclusion

The Apriori algorithm is used for mining association rules. It works on the principle, “the non-empty subsets
of frequent itemsets must also be frequent”. It forms k-itemset candidates from (k-1) itemsets and scans the
database to find the frequent itemsets.

Frequent Pattern Growth Algorithm is the method of finding frequent patterns without candidate generation.
It constructs an FP Tree rather than using the generate and test strategy of Apriori. The focus of the FP Growth
algorithm is on fragmenting the paths of the items and mining frequent patterns.

Classification and Prediction

There are two forms of data analysis that can be used for extracting models describing important classes or to
predict future data trends. These two forms are as follows −

Classification
Prediction

Classification models predict categorical class labels; and prediction models predict continuous valued
functions. For example, we can build a classification model to categorize bank loan applications as either
safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer
equipment given their income and occupation.

What is classification?

Following are the examples of cases where the data analysis task is Classification −
A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or
which are safe.

A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new
computer.
In both of the above examples, a model or classifier is constructed to predict the categorical labels. These
labels are risky or safe for loan application data and yes orno for marketing data.

What is prediction?

Following are the examples of cases where the data analysis task is Prediction −
Suppose the marketing manager needs to predict how much a given customer will spend during a sale at his
company. In this example we are bothered to predict a numeric value. Therefore the data analysis task is an
example of numeric prediction. In this case, a model or a predictor will be constructed that predicts a
continuous- valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often used for numeric prediction.

Classification Examples:

Email Spam Detection: Classify emails as either spam or not spam (ham). Features may include the sender,
subject line, and content of the email.

Image Classification: Classify images into categories, such as recognizing whether an image contains a cat
or a dog.

Sentiment Analysis: Determine the sentiment (positive, negative, or neutral) of a piece of text, such as a
product review or a social media post.

Medical Diagnosis: Classify medical images, like X-rays or MRIs, to diagnose diseases or conditions like
cancer.

Credit Risk Assessment: Determine whether a loan applicant is high, medium, or low risk based on their
financial history, credit score, and other relevant factors.

Species Identification: Identify the species of a plant or animal based on characteristics like size, shape, and
color.

Fraud Detection: Detect fraudulent transactions in financial data by classifying transactions as either
legitimate or fraudulent.

Face Recognition: Classify faces in images or video frames to recognize individuals.

Language Identification: Determine the language of a given text snippet or document.

Prediction Examples:

Stock Price Prediction: Predict the future price of a stock based on historical price data, trading volume, and
other financial indicators.

Weather Forecasting: Predict future weather conditions like temperature, precipitation, and wind speed
based on historical weather data and atmospheric variables.

Sales Forecasting: Predict future sales for a product or service based on historical sales data, seasonality, and
marketing efforts.

Demand Forecasting: Predict the demand for a product in a supply chain to optimize inventory and
production planning.

Energy Consumption Prediction: Predict future energy consumption in a building or a city based on
historical data and weather conditions.

Traffic Flow Prediction: Predict traffic congestion and flow patterns on roads to optimize transportation
systems.

Healthcare Outcome Prediction: Predict patient outcomes or disease progression based on medical data,
such as patient records and test results.

Crop Yield Prediction: Predict agricultural crop yields based on factors like soil quality, weather conditions,
and farming practices.

Customer Lifetime Value Prediction: Predict the future value of a customer to a business based on their
past behavior and purchase history.

How Does Classification Works?

With the help of the bank loan application that we have discussed above, let us understand the working of
classification. The Data Classification process includes two steps −
Building the Classifier or Model
Using Classifier for Classification

Building the Classifier or Model

This step is the learning step or the learning phase.


In this step the classification algorithms build the classifier.
The classifier is built from the training set made up of database tuples and their associated class labels.
Each tuple that constitutes the training set is referred to as a category or class. These tuples can
also be referred to as sample, object or data points.

Using Classifier for Classification


In this step, the classifier is used for classification. Here the test data is used to estimate the
accuracy of classification rules. The classification rules can be applied to the new data tuples if
the accuracy is considered acceptable.
Example of building a classification model in data mining. In this example, Python and the popular scikit-
learn library are used to create a model that predicts whether an email is spam or not based on certain features.

Problem: Email Spam Classification

Dataset: You have a dataset of emails, each labeled as "spam" or "not spam," and you want to build a model
that can classify new, unlabeled emails as spam or not spam based on the email's content and other features.

Here are the steps involved:

Data Preparation:

Load your email dataset into a pandas DataFrame.


Preprocess the text data, which may include tasks like tokenization, removing stopwords, and
stemming/lemmatization.
Convert the target labels (spam or not spam) into numerical values, e.g., 1 for spam and 0 for not spam.

Feature Engineering:

Extract relevant features from the text data. Common text features include word frequency counts (using
techniques like TF-IDF), email length, number of links, etc.
Select the most important features if you have many, using techniques like feature selection.

Data Splitting:

Split your dataset into a training set and a testing set. This allows you to train the model on one portion of the
data and evaluate its performance on another.

Choose a Classification Model:

Select a classification algorithm. In this example, we'll use a simple one: Logistic Regression.

Model Training:

Train the chosen model on the training data. In Python with scikit-learn, it can be as simple as:

from sklearn.linear_model import LogisticRegression

# Create a Logistic Regression model


model = LogisticRegression()

# Train the model


model.fit(X_train, y_train)
Model Evaluation:

Use the testing data to evaluate the model's performance. Common metrics for classification include accuracy,
precision, recall, F1-score, and confusion matrix.
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions on the testing data


y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print a classification report and confusion matrix


print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
Model Tuning and Optimization:

Depending on the model's performance, you may need to fine-tune hyperparameters or consider more
advanced models.

Deployment:

If the model meets your requirements and performs well, you can deploy it in a production environment to
classify incoming emails.

This is a simplified example, but it illustrates the general steps involved in building a classification model in
data mining. Real-world problems may involve more complex data preprocessing, feature engineering, and
model selection. Additionally, handling imbalanced datasets and conducting cross-validation are important
considerations in classification tasks.

Classification rules are the decision criteria or patterns that a machine learning classifier, such as Logistic
Regression, Decision Trees, or Random Forests, learns from the training data to assign class labels to new or
unseen data points. These rules are essentially the "if-then" conditions that help the classifier make predictions.

Let's take an example with a simplified rule learned from a Logistic Regression classifier for email spam
classification:

Example Classification Rule (Logistic Regression):

If the sum of the frequencies of the words "buy," "now," and "discount" in an email is greater than 0.5, then
classify the email as "spam."
In this example, the classifier has learned a rule based on the presence and frequency of specific words in an
email. If the condition is met (the sum of the word frequencies exceeds 0.5), the email is classified as "spam."
Otherwise, it's classified as "not spam."

The specific rules learned by a classifier can vary significantly depending on the algorithm, the features used,
and the complexity of the problem. These rules are determined during the training phase when the classifier
adjusts its parameters to best fit the training data and minimize prediction errors.
Data preparation for Classification and Prediction / Issues

The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities −

Data Cleaning − Data cleaning involves removing the noise and treatment of missing values.
The noise is removed by applying smoothing techniques and the problem of missing values is
solved by replacing a missing value with most commonly occurring value for that attribute.

Relevance Analysis − Database may also have the irrelevant attributes. Correlation analysis is
used to know whether any two given attributes are related.

Data Transformation and reduction − The data can be transformed by anyof the following
methods.

Normalization − The data is transformed using normalization. Normalization involves scaling


all values for given attribute in order to make them fall within a small specified range.
Normalization is used when in the learning step, the neural networks or the methods involving
measurements are used.

Generalization − The data can also be transformed by generalizing it to the higher concept. For
this purpose we can use the concept hierarchies.

Note − Data can also be reduced by some other methods such as wavelet transformation,
binning, histogram analysis, and clustering.

Comparison of Classification and Prediction Methods

Here is the criteria for comparing the methods of Classification and Prediction −

Accuracy − Accuracy of classifier refers to the ability of classifier. It predict the class label
correctly and the accuracy of the predictor refers to how well a given predictor can guess the
value of predicted attribute for a new data.

Speed − This refers to the computational cost in generating and using the classifier or predictor.

Robustness − It refers to the ability of classifier or predictor to make correct predictions from
given noisy data.

Scalability − Scalability refers to the ability to construct the classifier or predictor efficiently;
given large amount of data.

Interpretability − It refers to what extent the classifier or predictor understands.


For additional reference:
https://vikramuniv.ac.in/files/wp-content/uploads/MCA-IV_DataMining17_PartitionAlgo_Keerti_Dixit.pdf

You might also like