data mining 5
data mining 5
Clustering helps to splits data into several subsets. Each of these subsets contains
data similar to each other, and these subsets are called clusters. Now that the data
from our customer base is divided into clusters, we can make an informed decision
about who we think is best suited for this product.
Let's understand this with an example, suppose we are a market manager, and we
have a new tempting product to sell. We are sure that the product would bring
enormous profit, as long as it is sold to the right people. So, how can we tell who is
best suited for the product from our company's huge customer base?
Clustering, falling under the category of unsupervised machine learning, is one
of the problems that machine learning algorithms solve.
o The intra-cluster similarities are high, It implies that the data present inside
the cluster is similar to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is
not similar to other data.
What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in
the cluster is less than the distance between any object in the cluster and
any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high
density of objects.
Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis.
It is based on data similarities and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications,
and it helps single out important characteristics that differentiate between
distinct groups.
1. Scalability:
Scalability in clustering implies that as we boost the amount of data objects, the
time to perform clustering should approximately scale to the complexity order of
the algorithm. For example, if we perform K- means clustering, we know it is O(n),
where n is the number of objects in the data. If we raise the number of data objects
10 folds, then the time taken to cluster them should also approximately increase 10
times. It means there should be a linear relationship. If that is not the case, then
there is some error with our implementation process.
Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure
illustrates the graphical example where it may lead to the wrong result.
2. Interpretability:
The clustering algorithm should be able to find arbitrary shape clusters. They should
not be limited to only distance measurements that tend to discover a spherical
cluster of small sizes.
4. Ability to deal with different types of attributes:
Algorithms should be capable of being applied to any data such as data based on
intervals (numeric), binary data, and categorical data.
Databases contain data that is noisy, missing, or incorrect. Few algorithms are
sensitive to such data and may result in poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but
also the low-dimensional space.
Association Rule
Association rule mining finds interesting associations and relationships among
large sets of data items. This rule shows how frequently a itemset occurs in a
transaction. A typical example is a Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to
show associations between items.It allows retailers to identify relationships
between the items that people buy together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of
an item based on the occurrences of other items in the transaction.
TID Items
1 Bread, Milk
1. Determine the similarity between individuals and all other clusters. (Find
proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
Let’s understand this concept with the help of graphical representation using a
dendrogram.
With the help of given demonstration, we can understand that how the actual
algorithm work. Here no calculation has been done below all the proximity among
the clusters are assumed.
Step 1:
Step 2:
Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and
Cluster R are similar to each other so that we can merge them in the second step.
Finally, we get the clusters [ (P), (QR), (ST), (V)]
Step 3:
Here, we recalculate the proximity as per the algorithm and combine the two
closest clusters [(ST), (V)] together to form new clusters as [(P), (QR), (STV)]
Step 4:
Repeat the same process. The clusters STV and PQ are comparable and combined
together to form a new cluster. Now we have [(P), (QQRSTV)].
Step 5:
Finally, the remaining two clusters are merged together to form a single cluster
[(PQRSTV)]
Input:
K: The number of clusters in which the dataset has to be divided
Output:
A dataset of K clusters
Method:
2. (Re) Assign each object to which object is most similar based upon
mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with
the updated values.
Figure – K-mean
ClusteringFlowchart:
Figure – K-
mean ClusteringExample: Suppose we want to group the visitors to a
website using just their age as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset. Iteration-1:
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Its inventors claim BIRCH to be the "first clustering algorithm proposed in the
database area to handle 'noise' (data points that are not part of the underlying
pattern) effectively", beating DBSCAN by two months. The BIRCH algorithm received
the SIGMOD 10 year test of time award in 2006.
Basic clustering algorithms like K means and agglomerative clustering are the most
commonly used clustering algorithms. But when performing clustering on very large
datasets, BIRCH and DBSCAN are the advanced clustering algorithms useful for
performing precise clustering on large datasets. Moreover, BIRCH is very useful
because of its easy implementation. BIRCH is a clustering algorithm that clusters
the dataset first in small summaries, then after small summaries get clustered. It
does not directly cluster the dataset. That is why BIRCH is often used with other
clustering algorithms; after making the summary, the summary can also be
clustered by other clustering algorithms.
Stages of BIRCH
BIRCH is often used to complement other clustering algorithms by creating a
summary of the dataset that the other clustering algorithm can now use. However,
BIRCH has one major drawback it can only process metric attributes. A metric
attribute is an attribute whose values can be represented in Euclidean space, i.e.,
no categorical attributes should be present. The BIRCH clustering algorithm consists
of two stages:
1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense
regions called Clustering Feature (CF) entries. Formally, a Clustering Feature
entry is defined as an ordered triple (N, LS, SS) where 'N' is the number of
data points in the cluster, 'LS' is the linear sum of the data points, and 'SS' is
the squared sum of the data points in the cluster. A CF entry can be
composed of other CF entries. Optionally, we can condense this initial CF tree
into a smaller CF.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of
the CF tree. A CF tree is a tree where each leaf node contains a sub-cluster.
Every entry in a CF tree contains a pointer to a child node, and a CF entry
made up of the sum of CF entries in the child nodes. Optionally, we can refine
these clusters.
Algorithm
The tree structure of the given data is built by the BIRCH algorithm called the
Clustering feature tree (CF tree). This algorithm is based on the CF (clustering
features) tree. In addition, this algorithm uses a tree-structured summary to create
clusters.
In context to the CF tree, the algorithm compresses the data into the sets of CF
nodes. Those nodes that have several sub-clusters can be called CF subclusters.
These CF subclusters are situated in no-terminal CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features
and holds necessary information of given data for further hierarchical clustering.
This prevents the need to work with whole data given as input. The tree cluster of
data points as CF is represented by three numbers (N, LS, SS).
There are mainly four phases which are followed by the algorithm of BIRCH.
In condensing, it resets and resizes the data for better fitting into the CF tree. In
global clustering, it sends CF trees for clustering using existing clustering
algorithms. Finally, refining fixes the problem of CF trees where the same valued
points are assigned to different leaf nodes.
Cluster Features
BIRCH clustering achieves its high efficiency by clever use of a small set of
summary statistics to represent a larger set of data points. These summary
statistics constitute a CF and represent a sufficient substitute for the actual data for
clustering purposes.
NOTE: The linear sum and the squared sum are equivalent to the mean and variance of the data point.
CF Tree
The building process of the CF Tree can be summarized in the following steps, such
as:
Step 1: For each given record, BIRCH compares the location of that record with the
location of each CF in the root node, using either the linear sum or the mean of the
CF. BIRCH passes the incoming record to the root node CF closest to the incoming
record.
Step 2: The record then descends down to the non-leaf child nodes of the root
node CF selected in step 1. BIRCH compares the location of the record with the
location of each non-leaf CF. BIRCH passes the incoming record to the non-leaf node
CF closest to the incoming record.
Step 3: The record then descends down to the leaf child nodes of the non-leaf node
CF selected in step 2. BIRCH compares the location of the record with the location of
each leaf. BIRCH tentatively passes the incoming record to the leaf closest to the
incoming record.
Step 4: Perform one of the below points (i) or (ii):
1. If the radius of the chosen leaf, including the new record, does not exceed the
threshold T, then the incoming record is assigned to that leaf. The leaf and its
parent CF's are updated to account for the new data point.
2. If the radius of the chosen leaf, including the new record, exceeds the
Threshold T, then a new leaf is formed, consisting of the incoming record only.
The parent CFs is updated to account for the new data point.
If step 4(ii) is executed, and the maximum L leaves are already in the leaf node, the
leaf node is split into two leaf nodes. If the parent node is full, split the parent node,
and so on. The most distant leaf node CFs are used as leaf node seeds, with the
remaining CFs being assigned to whichever leaf node is closer. Note that the radius
of a cluster may be calculated even without knowing the data points, as long as we
have the count n, the linear sum LS, and the squared sum SS. This allows BIRCH to
evaluate whether a given data point belongs to a particular sub-cluster without
scanning the original data set.
Parameters of BIRCH
There are three parameters in this algorithm that needs to be tuned. Unlike K-
means, the optimal number of clusters (k) need not be input by the user as the
algorithm determines them.
It uses available memory to derive the finest possible sub-clusters while minimizing
I/O costs. It is also an incremental method that does not require the whole data set
in advance.
DBSCAN Clustering
Clustering analysis or simply Clustering is basically an Unsupervised learning
method that divides the data points into a number of specific batches or groups,
such that the data points in the same groups have similar properties and data
points in different groups have different properties in some sense. It comprises
many different methods based on differential evolution.
E.g. K-Means (distance between points), Affinity propagation (graph distance),
Mean-shift (distance between points), DBSCAN (distance between nearest points),
Gaussian mixtures (Mahalanobis distance to centers), Spectral clustering (graph
distance), etc.
Fundamentally, all clustering methods use the same approach i.e. first we calculate
similarities and then we use it to cluster the data points into groups or batches.
Here we will focus on the Density-based spatial clustering of applications
with noise (DBSCAN) clustering method.
Clusters are dense regions in the data space, separated by regions of the lower
density of points. The DBSCAN algorithm is based on this intuitive notion of
“clusters” and “noise”. The key idea is that for each point of a cluster, the
neighborhood of a given radius has to contain at least a minimum number of
points.
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex clusters. In other words, they are
suitable only for compact and well-separated clusters. Moreover, they are also
severely affected by the presence of noise and outliers in the data.
1. Clusters can be of arbitrary shape such as those shown in the figure below.
1. eps: It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors. If the eps value is chosen too small then a large part of the data
will be considered as an outlier. If it is chosen very large then the clusters will
merge and the majority of the data points will be in the same clusters. One
way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The
larger the dataset, the larger value of MinPts must be chosen. As a general
rule, the minimum MinPts can be derived from the number of dimensions D in
the dataset as, MinPts >= D+1. The minimum value of MinPts must be
chosen at least 3.
1. Find all the neighbor points within eps and identify the core points or visited
with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new
cluster.
3. Find recursively all its density-connected points and assign them to the same
cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and both
points a and b are within the eps distance. This is a chaining process. So,
if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which in
turn is neighbor of a implying that b is a neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points
that do not belong to any cluster are noise.
# cluster index
C=1
mark p as visited
# find neighbors
if |N|>=MinPts:
N = N U N'
CURE Algorithm
CURE(Clustering Using Representatives)
CURE Architecture
Idea: Random sample, say ‘s’ is drawn out of a given data. This random
sample is partitioned, say ‘p’ partitions with size s/p. The partitioned sample
is partially clustered, into say ‘s/pq’ clusters. Outliers are
discarded/eliminated from this partially clustered partition. The partially
clustered partitions need to be clustered again. Label the data in the disk.
Procedure :
The primary objective of the apriori algorithm is to create the association rule
between different objects. The association rule describes how two or more objects
are related to one another. Apriori algorithm is also called frequent pattern mining.
Generally, you operate the Apriori algorithm on a database that consists of a huge
number of transactions. Let's understand the apriori algorithm with the help of an
example; suppose you go to Big Bazar and buy different products. It helps the
customers buy their products with ease and increases the sales performance of the
Big Bazar. In this tutorial, we will discuss the apriori algorithm with examples.
Introduction
We take an example to understand the concept better. You must have noticed that
the Pizza shop seller makes a pizza, soft drink, and breadstick combo together. He
also offers a discount to their customers who buy these combos. Do you ever think
why does he do so? He thinks that customers who buy pizza also buy soft drinks
and breadsticks. However, by making combos, he makes it easy for the customers.
At the same time, he also increases his sales performance.
Similarly, you go to Big Bazar, and you will find biscuits, chips, and Chocolate
bundled together. It shows that the shopkeeper makes it comfortable for the
customers to buy these products in the same place.
The above two examples are the best examples of Association Rules in Data Mining.
It helps us to learn the concept of apriori algorithms.
Apriori algorithm helps the customers to buy their products with ease and increases
the sales performance of the particular store.
1. Support
2. Confidence
3. Lift
We have already discussed above; you need a huge database containing a large no
of transactions. Suppose you have 4000 customers transactions in a Big Bazar. You
have to calculate the Support, Confidence, and Lift for two products, and you may
say Biscuits and Chocolate. This is because customers frequently buy these two
items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and
these 600 transactions include a 200 that includes Biscuits and chocolates. Using
this data, we will find out the support, confidence, and lift.
Support
Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by
the total number of transactions. Hence, we get
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that
comprise both biscuits and chocolates by the total number of transactions to get
the confidence.
Hence,
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given
below.
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together
is five times more than that of purchasing the biscuits alone. If the lift value is below
one, it requires that the people are unlikely to buy both the items together. Larger
the value, the better is the combination.
Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk,
Apple}. The database comprises six transactions where 1 represents the presence
of the product and 0 represents the absence of the product.
t1 1 1 1 0 0
t2 0 1 1 1 0
t3 0 0 0 1 1
t4 1 1 0 1 0
t5 1 1 1 0 1
t6 1 1 1 1 1
Step 1
Make a frequency table of all the products that appear in all the transactions. Now,
short the frequency table to add only those products with a threshold support level
of over 50 percent. We find the given frequency table.
Rice (R) 4
Pulse(P) 5
Oil(O) 4
Milk(M) 4
The above table indicated the products frequently bought by the customers.
Step 2
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given
frequency table.
RP 4
RO 3
RM 2
PO 4
PM 3
OM 2
Step 3
Implementing the same threshold support of 50 percent and consider the products
that are more than 50 percent. In our case, it is more than 3
Step 4
Now, look for a set of three products that the customers buy together. We get the
given combination.
Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency
table.
POM 3
If you implement the threshold assumption, you can figure out that the customers'
set of three products is RPO.
Transaction Reduction
The primary requirements to find the association rules in data mining are given
below.
Analyze all the rules and find the support and confidence levels for the individual
rule. Afterward, eliminate the values which are less than the threshold support and
confidence levels.
The two-step approach is a better option to find the associations rules than the
Brute Force method.
Step 1
In this article, we have already discussed how to create the frequency table and
calculate itemsets having a greater support value than that of the threshold
support.
Step 2
To create association rules, you need to use a binary partition of the frequent
itemsets. You need to choose the ones having the highest confidence levels.
In the above example, you can see that the RPO combination was the frequent
itemset. Now, we find out all the rules using RPO.
You can see that there are six different combinations. Therefore, if you have n
elements, there will be 2n - 2 candidate association rules.
The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable
method for mining the complete set of frequent patterns by pattern fragment
growth, using an extended prefix-tree structure for storing compressed and crucial
information about frequent patterns named frequent-pattern tree (FP-tree). In his
study, Han proved that his method outperforms other popular methods for mining
frequent patterns, e.g. the Apriori Algorithm and the TreeProjection. In some later
works, it was proved that FP-Growth performs better than other methods,
including Eclat and Relim. The popularity and efficiency of the FP-Growth Algorithm
contribute to many studies that propose variations to improve its performance.
What is FP Growth Algorithm?
The FP-Growth Algorithm is an alternative way to find frequent item sets without
using candidate generations, thus improving performance. For so much, it uses a
divide-and-conquer strategy. The core of this method is the usage of a special data
structure named frequent-pattern tree (FP-tree), which retains the item set
association information.
Using this strategy, the FP-Growth reduces the search costs by recursively looking
for short patterns and then concatenating them into the long frequent patterns.
In large databases, holding the FP tree in the main memory is impossible. A strategy
to cope with this problem is to partition the database into a set of smaller databases
(called projected databases) and then construct an FP-tree from each of these
smaller databases.
FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores
quantitative information about frequent patterns in a database. Each transaction is
read and then mapped onto a path in the FP-tree. This is done until all transactions
have been read. Different transactions with common subsets allow the tree to
remain compact because their paths overlap.
A frequent Pattern Tree is made with the initial item sets of the database. The
purpose of the FP tree is to mine the most frequent pattern. Each node of the FP
tree represents an item of the item set.
The root node represents null, while the lower nodes represent the item sets. The
associations of the nodes with the lower nodes, that is, the item sets with the other
item sets, are maintained while forming the tree.
1. One root is labelled as "null" with a set of item-prefix subtrees as children and
a frequent-item-header table.
2. Each node in the item-prefix subtree consists of three fields:
o Item-name: registers which item is represented by the node;
o Count: the number of transactions represented by the portion of the
path reaching the node;
o Node-link: links to the next node in the FP-tree carrying the same item
name or null if there is none.
3. Each entry in the frequent-item-header table consists of two fields:
o Item-name: as the same to the node;
o Head of node-link: a pointer to the first node in the FP-tree carrying the
item name.
Additionally, the frequent-item-header table can have the count support for an item.
The below diagram is an example of a best-case scenario that occurs when all
transactions have the same itemset; the size of the FP-tree will be only a single
branch of nodes.
The worst-case scenario occurs when every transaction has a unique item set. So
the space needed to store the tree is greater than the space used to store the
original data set because the FP-tree requires additional space to store pointers
between nodes and the counters for each item. The diagram below shows how a
worst-case scenario FP-tree might appear. As you can see, the tree's complexity
grows with each transaction's uniqueness.
Algorithm by Han
The original algorithm to construct the FP-Tree defined by Han is given below:
1. The first step is to scan the database to find the occurrences of the itemsets
in the database. This step is the same as the first step of Apriori. The count of
1-itemsets in the database is called support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the
tree. The root is represented by null.
3. The next step is to scan the database again and examine the transactions.
Examine the first transaction and find out the itemset in it. The itemset with
the max count is taken at the top, and then the next itemset with the lower
count. It means that the branch of the tree is constructed with transaction
itemsets in descending order of count.
4. The next transaction in the database is examined. The itemsets are ordered
in descending order of count. If any itemset of this transaction is already
present in another branch, then this transaction branch would share a
common prefix to the root.
This means that the common itemset is linked to the new node of another
itemset in this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions.
The common node and new node count are increased by 1 as they are
created and linked according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is
examined first, along with the links of the lowest nodes. The lowest node
represents the frequency pattern length 1. From this, traverse the path in the
FP Tree. This path or paths is called a conditional pattern base.
A conditional pattern base is a sub-database consisting of prefix paths in the
FP tree occurring with the lowest node (suffix).
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path.
The itemsets meeting the threshold support are considered in the Conditional
FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.
Using this algorithm, the FP-tree is constructed in two database scans. The first scan
collects and sorts the set of frequent items, and the second constructs the FP-Tree.
Example
Table 1:
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
Item Count
I2 5
I1 4
I3 4
I4 4
Build FP Tree
1. The lowest node item, I5, is not considered as it does not have a min support
count. Hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}.
Therefore considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3:
1} this forms the conditional pattern base.
3. The conditional pattern base is considered a transaction database, and an FP
tree is constructed. This will contain {I2:2, I3:2}, I1 is not considered as it
does not meet the min support count.
4. This path will generate all combinations of frequent patterns : {I2,I4:2},
{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node
FP-tree : {I2:4, I1:3} and frequent patterns are generated: {I2,I3:4},
{I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-
tree: {I2:4} and frequent patterns are generated: {I2, I1:4}.
The diagram given below depicts the conditional FP tree associated with the
conditional node I3.
FP-Growth Algorithm
After constructing the FP-Tree, it's possible to mine it to find the complete set of
frequent patterns. Han presents a group of lemmas and properties to do this job
and then describes the following FP-Growth Algorithm.
Algorithm 2: FP-Growth
1. Procedure FP-growth(Tree, a)
2. {
3. If the tree contains a single prefix path, then.
4. {
5. // Mining single prefix-path FP-tree
6. let P be the single prefix-path part of the tree;
7. let Q be the multipath part with the top branching node replaced by a null root;
8. for each combination (denoted as ß) of the nodes in the path, P do
9. generate pattern ß ∪ a with support = minimum support of nodes in ß;
10.let freq pattern set(P) be the set of patterns so generated;
11. }
12.else let Q be Tree;
13.for each item ai in Q, do
14. {
15. // Mining multipath FP-tree
16.generate pattern ß = ai ∪ a with support = ai .support;
17.construct ß's conditional pattern-based, and then ß's conditional FP-tree Tree ß;
18.if Tree ß ≠ Ø then
19.call FP-growth(Tree ß, ß);
20.let freq pattern set(Q) be the set of patterns so generated;
21. }
22.return(freq pattern set(P) ∪ freq pattern set(Q) ∪ (freq pattern set(P) × freq pattern
set(Q)))
23.}
When the FP-tree contains a single prefix path, the complete set of frequent
patterns can be generated in three parts:
The resulting patterns for a single prefix path are the enumerations of its subpaths
with minimum support. After that, the multipath Q is defined, and the resulting
patterns are processed. Finally, the combined results are returned as the frequent
patterns found.
o This algorithm needs to scan the database twice when compared to Apriori,
which scans the transactions for each iteration.
o The pairing of items is not done in this algorithm, making it faster.
o The database is stored in a compact version in memory.
o It is efficient and scalable for mining both long and short frequent patterns.
Apriori FP Growth
Apriori generates frequent patterns by making the itemsets FP Growth generates an FP-Tree for
using pairings such as single item set, double itemset, and making frequent patterns.
triple itemset.
Apriori uses candidate generation where frequent subsets are FP-growth generates a conditional FP-Tree
extended one item at a time. for every item in the data.
Since apriori scans the database in each step, it becomes FP-tree requires only one database scan in
time-consuming for data where the number of items is larger. its beginning steps, so it consumes less
time.
A converted version of the database is saved in the memory A set of conditional FP-tree for every item
is saved in the memory