0% found this document useful (0 votes)
15 views

data mining 5

Uploaded by

shivasingh38025
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

data mining 5

Uploaded by

shivasingh38025
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Clustering in Data Mining

Clustering is an unsupervised Machine Learning-based Algorithm that comprises a


group of data points into clusters so that the objects belong to the same group.

Clustering helps to splits data into several subsets. Each of these subsets contains
data similar to each other, and these subsets are called clusters. Now that the data
from our customer base is divided into clusters, we can make an informed decision
about who we think is best suited for this product.

Let's understand this with an example, suppose we are a market manager, and we
have a new tempting product to sell. We are sure that the product would bring
enormous profit, as long as it is sold to the right people. So, how can we tell who is
best suited for the product from our company's huge customer base?
Clustering, falling under the category of unsupervised machine learning, is one
of the problems that machine learning algorithms solve.

Clustering only utilizes input data, to determine patterns, anomalies, or similarities


in its input data.

A good clustering algorithm aims to obtain clusters whose:

o The intra-cluster similarities are high, It implies that the data present inside
the cluster is similar to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is
not similar to other data.

What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in
the cluster is less than the distance between any object in the cluster and
any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high
density of objects.

What is clustering in Data Mining?


o Clustering is the method of converting a group of abstract objects into
classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of
significant subclasses called clusters.
o It helps users to understand the structure or natural grouping in a data set
and used either as a stand-alone instrument to get a better insight into data
distribution or as a pre-processing step for other algorithms

Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis.
It is based on data similarities and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications,
and it helps single out important characteristics that differentiate between
distinct groups.

Applications of cluster analysis in data mining:


o In many applications, clustering analysis is widely used, such as data
analysis, market research, pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on
the purchasing patterns. They can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit
card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into
the distribution of data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies,
categorization of genes with the same functionalities and gain insight into
structure inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth
observation database and the identification of house groups in a city
according to house type, value, and geographical location.

Why is clustering used in data mining?


Clustering analysis has been an evolving problem in data mining due to its variety
of applications. The advent of various data clustering tools in the last few years and
their comprehensive use in a broad range of applications, including image
processing, computational biology, mobile communication, medicine, and
economics, must contribute to the popularity of these algorithms. The main issue
with the data clustering algorithms is that it cant be standardized. The advanced
algorithm may give the best results with one type of data set, but it may fail or
perform poorly with other kinds of data set. Although many efforts have been made
to standardize the algorithms that can perform well in all situations, no significant
achievement has been achieved so far. Many clustering tools have been proposed
so far. However, each algorithm has its advantages or disadvantages and cant work
on all real situations.

1. Scalability:

Scalability in clustering implies that as we boost the amount of data objects, the
time to perform clustering should approximately scale to the complexity order of
the algorithm. For example, if we perform K- means clustering, we know it is O(n),
where n is the number of objects in the data. If we raise the number of data objects
10 folds, then the time taken to cluster them should also approximately increase 10
times. It means there should be a linear relationship. If that is not the case, then
there is some error with our implementation process.

Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure
illustrates the graphical example where it may lead to the wrong result.

2. Interpretability:

The outcomes of clustering should be interpretable, comprehensible, and usable.

3. Discovery of clusters with attribute shape:

The clustering algorithm should be able to find arbitrary shape clusters. They should
not be limited to only distance measurements that tend to discover a spherical
cluster of small sizes.
4. Ability to deal with different types of attributes:

Algorithms should be capable of being applied to any data such as data based on
intervals (numeric), binary data, and categorical data.

5. Ability to deal with noisy data:

Databases contain data that is noisy, missing, or incorrect. Few algorithms are
sensitive to such data and may result in poor quality clusters.

6. High dimensionality:

The clustering tools should not only able to handle high dimensional data space but
also the low-dimensional space.

Association Rule
Association rule mining finds interesting associations and relationships among
large sets of data items. This rule shows how frequently a itemset occurs in a
transaction. A typical example is a Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to
show associations between items.It allows retailers to identify relationships
between the items that people buy together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of
an item based on the occurrences of other items in the transaction.

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke


Before we start defining the rule, let us first see the basic definitions.
Support Count ( ) – Frequency of occurrence of a itemset.
Here ({Milk, Bread, Diaper})=2
Frequent Itemset – An itemset whose support is greater than or equal to
minsup threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y
are any 2 itemsets.
Example: {Milk, Diaper}->{Beer}
Rule Evaluation Metrics –
 Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the
rule as a percentage of the total number of transaction.It is a measure of how
frequently the collection of items occur together as a percentage of all
transactions.
 Support = (X+Y) total –
It is interpreted as fraction of transactions that contain both X and Y.
 Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as
the no of transactions that includes all items in {A} to the no of transactions
that includes all items in {A}.
 Conf(X=>Y) = Supp(X Y) Supp(X) –
It measures how often each item in Y appears in transactions that contains
items in X also.
 Lift (l) –
he lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each
other.The expected confidence is the confidence divided by the frequency of
{Y}.
 Lift(X=>Y) = Conf(X=>Y) Supp(Y) –
Lift value near 1 indicates X and Y almost often appear together as expected,
greater than 1 means they appear together more than expected and less
than 1 means they appear less than expected.Greater lift values indicate
stronger association.
Example – From the above table, {Milk, Diaper}=>{Beer}
s= ({Milk, Diaper, Beer}) |T|
= 2/5
= 0.4

c= (Milk, Diaper, Beer) (Milk, Diaper)


= 2/3
= 0.67
l= Supp ({Milk, Diaper, Beer}) Supp ({Milk, Diaper})*Supp ({Beer})
= 0.4/ (0.6*0.6)
= 1.11
The Association rule is very useful in analyzing datasets. The data is collected
using bar-code scanners in supermarkets. Such databases consist of a large
number of transaction records which list all items bought by a customer on a
single purchase. So the manager could know if certain groups of items are
consistently purchased together and use this data for adjusting store layouts,
cross-selling, promotions based on statistics.

Hierarchical clustering in data mining


Hierarchical clustering refers to an unsupervised learning procedure that
determines successive clusters based on previously defined clusters. It works via
grouping data into a tree of clusters. Hierarchical clustering stats by treating each
data points as an individual cluster. The endpoint refers to a different set of
clusters, where each cluster is different from the other cluster, and the objects
within each cluster are the same as one another.

There are two types of hierarchical clustering

o Agglomerative Hierarchical Clustering


o Divisive Clustering

Agglomerative hierarchical clustering


Agglomerative clustering is one of the most common types of hierarchical clustering
used to group similar objects in clusters. Agglomerative clustering is also known as
AGNES (Agglomerative Nesting). In agglomerative clustering, each data point act as
an individual cluster and at each step, data objects are grouped in a bottom-up
method. Initially, each data object is in its cluster. At each iteration, the clusters are
combined with different clusters until one cluster is formed.

Agglomerative hierarchical clustering algorithm

1. Determine the similarity between individuals and all other clusters. (Find
proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
Let’s understand this concept with the help of graphical representation using a
dendrogram.

With the help of given demonstration, we can understand that how the actual
algorithm work. Here no calculation has been done below all the proximity among
the clusters are assumed.

Let's suppose we have six different data points P, Q, R, S, T, V.

Step 1:

Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the


distance between the individual cluster from all other clusters.

Step 2:

Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and
Cluster R are similar to each other so that we can merge them in the second step.
Finally, we get the clusters [ (P), (QR), (ST), (V)]

Step 3:

Here, we recalculate the proximity as per the algorithm and combine the two
closest clusters [(ST), (V)] together to form new clusters as [(P), (QR), (STV)]

Step 4:

Repeat the same process. The clusters STV and PQ are comparable and combined
together to form a new cluster. Now we have [(P), (QQRSTV)].

Step 5:

Finally, the remaining two clusters are merged together to form a single cluster
[(PQRSTV)]

Divisive Hierarchical Clustering


Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical
clustering. In Divisive Hierarchical clustering, all the data points are considered an
individual cluster, and in every iteration, the data points that are not similar are
separated from the cluster. The separated data points are treated as an individual
cluster. Finally, we are left with N clusters.
Advantages of Hierarchical clustering
o It is simple to implement and gives the best output in some cases.
o It is easy and results in a hierarchy, a structure that contains more
information.
o It does not need us to pre-specify the number of clusters.

Disadvantages of hierarchical clustering


o It breaks the large clusters.
o It is Difficult to handle different sized clusters and convex shapes.
o It is sensitive to noise and outliers.
o The algorithm can never be changed or deleted once it was done previously.

Partitioning Method (K-Mean) in Data Mining


Partitioning Method: This clustering method classifies the information into
multiple groups based on the characteristics and similarity of the data. Its
the data analysts to specify the number of clusters that has to be generated
for the clustering methods. In the partitioning method when database(D) that
contains multiple(N) objects then the partitioning method constructs user-
specified(K) partitions of the data in which each partition represents a cluster
and a particular region. There are many algorithms that come under
partitioning method some of the popular ones are K-Mean, PAM(K-Medoids),
CLARA algorithm (Clustering Large Applications) etc. In this article, we will be
seeing the working of K Mean algorithm in detail. K-Mean (A centroid
based Technique): The K means algorithm takes the input parameter K
from the user and partitions the dataset containing N objects into K clusters
so that resulting similarity among the data objects inside the group
(intracluster) is high but the similarity of data objects with the data objects
from outside the cluster is low (intercluster). The similarity of the cluster is
determined with respect to the mean value of the cluster. It is a type of
square error algorithm. At the start randomly k objects from the dataset are
chosen in which each of the objects represents a cluster mean(centre). For
the rest of the data objects, they are assigned to the nearest cluster based
on their distance from the cluster mean. The new mean of each of the cluster
is then calculated with the added data objects. Algorithm: K mean:

Input:
K: The number of clusters in which the dataset has to be divided

D: A dataset containing N number of objects

Output:

A dataset of K clusters

Method:

1. Randomly assign K objects from the dataset(D) as cluster centres(C)

2. (Re) Assign each object to which object is most similar based upon
mean values.

3. Update Cluster means, i.e., Recalculate the mean of each cluster with
the updated values.

4. Repeat Step 2 until no change occurs.

Figure – K-mean
ClusteringFlowchart:
Figure – K-
mean ClusteringExample: Suppose we want to group the visitors to a
website using just their age as follows:

16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66

Initial Cluster:

K=2

Centroid(C1) = 16 [16]

Centroid(C2) = 22 [22]

Note: These two points are chosen randomly from the dataset. Iteration-1:

C1 = 16.33 [16, 16, 17]

C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]

Iteration-2:

C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]

C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]

Iteration-3:

C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]

Iteration-4:

C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]

C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]

No change Between Iteration 3 and 4, so we stop. Therefore we get the


clusters (16-29) and (36-66) as 2 clusters we get using K Mean Algorithm.

BIRCH in Data Mining


BIRCH (balanced iterative reducing and clustering using hierarchies) is an
unsupervised data mining algorithm that performs hierarchical clustering over large
data sets. With modifications, it can also be used to accelerate k-means clustering
and Gaussian mixture modeling with the expectation-maximization algorithm. An
advantage of BIRCH is its ability to incrementally and dynamically cluster incoming,
multi-dimensional metric data points to produce the best quality clustering for a
given set of resources (memory and time constraints). In most cases, BIRCH only
requires a single scan of the database.

Its inventors claim BIRCH to be the "first clustering algorithm proposed in the
database area to handle 'noise' (data points that are not part of the underlying
pattern) effectively", beating DBSCAN by two months. The BIRCH algorithm received
the SIGMOD 10 year test of time award in 2006.

Basic clustering algorithms like K means and agglomerative clustering are the most
commonly used clustering algorithms. But when performing clustering on very large
datasets, BIRCH and DBSCAN are the advanced clustering algorithms useful for
performing precise clustering on large datasets. Moreover, BIRCH is very useful
because of its easy implementation. BIRCH is a clustering algorithm that clusters
the dataset first in small summaries, then after small summaries get clustered. It
does not directly cluster the dataset. That is why BIRCH is often used with other
clustering algorithms; after making the summary, the summary can also be
clustered by other clustering algorithms.

It is provided as an alternative to MinibatchKMeans. It converts data to a tree data


structure with the centroids being read off the leaf. And these centroids can be the
final cluster centroid or the input for other cluster algorithms like Agglomerative
Clustering.

Problem with Previous Clustering Algorithm


Previous clustering algorithms performed less effectively over very large databases
and did not adequately consider the case wherein a dataset was too large to fit in
main memory. Furthermore, most of BIRCH's predecessors inspect all data points
(or all currently existing clusters) equally for each clustering decision. They do not
perform heuristic weighting based on the distance between these data points. As a
result, there was a lot of overhead maintaining high clustering quality while
minimizing the cost of additional IO (input/output) operations.

Stages of BIRCH
BIRCH is often used to complement other clustering algorithms by creating a
summary of the dataset that the other clustering algorithm can now use. However,
BIRCH has one major drawback it can only process metric attributes. A metric
attribute is an attribute whose values can be represented in Euclidean space, i.e.,
no categorical attributes should be present. The BIRCH clustering algorithm consists
of two stages:

1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense
regions called Clustering Feature (CF) entries. Formally, a Clustering Feature
entry is defined as an ordered triple (N, LS, SS) where 'N' is the number of
data points in the cluster, 'LS' is the linear sum of the data points, and 'SS' is
the squared sum of the data points in the cluster. A CF entry can be
composed of other CF entries. Optionally, we can condense this initial CF tree
into a smaller CF.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of
the CF tree. A CF tree is a tree where each leaf node contains a sub-cluster.
Every entry in a CF tree contains a pointer to a child node, and a CF entry
made up of the sum of CF entries in the child nodes. Optionally, we can refine
these clusters.

Due to this two-step process, BIRCH is also called Two-Step Clustering.

Algorithm
The tree structure of the given data is built by the BIRCH algorithm called the
Clustering feature tree (CF tree). This algorithm is based on the CF (clustering
features) tree. In addition, this algorithm uses a tree-structured summary to create
clusters.
In context to the CF tree, the algorithm compresses the data into the sets of CF
nodes. Those nodes that have several sub-clusters can be called CF subclusters.
These CF subclusters are situated in no-terminal CF nodes.

The CF tree is a height-balanced tree that gathers and manages clustering features
and holds necessary information of given data for further hierarchical clustering.
This prevents the need to work with whole data given as input. The tree cluster of
data points as CF is represented by three numbers (N, LS, SS).

o N = number of items in subclusters


o LS = vector sum of the data points
o SS = sum of the squared data points

There are mainly four phases which are followed by the algorithm of BIRCH.

o Scanning data into memory.


o Condense data (resize data).
o Global clustering.
o Refining clusters.
Two of them (resize data and refining clusters) are optional in these four phases.
They come in the process when more clarity is required. But scanning data is just
like loading data into a model. After loading the data, the algorithm scans the whole
data and fits them into the CF trees.

In condensing, it resets and resizes the data for better fitting into the CF tree. In
global clustering, it sends CF trees for clustering using existing clustering
algorithms. Finally, refining fixes the problem of CF trees where the same valued
points are assigned to different leaf nodes.

Cluster Features
BIRCH clustering achieves its high efficiency by clever use of a small set of
summary statistics to represent a larger set of data points. These summary
statistics constitute a CF and represent a sufficient substitute for the actual data for
clustering purposes.

A CF is a set of three summary statistics representing a set of data points in a single


cluster. These statistics are as follows:

o Count [The number of data values in the cluster]


o Linear Sum [The sum of the individual coordinates. This is a measure of the
location of the cluster]
o Squared Sum [The sum of the squared coordinates. This is a measure of the
spread of the cluster]

NOTE: The linear sum and the squared sum are equivalent to the mean and variance of the data point.

CF Tree
The building process of the CF Tree can be summarized in the following steps, such
as:

Step 1: For each given record, BIRCH compares the location of that record with the
location of each CF in the root node, using either the linear sum or the mean of the
CF. BIRCH passes the incoming record to the root node CF closest to the incoming
record.

Step 2: The record then descends down to the non-leaf child nodes of the root
node CF selected in step 1. BIRCH compares the location of the record with the
location of each non-leaf CF. BIRCH passes the incoming record to the non-leaf node
CF closest to the incoming record.

Step 3: The record then descends down to the leaf child nodes of the non-leaf node
CF selected in step 2. BIRCH compares the location of the record with the location of
each leaf. BIRCH tentatively passes the incoming record to the leaf closest to the
incoming record.
Step 4: Perform one of the below points (i) or (ii):

1. If the radius of the chosen leaf, including the new record, does not exceed the
threshold T, then the incoming record is assigned to that leaf. The leaf and its
parent CF's are updated to account for the new data point.
2. If the radius of the chosen leaf, including the new record, exceeds the
Threshold T, then a new leaf is formed, consisting of the incoming record only.
The parent CFs is updated to account for the new data point.

If step 4(ii) is executed, and the maximum L leaves are already in the leaf node, the
leaf node is split into two leaf nodes. If the parent node is full, split the parent node,
and so on. The most distant leaf node CFs are used as leaf node seeds, with the
remaining CFs being assigned to whichever leaf node is closer. Note that the radius
of a cluster may be calculated even without knowing the data points, as long as we
have the count n, the linear sum LS, and the squared sum SS. This allows BIRCH to
evaluate whether a given data point belongs to a particular sub-cluster without
scanning the original data set.

Clustering the Sub-Clusters


Once the CF tree is built, any existing clustering algorithm may be applied to the
sub-clusters (the CF leaf nodes) to combine these sub-clusters into clusters. The
task of clustering becomes much easier as the number of sub-clusters is much less
than the number of data points. When a new data value is added, these statistics
may be easily updated, thus making the computation more efficient.

Parameters of BIRCH
There are three parameters in this algorithm that needs to be tuned. Unlike K-
means, the optimal number of clusters (k) need not be input by the user as the
algorithm determines them.

o Threshold: Threshold is the maximum number of data points a sub-cluster in


the leaf node of the CF tree can hold.
o branching_factor: This parameter specifies the maximum number of CF
sub-clusters in each node (internal node).
o n_clusters: The number of clusters to be returned after the entire BIRCH
algorithm is complete, i.e., the number of clusters after the final clustering
step. The final clustering step is not performed if set to none, and
intermediate clusters are returned.
Advantages of BIRCH
It is local in that each clustering decision is made without scanning all data points
and existing clusters. It exploits the observation that the data space is not usually
uniformly occupied, and not every data point is equally important.

It uses available memory to derive the finest possible sub-clusters while minimizing
I/O costs. It is also an incremental method that does not require the whole data set
in advance.

DBSCAN Clustering
Clustering analysis or simply Clustering is basically an Unsupervised learning
method that divides the data points into a number of specific batches or groups,
such that the data points in the same groups have similar properties and data
points in different groups have different properties in some sense. It comprises
many different methods based on differential evolution.
E.g. K-Means (distance between points), Affinity propagation (graph distance),
Mean-shift (distance between points), DBSCAN (distance between nearest points),
Gaussian mixtures (Mahalanobis distance to centers), Spectral clustering (graph
distance), etc.

Fundamentally, all clustering methods use the same approach i.e. first we calculate
similarities and then we use it to cluster the data points into groups or batches.
Here we will focus on the Density-based spatial clustering of applications
with noise (DBSCAN) clustering method.

Density-Based Spatial Clustering Of Applications With Noise (DBSCAN)

Clusters are dense regions in the data space, separated by regions of the lower
density of points. The DBSCAN algorithm is based on this intuitive notion of
“clusters” and “noise”. The key idea is that for each point of a cluster, the
neighborhood of a given radius has to contain at least a minimum number of
points.
Why DBSCAN?

Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex clusters. In other words, they are
suitable only for compact and well-separated clusters. Moreover, they are also
severely affected by the presence of noise and outliers in the data.

Real-life data may contain irregularities, like:

1. Clusters can be of arbitrary shape such as those shown in the figure below.

2. Data may contain noise.


The figure above shows a data set containing non-convex shape clusters and
outliers. Given such data, the k-means algorithm has difficulties in identifying these
clusters with arbitrary shapes.

Parameters Required For DBSCAN Algorithm

1. eps: It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors. If the eps value is chosen too small then a large part of the data
will be considered as an outlier. If it is chosen very large then the clusters will
merge and the majority of the data points will be in the same clusters. One
way to find the eps value is based on the k-distance graph.

2. MinPts: Minimum number of neighbors (data points) within eps radius. The
larger the dataset, the larger value of MinPts must be chosen. As a general
rule, the minimum MinPts can be derived from the number of dimensions D in
the dataset as, MinPts >= D+1. The minimum value of MinPts must be
chosen at least 3.

In this algorithm, we have 3 types of data points.


Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the
neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.

Steps Used In DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points or visited
with more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new
cluster.

3. Find recursively all its density-connected points and assign them to the same
cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and both
points a and b are within the eps distance. This is a chaining process. So,
if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which in
turn is neighbor of a implying that b is a neighbor of a.

4. Iterate through the remaining unvisited points in the dataset. Those points
that do not belong to any cluster are noise.

Pseudocode For DBSCAN Clustering Algorithm

DBSCAN(dataset, eps, MinPts){

# cluster index

C=1

for each unvisited point p in dataset {

mark p as visited

# find neighbors

Neighbors N = find the neighboring points of p

if |N|>=MinPts:

N = N U N'

if p' is not a member of any cluster:

add p' to cluster C

CURE Algorithm
CURE(Clustering Using Representatives)

 It is a hierarchical based clustering technique, that adopts a middle ground


between the centroid based and the all-point extremes. Hierarchical
clustering is a type of clustering, that starts with a single point cluster, and
moves to merge with another cluster, until the desired number of clusters are
formed.

 It is used for identifying the spherical and non-spherical clusters.

 It is useful for discovering groups and identifying interesting distributions in


the underlying data.

 Instead of using one point centroid, as in most of data mining algorithms,


CURE uses a set of well-defined representative points, for efficiently handling
the clusters and eliminating the outliers.

Representation of Clusters and Outliers

Six steps in CURE algorithm:

CURE Architecture
 Idea: Random sample, say ‘s’ is drawn out of a given data. This random
sample is partitioned, say ‘p’ partitions with size s/p. The partitioned sample
is partially clustered, into say ‘s/pq’ clusters. Outliers are
discarded/eliminated from this partially clustered partition. The partially
clustered partitions need to be clustered again. Label the data in the disk.

Representation of partitioning and clustering

 Procedure :

1. Select target sample number ‘gfg’.

2. Choose ‘gfg’ well scattered points in a cluster.

3. These scattered points are shrunk towards centroid.

4. These points are used as representatives of clusters and used in ‘Dmin’


cluster merging approach. In Dmin(distance minimum) cluster merging
approach, the minimum distance from the scattered point inside the
sample ‘gfg’ and the points outside ‘gfg sample, is calculated. The
point having the least distance to the scattered point inside the
sample, when compared to other points, is considered and merged into
the sample.

5. After every such merging, new sample points will be selected to


represent the new cluster.

6. Cluster merging will stop until target, say ‘k’ is reached.


Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association
rules between objects. It means how two or more objects are related to one
another. In other words, we can say that the apriori algorithm is an association rule
leaning that analyzes that people who bought product A also bought product B.

The primary objective of the apriori algorithm is to create the association rule
between different objects. The association rule describes how two or more objects
are related to one another. Apriori algorithm is also called frequent pattern mining.
Generally, you operate the Apriori algorithm on a database that consists of a huge
number of transactions. Let's understand the apriori algorithm with the help of an
example; suppose you go to Big Bazar and buy different products. It helps the
customers buy their products with ease and increases the sales performance of the
Big Bazar. In this tutorial, we will discuss the apriori algorithm with examples.

Introduction
We take an example to understand the concept better. You must have noticed that
the Pizza shop seller makes a pizza, soft drink, and breadstick combo together. He
also offers a discount to their customers who buy these combos. Do you ever think
why does he do so? He thinks that customers who buy pizza also buy soft drinks
and breadsticks. However, by making combos, he makes it easy for the customers.
At the same time, he also increases his sales performance.

Similarly, you go to Big Bazar, and you will find biscuits, chips, and Chocolate
bundled together. It shows that the shopkeeper makes it comfortable for the
customers to buy these products in the same place.
The above two examples are the best examples of Association Rules in Data Mining.
It helps us to learn the concept of apriori algorithms.

What is Apriori Algorithm?


Apriori algorithm refers to an algorithm that is used in mining frequent products
sets and relevant association rules. Generally, the apriori algorithm operates on a
database containing a huge number of transactions. For example, the items
customers but at a Big Bazar.

Apriori algorithm helps the customers to buy their products with ease and increases
the sales performance of the particular store.

Components of Apriori algorithm


The given three components comprise the apriori algorithm.

1. Support
2. Confidence
3. Lift

Let's take an example to understand this concept.

We have already discussed above; you need a huge database containing a large no
of transactions. Suppose you have 4000 customers transactions in a Big Bazar. You
have to calculate the Support, Confidence, and Lift for two products, and you may
say Biscuits and Chocolate. This is because customers frequently buy these two
items together.

Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and
these 600 transactions include a 200 that includes Biscuits and chocolates. Using
this data, we will find out the support, confidence, and lift.

Support
Support refers to the default popularity of any product. You find the support as a
quotient of the division of the number of transactions comprising that product by
the total number of transactions. Hence, we get

Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)

= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and
chocolates together. So, you need to divide the number of transactions that
comprise both biscuits and chocolates by the total number of transactions to get
the confidence.

Hence,

Confidence = (Transactions relating both biscuits and Chocolate) / (Total


transactions involving Biscuits)

= 200/400

= 50 percent.

It means that 50 percent of customers who bought biscuits bought chocolates also.

Lift
Consider the above example; lift refers to the increase in the ratio of the sale of
chocolates when you sell biscuits. The mathematical equations of lift are given
below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

= 50/10 = 5

It means that the probability of people buying both biscuits and chocolates together
is five times more than that of purchasing the biscuits alone. If the lift value is below
one, it requires that the people are unlikely to buy both the items together. Larger
the value, the better is the combination.

How does the Apriori Algorithm work in Data Mining?


We will understand this algorithm with the help of an example

Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk,
Apple}. The database comprises six transactions where 1 represents the presence
of the product and 0 represents the absence of the product.

Transaction ID Rice Pulse Oil Milk Apple

t1 1 1 1 0 0
t2 0 1 1 1 0

t3 0 0 0 1 1

t4 1 1 0 1 0

t5 1 1 1 0 1

t6 1 1 1 1 1

The Apriori Algorithm makes the given assumptions

o All subsets of a frequent itemset must be frequent.


o The subsets of an infrequent item set must be infrequent.
o Fix a threshold support level. In our case, we have fixed it at 50 percent.

Step 1

Make a frequency table of all the products that appear in all the transactions. Now,
short the frequency table to add only those products with a threshold support level
of over 50 percent. We find the given frequency table.

Product Frequency (Number of transactions)

Rice (R) 4

Pulse(P) 5

Oil(O) 4

Milk(M) 4

The above table indicated the products frequently bought by the customers.

Step 2
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given
frequency table.

Itemset Frequency (Number of transactions)

RP 4

RO 3

RM 2

PO 4

PM 3

OM 2

Step 3

Implementing the same threshold support of 50 percent and consider the products
that are more than 50 percent. In our case, it is more than 3

Thus, we get RP, RO, PO, and PM

Step 4

Now, look for a set of three products that the customers buy together. We get the
given combination.

1. RP and RO give RPO


2. PO and PM give POM

Step 5

Calculate the frequency of the two itemsets, and you will get the given frequency
table.

Itemset Frequency (Number of transactions)


RPO 4

POM 3

If you implement the threshold assumption, you can figure out that the customers'
set of three products is RPO.

We have considered an easy example to discuss the apriori algorithm in data


mining. In reality, you find thousands of such combinations.

How to improve the efficiency of the Apriori Algorithm?


There are various methods used for the efficiency of the Apriori algorithm

Hash-based itemset counting

In hash-based itemset counting, you need to exclude the k-itemset whose


equivalent hashing bucket count is least than the threshold is an infrequent itemset.

Transaction Reduction

In transaction reduction, a transaction not involving any frequent X itemset


becomes not valuable in subsequent scans.

Apriori Algorithm in data mining


We have already discussed an example of the apriori algorithm related to the
frequent itemset generation. Apriori algorithm has many applications in data
mining.

The primary requirements to find the association rules in data mining are given
below.

Use Brute Force

Analyze all the rules and find the support and confidence levels for the individual
rule. Afterward, eliminate the values which are less than the threshold support and
confidence levels.

The two-step approaches

The two-step approach is a better option to find the associations rules than the
Brute Force method.

Step 1
In this article, we have already discussed how to create the frequency table and
calculate itemsets having a greater support value than that of the threshold
support.

Step 2

To create association rules, you need to use a binary partition of the frequent
itemsets. You need to choose the ones having the highest confidence levels.

In the above example, you can see that the RPO combination was the frequent
itemset. Now, we find out all the rules using RPO.

RP-O, RO-P, PO-R, O-RP, P-RO, R-PO

You can see that there are six different combinations. Therefore, if you have n
elements, there will be 2n - 2 candidate association rules.

Advantages of Apriori Algorithm


o It is used to calculate large itemsets.
o Simple to understand and apply.

Disadvantages of Apriori Algorithms


o Apriori algorithm is an expensive method to find support since the calculation
has to pass through the whole database.
o Sometimes, you need a huge number of candidate rules, so it becomes
computationally more expensive.

FP Growth Algorithm in Data Mining


In Data Mining, finding frequent patterns in large databases is very important and
has been studied on a large scale in the past few years. Unfortunately, this task is
computationally expensive, especially when many patterns exist.

The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable
method for mining the complete set of frequent patterns by pattern fragment
growth, using an extended prefix-tree structure for storing compressed and crucial
information about frequent patterns named frequent-pattern tree (FP-tree). In his
study, Han proved that his method outperforms other popular methods for mining
frequent patterns, e.g. the Apriori Algorithm and the TreeProjection. In some later
works, it was proved that FP-Growth performs better than other methods,
including Eclat and Relim. The popularity and efficiency of the FP-Growth Algorithm
contribute to many studies that propose variations to improve its performance.
What is FP Growth Algorithm?
The FP-Growth Algorithm is an alternative way to find frequent item sets without
using candidate generations, thus improving performance. For so much, it uses a
divide-and-conquer strategy. The core of this method is the usage of a special data
structure named frequent-pattern tree (FP-tree), which retains the item set
association information.

This algorithm works as follows:

o First, it compresses the input database creating an FP-tree instance to


represent frequent items.
o After this first step, it divides the compressed database into a set of
conditional databases, each associated with one frequent pattern.
o Finally, each such database is mined separately.

Using this strategy, the FP-Growth reduces the search costs by recursively looking
for short patterns and then concatenating them into the long frequent patterns.

In large databases, holding the FP tree in the main memory is impossible. A strategy
to cope with this problem is to partition the database into a set of smaller databases
(called projected databases) and then construct an FP-tree from each of these
smaller databases.

FP-Tree
The frequent-pattern tree (FP-tree) is a compact data structure that stores
quantitative information about frequent patterns in a database. Each transaction is
read and then mapped onto a path in the FP-tree. This is done until all transactions
have been read. Different transactions with common subsets allow the tree to
remain compact because their paths overlap.

A frequent Pattern Tree is made with the initial item sets of the database. The
purpose of the FP tree is to mine the most frequent pattern. Each node of the FP
tree represents an item of the item set.

The root node represents null, while the lower nodes represent the item sets. The
associations of the nodes with the lower nodes, that is, the item sets with the other
item sets, are maintained while forming the tree.

Han defines the FP-tree as the tree structure given below:

1. One root is labelled as "null" with a set of item-prefix subtrees as children and
a frequent-item-header table.
2. Each node in the item-prefix subtree consists of three fields:
o Item-name: registers which item is represented by the node;
o Count: the number of transactions represented by the portion of the
path reaching the node;
o Node-link: links to the next node in the FP-tree carrying the same item
name or null if there is none.
3. Each entry in the frequent-item-header table consists of two fields:
o Item-name: as the same to the node;
o Head of node-link: a pointer to the first node in the FP-tree carrying the
item name.

Additionally, the frequent-item-header table can have the count support for an item.
The below diagram is an example of a best-case scenario that occurs when all
transactions have the same itemset; the size of the FP-tree will be only a single
branch of nodes.

The worst-case scenario occurs when every transaction has a unique item set. So
the space needed to store the tree is greater than the space used to store the
original data set because the FP-tree requires additional space to store pointers
between nodes and the counters for each item. The diagram below shows how a
worst-case scenario FP-tree might appear. As you can see, the tree's complexity
grows with each transaction's uniqueness.
Algorithm by Han
The original algorithm to construct the FP-Tree defined by Han is given below:

Algorithm 1: FP-tree construction

Input: A transaction database DB and a minimum support threshold?

Output: FP-tree, the frequent-pattern tree of DB.

Method: The FP-tree is constructed as follows.

1. The first step is to scan the database to find the occurrences of the itemsets
in the database. This step is the same as the first step of Apriori. The count of
1-itemsets in the database is called support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the
tree. The root is represented by null.
3. The next step is to scan the database again and examine the transactions.
Examine the first transaction and find out the itemset in it. The itemset with
the max count is taken at the top, and then the next itemset with the lower
count. It means that the branch of the tree is constructed with transaction
itemsets in descending order of count.
4. The next transaction in the database is examined. The itemsets are ordered
in descending order of count. If any itemset of this transaction is already
present in another branch, then this transaction branch would share a
common prefix to the root.
This means that the common itemset is linked to the new node of another
itemset in this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions.
The common node and new node count are increased by 1 as they are
created and linked according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is
examined first, along with the links of the lowest nodes. The lowest node
represents the frequency pattern length 1. From this, traverse the path in the
FP Tree. This path or paths is called a conditional pattern base.
A conditional pattern base is a sub-database consisting of prefix paths in the
FP tree occurring with the lowest node (suffix).
7. Construct a Conditional FP Tree, formed by a count of itemsets in the path.
The itemsets meeting the threshold support are considered in the Conditional
FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.

Using this algorithm, the FP-tree is constructed in two database scans. The first scan
collects and sorts the set of frequent items, and the second constructs the FP-Tree.

Example

Support threshold=50%, Confidence= 60%

Table 1:

Transaction List of items

T1 I1,I2,I3
T2 I2,I3,I4

T3 I4,I5

T4 I1,I2,I4

T5 I1,I2,I3,I5

T6 I1,I2,I3,I4

Solution: Support threshold=50% => 0.5*6= 3 => min_sup=3

Table 2: Count of each item

Item Count

I1 4

I2 5

I3 4

I4 4

I5 2

Table 3: Sort the itemset in descending order.

Item Count

I2 5

I1 4
I3 4

I4 4

Build FP Tree

Let's build the FP tree in the following steps, such as:

1. Considering the root node null.


2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1},
{I3:1}, where I2 is linked as a child, I1 is linked to I2 and I3 is linked to I1.
3. T2: I2, I3, and I4 contain I2, I3, and I4, where I2 is linked to root, I3 is linked to
I2 and I4 is linked to I3. But this branch would share the I2 node as common
as it is already used in T1.
4. Increment the count of I2 by 1, and I3 is linked as a child to I2, and I4 is linked
as a child to I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the
root node. Hence it will be incremented by 1. Similarly I1 will be incremented
by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3},
{I3:2}, {I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4},
{I3:3}, {I4 1}.
Mining of FP-tree is summarized below:

1. The lowest node item, I5, is not considered as it does not have a min support
count. Hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}.
Therefore considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3:
1} this forms the conditional pattern base.
3. The conditional pattern base is considered a transaction database, and an FP
tree is constructed. This will contain {I2:2, I3:2}, I1 is not considered as it
does not meet the min support count.
4. This path will generate all combinations of frequent patterns : {I2,I4:2},
{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node
FP-tree : {I2:4, I1:3} and frequent patterns are generated: {I2,I3:4},
{I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-
tree: {I2:4} and frequent patterns are generated: {I2, I1:4}.

Item Conditional Pattern Base Conditional FP-tree Frequent Patterns Generated

I4 {I2,I1,I3:1},{I2,I3:1} {I2:2, I3:2} {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}

I3 {I2,I1:3},{I2:1} {I2:4, I1:3} {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}

I1 {I2:4} {I2:4} {I2,I1:4}

The diagram given below depicts the conditional FP tree associated with the
conditional node I3.

FP-Growth Algorithm
After constructing the FP-Tree, it's possible to mine it to find the complete set of
frequent patterns. Han presents a group of lemmas and properties to do this job
and then describes the following FP-Growth Algorithm.

Algorithm 2: FP-Growth

Input: A database DB, represented by FP-tree constructed according to Algorithm


1, and a minimum support threshold?

Output: The complete set of frequent patterns.

Method: Call FP-growth (FP-tree, null).

1. Procedure FP-growth(Tree, a)
2. {
3. If the tree contains a single prefix path, then.
4. {
5. // Mining single prefix-path FP-tree
6. let P be the single prefix-path part of the tree;
7. let Q be the multipath part with the top branching node replaced by a null root;
8. for each combination (denoted as ß) of the nodes in the path, P do
9. generate pattern ß ∪ a with support = minimum support of nodes in ß;
10.let freq pattern set(P) be the set of patterns so generated;
11. }
12.else let Q be Tree;
13.for each item ai in Q, do
14. {
15. // Mining multipath FP-tree
16.generate pattern ß = ai ∪ a with support = ai .support;
17.construct ß's conditional pattern-based, and then ß's conditional FP-tree Tree ß;
18.if Tree ß ≠ Ø then
19.call FP-growth(Tree ß, ß);
20.let freq pattern set(Q) be the set of patterns so generated;
21. }
22.return(freq pattern set(P) ∪ freq pattern set(Q) ∪ (freq pattern set(P) × freq pattern
set(Q)))
23.}

When the FP-tree contains a single prefix path, the complete set of frequent
patterns can be generated in three parts:

1. The single prefix-path P,


2. The multipath Q,
3. And their combinations (lines 01 to 03 and 14).

The resulting patterns for a single prefix path are the enumerations of its subpaths
with minimum support. After that, the multipath Q is defined, and the resulting
patterns are processed. Finally, the combined results are returned as the frequent
patterns found.

Advantages of FP Growth Algorithm


Here are the following advantages of the FP growth algorithm, such as:

o This algorithm needs to scan the database twice when compared to Apriori,
which scans the transactions for each iteration.
o The pairing of items is not done in this algorithm, making it faster.
o The database is stored in a compact version in memory.
o It is efficient and scalable for mining both long and short frequent patterns.

Disadvantages of FP-Growth Algorithm


This algorithm also has some disadvantages, such as:

o FP Tree is more cumbersome and difficult to build than Apriori.


o It may be expensive.
o The algorithm may not fit in the shared memory when the database is large.

Difference between Apriori and FP Growth Algorithm


Apriori and FP-Growth algorithms are the most basic FIM algorithms. There are
some basic differences between these algorithms, such as:

Apriori FP Growth

Apriori generates frequent patterns by making the itemsets FP Growth generates an FP-Tree for
using pairings such as single item set, double itemset, and making frequent patterns.
triple itemset.

Apriori uses candidate generation where frequent subsets are FP-growth generates a conditional FP-Tree
extended one item at a time. for every item in the data.

Since apriori scans the database in each step, it becomes FP-tree requires only one database scan in
time-consuming for data where the number of items is larger. its beginning steps, so it consumes less
time.

A converted version of the database is saved in the memory A set of conditional FP-tree for every item
is saved in the memory

It uses a breadth-first search It uses a depth-first search.

You might also like