Unit-4 (2)
Unit-4 (2)
CLUSTERING
Clustering (or) cluster analysis is a machine learning technique, which groups the
unlabelled dataset. It can be defined as "A way of grouping the data points into different
clusters, consisting of similar data points. The objects with the possible similarities remain in
a group that has less or no similarities with another group."
It is an unsupervised learning method, hence no supervision is provided to the
algorithm, and it deals with the unlabeled dataset. The clustering technique can be widely
used in various tasks. Some most common uses of this technique are:
Market Segmentation
Statistical data analysis
Social network analysis
Image segmentation
Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of products. Netflix also uses
this technique to recommend the movies and web-series to its users as per the watch history.
1
3. Distribution Model-Based Clustering: In this, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done
by assuming some distributions commonly Gaussian Distribution. The example of
this type is the Expectation-Maximization Clustering algorithm that uses Gaussian
Mixture Models (GMM).
4. Hierarchical Clustering: Hierarchical clustering can be used as an alternative for the
partitioned clustering as there is no requirement of pre-specifying the number of
clusters to be created. In this technique, the dataset is divided into clusters to create a
tree-like structure, which is also called a dendrogram. Examples are Agglomerative
(bottom-up approach) & Divisive (top-down approach).
5. Fuzzy Clustering: Fuzzy clustering is a type of soft method in which a data object
may belong to more than one group or cluster. Each dataset has a set of membership
coefficients, which depend on the degree of membership to be in a cluster. Fuzzy C-
means algorithm is the example of this type of clustering; it is sometimes also known
as the Fuzzy k-means algorithm
Applications of Clustering:
Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets
into different groups.
Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
Customer Segmentation: It is used in market research to segment the customers based
on their choice and preferences.
Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
Land Use: The clustering technique is used in identifying the area of similar lands use
in the GIS database. This can be very useful to find that for what purpose the
particular land should be used, that means for which purpose it is more suitable.
2
K-MEANS CLUSTERING ALGORITHM
K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters “It is an iterative algorithm that divides the unlabeled
dataset into k different clusters in such a way that each dataset belongs only one group that
has similar properties.”
It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point and
their corresponding clusters. The algorithm takes the unlabeled dataset as input, divides the
dataset into k-number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm. The k-means clustering
algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an iterative process.
2. Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
3
Example of K-Means Algorithm:
Data Points X Y
A1 2 10
A2 2 5
A3 8 4
B1 5 8
B2 7 5
B3 6 4
C1 1 2
C2 4 9
In the given example, there are 8 data points. Consider the initial centroids as A1, B1
& C1. So, the value of K=3. If initially any centroids are given we need to consider them if
not we can selects any of the data points as centroid. The formula for calculating distance is
Initial centroids are A1: (2, 10) , B1: (5, 8) , C1: (1, 2). Initially we need to calculate
the distance by using above formula. Once the distance has been calculated, we need to
assign the data points to one of the cluster. So, we need to consider the distances and the one
which is having the smallest value, to that we can assign the cluster.
4
After the 1st iteration the data points A1 is assigned to 1st Cluster, A3, B1, B2, B3, C2
are assigned to the 2nd cluster and A2, C1 are assigned to the 3rd cluster. Now we need to
calculate the new centroids. The new centroids are A1: (2, 10), B1: (6, 6) & C1: (1.5, 3.5).
After considering the new centroids:
If we look into the above table C2 was assigned to 2 nd Cluster, but now it is assigned
to 1st cluster, which means the data point has been moved from one cluster to other. So, we
need to calculate the new centroids again. The new centroids are A1: (3, 9.5), B1: (6.5, 5.25)
& C1: (1.5, 3.5). Here, the new cluster will become the current cluster. After considering the
new centroids:
If we look into the above table B1 was assigned to 2nd Cluster, but now it is assigned
to 1st cluster, which means the data point has been moved from one cluster to other. So, we
need to calculate the new centroids again. The new centroids are A1: (3.67, 9), B1: (7, 4.33)
5
& C1: (1.5, 3.5). Here, the new cluster will become the current cluster. After considering the
new centroids:
Finally, cluster & new cluster became exactly same. A1, B1, C2 belongs to 1 st cluster.
A3, B2, B3 belongs to 2nd cluster & A2, C1 belongs to 3rd cluster.
LIMITATIONS OF K-MEANS
Choosing the right number of clusters: K-means requires the user to specify the
number of clusters to be generated, which can be difficult to determine. A poor choice
of K can lead to suboptimal clustering results.
Outliers can skew results: K-means assumes that all data points are equally important
in the clustering process, which can lead to outliers skewing the results. Outliers may
be assigned to a cluster, leading to inaccurate cluster assignments for the other data
points.
Sensitive to outliers: K-means is sensitive to outliers or noise data, which can distort
the resulting clusters.
Assumes spherical clusters: K-means assumes that the clusters are spherical and have
equal variances, which is not always true in real-world scenarios.
Limited applicability to non-numerical data: K-means is designed for numerical data
and does not handle categorical or textual data well.
Lack of robustness: K-means can be sensitive to the distribution of the data, and the
resulting clusters may not be robust to small changes in the input data.
6
Difficulty in handling high-dimensional data: K-means is less effective in high-
dimensional data, where the "curse of dimensionality" makes it harder to identify
meaningful clusters.
7
K-Means Clustering:
K-means clustering is a very popular clustering algorithm which applied when we
have a dataset with labels unknown. The goal is to find certain groups based on some kind of
similarity in the data with the number of groups represented by K. This algorithm is generally
used in areas like market segmentation, customer segmentation, etc. But, it can also be used
to segment different objects in the images on the basis of the pixel values.
8
Finally clustering can be a useful preprocessing technique in machine learning for
identifying patterns and structures in the data, grouping similar instances together, and
selecting or engineering features.
9
distance trained using Expectation-Maximization (EM), and Euclidean distance
changed by the shortest distance algorithm.
10
Border Point: A point which has fewer than MinPts within eps but it is in the
neighbourhood of a core point
Noise or outlier: A point which is not a core point or border point
Other Parameters:
A point X is directly density-reachable from point Y w.r.t epsilon, minPoints if,
X belongs to the neighborhood of Y, i.e, dist(X, Y) <= epsilon
Y is a core point
X is density-reachable from Y with X being directly density-reachable from P2, P2
from P3, and P3 from Y. But, the inverse of this is not valid.
A point X is density-connected from point Y w.r.t epsilon and minPoints if there
exists a point O such that both X and Y are density-reachable from O w.r.t to epsilon
and minPoints.
DBSCAN Algorithm:
1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same cluster as
the core point.
A point a and b are said to be density connected if there exist a point c which
has a sufficient number of points in its neighbors and both the points a and b are
within the eps distance. This is a chaining process. So, if b is neighbor of c, c is
11
neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b is
neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.
Example:
S1 5 7
S2 8 4
S3 3 3
S4 4 4
S5 3 7
S6 6 7
S7 6 1
S8 5 5
12
Identify the neighbours of each point for ε = 3.5
13
Identify the core points & noise points. Check for direct density reachable condition
for noise points. If the density reachable condition is satisfied, convert noise to boundary
point.
Point Core/Noise
S1 Core
S2 Noise Boundary
S3 Core
S4 Core
S5 Core
S6 Core
S7 Noise Noise
S8 Core
GAUSSIAN MIXTURES
A Gaussian mixture model (GMM) is a probabilistic model that assumes that the
instances were generated from a mixture of several Gaussian distributions whose parameters
are unknown. All the instances generated from a single Gaussian distribution form a cluster
that typically looks like an ellipsoid. Each cluster can have a different ellipsoidal shape, size,
density and orientation.
This generative process can be represented as a graphical model. This is a graph
which represents the structure of the conditional dependencies between random variables.
14
The circles represent random variables
The squares represent fixed values
The large rectangles are called plates: they indicate that their content is repeated
several times
The number indicated at the bottom right hand side of each plate indicates how many
times its content is repeated
Each variable z (i)
is drawn from the categorical distribution with weights ϕ. Each
(i)
variable x is drawn from the normal distribution with the mean and covariance
matrix defined by its cluster z (i).
The solid arrows represent conditional dependencies
The squiggly arrow from z (i)
to x (i)
represents a switch: depending on the value of z
(i)
, the instance x (i) will be sampled from a different Gaussian distribution
Shaded nodes indicate that the value is known, so in this case only the random
(i)
variables x have known values: they are called observed variables. The unknown
random variables z (i) are called latent variables.
15
>>> gm.predict(X)
>>> gm.predict_proba(X)
It is a generative model, meaning we can actually sample new instances from it
>>> X_new, y_new = gm.sample(6)
>>> X_new
>>> y_new
It is also possible to estimate the density of the model at any given location. This is
achieved using the score_samples() method: for each instance it is given, this method
estimates the log of the probability density function (PDF) at that location. The greater the
score, the higher the density
>>> gm.score_samples(X)
If we compute the exponential of these scores, we get the value of the PDF at the
location of the given instances. These are not probabilities, but probability densities: they can
take on any positive value, not just between 0 and 1. To estimate the probability that an
instance will fall within a particular region, we would have to integrate the PDF over that
region.
16
DIMENSIONALITY REDUCTION
17
Disadvantages of dimensionality Reduction:
Some data may be lost due to dimensionality reduction
In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown
18
Ridge Regression, etc.
19
consider those variables or features that show a high correlation with the target
variable.
7. Random Forest: Random Forest is a popular and very useful feature selection
algorithm in machine learning. In this technique, we need to generate a large set of
trees against the target variable, and with the help of usage statistics of each attribute,
we need to find the subset of features.
8. Factor Analysis: Factor analysis is a technique in which each variable is kept within a
group according to the correlation with other variables, it means variables within a
group can have a high correlation between themselves, but they have a low correlation
with variables of other groups.
9. Auto-Encoder: One of the popular methods of dimensionality reduction is auto-
encoder. In this, the input is compressed into latent-space representation, and output is
occurred using this representation. It has mainly two parts
Encoder: The function of the encoder is to compress the input to form the
latent-space representation.
Decoder: The function of the decoder is to recreate the output from the latent-
space representation.
20
PCA Algorithm:
21
22
Example:
Given the data in Table, reduce the dimension from 2 to 1 using the Principal
Component Analysis (PCA) algorithm.
Step 1:
No. of features, n=2
No. of samples, N=4
The scatter plot of the given data points are
23
Step 2: Calculate the mean of X1 and X2
Step 4: Calculate the Eigen Values, Eigen Vectors & Normalized Eigen Vector of the
Covariance Matrix
24
25
Step 5: Deriving New Dataset
26
USING SCIKIT-LEARN
Scikit-Learn is a free software machine learning library for the Python programming
language. It features various classification, regression and clustering algorithms including
support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is
designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Scikit-Learn’s PCA class implements PCA using SVD decomposition just like we did
before. The following code applies PCA to reduce the dimensionality of the dataset down to
two dimensions:
RANDOMIZED PCA
If we set the svd_solver hyperparameter to "randomized", Scikit-Learn uses a
stochastic algorithm called Randomized PCA that quickly finds an approximation of the first
d principal components. Its computational complexity is O(m × d 2 ) + O(d3), instead of O(m ×
27
n2 ) + O(n3) for the full SVD approach, so it is dramatically faster than full SVD when d is
much smaller than n:
KERNEL PCA
Kernel PCA a mathematical technique that implicitly maps instances into a very high-
dimensional space (called the feature space), enabling nonlinear classification and regression
with Support Vector Machines.
A linear decision boundary in the high-dimensional feature space corresponds to a
complex nonlinear decision boundary in the original space. It turns out that the same trick can
be applied to PCA, making it possible to perform complex nonlinear projections for
dimensionality reduction. This is called Kernel PCA. It is often good at preserving clusters of
instances after projection, or sometimes even unrolling datasets that lie close to a twisted
manifold.
For example, the following code uses Scikit-Learn’s KernelPCA class to perform
kPCA with an RBF kernel
28
29