Clustering
Clustering
Classifications of clustering
Clustering is classified into several types based on the approach used to group the data. These methods
differ in how they define clusters and the assumptions they make about the underlying data.
1. Partitioning Clustering
Partitioning methods divide the dataset into a predefined number of clusters (typically kkk clusters) such
that each data point belongs to exactly one cluster.
Key Algorithms:
● K-Means Clustering:
o The most popular partitioning method.
It tries to minimize the sum of squared distances between the data points and the centroid of
o
each cluster.
o The user specifies the number of clusters (k), and the algorithm iteratively assigns points to
the nearest cluster centroid.
● K-Medoids (PAM - Partitioning Around Medoids):
o Similar to K-means but uses medoids (actual data points) instead of centroids.
o It is more robust to outliers because it uses representative points from the dataset rather than
calculating a centroid.
Pros:
Cons:
2. Hierarchical Clustering
Hierarchical clustering builds a multilevel hierarchy of clusters either by merging smaller clusters
(agglomerative approach) or by splitting larger clusters (divisive approach). The result is a tree-like
structure called a dendrogram.
Pros:
Cons:
3. Density-Based Clustering
In density-based clustering, clusters are defined as regions of high density separated by regions of low
density. This method is particularly useful for discovering clusters of arbitrary shape and for handling
noise in the data.
Key Algorithm:
Pros:
Cons:
4. Grid-Based Clustering
Grid-based clustering divides the data space into a finite number of cells that form a grid structure, and the
clustering operations are performed on these grid cells.
Key Algorithm:
Pros:
Cons:
● Grid size is fixed, which might affect cluster quality.
● Does not perform well for high-dimensional data.
5. Model-Based Clustering
Model-based clustering assumes that the data is generated by a mixture of underlying probability
distributions, typically Gaussian distributions. Each cluster corresponds to a different probability
distribution, and the algorithm estimates the parameters of these distributions.
Key Algorithms:
Pros:
Cons:
6. Fuzzy Clustering
Fuzzy clustering allows each data point to belong to multiple clusters with varying degrees of
membership. Unlike hard clustering, where each point belongs to one cluster, fuzzy clustering allows for
uncertainty in the cluster assignments.
Key Algorithm:
Pros:
● Useful in situations where data points can naturally belong to more than one cluster.
● Provides more nuanced clustering results.
Cons:
In high-dimensional datasets, clusters may only exist in a subset of dimensions, making traditional
clustering methods less effective. Subspace clustering addresses this issue by identifying clusters in
lower-dimensional subspaces of the data.
Key Algorithms:
Pros:
Cons:
1. Correlation in Clustering
Correlation measures the relationship between variables, reflecting whether changes in one variable lead to
predictable changes in another.
In clustering, correlation is used when the goal is to group data based on the pattern or trend of variables. If
two variables are highly correlated, they likely belong to the same cluster.
● Pearson Correlation: Commonly used to measure linear relationships. Clustering algorithms like
hierarchical clustering can use correlation-based distances, where clusters are formed by grouping
variables with high correlation.
● Cosine Similarity: This measures the angle between two vectors (data points), often used when
clustering high-dimensional data like text or documents. If the angle is small (close to 0), the vectors
are considered highly correlated and can be clustered together.
Application: Correlation is useful in cases like time series or gene expression data clustering, where you care
about how data points rise or fall together.
2. Distance in Clustering
Distance measures how far apart two data points are, often using spatial geometry. The smaller the distance,
the more similar the data points are, and they are likely to belong to the same cluster.
● Euclidean Distance: The most common distance measure, calculating the straight line between two
points in space. It is used in many clustering algorithms like K-means, where clusters form based on
proximity in space.
● Manhattan Distance: Measures the distance between two points along axes at right angles (like
walking in a city grid). It is useful when features are independent and vary greatly in magnitude.
● Mahalanobis Distance: This is used when the data has varying distributions, adjusting the distance
based on the variance of the data, making it ideal when the features are correlated.
Uses: Distance-based clustering (e.g., K-means or hierarchical clustering) is more effective when you're
interested in grouping data points that are spatially close together.
● Clustering in high-dimensional data may use distance metrics like cosine similarity that are based
on correlations.
● Correlation-based distances like Mahalanobis distance can adjust for relationships between
variables, leading to more meaningful clusters in data with complex interactions.
Variable A 3 5 7 9
Variable B 2 4 6 8
Use Pearson correlation to determine if Variables A and B are highly correlated and should belong to the
same cluster. Solution r =1.
Problem 2: Clustering with Euclidean Distance
Point A: (1, 2), Point B: (4, 6), Point C: (7, 1) Determine whether Points A and B or Points A and C should be
clustered together using Euclidean distance.
Problem 3: Clustering with Manhattan Distance
You are given three data points in a city grid:Point X: (1, 2)Point Y: (5, 5)Point Z: (3, 8) Cluster points using
Manhattan distance.
Two data points A and B with two correlated features: Point A: (2, 3)&Point B: (6, 7) and Covariance matrix
Σ=[ 4 22 3 ] .
K-means clustering
The K-means clustering algorithm is a popular unsupervised machine learning algorithm used to partition a
dataset into K clusters.
Algorithm:
Once all data points are assigned to clusters, compute the new centroids for each cluster
by calculating the mean (average) of all the data points assigned to that cluster.
Repeat steps 2 and 3 until the centroids no longer change (or change very little), or until a
specified number of iterations is reached.
● STEP 5. Stopping Criteria: The algorithm converges when the centroids no longer change
or when the change in assignments becomes minimal.
● STEP 6. Visualization of the given data points clustered into 3(min pure assume) groups using K-
means. Each colour represents a different cluster, and the crosses mark the centroids of the clusters.
Remark:
⮚ The algorithm minimizes the within-cluster variance (sum of squared distances from each
point to its cluster centroid).
⮚ K-means is sensitive to the initial placement of centroids, so using methods like k-means++
can help improve performance by initializing centroids more effectively.
⮚ The number of clusters K must be specified in advance, which can sometimes be a challenge.
Methods like the elbow method or silhouette score are used to choose the optimal K.
Problems:
1. Suppose we measure two variables X 1 and X 2 for four items A,B,C and D. The data are as
follows: Observations
and (CD). B -1 1
C 1 -2
D -3 -2
2. Using K -Means clustering to cluster the following data into two groups. Consider cluster centroid
are m1=4 and m2=11. The distance function used is Euclidean distance. { 2, 4, 10, 12, 3, 20, 30,
11, 25 }.
3. Using K-mean clustering algorithm to divide the following into two clusters
X1 1 2 2 3 4 5
X2 1 1 3 2 3 5
Hierarchical Clustering
Algorithm:
Step1. Initialization
● Start with each data point as a separate cluster. If you have N data points, initialize with N
clusters (each cluster containing one point).
● Compute the distance (or similarity) between every pair of clusters. Use a distance metric like
Euclidean distance, Manhattan distance, or others depending on your data.
● Store the distances in a distance matrix.
● Find the pair of clusters that are closest (have the smallest distance) and merge them into a
single cluster.
● This reduces the number of clusters by 1.
● After merging, update the distance matrix to reflect the new cluster distances.
● The distance between the new cluster and the remaining clusters is calculated using a linkage
criterion such as:
o Single Linkage (Minimum): Distance between two clusters is the minimum distance
between any pair of points in the two clusters.
o Complete Linkage (Maximum): Distance is the maximum distance between any pair
of points in the clusters.
o Average Linkage: Distance is the average of all pairwise distances between points in
the clusters.
o Centroid Linkage: Distance between the centroids (mean points) of the clusters.
Step 5. Repeat
● Repeat steps 3 and 4 until all data points are in a single cluster, or a predefined number of
clusters is reached.
● During the merging process, keep track of the order in which clusters are merged.
● Construct a dendrogram (a tree-like diagram) that shows the hierarchical relationship
between clusters at different levels of similarity.
● To get a final clustering solution, you can "cut" the dendrogram at a specific height. This will
result in a specified number of clusters, depending on where the cut is made.
Problems.
1. Consider the hypothetical distances between pairs of five objects as follows
D= [ 0 9 3 6 119 0 7 5 10 37 0 9 26 5 9 0 8 1110 2 8 0 ]Cluster the items using each of the following
procedures (a) Single linkage hierarchical procedure (b) Complete linkage hierarchical
procedure. Draw the dendrograms and compare the result in (a) and (b).
2. Compute complete link technique Agglomeratine Hierarchical clustering {(1,1),(1.5,1.5),(5,5),(3,4),
(4,4),(3,3.5)}.
Overlapping Clustering
This approach generally follows soft clustering methods like Fuzzy C-Means or Probabilistic Latent
Semantic Analysis (PLSA), where data points can belong to multiple clusters with varying degrees of
membership.
Step 2. Initialization
● Set Parameters: Choose the number of clusters k (may not be fixed if it’s adaptive), a
threshold for cluster assignment, and maximum iterations.
● Centroids: Randomly initialize k cluster centroids or select initial cluster centers based on a
heuristic (like k-means++ initialization).
● For each data point, assign soft membership to each cluster based on similarity:
o Calculate the distance/similarity of the data point to each centroid.
o Convert the distance into a degree of membership to each cluster (e.g., using a
Gaussian function or a normalized similarity measure).
o Ensure that the membership for each data point across all clusters sums up to 1.
● Update the centroid of each cluster by calculating the weighted average of all points based
● Stopping Criteria: Check if the centroids change less than a defined threshold or if the
maximum number of iterations is reached. If not, go back to step 4.
● Assign each point to one or more clusters where its membership is above a certain threshold.
If the point has significant membership in multiple clusters, it belongs to those clusters
(allowing overlap).
Step 9. Output
● Final overlapping clusters, where each point may belong to multiple clusters based on its
degree of membership.