ML 5
ML 5
Clustering Techniques
1. K-Means Clustering
2. K-Medoids Clustering
Similar to K-Means but uses actual data points (medoids) as cluster centers.
3. Hierarchical Clustering
Two approaches:
5. Spectral Clustering
1/61
Uses the spectrum (eigenvalues) of similarity matrices to partition data.
Outlier Analysis
1. Isolation Factor
Plots the sum of squared distances (inertia) against the number of clusters.
2. Extrinsic Methods
3. Intrinsic Methods
2/61
Explain k means algorithm
K-Means Algorithm
K-Means is an unsupervised machine learning algorithm used for clustering data into k
distinct groups based on feature similarity. It minimizes the variance within clusters by
iteratively updating cluster centroids. Here's an explanation of the algorithm:
For each data point in the dataset, compute the Euclidean distance to each centroid.
3. Update Centroids
For each cluster, compute the mean of all points assigned to that cluster.
5. Output
The algorithm produces k clusters, with each data point belonging to exactly one
cluster.
Key Concepts
3/61
Distance Metric
Centroid
Cluster Assignment
Advantages
1. Simple to implement and computationally efficient.
Disadvantages
1. Sensitive to:
Applications
Customer segmentation
Image compression
Document clustering
4/61
K-Means Clustering
K-Means is an unsupervised machine learning algorithm used for dividing data into k
clusters. It groups similar data points into clusters and identifies the cluster centers
(centroids) to represent each group. The objective is to minimize the variance within each
cluster while maximizing the difference between clusters.
1. Initialize k Clusters
Decide the number of clusters (k ) to divide the dataset into. For example, let's say k =2
for simplicity.
4. Update Centroids
Compute the new centroid for each cluster by calculating the mean position of all points
in the cluster.
5. Repeat
Reassign points to clusters and update centroids iteratively until the centroids stop
changing significantly or a maximum number of iterations is reached.
Example
Dataset:
Suppose we have the following 2D data points:
5/61
Objective:
Cluster the data into k = 2 groups.
Steps:
1. Initialize Centroids:
Randomly select (1, 2) and (6, 8) as the initial centroids.
For (6, 8) and (7, 9), both are closer to (6, 8).
Result: Two clusters:
3. Update Centroids:
Compute the mean of each cluster:
2
) = (1.5, 2.5)
Cluster 2 centroid: ( 6+7
2
, 8+9
2
) = (6.5, 8.5)
4. Repeat:
Reassign points based on the updated centroids. If no further changes occur, the
algorithm stops.
Final Output:
Clusters:
Key Concepts
Centroids: Represent the center of each cluster.
6/61
Convergence: When centroids stabilize and stop moving.
Applications
Market segmentation
Document clustering
Image compression
1. Outliers or Noise:
K-Means uses the mean to determine cluster centroids, which is sensitive to extreme
values.
K-Medoids selects actual data points as centroids (medoids), making it more robust
to outliers and noise.
K-Medoids can work with arbitrary distance metrics, while K-Means relies on
Euclidean distance.
K-Medoids Algorithm
K-Medoids is a clustering algorithm that minimizes the sum of dissimilarities between data
points and their cluster medoid (the most centrally located point within the cluster).
7/61
Steps in the K-Medoids Algorithm
1. Initialization:
2. Cluster Assignment:
Assign each data point to the nearest medoid based on a chosen distance metric
(e.g., Manhattan, Euclidean).
3. Update Medoids:
For each cluster, compute the total dissimilarity for every point in the cluster.
Replace the current medoid with the point that minimizes the total dissimilarity.
4. Repeat:
Repeat the cluster assignment and medoid update steps until the medoids stabilize
or a maximum number of iterations is reached.
5. Output:
Key Concepts
Medoid:
A medoid is the most centrally located point in a cluster. It minimizes the total
distance to all other points in the cluster.
Dissimilarity Metric:
The algorithm can use any metric to measure distance (e.g., Euclidean, Manhattan,
or cosine).
Example
8/61
Dataset:
Points: A(2, 3), B(3, 4), C(5, 6), D(8, 8).
Suppose k = 2.
Steps:
1. Initialization:
Select A and D as the initial medoids.
2. Cluster Assignment:
Compute the distances of each point to the medoids and assign them to the nearest
medoid:
Cluster 2 (medoid D ): D .
3. Update Medoids:
Select the point with the lowest dissimilarity as the new medoid (e.g., B ).
Advantages of K-Medoids
1. Robust to outliers and noise.
Disadvantages of K-Medoids
1. Computationally expensive for large datasets due to pairwise distance calculations.
9/61
Applications of K-Medoids
1. Market Segmentation: Analyzing customer preferences.
1/2
Hierarchical Clustering
Hierarchical clustering is a clustering technique that builds a hierarchy of clusters. It can be
visualized as a tree structure (dendrogram), where each node represents a cluster. Unlike K-
Means or K-Medoids, hierarchical clustering does not require specifying the number of
clusters upfront.
Iteratively merges the closest clusters until all points form a single cluster or a
stopping criterion is met.
Iteratively splits the cluster into smaller clusters until each point becomes its own
cluster.
10/61
1. Initialize Clusters:
Recalculate distances between the new cluster and the remaining clusters using a
linkage criterion (e.g., single-link, complete-link, average-link).
5. Repeat:
Continue merging clusters until only one cluster remains or a desired number of
clusters is achieved.
Linkage Criteria
1. Single Linkage:
2. Complete Linkage:
3. Average Linkage:
4. Centroid Linkage:
Example
Dataset:
11/61
Points: A(1, 2), B(2, 3), C(6, 7), D(7, 8).
Steps:
1. Initialize Clusters: Each point is its own cluster: {A}, {B}, {C}, {D}.
− AB AC AD
AB − BC BD
Distance Matrix:
AC BC − CD
AD BD CD −
3. Merge Closest Clusters: Combine {A} and {B} since their distance is smallest:
{A, B}, {C}, {D}.
4. Update Distance Matrix: Calculate new distances based on the linkage criterion.
5. Repeat: Continue merging the closest clusters until a single cluster or desired number of
clusters is formed.
Plot a dendrogram to visualize the merging process and decide the number of
clusters by cutting at a specific level.
Advantages
1. No need to predefine the number of clusters.
Disadvantages
1. Computationally expensive for large datasets due to distance matrix calculation.
12/61
3. Results depend on the chosen linkage criterion.
Applications
1. Gene expression analysis in biology.
2. Customer segmentation.
3. Document clustering.
Initialization Does not require initial cluster Requires k initial centroids (random
centroids. or user-defined).
Number of Can decide dynamically by cutting the Requires predefining the number of
Clusters dendrogram. clusters (k ).
Distance Metrics Supports multiple distance metrics Primarily uses Euclidean distance.
(e.g., Euclidean, Manhattan).
Scalability Computationally expensive for large Efficient for large datasets (O(n ⋅ k ⋅
datasets (O(n3 )). i), where i is the number of
iterations).
Robustness to Sensitive to noise and outliers. Sensitive to outliers, which can skew
Noise centroids.
Flexibility Can work with non-Euclidean distances Works best with Euclidean distance
and hierarchical structures. and flat cluster structures.
13/61
Feature Hierarchical Clustering K-Means Clustering
Usage Scenario Small to medium datasets; when Large datasets; when the number of
hierarchy or relationships are clusters is predefined.
important.
Hierarchical Clustering:
Advantages:
Disadvantages:
K-Means Clustering:
Advantages:
Disadvantages:
14/61
Applications
Application Area Hierarchical Clustering K-Means Clustering
Key Takeaway
Use Hierarchical Clustering for smaller datasets where relationships between clusters
are important.
Use K-Means for larger datasets where speed and simplicity are essential.
Hierarchical Clustering
Hierarchical clustering builds a hierarchy of clusters and can be divided into two types:
Continues until all points are merged into one cluster or a stopping condition is met.
Iteratively splits clusters into smaller clusters until each point becomes its own
cluster or a desired number of clusters is achieved.
Key Features:
15/61
Does not require specifying the number of clusters in advance.
Advantages:
Disadvantages:
How it Works:
1. Core Points: Points with a minimum number of neighbors (density threshold) within a
specified radius (ϵ).
2. Border Points: Points within the radius of a core point but not dense enough to be core
points themselves.
3. Noise Points: Points that are not core or border points (outliers).
Steps in DBSCAN:
16/61
Expand the cluster by adding all reachable points.
Key Features:
Advantages:
Disadvantages:
Comparison
Feature Hierarchical Clustering Density-Based Clustering
Scalability Less scalable for large datasets ( More scalable for dense data
3
O(n )). distributions.
Application When hierarchy matters (e.g., When dealing with irregularly shaped
Scenario phylogenetics). clusters or noise.
17/61
Applications
Hierarchical Clustering:
Density-Based Clustering:
Let me know if you'd like an example or code demonstration for either approach!
Key Concepts:
1. Core Points:
A point is a core point if it has at least M inPts neighbors within a given radius (ϵ).
2. Border Points:
A point is a border point if it lies within the radius (ϵ) of a core point but does not
have enough neighbors to be a core point itself.
18/61
3. Noise Points:
Points that are neither core points nor border points are considered outliers or
noise.
If it has at least M inPts neighbors, it’s a core point, and a new cluster is
formed.
Add all directly reachable points from the core point. Repeat for new core points
until the cluster cannot grow further.
5. Step 5: Repeat
Move to the next unvisited point and repeat until all points are visited.
Example
Dataset:
Points in 2D space: A(1, 2), B(2, 3), C(5, 6), D(8, 8), E(9, 9).
Parameters:
ϵ = 2 (radius).
19/61
M inPts = 3 (minimum points required for a cluster).
2. Cluster Formation:
2. Image Processing:
3. Anomaly Detection:
4. Astronomy:
Advantages of DBSCAN
1. Detects clusters of arbitrary shapes and sizes.
20/61
Disadvantages of DBSCAN
1. Struggles with datasets of varying densities.
Density-Based Clustering
Density-based clustering identifies clusters as dense regions separated by sparse areas in
the dataset. It is particularly useful for discovering clusters of arbitrary shapes and sizes and
for detecting outliers.
Overview:
DBSCAN groups points that are closely packed together (dense regions) and marks points in
sparse regions as outliers.
Key Parameters:
Steps:
Border Points: Within ϵ-radius of a core point but with fewer than M inPts.
21/61
2. Expand clusters by connecting core points and their neighbors recursively.
Strengths:
Weaknesses:
Overview:
OPTICS is an extension of DBSCAN that can handle datasets with varying densities. It creates
an ordering of points based on their density-reachability rather than directly assigning
clusters.
Key Concepts:
Steps:
3. Extract clusters by identifying valleys in the reachability plot, representing dense regions.
Advantages:
Disadvantages:
Computationally expensive.
22/61
Requires manual interpretation of the reachability plot.
Overview:
DENCLUE (DENsity-based CLUstEring) forms clusters based on the influence of data points
using mathematical density functions.
Key Concepts:
1. Density Function: Uses a kernel density function (e.g., Gaussian) to estimate data
density.
2. Attractive Regions: Regions with high density where clusters are formed.
Steps:
Advantages:
Disadvantages:
23/61
Feature DBSCAN OPTICS DENCLUE
Applications
DBSCAN: Spatial data analysis (e.g., geospatial mapping, earthquake detection).
1. Key Concepts
Density:
Clusters are formed by grouping points that are densely packed together. The density is
defined by:
24/61
A minimum number of points (MinPts): The minimum number of neighbors within ϵ to
qualify as a dense region.
Point Classification:
Core Point: A point with at least M inPts neighbors within its ϵ-radius.
Border Point: A point that is not dense enough to be a core point but lies within ϵ-radius
of a core point.
Noise (Outlier): A point that is neither a core point nor a border point and does not
belong to any cluster.
If the point has at least M inPts neighbors, it becomes a core point and starts
forming a cluster.
If not, mark it as noise (for now). It may later become part of a cluster as a border
point.
2. Recursively check each point in the neighborhood. If they are core points, expand
the cluster further by including their neighbors.
3. Continue this process until no more points can be added to the cluster.
Step 4: Repeat
Move to the next unvisited point and repeat the process to form new clusters or mark
noise.
25/61
Example of Cluster Formation
Dataset:
Points:
P 1(1, 2), P 2(2, 3), P 3(5, 5), P 4(6, 6), P 5(10, 10)
Parameters:
ϵ = 2 (radius).
M inPts = 3.
Steps:
1. Start with P 1:
2. Move to P 3:
Since no core points are found, the algorithm labels all points as noise.
Suppose M inPts = 2.
P 1 and P 2 form a cluster because P 1 is now a core point, and P 2 is a border point.
P 3 and P 4 form another cluster.
P 5 remains noise.
26/61
Orders points by their reachability distances. Clusters are formed by identifying
regions with low reachability distances.
2. DENCLUE:
Forms clusters by following the gradient of density functions. Points move towards
local density maxima, which act as attractors.
Key Concepts
1. Similarity Graph:
Represents the data points as a graph where:
27/61
2. Graph Laplacian:
A matrix representation of the graph that encodes connectivity and similarity between
points.
2σ 2
Normalized Laplacian:
Use k -Means (or another clustering algorithm) to cluster the rows of the matrix formed
by the first k eigenvectors.
28/61
Step 5: Assign Clusters
Assign each original data point to a cluster based on the results of k -Means.
Example
Dataset:
Steps:
1. Similarity Matrix:
2. Graph Laplacian:
Compute Lnorm .
3. Eigenvectors:
Advantages
1. Can identify clusters of arbitrary shape.
Disadvantages
29/61
1. Computationally expensive for large datasets (computing eigenvectors).
Applications
1. Image segmentation.
3. Document clustering.
Basic Idea Divides data into k clusters by Uses the eigenvalues and eigenvectors of
minimizing the distance a similarity matrix to perform
between data points and cluster dimensionality reduction, then clusters
centroids. data in the reduced space.
Type of Clusters Assumes clusters are spherical Can handle clusters of arbitrary shape
and separable by centroids. and size.
Input Data Operates directly on raw feature Operates on a similarity matrix or graph
vectors. representation of data.
Distance Metric Relies on Euclidean distance for Relies on a user-defined similarity metric
optimization. for graph construction.
30/61
Aspect K-Means Clustering Spectral Clustering
Outlier Handling Does not inherently account for Can inherently handle outliers by graph
outliers; they affect centroids. construction.
Applications Simple clustering tasks like Complex clustering tasks like image
market segmentation, customer segmentation, graph-based clustering,
clustering, and grouping and non-convex data structures.
numerical data.
Key Takeaways
1. K-Means is simple, fast, and effective for well-separated or spherical clusters but
struggles with complex data structures.
2. Spectral Clustering is more flexible and powerful for clustering arbitrary shapes but is
computationally demanding and relies on graph-based representations.
Would you like to see an example comparing both techniques on the same dataset?
31/61
Characteristics of Outliers
1. Deviance from the Pattern: Outliers are significantly different from other data points in
the dataset.
2. Influence on Analysis: Outliers can skew statistical measures like mean and standard
deviation.
Types of Outliers
1. Global Outliers:
Data points that deviate significantly from the entire dataset.
Example: In a temperature dataset, a reading of 60°C when most are between 20°C
and 30°C.
2. Contextual Outliers:
Points that are outliers in a specific context but not globally.
3. Collective Outliers:
A subset of data points that are anomalous as a group but not individually.
32/61
Techniques for Outlier Analysis
1. Statistical Methods:
Z-Score Analysis: Points with a Z-score beyond a threshold (e.g., ±3) are outliers.
Tukey’s Fences: Uses interquartile range (IQR) to define bounds for detecting
outliers.
2. Machine Learning:
Supervised Learning: Training models with labeled data for anomalies (e.g., fraud
detection).
3. Distance-Based Methods:
4. Density-Based Methods:
Example: Local Outlier Factor (LOF): Compares the density of a point with its
neighbors.
5. Isolation-Based Methods:
6. Domain-Specific Rules:
33/61
4. Noise vs. Outliers: Distinguishing true anomalies from random noise can be complex.
Conclusion
Outlier analysis is crucial in understanding, cleaning, and interpreting data. Whether
identifying rare events or ensuring data integrity, it forms the backbone of anomaly
detection across various fields.
2. Border Points: Points that are on the boundary of clusters, often falling between dense
regions of the dataset. They may appear as outliers when the clusters are not well-
separated.
34/61
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) inherently
detects and labels outliers as noise. It uses parameters like the radius (ϵ) and
minimum number of points (M inPts) to identify core points and outliers.
Points that do not meet the minimum density requirement are classified as noise or
outliers.
One approach to handle outliers is to use K-Medoids, where centroids are actual
data points and are less sensitive to outliers.
3. Isolation Forests:
A machine learning technique used for outlier detection that can be applied before
clustering. It isolates outliers by randomly partitioning the data, making it faster
than traditional distance-based methods.
Outliers can be detected and removed from the dataset before applying clustering
algorithms. Techniques such as Z-score, IQR (Interquartile Range), or Local Outlier
Factor (LOF) can help detect outliers and reduce their influence.
Cluster Misidentification: Outliers can alter the overall shape and distribution of
clusters, leading to inaccurate identification of the true cluster boundaries.
Reduced Accuracy: The presence of outliers can reduce the accuracy of clustering
models, as the algorithm may wrongly assign outliers to existing clusters or fail to form
meaningful clusters.
35/61
Conclusion
Outlier analysis in clustering is essential to ensure that the clustering algorithm accurately
identifies true patterns in the data. By detecting and handling outliers properly, the
clustering results are more reliable, leading to better insights and decision-making.
Outlier Analysis
Outlier analysis is the process of identifying data points in a dataset that significantly
deviate from the general pattern or distribution of the rest of the data. These points, known
as outliers, can be due to errors, anomalies, or rare but interesting phenomena. Outliers can
impact statistical models, skew results, and lead to incorrect conclusions, so detecting and
handling them appropriately is critical in data analysis and machine learning.
1. Global Outliers: Points that deviate significantly from the entire dataset (e.g., extreme
temperature values in a weather dataset).
2. Contextual Outliers: Points that are normal in some contexts but unusual in others (e.g.,
high temperatures during winter).
3. Collective Outliers: Groups of points that together form an anomaly, though individual
points may not appear unusual.
1. Distance Calculation:
36/61
Compute the distance between each data point and its neighbors (using a distance
metric like Euclidean distance).
2. Reachability Distance:
For each point p, compute its reachability distance from another point q as:
The local reachability density of a point p is the inverse of the average reachability
distance of p to its k -nearest neighbors:
1
LRD(p) = 1
∑q∈Nk (p) reach-dist(p, q)
k
4. LOF Score:
The LOF score of a point p compares its local density to the average local density of
its neighbors. The LOF score is computed as:
LRD(q)
∑q∈Nk (p)
LRD(p)
LOF(p) =
∣Nk (p)∣
Example:
Consider a 2D dataset where most points are clustered around certain areas, but a few
points are far away from the cluster. LOF will assign high LOF scores to those far-away points,
indicating they are outliers.
Advantages of LOF
37/61
Local Sensitivity: LOF can detect outliers that are not globally abnormal but are outliers
in their local neighborhood. This is useful for data with varying density.
No Assumption of Distribution: Unlike statistical methods, LOF does not require the
data to follow a specific distribution (e.g., Gaussian).
Applications of LOF
Fraud Detection: Identifying anomalous transactions that do not conform to normal
behavior.
Anomaly Detection: In fields like sensor data analysis, medical diagnostics, and image
processing.
LOF is particularly effective in datasets where the distribution is not uniform, and outliers
may not necessarily be the most distant points but may still be isolated in a dense region.
The core idea behind the Isolation Forest algorithm is that outliers are easier to isolate than
normal data points because they are different from the majority of the data. In other words,
outliers have fewer neighbors and are more likely to be separated with fewer partitioning
steps compared to normal points.
38/61
How the Isolation Forest Model Works
1. Recursive Partitioning:
2. Isolation Score:
The isolation score of a point is determined by how many splits (or decisions) it
takes to isolate that point.
Outliers tend to be isolated with fewer splits because they are different from
the rest of the data.
Normal points take more splits because they are surrounded by other similar
points.
c(n)
where:
h(x) is the path length from the root to the leaf node for point x,
E(h(x)) is the expected path length for point x,
c(n) is a normalization factor based on the number of data points n.
3. Anomaly Scoring:
The anomaly score is assigned based on the average path length across all the
trees:
If the score is close to 0, the point is a normal point (requires more splits to
isolate).
4. Ensemble Method:
39/61
Steps in the Isolation Forest Algorithm
1. Build Multiple Isolation Trees:
Randomly select a feature and a random split value for each tree, and recursively
partition the data.
For each data point, compute how "isolated" it is by counting the number of splits
required to isolate it within the trees.
Points with high anomaly scores are flagged as outliers, while those with low scores
are considered normal.
2. Scalability:
3. No Assumptions:
Isolation Forest does not assume any distribution for the data, making it versatile
for a wide range of problems.
40/61
Applications of Isolation Forest
1. Fraud Detection:
Identifying abnormal network activity or intrusions that deviate from regular traffic
patterns.
Identifying unusual patterns in time-series data, such as sensor data or stock prices.
Conclusion
The Isolation Forest is an effective and efficient outlier detection technique that works well
on high-dimensional data. It is based on the intuitive principle that outliers are easier to
isolate, and it has become widely used due to its simplicity, scalability, and performance in
large datasets.
41/61
LOF evaluates the local density of each data point relative to its neighbors. If a point is
surrounded by points that are much denser (i.e., have similar local density), it is considered
to be a normal point. However, if the point is surrounded by points with lower densities, it is
deemed an outlier.
For each point, LOF identifies its k -nearest neighbors based on a distance metric,
typically Euclidean distance.
2. Reachability Distance:
The reachability distance between two points is defined as the maximum of the
distance between the points and the k -distance of the neighbor. This ensures that
the local density is consistent for points with similar reachability distances.
The local reachability density of a point p is the inverse of the average reachability
distance between p and its k -nearest neighbors.
4. LOF Score:
The LOF score of a point is calculated by comparing its LRD with the LRDs of its
neighbors. The LOF score is the average of the ratios of the LRD of each neighbor to
the LRD of the point:
1 LRD(q)
LOF(p) = ∑
∣Nk (p)∣ LRD(p)
q∈Nk (p)
Outliers are those with LOF > 1, meaning their local density is lower than their
neighbors.
Advantages of LOF
1. Detects Local Outliers:
LOF is particularly useful for identifying outliers in datasets with varying densities.
Unlike other methods (e.g., K-Means), which may fail to detect outliers in non-
42/61
uniformly distributed data, LOF can detect anomalies in both high and low-density
regions.
LOF does not require any assumption about the data distribution, making it
applicable to a wide range of datasets.
The LOF algorithm can handle large datasets efficiently, especially with
optimizations to the nearest neighbor search.
4. Robustness:
LOF is robust in detecting outliers even when they are close to clusters or when the
dataset is noisy.
Disadvantages of LOF
1. Sensitivity to Parameters (k):
2. Computationally Expensive:
LOF requires calculating the distances between points and their neighbors, which
can be computationally intensive for large datasets. The algorithm has a time
complexity of O(n2 ), making it slower for very large datasets unless optimizations
like KD-Trees are used.
4. Interpretability:
While LOF provides an outlier score, it may not always be easy to interpret why a
point is flagged as an outlier, especially in complex datasets with high-dimensional
features.
43/61
Applications of LOF
Fraud Detection:
Identifying fraudulent transactions that deviate from typical patterns of behavior.
Network Security:
Detecting network intrusions or anomalous behavior by identifying outlier access
patterns.
Healthcare:
Detecting rare medical conditions or abnormalities in patient data.
Conclusion
LOF is a powerful and flexible method for outlier detection, especially in datasets with
varying densities. It is well-suited for applications where data points do not follow a global
distribution and local density variations are important for identifying anomalies. However, its
sensitivity to the choice of k and computational cost can limit its application in very large or
high-dimensional datasets.
i) Optimization of Clusters
Optimization of clusters refers to improving the quality and accuracy of clustering results by
selecting the best configuration for the clusters based on certain criteria. This process
44/61
involves finding an optimal number of clusters, the right algorithm parameters, and
enhancing the separation between clusters.
Elbow Method: A popular method to optimize the number of clusters by plotting the
sum of squared distances (inertia) against the number of clusters. The optimal number
is usually at the "elbow" point, where the inertia starts to decrease at a slower rate.
Silhouette Score: Measures the quality of clusters by calculating how similar points
within a cluster are to each other and how distinct clusters are. Higher silhouette scores
indicate well-separated and dense clusters.
Cluster Validity Indexes: Various indexes (e.g., Davies-Bouldin index, Dunn index) help
assess the clustering results by comparing the intra-cluster distance and inter-cluster
distance, guiding the choice of the optimal cluster configuration.
Optimization ensures that the clusters are meaningful, with a good balance between
cohesion (points within a cluster) and separation (clusters from each other).
ii) K-Medoids
K-Medoids is a clustering algorithm similar to K-Means but uses actual data points as the
centroids (medoids) instead of the mean of points in a cluster. It is more robust to outliers
because the medoid is less affected by extreme points.
Working:
Each data point is assigned to the nearest medoid based on a distance metric
(commonly Euclidean or Manhattan distance).
For each cluster, a new medoid is selected as the point that minimizes the sum of
distances to all other points in the cluster.
Advantages:
Can work with non-continuous data, as it uses actual data points as medoids.
Disadvantages:
45/61
Computationally more expensive than K-Means (due to medoid calculation).
Silhouette Score: Measures how similar a point is to its own cluster compared to
other clusters. Scores range from -1 (poor) to +1 (good).
Davies-Bouldin Index: Measures the average similarity ratio of each cluster with the
cluster most similar to it. Lower values indicate better clustering.
Adjusted Rand Index (ARI): Measures the similarity between two data clusterings,
adjusted for chance.
Evaluation metrics help determine the optimal number of clusters, cohesiveness of the
clusters, and how well-separated they are. These metrics guide the choice of the clustering
method and its parameters.
46/61
Agglomerative Hierarchical Clustering is a bottom-up approach for hierarchical clustering.
In this method, each data point is initially considered as its own individual cluster. The
algorithm then progressively merges the closest clusters based on a chosen distance metric
until all data points belong to a single cluster.
2. Distance Calculation: Calculate the distance (similarity) between every pair of clusters.
Common distance metrics include Euclidean distance, Manhattan distance, or others.
3. Merge Clusters: Identify the two closest clusters and merge them into a single cluster.
4. Repeat: Repeat the process of calculating distances and merging clusters until only one
cluster remains, or until a predefined number of clusters is achieved.
Linkage Methods:
The way the distance between clusters is calculated during the merging process can vary:
Single Linkage: The distance between two clusters is defined as the minimum distance
between any two points in the clusters.
Complete Linkage: The distance between two clusters is defined as the maximum
distance between any two points in the clusters.
Average Linkage: The distance between two clusters is the average of all pairwise
distances between points in the clusters.
Ward's Method: Minimizes the variance of merged clusters, focusing on reducing the
sum of squared errors.
Advantages:
Disadvantages:
47/61
ii) Dendrogram
A Dendrogram is a tree-like diagram that illustrates the arrangement of clusters in
hierarchical clustering. It visually represents the merging of clusters at different levels and
helps to understand the structure of data.
Key Features:
Y-Axis: Represents the distance (or dissimilarity) at which clusters are merged. The
higher the position of the merge, the less similar the clusters are.
Branches: The branches represent clusters being merged at various steps. The points
where branches meet correspond to the level of similarity (distance) at which the
clusters are combined.
How to Use:
Visualizing Cluster Similarity: The closer two clusters are joined (lower on the Y-axis),
the more similar they are. High branches indicate less similarity between merged
clusters.
Advantages:
Helps in selecting the optimal number of clusters by observing the structure of the
dendrogram.
Disadvantages:
For large datasets, dendrograms can become hard to interpret due to the excessive
number of branches.
48/61
Explain Graph Based clustering
Graph-Based Clustering
Graph-Based Clustering is a method that treats the data as a graph, where each data point
is represented as a node and the relationships or similarities between the points are
represented as edges. The primary goal of graph-based clustering is to identify groups (or
clusters) of points that are highly connected or similar, based on the structure of the graph.
Edges represent relationships between data points, usually weighted based on their
similarity (e.g., Euclidean distance, cosine similarity, etc.).
The graph may be fully connected or sparsely connected, depending on how the
similarities are defined.
2. Graph Representation:
Adjacency Matrix: A matrix representation of the graph where the element A[i, j]
represents the weight (similarity) of the edge between nodes i and j .
Laplacian Matrix: A matrix derived from the adjacency matrix, used for spectral
clustering and other graph-based methods.
3. Partitioning the Graph: The core idea is to find dense subgraphs or connected
components in the graph that represent clusters. Various methods can be used to
partition the graph:
Spectral Clustering: Uses eigenvalues of the graph Laplacian to partition the graph.
The graph is divided into clusters based on the eigenvectors of the Laplacian, where
each cluster corresponds to a connected component in the graph.
49/61
Minimum Cut: The graph is divided by finding a cut that minimizes the sum of the
edge weights that separate the clusters, ensuring the clusters are internally dense
and separated by minimal connections.
4. Clustering: Once the graph is partitioned, the nodes within each subgraph or
community are grouped together to form clusters.
This approach divides the graph into two or more disjoint subgraphs by finding the
minimum cut, which minimizes the sum of edge weights that separate the
subgraphs.
The normalized cut and ratio cut are common variations used to ensure that
clusters are dense and well-separated.
3. Community Detection:
50/61
Advantages of Graph-Based Clustering
1. Handles Complex Data Structures:
Graph-based clustering methods are highly effective for non-Euclidean data (e.g., social
networks, web graphs, text data) where relationships between data points are more
important than their exact positions in space.
2. Flexibility:
The method can incorporate various similarity measures, such as distance, affinity, or
correlation, depending on the type of data.
2. Parameter Sensitivity:
Graph-based clustering methods, such as spectral clustering, can be sensitive to the
choice of similarity measure and the number of clusters k . The quality of clustering may
degrade if these parameters are not well chosen.
3. Memory Usage:
Storing the graph and performing computations on large graphs can be memory-
intensive. Efficient implementations and sparse matrices are often required for large-
scale applications.
51/61
4. Quality of Clustering:
The effectiveness of graph-based clustering depends heavily on the graph construction.
Poor choices for similarity measures or graph sparsity can lead to suboptimal clustering
results.
Web Mining: Clustering web pages based on link structures or content similarities.
Biological Networks: Identifying groups of genes or proteins that are closely related in
biological systems, such as in protein-protein interaction networks.
Conclusion
Graph-based clustering is a powerful technique, particularly useful for handling complex
data where traditional clustering algorithms like K-Means might struggle. It excels in
identifying clusters that are non-linearly separable and can capture global data structures.
However, it may face challenges in terms of computational efficiency and scalability for very
large datasets.
i) Elbow Method
The Elbow Method is a technique used to determine the optimal number of clusters in
clustering algorithms such as K-means. It helps identify the point where the within-cluster
52/61
sum of squares (WCSS) or inertia starts to diminish at a slower rate, forming an "elbow" in
the plot.
1. Run the clustering algorithm (e.g., K-means) for a range of k values (the number of
clusters) — typically starting from 1 up to a reasonable upper limit.
2. Calculate the Within-Cluster Sum of Squares (WCSS): This is the sum of squared
distances between each point and its cluster centroid, which measures the compactness
of the clusters.
4. Identify the Elbow: The "elbow" is the point where the WCSS starts to level off, and
adding more clusters does not result in significant improvement in WCSS. The k at this
point is considered the optimal number of clusters.
Example:
For a K-means clustering, if you plot the WCSS values for increasing values of k , the elbow is
where the curve begins to flatten, indicating the ideal number of clusters for the dataset.
Extrinsic Method:
Extrinsic methods for evaluating clustering results are based on external criteria or ground
truth labels. These methods compare the results of clustering with actual class labels or
predefined clusters, which are not available in unsupervised learning but can be used if
ground truth data is available.
Example Metrics:
Adjusted Rand Index (ARI): Measures the similarity between the clustering results
and the ground truth labels.
53/61
Fowlkes-Mallows Index (FMI): Combines precision and recall to evaluate the
clustering quality based on ground truth labels.
Extrinsic methods are useful when you have a known ground truth to compare the clustering
results.
Intrinsic Method:
Intrinsic methods for evaluating clustering do not rely on external labels. Instead, they
evaluate the clustering based on the internal properties of the clusters, such as how
cohesive (tight) the clusters are and how well-separated they are from each other.
Example Metrics:
Silhouette Score: Measures how similar a point is to its own cluster compared to
other clusters.
Davies-Bouldin Index: Measures the average similarity ratio of each cluster with the
cluster most similar to it. Lower values indicate better clustering.
Intrinsic methods are widely used when no ground truth is available, as they focus on the
characteristics of the clustering itself.
Summary:
Elbow Method: A technique for determining the optimal number of clusters by plotting
the within-cluster sum of squares (WCSS) and looking for the "elbow" point.
Extrinsic Method: An evaluation metric for clustering that compares the clustering
results to external ground truth labels.
Intrinsic Method: An evaluation metric for clustering that assesses the quality of
clusters based on their internal structure, such as cohesion and separation, without
using external labels.
54/61
Optimization of Clusters in Clustering
Optimization of clusters in clustering refers to the process of improving the quality and
accuracy of the clusters formed by the clustering algorithm. The primary goal is to ensure
that the data points within a cluster are similar to each other (cohesion), while the clusters
themselves are well-separated (separation). Optimization techniques focus on refining the
number of clusters, the structure of clusters, and the parameters used in the clustering
process to enhance the clustering results.
Challenge: One of the most significant issues in clustering is choosing the correct
number of clusters, k . If the number of clusters is chosen too low, important
patterns in the data may be overlooked. On the other hand, too many clusters may
result in overfitting or fragmentation of the data.
Optimization Techniques:
Elbow Method: Helps identify the optimal number of clusters by plotting the
sum of squared distances (inertia) for different k values. The "elbow" point
indicates where adding more clusters results in diminishing returns.
Silhouette Score: Measures how similar points are to their own cluster
compared to other clusters. A higher silhouette score indicates better
clustering, helping to decide the best number of clusters.
2. Cluster Initialization:
55/61
Challenge: Some algorithms, like K-Means, are sensitive to the initial starting points
(centroids). Poor initialization can lead to suboptimal cluster configurations.
Optimization Techniques:
Optimization Techniques:
Selecting the Right Distance Metric: Algorithms like K-Means and DBSCAN rely
heavily on the distance function. Optimizing the metric (e.g., using Manhattan
distance, cosine similarity, etc.) can lead to better clustering results for specific
datasets.
Challenge: Outliers and noise can negatively impact clustering results, causing
misclassification and inaccurate cluster formation.
Optimization Techniques:
DBSCAN: A density-based algorithm that can detect outliers and classify them
as noise, helping optimize clustering for datasets with noise.
Challenge: Clustering algorithms like K-Means assume that clusters are convex or
spherical, which may not always hold true in real-world data. Non-globular clusters
56/61
may not be effectively captured by algorithms like K-Means.
Optimization Techniques:
Optimization Techniques:
External Metrics: When ground truth is available, external metrics like Adjusted
Rand Index (ARI) and Normalized Mutual Information (NMI) can help evaluate
how well the clusters align with true labels.
2. Silhouette Score: Measures how well-separated and cohesive the clusters are. A high
silhouette score indicates that the clusters are both compact and well-separated.
57/61
by optimizing the space in which the algorithm operates.
Conclusion
The optimization of clusters is a critical issue in clustering because it directly influences the
effectiveness of the algorithm and the quality of the final clusters. Optimization involves
determining the right number of clusters, selecting the appropriate distance metrics,
handling outliers, and evaluating the quality of the clusters. By employing various techniques
like the Elbow Method, Silhouette Score, and optimizing initialization, clustering results can
be improved, leading to meaningful and interpretable clusters.
When designing a K-Medoids clustering algorithm, choosing the optimal number of clusters
(k ) is a critical step, and it directly influences the quality of the clustering. Here are some
methods to determine the optimal number of clusters for K-Medoids:
1. Elbow Method
The Elbow Method is a popular technique to determine the optimal number of clusters by
plotting the cost function (e.g., total within-cluster dissimilarity or dissimilarity sum) against
the number of clusters k .
How it works:
Run the K-Medoids algorithm for different values of k (e.g., from 1 to a maximum
value).
For each k , calculate the total cost, which is the sum of the dissimilarities (or
distances) between each point and the medoid of its assigned cluster.
Look for the "elbow" point in the graph, where the rate of decrease in the cost
function slows down significantly. The k corresponding to the elbow is usually
chosen as the optimal number of clusters.
Why it works:
The total dissimilarity generally decreases as the number of clusters increases, but
at some point, the reduction slows down significantly. The elbow point indicates that
58/61
further increasing the number of clusters doesn't significantly improve the
clustering quality.
2. Silhouette Score
The Silhouette Score measures how similar an object is to its own cluster compared to other
clusters. A higher silhouette score indicates better-defined clusters.
How it works:
For each k , compute the average silhouette score of all points in the dataset.
The silhouette score ranges from -1 (bad clustering) to +1 (good clustering), with a
score close to 0 indicating overlapping clusters.
The optimal number of clusters is the k with the highest average silhouette score.
Why it works:
A higher silhouette score indicates that the points are closer to their own cluster and
far from other clusters, which is desirable for good clustering.
3. Gap Statistic
The Gap Statistic compares the performance of the clustering algorithm to a random
clustering result. It helps in determining the optimal k by measuring the gap between the
observed cost and the expected cost under a random clustering.
How it works:
For each k , calculate the clustering cost for the real data and for a random dataset
with similar characteristics.
The optimal k corresponds to the value of k that maximizes the gap between the
real and random clustering costs.
Why it works:
The gap statistic helps to avoid overfitting (too many clusters) and underfitting (too
few clusters) by finding the point where the clustering structure in the real data is
significantly different from random noise.
4. Cross-Validation
59/61
If you have a labeled dataset or can partition your data, you can use cross-validation to
evaluate the performance of different clustering solutions. The idea is to partition the data
into subsets and check how consistent the clustering is across different subsets.
How it works:
A stable solution with consistent clustering across folds suggests a good choice for
k.
Why it works:
5. Davies-Bouldin Index
The Davies-Bouldin Index is a measure that evaluates the average similarity ratio between
clusters. Lower values of this index indicate better clustering, where clusters are well-
separated and compact.
How it works:
For each k , calculate the Davies-Bouldin Index. The optimal k corresponds to the
value with the lowest Davies-Bouldin Index.
Why it works:
How it works:
60/61
The goal is to minimize inertia, and the optimal k is typically the one where the
inertia starts to level off (similar to the elbow method).
Why it works:
By minimizing inertia, you ensure that data points within each cluster are as similar
as possible to the medoid.
Conclusion
To choose the optimal number of clusters k when designing a K-Medoids clustering
algorithm, you can use a combination of the methods above, such as:
The Elbow Method or Gap Statistic for a general idea of the number of clusters.
Using these techniques together provides a robust approach to selecting the optimal
number of clusters for K-Medoids clustering.
61/61