Clustering Metrics in Machine Learning

Last Updated : 23 Jul, 2025

Clustering is a technique in Machine Learning that is used to group similar data points. While the algorithm performs its job, helping uncover the patterns and structures in the data, it is important to judge how well it functions. Several metrics have been designed to evaluate the performance of these clustering algorithms.

In this article, we will explore these metrics and see the mathematical concepts that lie behind them. After that, we will demonstrate their practical implementation using scikit-learn.

What is Clustering?

Clustering is an unsupervised machine-learning approach that is used to group comparable data points based on specific traits or attributes. Clustering algorithms do not require labelled data, which makes them ideal for finding patterns in large datasets. It is a widely used technique in applications like customer segmentation, image recognition, anomaly detection, etc.

There are multiple clustering algorithms, and each has its way of grouping data points. Clustering metrics are used to evaluate all these algorithms. Let us take a look at some of the most commonly used clustering metrics:

1. Silhouette Score

The Silhouette Score is a way to measure how good the clusters are in a dataset. It helps us understand how well the data points have been grouped. The score ranges from -1 to 1.

A score close to 1 means a point fits really well in its group (cluster) and is far from other groups.
A score close to 0 means the point is on the border between two clusters.
A score close to -1 means the point might be in the wrong cluster.

Silhouette Score (S) for a data point i is calculated as:

S(i) = \frac{b(i)- a(i)}{max({a(i),b(i)})}

where,

a(i) is the average distance from i to other data points in the same cluster.
b(i) is the smallest average distance from i to data points in a different cluster.

2. Davies-Bouldin Index

The Davies-Bouldin Index (DBI) helps us measure how good the clustering is in a dataset. It looks at how tight each cluster is (compactness), and how far apart the clusters are (separation).

Lower DBI = better, clearer clusters
Higher DBI = messy, overlapping clusters

A lower score is better, because it means:

Points in the same cluster are close to each other.
Different clusters are far apart from one another.

Davies-Bouldin Index (DB) is calculated as:

DB = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{R_{ii} + R_{jj}}{R_{ij}} \right)

where,

k is the total number of clusters.
R_{ii} is the compactness of cluster i.
R_{jj} is the compactness of cluster j.
R_{ij} is the dissimilarity (distance) between cluster i and cluster j.

3. Calinski-Harabasz Index (Variance Ratio Criterion)

The Calinski-Harabasz Index measures how good the clusters are in a dataset.

It looks at:

How close the points are inside each cluster?
How far apart the clusters are?

A higher score is better, as it means the clusters are tight and well-separated. It helps determine the ideal number of clusters.

Calinski-Harabasz Index (CH) is calculated as:

CH = \frac{B}{W} \times \frac{N - K}{K - 1}

where,

B is the sum of squares between clusters.
W is the sum of squares within clusters.
N is the total number of data points.
K is the number of clusters.

Calculating between group sum of squares (B)

B= \sum_{k=1}^{K} n_k \times ||C_k – C||^2

where,

n_k is the number of observation in cluster 'k'
C_k is the centroid of cluster 'k'
C is the centroid of the dataset
K is number of clusters

Calculating within the group sum of squares (W)

W = \sum_{i=1}^{n_k} ||X_{ik} – C_{k}||^2

where,

n_k is the number of observation in cluster 'k'
X_{ik} is the i-th observation of cluster 'k'
C_k is the centroid of cluster 'k'

4. Adjusted Rand Index (ARI)

The Adjusted Rand Index (ARI) helps us measure how accurate a clustering result is by comparing it to the true labels (ground truth).

It checks how well the pairs of points are grouped:

Are the same pairs together in both the real and predicted clusters?
Are different pairs also kept apart correctly?

The score ranges from -1 to 1:

1 means perfect match - the clustering is exactly right.
0 means random guess - no better than chance.
Below 0 means worse than random - very poor clustering.

Adjusted Rand Index (ARI) is calculated as:

ARI = \frac{(RI - Expected_{RI})}{(max(RI) - Expected_{RI})}

where,

RI is the Rand Index.
Expected_{RI} is the expected value of the Rand Index.

5. Mutual Information (MI)

Mutual Information measures how much two variables are related or connected. In clustering, it compares how much the true cluster labels match with the predicted labels. It shows how much knowing about one variable helps us predict the other. The more agreement there is, the higher the score.

Higher values mean better agreement between the clusters.
Zero means no agreement at all.

MI between true labels Y and predicted labels Z is calculated as:

MI(y, z) = \sum_{i} \sum_{j} p(y_i, z_j) \cdot \log \left( \frac{p(y_i, z_j)}{p(y_i) \cdot p(z_j)} \right)

where,

y_i is a true label.
z_i is a predicted label.
p(y_i, z_i) is the joint probability of y_i and z_j.
p(y_i) and p(z_i) are the marginal probabilities.

These clustering metrics help in evaluating the quality and performance of clustering algorithms, allowing for informed decisions when selecting the most suitable clustering solution for a given dataset.

Steps to Evaluate Clustering Using Sklearn

Let's consider an example using the Iris dataset and the K-Means clustering algorithm. We will calculate the Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and Adjusted Rand Index to evaluate the clustering.

Import Libraries

Import the necessary libraries, including scikit-learn (sklearn).

Python

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.metrics import mutual_info_score, adjusted_rand_score

Load Your Data

Load or generate your dataset for clustering. Iris dataset consists of 150 samples of iris flowers. There are three species of iris flower: setosa, versicolor, and virginica with four features: sepal length, sepal width, petal length, and petal width.

Python

# Example using a built-in dataset (e.g., Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data

Perform Clustering

Choose a clustering algorithm, such as K-Means, and fit it to your data.

K means is an unsupervised technique used for creating cluster based on similarity. It iteratively assigns data points to the nearest cluster center and updates the centroids until convergence.

Python

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

Calculate Clustering Metrics

Use the appropriate clustering metrics to evaluate the clustering results.

Python

# Calculate clustering metrics
silhouette = silhouette_score(X, kmeans.labels_)
db_index = davies_bouldin_score(X, kmeans.labels_)
ch_index = calinski_harabasz_score(X, kmeans.labels_)
ari = adjusted_rand_score(iris.target, kmeans.labels_)
mi = mutual_info_score(iris.target, kmeans.labels_)

# Print the metric scores
print(f"Silhouette Score: {silhouette:.2f}")
print(f"Davies-Bouldin Index: {db_index:.2f}")
print(f"Calinski-Harabasz Index: {ch_index:.2f}")
print(f"Adjusted Rand Index: {ari:.2f}")
print(f"Mutual Information (MI): {mi:.2f}")

Output:

Silhouette Score: 0.55
Davies-Bouldin Index: 0.66
Calinski-Harabasz Index: 561.63
Adjusted Rand Index: 0.73
Mutual Information (MI): 0.83

Interpreting the Metrics

Here's an interpretation of the metric scores obtained:

Silhouette Score (0.55)

This score reveals how similar data points are inside their clusters when compared to data points from other clusters. A result of 0.55 indicates that there is some separation between the clusters, but there is still space for improvement. Closer to 1 values suggest better-defined clusters.

Davies-Bouldin Index (0.66)

This index calculates the average similarity between each cluster and its closest neighbors. A lower score is preferable, and 0.66 suggests a pretty strong separation across clusters.

The score Index (561.63)

Calculates the ratio of between-cluster variation to within-cluster variance. Higher values suggest more distinct groups. Your clusters are distinct and independent with a score of 561.63.

The Adjusted Rand Index (0.73)

Compares the resemblance of genuine class labels to predicted cluster labels. A rating of 0.73 shows that the clustering findings and the actual class labels correspond rather well.

Mutual Information (MI) (0.83)

This metric measures the agreement between the true class labels and the predicted cluster labels. A score of 0.75 indicates a substantial amount of shared information between the true labels and the clusters assigned by the algorithm.

Read More:

Clustering in Machine Learning
Hierarchical Clustering in Machine Learning
K means Clustering

kesavare0

Improve

Article Tags :

Clustering Metrics in Machine Learning

What is Clustering?

1. Silhouette Score

2. Davies-Bouldin Index

3. Calinski-Harabasz Index (Variance Ratio Criterion)

4. Adjusted Rand Index (ARI)

5. Mutual Information (MI)

Steps to Evaluate Clustering Using Sklearn

Import Libraries

Load Your Data

Perform Clustering

Calculate Clustering Metrics

Interpreting the Metrics

Silhouette Score (0.55)

Davies-Bouldin Index (0.66)

The score Index (561.63)

The Adjusted Rand Index (0.73)

Mutual Information (MI) (0.83)

Explore

Machine Learning Basics

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advanced Techniques

Machine Learning Practice

Thank You!

What kind of Experience do you want to share?