Plotting Boundaries of Cluster Zone with Scikit-Learn

Last Updated : 19 Sep, 2024

Clustering is a popular technique in machine learning for identifying groups within a dataset based on similarity. In Python, the scikit-learn package provides a range of clustering algorithms like KMeans, DBSCAN, and Agglomerative Clustering. A critical aspect of cluster analysis is visualizing the results, particularly when it comes to plotting the boundaries of cluster zones. This article will cover how to plot cluster boundaries using scikit-learn, focusing on the theory behind clustering and hands-on implementation.

Table of Content

Common Clustering Algorithms in Scikit-Learn
Why Visualize Cluster Boundaries?
Plotting Cluster Boundaries with KMeans
Plotting Cluster Boundaries with Agglomerative Clustering
Plotting Cluster Boundaries with DBSCAN
Best Practices for Visualizing Cluster Zones

Common Clustering Algorithms in Scikit-Learn

Scikit-learn offers a variety of clustering algorithms, each suitable for different data types and structures. Some of the most commonly used algorithms include:

KMeans: Partitions the data into kkk clusters, where each cluster is represented by the mean of its points.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups together closely packed points and marks points in low-density regions as outliers.
Agglomerative Clustering: A hierarchical clustering technique that merges data points based on distance measures.

We'll focus on how to visualize the boundaries generated by these algorithms.

Why Visualize Cluster Boundaries?

Visualizing the boundaries of cluster zones helps in understanding how the clustering algorithm has partitioned the data space. This is particularly important when:

Validating the effectiveness of the clustering algorithm.
Identifying data points that are misclassified or close to boundary regions.
Understanding the separation between different clusters.

Plotting Cluster Boundaries with KMeans

The KMeans algorithm is one of the most widely used clustering methods. In this section, we'll use scikit-learn's KMeans to fit the data and visualize the boundaries between different clusters.

Step 1: Import Required Libraries

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from matplotlib.colors import ListedColormap

Step 2: Generate Synthetic Data

We'll use make_blobs to generate a dataset with 3 clusters for demonstration.

Python

X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

Step 3: Create a Meshgrid for Plotting Decision Boundaries

We need a meshgrid that spans the entire range of our data points.

Python

# Create a meshgrid for plotting decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

Step 4: Fit KMeans Clustering

Python

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
Z_kmeans = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z_kmeans = Z_kmeans.reshape(xx.shape)

Output:

/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)

Step 5: Plot the Boundaries for KMeans

Python

# Define color map for cluster zones
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])

# Plot decision boundaries for KMeans
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z_kmeans, cmap=cmap_light, alpha=0.6)

# Plot the data points
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=50, edgecolor='k', cmap='viridis')

# Plot the cluster centers
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, marker='x')

plt.title('KMeans Clustering - Cluster Boundaries')
plt.show()

Output:

Plotting Cluster Boundaries with Agglomerative Clustering

Agglomerative Clustering builds nested clusters by merging or splitting them based on distance. The boundaries in hierarchical clustering can be visualized similarly.

Step 1: Import Required Libraries

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from matplotlib.colors import ListedColormap

Step 2: Generate Synthetic Data

We'll use make_blobs to generate a dataset with 3 clusters for demonstration.

Python

X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

Step 3: Create a Meshgrid for Plotting Decision Boundaries

We need a meshgrid that spans the entire range of our data points.

Python

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

Step 4: Fit Agglomerative Clustering

Python

agg_clustering = AgglomerativeClustering(n_clusters=3)
y_pred_agg = agg_clustering.fit_predict(X)

Step 5: Plot the Boundaries for Agglomerative Clustering

Unfortunately, AgglomerativeClustering does not have a predict method. However, we can approximate cluster boundaries by using a Nearest Neighbor classifier.

Python

from sklearn.neighbors import KNeighborsClassifier

# Use Nearest Neighbor Classifier to predict the cluster for each point in the meshgrid
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y_pred_agg)

Z_agg = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z_agg = Z_agg.reshape(xx.shape)

# Plot decision boundaries for Agglomerative Clustering
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z_agg, cmap=cmap_light, alpha=0.6)

# Plot the data points
plt.scatter(X[:, 0], X[:, 1], c=y_pred_agg, s=50, edgecolor='k', cmap='viridis')

plt.title('Agglomerative Clustering - Cluster Boundaries')
plt.show()

Output:

Plotting Cluster Boundaries with Agglomerative Clustering

Plotting Cluster Boundaries with DBSCAN

The DBSCAN algorithm is more flexible than KMeans as it doesn’t require specifying the number of clusters in advance. It’s particularly useful for datasets with varying density. Here’s how you can visualize the cluster boundaries using DBSCAN:

Step 1: Import Required Libraries

Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from matplotlib.colors import ListedColormap

Step 2: Generate Synthetic Data

We'll use make_blobs to generate a dataset with 3 clusters for demonstration.

Python

X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)

Step 3: Create a Meshgrid for Plotting Decision Boundaries

We need a meshgrid that spans the entire range of our data points.

Python

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

Step 4: Fit DBSCAN Clustering

Python

dbscan = DBSCAN(eps=0.5, min_samples=5)
y_pred_dbscan = dbscan.fit_predict(X)

Step 5: Plot the Boundaries for DBSCAN

Similar to Agglomerative Clustering, DBSCAN doesn't have a predict method, so we also use the Nearest Neighbor classifier.

Python

# Handle outliers by setting them to -1
unique_labels = np.unique(y_pred_dbscan)
filtered_X = X[y_pred_dbscan != -1]
filtered_labels = y_pred_dbscan[y_pred_dbscan != -1]

# Use Nearest Neighbor Classifier to predict the cluster for each point in the meshgrid
knn_dbscan = KNeighborsClassifier(n_neighbors=1)
knn_dbscan.fit(filtered_X, filtered_labels)

Z_dbscan = knn_dbscan.predict(np.c_[xx.ravel(), yy.ravel()])
Z_dbscan = Z_dbscan.reshape(xx.shape)

# Plot decision boundaries for DBSCAN Clustering
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z_dbscan, cmap=ListedColormap(('red', 'green', 'blue')), alpha=0.6) #added a ListedColormap

# Plot the data points
plt.scatter(X[:, 0], X[:, 1], c=y_pred_dbscan, s=50, edgecolor='k', cmap='plasma')

plt.title('DBSCAN Clustering - Cluster Boundaries')
plt.show()

Output:

DBscan — Plotting Cluster Boundaries with DBSCAN

Best Practices for Visualizing Cluster Zones

Use High-Quality Data: Ensure your data is suitable for clustering, and perform preprocessing like normalization if needed.
Choose the Right Algorithm: Different clustering algorithms have different strengths. For example, DBSCAN works well with noise, while KMeans is suitable for spherical clusters.
Consider the Dimensionality: Cluster boundaries are easy to visualize in 2D, but for high-dimensional data, dimensionality reduction techniques like PCA may be required.
Test Different Parameters: Vary parameters like the number of clusters (n_clusters for KMeans) or eps for DBSCAN to find the best clustering solution.

Conclusion

In this article, we explored how to visualize cluster boundaries using three popular algorithms in scikit-learn: KMeans, DBSCAN, and Agglomerative Clustering. Visualizing cluster zones is a powerful way to understand the performance of a clustering algorithm and gain insights into your dataset’s structure. By following the techniques outlined here, you can create insightful visualizations that highlight the decision boundaries of clustering algorithms.

Project | Scikit-learn - Whisky Clustering

venkatmar8xp

Improve

Article Tags :

Practice Tags :

Machine Learning

Plotting Boundaries of Cluster Zone with Scikit-Learn

Common Clustering Algorithms in Scikit-Learn

Why Visualize Cluster Boundaries?

Plotting Cluster Boundaries with KMeans

Step 1: Import Required Libraries

Step 2: Generate Synthetic Data

Step 3: Create a Meshgrid for Plotting Decision Boundaries

Step 4: Fit KMeans Clustering

Step 5: Plot the Boundaries for KMeans

Plotting Cluster Boundaries with Agglomerative Clustering

Step 1: Import Required Libraries

Step 2: Generate Synthetic Data

Step 3: Create a Meshgrid for Plotting Decision Boundaries

Step 4: Fit Agglomerative Clustering

Step 5: Plot the Boundaries for Agglomerative Clustering

Plotting Cluster Boundaries with DBSCAN

Step 1: Import Required Libraries

Step 2: Generate Synthetic Data

Step 3: Create a Meshgrid for Plotting Decision Boundaries

Step 4: Fit DBSCAN Clustering

Step 5: Plot the Boundaries for DBSCAN

Best Practices for Visualizing Cluster Zones

Conclusion

Similar Reads

Thank You!

What kind of Experience do you want to share?