Plotting Boundaries of Cluster Zone with Scikit-Learn
Last Updated :
19 Sep, 2024
Clustering is a popular technique in machine learning for identifying groups within a dataset based on similarity. In Python, the scikit-learn package provides a range of clustering algorithms like KMeans, DBSCAN, and Agglomerative Clustering. A critical aspect of cluster analysis is visualizing the results, particularly when it comes to plotting the boundaries of cluster zones. This article will cover how to plot cluster boundaries using scikit-learn, focusing on the theory behind clustering and hands-on implementation.
Common Clustering Algorithms in Scikit-Learn
Scikit-learn offers a variety of clustering algorithms, each suitable for different data types and structures. Some of the most commonly used algorithms include:
- KMeans: Partitions the data into kkk clusters, where each cluster is represented by the mean of its points.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Groups together closely packed points and marks points in low-density regions as outliers.
- Agglomerative Clustering: A hierarchical clustering technique that merges data points based on distance measures.
We'll focus on how to visualize the boundaries generated by these algorithms.
Why Visualize Cluster Boundaries?
Visualizing the boundaries of cluster zones helps in understanding how the clustering algorithm has partitioned the data space. This is particularly important when:
- Validating the effectiveness of the clustering algorithm.
- Identifying data points that are misclassified or close to boundary regions.
- Understanding the separation between different clusters.
Plotting Cluster Boundaries with KMeans
The KMeans algorithm is one of the most widely used clustering methods. In this section, we'll use scikit-learn's KMeans
to fit the data and visualize the boundaries between different clusters.
Step 1: Import Required Libraries
Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from matplotlib.colors import ListedColormap
Step 2: Generate Synthetic Data
We'll use make_blobs to generate a dataset with 3 clusters for demonstration.
Python
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)
Step 3: Create a Meshgrid for Plotting Decision Boundaries
We need a meshgrid that spans the entire range of our data points.
Python
# Create a meshgrid for plotting decision boundaries
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
Step 4: Fit KMeans Clustering
Python
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
Z_kmeans = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z_kmeans = Z_kmeans.reshape(xx.shape)
Output:
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
Step 5: Plot the Boundaries for KMeans
Python
# Define color map for cluster zones
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
# Plot decision boundaries for KMeans
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z_kmeans, cmap=cmap_light, alpha=0.6)
# Plot the data points
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=50, edgecolor='k', cmap='viridis')
# Plot the cluster centers
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, marker='x')
plt.title('KMeans Clustering - Cluster Boundaries')
plt.show()
Output:
Plotting Cluster Boundaries with KMeansPlotting Cluster Boundaries with Agglomerative Clustering
Agglomerative Clustering builds nested clusters by merging or splitting them based on distance. The boundaries in hierarchical clustering can be visualized similarly.
Step 1: Import Required Libraries
Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from matplotlib.colors import ListedColormap
Step 2: Generate Synthetic Data
We'll use make_blobs to generate a dataset with 3 clusters for demonstration.
Python
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)
Step 3: Create a Meshgrid for Plotting Decision Boundaries
We need a meshgrid that spans the entire range of our data points.
Python
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
Step 4: Fit Agglomerative Clustering
Python
agg_clustering = AgglomerativeClustering(n_clusters=3)
y_pred_agg = agg_clustering.fit_predict(X)
Step 5: Plot the Boundaries for Agglomerative Clustering
Unfortunately, AgglomerativeClustering does not have a predict method. However, we can approximate cluster boundaries by using a Nearest Neighbor classifier.
Python
from sklearn.neighbors import KNeighborsClassifier
# Use Nearest Neighbor Classifier to predict the cluster for each point in the meshgrid
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y_pred_agg)
Z_agg = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z_agg = Z_agg.reshape(xx.shape)
# Plot decision boundaries for Agglomerative Clustering
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z_agg, cmap=cmap_light, alpha=0.6)
# Plot the data points
plt.scatter(X[:, 0], X[:, 1], c=y_pred_agg, s=50, edgecolor='k', cmap='viridis')
plt.title('Agglomerative Clustering - Cluster Boundaries')
plt.show()
Output:
Plotting Cluster Boundaries with Agglomerative ClusteringPlotting Cluster Boundaries with DBSCAN
The DBSCAN algorithm is more flexible than KMeans as it doesn’t require specifying the number of clusters in advance. It’s particularly useful for datasets with varying density. Here’s how you can visualize the cluster boundaries using DBSCAN:
Step 1: Import Required Libraries
Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from matplotlib.colors import ListedColormap
Step 2: Generate Synthetic Data
We'll use make_blobs to generate a dataset with 3 clusters for demonstration.
Python
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=1.0, random_state=42)
Step 3: Create a Meshgrid for Plotting Decision Boundaries
We need a meshgrid that spans the entire range of our data points.
Python
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
Step 4: Fit DBSCAN Clustering
Python
dbscan = DBSCAN(eps=0.5, min_samples=5)
y_pred_dbscan = dbscan.fit_predict(X)
Step 5: Plot the Boundaries for DBSCAN
Similar to Agglomerative Clustering, DBSCAN doesn't have a predict method, so we also use the Nearest Neighbor classifier.
Python
# Handle outliers by setting them to -1
unique_labels = np.unique(y_pred_dbscan)
filtered_X = X[y_pred_dbscan != -1]
filtered_labels = y_pred_dbscan[y_pred_dbscan != -1]
# Use Nearest Neighbor Classifier to predict the cluster for each point in the meshgrid
knn_dbscan = KNeighborsClassifier(n_neighbors=1)
knn_dbscan.fit(filtered_X, filtered_labels)
Z_dbscan = knn_dbscan.predict(np.c_[xx.ravel(), yy.ravel()])
Z_dbscan = Z_dbscan.reshape(xx.shape)
# Plot decision boundaries for DBSCAN Clustering
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z_dbscan, cmap=ListedColormap(('red', 'green', 'blue')), alpha=0.6) #added a ListedColormap
# Plot the data points
plt.scatter(X[:, 0], X[:, 1], c=y_pred_dbscan, s=50, edgecolor='k', cmap='plasma')
plt.title('DBSCAN Clustering - Cluster Boundaries')
plt.show()
Output:
Plotting Cluster Boundaries with DBSCANBest Practices for Visualizing Cluster Zones
- Use High-Quality Data: Ensure your data is suitable for clustering, and perform preprocessing like normalization if needed.
- Choose the Right Algorithm: Different clustering algorithms have different strengths. For example, DBSCAN works well with noise, while KMeans is suitable for spherical clusters.
- Consider the Dimensionality: Cluster boundaries are easy to visualize in 2D, but for high-dimensional data, dimensionality reduction techniques like PCA may be required.
- Test Different Parameters: Vary parameters like the number of clusters (
n_clusters
for KMeans) or eps
for DBSCAN to find the best clustering solution.
Conclusion
In this article, we explored how to visualize cluster boundaries using three popular algorithms in scikit-learn: KMeans, DBSCAN, and Agglomerative Clustering. Visualizing cluster zones is a powerful way to understand the performance of a clustering algorithm and gain insights into your dataset’s structure. By following the techniques outlined here, you can create insightful visualizations that highlight the decision boundaries of clustering algorithms.
Similar Reads
Revealing K-Modes Cluster Features with Scikit-Learn
Clustering is a powerful technique in unsupervised machine learning that helps in identifying patterns and structures in data. While K-Means is widely known for clustering numerical data, K-Modes is a variant specifically designed for categorical data. In this article, we will delve into the K-Modes
3 min read
Project | Scikit-learn - Whisky Clustering
Introduction | Scikit-learn Scikit-learn is a machine learning library for Python.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numeric
4 min read
Plotting Categorical Variable with Stacked Bar Plot
Stacked bar plots are a powerful visualization tool used to display the relationship between two categorical variables. They allow us to see not only the total counts or proportions of one categorical variable but also how these totals are divided among the levels of another categorical variable. Th
6 min read
Agglomerative clustering with different metrics in Scikit Learn
Agglomerative clustering is a type of Hierarchical clustering that works in a bottom-up fashion. Metrics play a key role in determining the performance of clustering algorithms. Choosing the right metric helps the clustering algorithm to perform better. This article discusses agglomerative clusterin
4 min read
Euclidean Distance using Scikit-Learn - Python
Scikit-Learn is the most powerful and useful library for machine learning in Python. It contains a lot of tools, that are helpful in machine learning like regression, classification, clustering, etc. Euclidean distance is one of the metrics which is used in clustering algorithms to evaluate the degr
3 min read
Hierarchical Clustering with Scikit-Learn
Hierarchical clustering is a popular method in data science for grouping similar data points into clusters. Unlike other clustering techniques like K-means, hierarchical clustering does not require the number of clusters to be specified in advance. Instead, it builds a hierarchy of clusters that can
4 min read
Species Distribution Modeling in Scikit Learn
Species Distribution Modeling (SDM) is a crucial tool in conservation biology, ecology, and related fields. It involves predicting the geographic distribution of species based on environmental variables and species occurrence data. This article explores how to implement SDM using Scikit-Learn, a pop
5 min read
How to Install Scikit-Learn on Linux?
In this article, we are going to see how to install Scikit-Learn on Linux. Scikit-Learn is a python open source library for predictive data analysis. It is built on NumPy, SciPy, and matplotlib. It is written in Python, Cython, C, and C++ language. It is available for Linux, Unix, Windows, and Mac.
2 min read
Implementing PCA in Python with scikit-learn
In this article, we will learn about PCA (Principal Component Analysis) in Python with scikit-learn. Let's start our learning step by step. WHY PCA? When there are many input attributes, it is difficult to visualize the data. There is a very famous term âCurse of dimensionality in the machine learni
4 min read
How to Draw Decision Boundaries in R
Decision boundaries are essential concepts in machine learning, especially for classification tasks. They define the regions in feature space where the model predicts different classes. Visualizing decision boundaries helps us understand how a classifier separates different classes. In this article,
4 min read