DBSCAN Clustering
DBSCAN Clustering
html
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
How to Master the Popular DBSCAN
Clustering Algorithm for Machine Learning
Abhishek Sharma, September 8, 2020
Login to Bookmark this article
Overview
DBSCAN clustering is an underrated yet super useful clustering algorithm for
unsupervised learning problems
Learn how DBSCAN clustering works, why you should learn it, and how to
implement DBSCAN clustering in Python
Introduction
Mastering unsupervised learning opens up a broad range of avenues for a data scientist. There
is so much scope in the vast expanse of unsupervised learning and yet a lot of beginners in
machine learning tend to shy away from it. In fact, I’m sure most newcomers will stick to
basic clustering algorithms like K-Means clustering and hierarchical clustering.
While there’s nothing wrong with that approach, it does limit what you can do when faced
with clustering projects. And why limit yourself when you can expand your learning,
knowledge, and skillset by learning the powerful DBSCAN clustering algorithm?
Clustering is an essential technique in machine learning and is used widely across domains
and industries (think about Uber’s route optimization, Amazon’s recommendation system,
Netflix’s customer segmentation, and so on). This article is for anyone who wants to add an
invaluable algorithm to their budding machine learning skillset – DBSCAN clustering!
Here, we’ll learn about the popular and powerful DBSCAN clustering algorithm and how you
can implement it in Python. There’s a lot to unpack so let’s get rolling!
If you’re new to machine learning and unsupervised learning, check out this popular
resource:
Table of Contents
Why do we need DBSCAN Clustering?
What Exactly is DBSCAN Clustering?
Reachability and Connectivity
Parameter Selection
Implementing DBSCAN Clustering in Python
Applying Clustering Algorithms
Clustering is an unsupervised learning technique where we try to group the data points
based on specific characteristics. There are various clustering algorithms with K-Means
and Hierarchical being the most used ones. Some of the use cases of clustering algorithms
include:
Document Clustering
Recommendation Engine
Image Segmentation
Market Segmentation
Search Result Grouping
and Anomaly Detection.
All these problems use the concept of clustering to reach their end goal. Therefore, it is
crucial to understand the concept of clustering. But here’s the issue with these two clustering
algorithms.
K-Means and Hierarchical Clustering both fail in creating clusters of arbitrary shapes. They
are not able to form clusters based on varying densities. That’s why we need DBSCAN
clustering.
Let’s try to understand it with an example. Here we have data points densely present in the
form of concentric circles:
We can see three different dense clusters in the form of concentric circles with some noise
here. Now, let’s run K-Means and Hierarchical clustering algorithms and see how they
cluster these data points.
You might be wondering why there are four colors in the graph? As I said earlier, this data
contains noise too, therefore, I have taken noise as a different cluster which is represented by
the purple color. Sadly, both of them failed to cluster the data points. Also, they were not able
to properly detect the noise present in the dataset. Now, let’s take a look at the results from
DBSCAN clustering.
Awesome! DBSCAN is not just able to cluster the data points correctly, but it also perfectly
detects noise in the dataset.
It groups ‘densely grouped’ data points into a single cluster. It can identify clusters in large
spatial datasets by looking at the local density of the data points. The most exciting feature
of DBSCAN clustering is that it is robust to outliers. It also does not require the number of
clusters to be told beforehand, unlike K-Means, where we have to specify the number of
centroids.
DBSCAN requires only two parameters: epsilon and minPoints. Epsilon is the radius of the
circle to be created around each data point to check the density and minPoints is the
minimum number of data points required inside that circle for that data point to be classified
as a Core point.
In higher dimensions the circle becomes hypersphere, epsilon becomes the radius of that
hypersphere, and minPoints is the minimum number of data points required inside that
hypersphere.
Here, we have some data points represented by grey color. Let’s see how DBSCAN clusters
these data points.
DBSCAN creates a circle of epsilon radius around every data point and classifies them
into Core point, Border point, and Noise. A data point is a Core point if the circle around it
contains at least ‘minPoints’ number of points. If the number of points is less than minPoints,
then it is classified as Border Point, and if there are no other data points around any data
point within epsilon radius, then it treated as Noise.
The above figure shows us a cluster created by DBCAN with minPoints = 3. Here, we draw a
circle of equal radius epsilon around every data point. These two parameters help in creating
spatial clusters.
All the data points with at least 3 points in the circle including itself are considered as Core
points represented by red color. All the data points with less than 3 but greater than 1 point in
the circle including itself are considered as Border points. They are represented by yellow
color. Finally, data points with no point other than itself present inside the circle are
considered as Noise represented by the purple color.
For locating data points in space, DBSCAN uses Euclidean distance, although other
methods can also be used (like great circle distance for geographical data). It also needs to
scan through the entire dataset once, whereas in other algorithms we have to do it multiple
times.
Directly Density-Reachable
Density-Reachable
Density-Connected
A point X is density-connected from point Y w.r.t epsilon and minPoints if there exists a
point O such that both X and Y are density-reachable from O w.r.t to epsilon and minPoints.
Here, both X and Y are density-reachable from O, therefore, we can say that X is density-
connected from Y.
minPoints>=Dimensions+1.
It does not make sense to take minPoints as 1 because it will result in each point being a
separate cluster. Therefore, it must be at least 3. Generally, it is twice the dimensions. But
domain knowledge also decides its value.
The value of epsilon can be decided from the K-distance graph. The point of maximum
curvature (elbow) in this graph tells us about the value of epsilon. If the value of epsilon
chosen is too small then a higher number of clusters will be created, and more data points will
be taken as noise. Whereas, if chosen too big then various small clusters will merge into a big
cluster, and we will lose details.
Step 1-
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import matplotlib
Step 2-
Here, I am creating a dataset with only two features so that we can visualize it easily. For
creating the dataset I have created a function PointsInCircum which takes the radius and
number of data points as arguments and returns an array of data points which when plotted
forms a circle. We do this with the help of sin and cosine curves.
np.random.seed(42)
Step 3-
One circle won’t be sufficient to see the clustering ability of DBSCAN. Therefore, I have
created three concentric circles of different radii. Also, I will add noise to this data so that we
can see how different types of clustering algorithms deals with noise.
Step 4-
Let’s plot these data points and see how they look in the feature space. Here, I use the scatter
plot for plotting these data points. Use the following syntax:
plt.figure(figsize=(10,10))
plt.scatter(df[0],df[1],s=15,color='grey')
plt.title('Dataset',fontsize=20)
plt.xlabel('Feature 1',fontsize=14)
plt.ylabel('Feature 2',fontsize=14)
plt.show()
Perfect! It’s excellent for clustering a problem.
If you want to know more about visualization in Python you can read the following articles:
We’ll first start with K-Means because it is the easiest clustering algorithm
df['KMeans_labels']=k_means.labels_
Here, K-means failed to cluster the data points into four clusters. Also, it didn’t work well
with noise. Therefore, it is time to try another popular clustering algorithm, i.e., Hierarchical
Clustering.
2. Hierarchical Clustering
For this article, I am performing Agglomerative Clustering but there is also another type of
hierarchical clustering algorithm known as Divisive Clustering. Use the following syntax:
Here, I am taking labels from the Agglomerative Clustering model and plotting the results
using a scatter plot. It is similar to what I did with KMeans.
df['HR_labels']=model.labels_
Sadly, the hierarchical clustering algorithm also failed to cluster the data points properly.
3. DBSCAN Clustering
Now, it’s time to implement DBSCAN and see its power. Import DBSCAN from
sklearn.cluster. Let’s first run DBSCAN without any parameter optimization and see the
results.
from sklearn.cluster import DBSCAN
dbscan=DBSCAN()
dbscan.fit(df[[0,1]])
Here, epsilon is 0.5, and min_samples or minPoints is 5. Let’s visualize the results from this
model:
df['DBSCAN_labels']=dbscan.labels_
Interesting! All the data points are now of purple color which means they are treated as noise.
It is because the value of epsilon is very small and we didn’t optimize parameters. Therefore,
we need to find the value of epsilon and minPoints and then train our model again.
For epsilon, I am using the K-distance graph. For plotting a K-distance Graph, we need the
distance between a point and its nearest data point for all data points in the dataset. We obtain
this using NearestNeighbors from sklearn.neighbors.
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(df[[0,1]])
distances, indices = nbrs.kneighbors(df[[0,1]])
The distance variable contains an array of distances between a data point and its nearest data
point for all data points in the dataset.
Let’s plot our K-distance graph and find the value of epsilon. Use the following syntax:
The optimum value of epsilon is at the point of maximum curvature in the K-Distance Graph,
which is 30 in this case. Now, it’s time to find the value of minPoints. The value of minPoints
also depends on domain knowledge. This time I am taking minPoints as 6:
df['DBSCAN_opt_labels']=dbscan_opt.labels_
df['DBSCAN_opt_labels'].value_counts()
The most amazing thing about DBSCAN is that it separates noise from the dataset pretty
well. Here, 0, 1 and 2 are the three different clusters, and -1 is the noise. Let’s plot the results
and see what we get.
DBSCAN amazingly clustered the data points into three clusters, and it also detected noise in
the dataset represented by the purple color.
One thing important to note here is that, though DBSCAN creates clusters based on varying
densities, it struggles with clusters of similar densities. Also, as the dimension of data
increases, it becomes difficult for DBSCAN to create clusters and it falls prey to the Curse of
Dimensionality.
End Notes
So in this article, I explained the DBSCAN clustering algorithm in-depth and also showcased
how it is useful in comparison with other clustering algorithms. Also, note that there also
exists a much better and recent version of this algorithm known as HDBSCAN which uses
Hierarchical Clustering combined with regular DBSCAN. It is much faster and accurate than
DBSCAN.