DBSCAN for Clustering Data by Location and Density in R
Last Updated :
19 Sep, 2024
Clustering is an important technique in data analysis used to group similar data points together. One of the popular clustering algorithms is DBSCAN (Density-Based Spatial Clustering of Applications with Noise). Unlike other clustering methods such as K-Means, DBSCAN does not require the user to specify the number of clusters beforehand. Instead, it forms clusters based on the density of the data points. Here we will explain about the DBSCAN clustering data based on location and density.
Understanding DBSCAN
DBSCAN classifies points into clusters based on their density. The key parameters are:
- eps (epsilon): The maximum distance between two points for them to be considered neighbors.
- Determine the appropriate
eps
value using a k-distance plot. Plot the distance to the k-th nearest neighbor (where k is typically set to minPts
). Look for a sharp increase in distance to select an optimal eps
. - A small
eps
value may result in many small clusters or noise, while a large eps
may merge distinct clusters into one.
- minPts (minimum points): The minimum number of points required to form a dense region (cluster).Choose
minPts
based on the expected size of clusters and the density of data points.- Choose
minPts
based on the expected size of clusters and the density of data points. - A small
minPts
value might result in fewer clusters with more points, while a large minPts
might lead to many points being classified as noise.
How DBSCAN Differentiates Between Points?
- Core Points: A point is a core point if it has at least
minPts
neighbors within a distance of eps
. These points are part of the dense region and form the core of a cluster. - Border Points: A point is a border point if it is within the
eps
distance of a core point but does not have enough neighbors to be a core point itself. Border points belong to a cluster but do not form a dense region on their own. - Noise Points: Points that are neither core points nor border points are considered noise. They do not belong to any cluster because they are too far from other points.
Now we implements DBSCAN for Clustering Data by Location and Density in R Programing Language.
Step 1: Install and Load Required Packages
First, we will install and load the required packages.
R
install.packages("dbscan")
library(dbscan)
Step 2: Create a Sample Dataset
Now create a sample dataset.
R
set.seed(123)
data <- data.frame(
x = c(rnorm(50, mean = 2), rnorm(50, mean = 8)),
y = c(rnorm(50, mean = 2), rnorm(50, mean = 8))
)
plot(data$x, data$y, main = "Scatter Plot of Data", xlab = "X", ylab = "Y")
Output:
DBSCAN for Clustering Data by Location and Density in RStep 3: Apply DBSCAN
Next, we need to decide on the eps
and minPts
values.
R
db <- dbscan(data, eps = 1.5, minPts = 5)
print(db$cluster)
Output:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[47] 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[93] 2 2 2 2 2 2 2 2
Step 4: Visualize the Clusters
After running DBSCAN, now we can visualize the clusters.
R
plot(data$x, data$y, col = db$cluster + 1, pch = 19, main = "DBSCAN Clustering",
xlab = "X", ylab = "Y")
legend("topright", legend = unique(db$cluster), col = unique(db$cluster) + 1, pch = 19)
Output:
DBSCAN for Clustering Data by Location and Density in R- Each point in the plot represents an observation from our dataset.
- Points are colored based on the cluster they belong to, as identified by the DBSCAN algorithm.
- If there are points that do not belong to any cluster (noise points), they will be marked in a separate color corresponding to a cluster label of
-1
.
3D Visualization for Spatial Data
For spatial data or any three-dimensional data, you can use 3D visualization:
R
install.packages("rgl")
library(rgl)
# Assuming data has a third dimension `z`
data$z <- rnorm(nrow(data), mean = 5)
# Apply DBSCAN
db <- dbscan(data[, c("x", "y", "z")], eps = 1.5, minPts = 5)
# 3D Scatter Plot
plot3d(data$x, data$y, data$z, col = db$cluster + 1, size = 5, xlab = "X",
ylab = "Y", zlab = "Z")
Output:
3D PlottingBenefits and Limitations of DBSCAN for Clustering
Here are the some main Benefits and Limitations of DBSCAN for Clustering .
- Utilize optimized libraries such as
dbscan
in R. - Reduce the number of dimensions before clustering.
- Finds clusters of any shape, not just spherical.
- Identifies and excludes noise and outliers.
- Performance depends heavily on choosing
eps
and minPts
. - Computational cost can be high with very large or high-dimensional datasets.
- Struggles with clusters of varying densities.
- Performance can degrade in high-dimensional spaces.
- Requires optimization techniques for efficient processing of large datasets.
Conclusion
DBSCAN is a flexible and effective clustering algorithm for identifying clusters of varying shapes and handling noise in datasets. By understanding and tuning parameters like eps
and minPts
, and using tools like k-distance plots and 3D visualizations, we can uncover meaningful patterns in our data. With considerations for computational efficiency, DBSCAN becomes a valuable tool for diverse clustering tasks in R.
Similar Reads
DBSCAN Clustering in ML | Density based clustering
DBSCAN is a density-based clustering algorithm that groups data points that are closely packed together and marks outliers as noise based on their density in the feature space. It identifies clusters as dense regions in the data space separated by areas of lower density. Unlike K-Means or hierarchic
6 min read
KMeans Clustering and PCA on Wine Dataset
K-Means Clustering: K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k means clustering, we specify the nu
6 min read
Clustering High-Dimensional Data in Data Mining
Clustering is basically a type of unsupervised learning method. An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labeled responses. Clustering is the task of dividing the population or data points into a number of groups such that
3 min read
What are the best practices for clustering high-dimensional data?
Clustering is a fundamental technique in machine learning and data analysis, used to group similar data points based on their features. However, when it comes to high-dimensional data, the process becomes more complex due to the "curse of dimensionality," which can lead to challenges such as increas
7 min read
Clustering-Based approaches for outlier detection in data mining
Clustering Analysis is the process of dividing a set of data objects into subsets. Each subset is a cluster such that objects are similar to each other. The set of clusters obtained from clustering analysis can be referred to as Clustering. For example: Segregating customers in a Retail market as a
6 min read
How to Use Custom Distance Functions for Clustering?
When working with clustering algorithms, especially K-Means, you may encounter scenarios where the default Euclidean distance metric might not fit your data. Perhaps, you want to use Manhattan distance or even a more complex custom similarity function. However, scikit-learnâs K-Means only supports E
5 min read
Graph Clustering Methods in Data Mining
Technological advancement has made data analysis and visualization easy. These include the development of software and hardware technologies. According to Big Data, 90% of global data has doubled after 1.2 years since 2014. In every decade we live, we can attest that data analysis is becoming more s
5 min read
Image Segmentation By Clustering
Segmentation By clustering It is a method to perform Image Segmentation of pixel-wise segmentation. In this type of segmentation, we try to cluster the pixels that are together. There are two approaches for performing the Segmentation by clustering. Clustering by MergingClustering by Divisive Cluste
5 min read
Choosing the Right Clustering Algorithm for Your Dataset
Clustering is a crucial technique in data science that helps uncover hidden patterns and groups in datasets. Selecting the appropriate clustering algorithm is essential to get meaningful insights. With numerous algorithms available, each having its strengths and limitations, choosing the right one f
5 min read
Clustering a Binary Matrix in R
Clustering a binary matrix involves grouping rows or columns based on their similarity in binary patterns, where each element is either 0 or 1. This approach is useful for exploring patterns or identifying clusters of entities that exhibit similar binary behaviors across multiple attributes or featu
4 min read