Open In App

DBSCAN for Clustering Data by Location and Density in R

Last Updated : 19 Sep, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

Clustering is an important technique in data analysis used to group similar data points together. One of the popular clustering algorithms is DBSCAN (Density-Based Spatial Clustering of Applications with Noise). Unlike other clustering methods such as K-Means, DBSCAN does not require the user to specify the number of clusters beforehand. Instead, it forms clusters based on the density of the data points. Here we will explain about the DBSCAN clustering data based on location and density.

Understanding DBSCAN

DBSCAN classifies points into clusters based on their density. The key parameters are:

  1. eps (epsilon): The maximum distance between two points for them to be considered neighbors.
    • Determine the appropriate eps value using a k-distance plot. Plot the distance to the k-th nearest neighbor (where k is typically set to minPts). Look for a sharp increase in distance to select an optimal eps.
    • A small eps value may result in many small clusters or noise, while a large eps may merge distinct clusters into one.
  2. minPts (minimum points): The minimum number of points required to form a dense region (cluster).Choose minPts based on the expected size of clusters and the density of data points.
    • Choose minPts based on the expected size of clusters and the density of data points.
    • A small minPts value might result in fewer clusters with more points, while a large minPts might lead to many points being classified as noise.

How DBSCAN Differentiates Between Points?

  • Core Points: A point is a core point if it has at least minPts neighbors within a distance of eps. These points are part of the dense region and form the core of a cluster.
  • Border Points: A point is a border point if it is within the eps distance of a core point but does not have enough neighbors to be a core point itself. Border points belong to a cluster but do not form a dense region on their own.
  • Noise Points: Points that are neither core points nor border points are considered noise. They do not belong to any cluster because they are too far from other points.

Now we implements DBSCAN for Clustering Data by Location and Density in R Programing Language.

Step 1: Install and Load Required Packages

First, we will install and load the required packages.

R
install.packages("dbscan")
library(dbscan)

Step 2: Create a Sample Dataset

Now create a sample dataset.

R
set.seed(123)
data <- data.frame(
  x = c(rnorm(50, mean = 2), rnorm(50, mean = 8)),
  y = c(rnorm(50, mean = 2), rnorm(50, mean = 8))
)

plot(data$x, data$y, main = "Scatter Plot of Data", xlab = "X", ylab = "Y")

Output:

gh
DBSCAN for Clustering Data by Location and Density in R

Step 3: Apply DBSCAN

Next, we need to decide on the eps and minPts values.

R
db <- dbscan(data, eps = 1.5, minPts = 5)
print(db$cluster)

Output:

  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[47] 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[93] 2 2 2 2 2 2 2 2

Step 4: Visualize the Clusters

After running DBSCAN, now we can visualize the clusters.

R
plot(data$x, data$y, col = db$cluster + 1, pch = 19, main = "DBSCAN Clustering", 
     xlab = "X", ylab = "Y")
legend("topright", legend = unique(db$cluster), col = unique(db$cluster) + 1, pch = 19)

Output:

gh
DBSCAN for Clustering Data by Location and Density in R
  • Each point in the plot represents an observation from our dataset.
  • Points are colored based on the cluster they belong to, as identified by the DBSCAN algorithm.
  • If there are points that do not belong to any cluster (noise points), they will be marked in a separate color corresponding to a cluster label of -1.

3D Visualization for Spatial Data

For spatial data or any three-dimensional data, you can use 3D visualization:

R
install.packages("rgl")
library(rgl)

# Assuming data has a third dimension `z`
data$z <- rnorm(nrow(data), mean = 5)

# Apply DBSCAN
db <- dbscan(data[, c("x", "y", "z")], eps = 1.5, minPts = 5)

# 3D Scatter Plot
plot3d(data$x, data$y, data$z, col = db$cluster + 1, size = 5, xlab = "X", 
                                                   ylab = "Y", zlab = "Z")

Output:

Screenshot-2024-09-18-223600
3D Plotting

Benefits and Limitations of DBSCAN for Clustering

Here are the some main Benefits and Limitations of DBSCAN for Clustering .

  • Utilize optimized libraries such as dbscan in R.
  • Reduce the number of dimensions before clustering.
  • Finds clusters of any shape, not just spherical.
  • Identifies and excludes noise and outliers.
  • Performance depends heavily on choosing eps and minPts.
  • Computational cost can be high with very large or high-dimensional datasets.
  • Struggles with clusters of varying densities.
  • Performance can degrade in high-dimensional spaces.
  • Requires optimization techniques for efficient processing of large datasets.

Conclusion

DBSCAN is a flexible and effective clustering algorithm for identifying clusters of varying shapes and handling noise in datasets. By understanding and tuning parameters like eps and minPts, and using tools like k-distance plots and 3D visualizations, we can uncover meaningful patterns in our data. With considerations for computational efficiency, DBSCAN becomes a valuable tool for diverse clustering tasks in R.


Next Article

Similar Reads