Data mining
Data mining
36215
BSCS 5th Eve
Submitted to: Mr. Tauqeer Abbas
Key Concepts:
Epsilon (ε): The maximum distance between two points to be considered as
neighbors.
Steps:
1.Core points: A point is a core point if it has at least MinPts points within a
distance of ε.
2.Border points: A point is a border point if it has fewer than MinPts points
within ε, but is in the neighborhood of a core point.
3.Noise points: Points that are neither core points nor border points.
Example Dataset:
Consider the following 2D dataset:
(1, 2), (2, 2), (2, 3), (8, 7), (8, 8), (25, 80)
DBSCAN Algorithm with Parameters:
Steps:
1.tarting with point (1, 2):
Look for points within ε = 2 distance. Points (1, 2), (2, 2), and (2, 3) are found
within this distance.
Since there are more than MinPts (3 points), these form a cluster.
(1, 2) becomes a core point, and (2, 2), (2, 3) are part of the same cluster.
Points (8, 7) and (8, 8) are within ε = 2 distance. These form another cluster.
This point does not have enough neighboring points within ε = 2 distance, so
it’s marked as noise.
Final Clusters:
Cluster 1: {(1, 2), (2, 2), (2, 3)}
Visual Representation:
Cluster 1 would be points near (1, 2).
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
# DBSCAN clustering
Db = DBSCAN(eps=2, min_samples=2)
Labels = db.fit_predict(X)
Plt.title(“DBSCAN Clustering”)
Plt.show()
Print(labels)
Output:
The clusters are marked in different colors.
The point (25, 80) will be labeled -1 indicating it’s considered noise.
Advantages of DBSCAN:
Can find clusters of arbitrary shapes.
Disadvantages:
Sensitive to the choice of ε and MinPts.