0% found this document useful (0 votes)
5 views

Data Mining - Lecture 9

Uploaded by

hendymostafa256
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Mining - Lecture 9

Uploaded by

hendymostafa256
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Mining and Business

Intelligence

Partitioning

Clustering Hierarchical

Density
By
Dr. Nora Shoaip

Lecture 8

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2023 - 2024
Outline
 The Basics
o What is Cluster Analysis?
o Requirements for Cluster Analysis
o Overview of methods
 Partitioning Methods
o K-Means
 Hierarchical Methods
o Agglomerative vs. Divisive
o Distance Measures
 Density-Based Methods
o DBSCAN

2
What is Cluster Analysis?

 Partitioning a set of data objects into subsets or clusters


 objects in a cluster are similar, yet dissimilar to objects in other clusters

 Goal: discovery of previously unknown groups within the data


 Clusters are implicit classes
 Applications  business intelligence, image pattern recognition, web
search, biology, security
 Clustering can be used for pre-processing and outlier detection
Requirements for Cluster Analysis
 Scalability : currently handles small datasets, uses sampling
 Handling different attribute types : mostly numerical
 Discovering clusters with arbitrary shape : currently mostly spherical
 Domain knowledge & input parameters : # clusters & clustering results
 Handling noisy data : currently sensitive to noise
 Incremental clustering & insensitivity to input order : new data requires
recomputing clusters from scratch – sensitive to order
 Handling high-dimensionality data : mostly low Dimensionality
 Constraint-based clustering : little support for domain constraints
 Interpretability & usability : are results comprehensible & usable?
Comparing Cluster Analysis Methods

 The partitioning criteria – flat or hierarchical?


 Separation of clusters – mutually exclusive or overlapping?
 Similarity measure – distance or connectivity/density?

5
Overview of Cluster Analysis Methods Partitioning
Overview of Cluster Analysis Methods
Hierarchical
Overview of Cluster Analysis Methods Density-
based
Overview of Cluster
Analysis Methods
Grid-based
Overview of Cluster Analysis Methods
Method Characteristics
Partitioning — Find mutually exclusive clusters of spherical shape
methods — Distance-based
— May use mean or medoid to represent cluster center
— Effective for small- to medium-size data sets

Hierarchical — Clustering is hierarchy involving multiple levels


methods — Cannot correct erroneous merges/splits
— May consider object “linkages”

Density-based — Can find arbitrarily shaped clusters


methods — Clusters are dense regions separated by low-density regions
— Each point must have a minimum number of points within its
“neighborhood”
— May filter out outliers
Grid-based — Use a multi-resolution grid data structure
methods — Fast processing time
Partitioning Methods
K-Means – A Centroid-Based Technique

11
Partitioning Methods
K-Means
Partitioning Methods
K-Means
Partitioning Methods K-Means
Partitioning Methods K-Means
Cluster the eight points in table using k-means. (15 points) A1 A2
Assume that k = 3 and that initially the points are assigned to
clusters as follows: C1 = {x1, x2, x3}, C2 = {x4, x5, x6}, C3 = x1 2 10
{x7, x8}. x2 2 5
• Apply the k-means algorithm until convergence (i.e., until the x3 8 4
clusters do not change), using the Manhattan distance. x4 5 8
(Hint: The Manhattan distance is: d(i, j) = |xi1-xj1|+ |xi2-xj2|+ ….+
| xin-xjn|.) Make sure you clearly identify the final clustering and x5 7 5
show your steps. x6 6 4

• Compute the silhouette coefficient for object x1. What is the


x7 1 2
meaning of the computed value? x8 4 9
Partitioning Methods K-Means

A1 A2
x1 2 10
x2 2 5
x3 8 4
x4 5 8
x5 7 5
x6 6 4
x7 1 2
x8 4 9
Partitioning Methods K-Means

X1 X2 X3 X4 X5 X6 X7 X8
(2,10 (2,5) (8,4) (5,8) (7,5) (6,4) (1,2) (4,9)
)

5 1 7 5 5 5 5 5
Partitioning Methods K-Means

Factors to consider:
 Selection of k
 Selection of initial centroids
 Calculation of dissimilarity
 Calculation of cluster means
When it fails!
 Clusters with very different sizes & with concave shapes
Hierarchical Methods
Agglomerative versus Divisive Clustering
 Hierarchical clustering  group data objects into a hierarchy or “tree” of
clusters
 Agglomerative  bottom-up (merge) composition
 Each object has its own cluster
 Two clusters that are closest merged into a bigger cluster
 Iteratively merge till termination condition or single cluster is formed
 Divisive  top-down (split) composition
 All objects in one big cluster
 Divide into subclusters
 Recursively divide subclusters into even smaller subclusters
 Terminate when each object has his own cluster or objects in clusters are
similar “enough”
Hierarchical Methods
Agglomerative Clustering
Agglomerative
AGNES
Step 0 Step 1
minimum
distance a
ab
b

e
Hierarchical Methods
Agglomerative Clustering
Agglomerative
AGNES
Step 0 Step 1 Step 2

ab

d
de
e
Measure distance between c, d, e and individual elements in cluster {a,b}, choose any
with minimum distance (single linkage)
Hierarchical Methods
Agglomerative Clustering
Agglomerative
AGNES
Step 0 Step 1 Step 2 Step 3

ab

cde
de
Measure distance between c and individual elements in cluster {a,b} and {d,e}, as
well as distance between pairs in {a,b} and {d,e}, choose any with minimum distance
(single linkage)
Hierarchical Methods
Agglomerative Clustering

Agglomerative
AGNES
Step 0 Step 1 Step 2 Step 3 Step 4

ab

abcde

cde
Hierarchical Methods
Agglomerative Clustering
Step 4 Step 3 Step 2 Step 1 Step 0

a
ab
b

c abcde

d cde
de
e

Minimal spanning tree!


Hierarchical Methods
Divisive Clustering
Divisive
Step 4 Step 3 Step 2 Step 1 Step 0
DIANA
maximum
a distance
ab
b

c abcde

d cde
de
e

How to divide a cluster is a challenge! Heuristic approaches may be used


Density-based Methods
Density-Based Clustering Based on Connected Regions with High Density
DBSCAN: Density-Based Spatial Clustering of Applications with Noise
Density-based Methods DBSCAN

o an object p is directly density-reachable from


p another object q if and only if q is a core object
q and p is in the ϵ-neighborhood of q
Density-based Methods DBSCAN

o an object p is directly density-reachable from


another object q if and only if q is a core object
q and p is in the -neighborhood of q

objects q & m are density-connected if there


is an object o such that q & m are both
density-reachable from o
Density-based Methods DBSCAN

You might also like