0% found this document useful (0 votes)
10 views

Cluster

Clustering involves grouping data objects into clusters so that objects within a cluster are similar but objects in different clusters are dissimilar. Common clustering applications include market segmentation, image recognition, and as a preprocessing step for other algorithms. Popular clustering approaches include partitioning methods like k-means and k-medoids, hierarchical clustering, density-based clustering, and grid-based clustering.

Uploaded by

sondaravalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Cluster

Clustering involves grouping data objects into clusters so that objects within a cluster are similar but objects in different clusters are dissimilar. Common clustering applications include market segmentation, image recognition, and as a preprocessing step for other algorithms. Popular clustering approaches include partitioning methods like k-means and k-medoids, hierarchical clustering, density-based clustering, and grid-based clustering.

Uploaded by

sondaravalli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Clustering

Concepts & Methods


Clustering
• Cluster: A collection of data objects
• Clustering: process of grouping a set of data objects into multiple
groups or clusters so that
– objects within a cluster have high similarity
– but are very dissimilar to objects in other clusters
• Unsupervised learning
– no predefined classes (i.e., learning by observations)
• Purpose: summarize data – increases ease of interpretability
– loss of detailed information
Clustering
• Applications
– As a stand-alone tool to get insight into data distribution
– Business intelligence – market segmentation, CRM
– Image recognition – handwritten character recognition
– Web search
Customers
– As a preprocessing step for other algorithms
– Outlier detection - credit card fraud

• Challenges
– Scalability
– Ability to deal with different types of attributes
– Discovery of clusters with arbitrary shape
– Requirements for domain knowledge to determine input parameters
– Ability to deal with noisy data
Considerations for Cluster Analysis
• Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
• Separation of clusters
– Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
• Similarity measure
– Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
• Clustering space
– Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
Major Clustering Approaches
• Partitioning approach
– Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach
– Create a hierarchical decomposition of the set of data (or objects)
using some criterion
– Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach
– Based on connectivity and density functions; can find arbitrarily
shaped clusters
– Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach
– Based on a multiple-level granularity structure; fast processing time
– Typical methods: STING, WaveCluster, CLIQUE
Major Clustering Approaches
• Density-based approach
– Clusters are dense regions in the data space, separated by regions of lower density of
points
– Goal is to identify dense regions; measured by number of objects close to a given point
– Any point x in the data set, with a neighbor count greater than or equal to Min_Pts, is
marked as a core point
– x is border point, if the number of its neighbors is less than Min_Pts, but it belongs to
the ϵ-neighborhood of some core point z
– If a point is neither a core nor a border point, then it is called a noise point or an outlier
Major Clustering Approaches
• Grid-based approach
1. Partitioning the data space into a finite number of cells
2. Calculating the cell density for each cell
3. Sorting of the cells according to their densities
4. Identifying cluster centres (blocks with the highest density )
5. Traversal of neighbor cells
Partitioning Algorithms
• Given a data set, D, of n objects, and k, the number of clusters
to form, a partitioning algorithm organizes the objects into k
partitions (k<=n), where each partition represents a cluster
• Sum of squared distances is minimized (where ci is the centroid
or medoid of cluster Ci)

• Find a partition of k clusters that optimizes the chosen


partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means: Each cluster is represented by the center of the cluster
– k-medoids or PAM (Partition around medoids): Each cluster is
represented by one of the objects in the cluster
k-Means Clustering - Illustration
• k=3
• Step 1: Initialize cluster centers
– randomly pick three points C1, C2 and C3
• Step 2: Assign observations to the closest
cluster center
– For each point compute distance to each cluster
center
– Assign each point to the clusters based on the
minimum distance to the cluster center
• Step 3: Revise cluster centers as mean of
assigned observations
• Step 4: Repeat Step 2 and Step 3 until
convergence
k-Means: Comments
• Choosing the right k
– Elbow method
• Centroid Initialisation
– k-means++

• Sensitive to outliers
k-Medoids Clustering Method
• Instead of taking the mean value of the objects in a cluster as
a reference point, actual object is used
• Absolute-error criterion is used
10 10

9 9

8 8

7 7
6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

k-means k-medoids
k-Medoid Clustering Method
• Find representative objects (medoids) in clusters

• PAM (Partitioning Around Medoids)


– Starts from an initial set of medoids and iteratively replaces
one of the medoids by one of the non-medoids if it
improves the total distance of the resulting clustering

– PAM works effectively for small data sets, but does not
scale well for large data sets (due to the computational
complexity)
– O(k(n-k)2) for each iteration, n = # data, k= # clusters
k-Medoids Algorithm: Illustration
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7
Arbitrary Assign
7

6 6 6

5
choose k 5 each 5

4 object as 4 remaining 4

3
initial 3
object to 3

2
medoids 2
nearest 2

medoids
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object, Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping O total cost of
Until no change
7 7

and Oramdom 6
swapping 6

5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering
• Group data objects into a hierarchy
• Produces a set of nested clusters organized as a hierarchical tree
• Dendogram - A tree structure representing the sequence of merging
decisions
• Useful for data summarization and visualization
• Does not require the number of clusters k as an input
• Needs a termination condition
• Methods – agglomerative, divisive, BIRCH, Chameleon

Error

Data objects
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative: bottom-up (merging) fashion
1. Start with the points as individual clusters
2. At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left

– Divisive: top-down (splitting) fashion


1. Start with one, all-inclusive cluster
2. At each step, split a cluster until each cluster contains a point (or
there are k clusters)
– Requires at most n iterations
Agglomerative vs. Divisive
How to Define Inter-Cluster Similarity

• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective
function
– Ward’s Method uses squared error
(ESS)
Ward’s Method - Example
• Five customers – A, B, C, D, E
• Ratings provided – 2, 5, 9, 10, 15, respectively, on a 20-point scale
• Cluster customers based on ratings
• Stage 1:
– Five clusters of one, ESS = 0
– No loss of information since there is no
clustering
• Stage 2:
– Combining C and D as they are closest; Centroid = 9.5
– Four cluster solution: {A, B, {C,D}, E}
– ESS = 0 + 0 + [(9-9.5)2 + (10-9.5)2] + 0 = 0.5
• Stage 3:
– ESS for the solution {{A,B}, {C,D}, E} = 5.0
– ESS for the solution {A, B, {C,D,E}} = 20.7
– ESS for the solution {{A,E}, {C,D}, B} = 85.0
– ESS for the solution {A, {B,E}, {C,D}} = 50.5
– ESS for the solution {A, {B,C,D}, E} = 14.0
– ESS for the solution {{A,C,D}, B, E} = 50.5
Ward’s Method - Example
• Stage 4:
– ESS for the solution {{A,B,C,D}, E} = 41.0
– ESS for the solution {{A,B,E}, {C,D}} = 93.2
– ESS for the solution {{A,B}, {C,D,E}} = 25.2

• Stage 5:
– ESS for the solution {{A,B, C,D,E}} = 98.8
Ward’s Method - Comments
• Clusters from previous stage are never taken apart
• Less sensitive to outliers
• Spherical tightly bound clusters - biased towards globular
clusters
• Can be used to decide value of k

You might also like