0% found this document useful (0 votes)
14 views

Cluster Analysis

Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarities. The goal is to maximize similarities within clusters and maximize differences between clusters. It is used in various domains like information retrieval, finance, biology, and marketing to group related documents, stocks with similar price movements, genes with similar functionality, and customers with similar characteristics respectively. The quality and interpretation of clustering results depend on the chosen similarity measure and number of clusters.

Uploaded by

Rani Shamas
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Cluster Analysis

Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters based on similarities. The goal is to maximize similarities within clusters and maximize differences between clusters. It is used in various domains like information retrieval, finance, biology, and marketing to group related documents, stocks with similar price movements, genes with similar functionality, and customers with similar characteristics respectively. The quality and interpretation of clustering results depend on the chosen similarity measure and number of clusters.

Uploaded by

Rani Shamas
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Cluster Analysis

Basic Concepts and Algorithms


What is Cluster Analysis?
 Cluster: a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
 Cluster analysis
– Grouping a set of data objects into clusters

Inter-cluster
Intra-cluster distances
distances are are
minimized maximized
Applications of Cluster Analysis

 Understanding Discovered Clusters Industry Group

1
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

– Information Retrieval Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,


DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Technology1-DOWN
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,
 Group related documents for Sun-DOWN

2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
browsing Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

– Finance
3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

 Group stocks with similar price 4


Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP
Schlumberger-UP
fluctuations
– Biology
 Group genes and proteins that have
similar functionality
– Marketing
 Help marketers discover distinct
groups in their customer bases, and
develop targeted marketing programs
Clustering gene expression data
Applications of Cluster Analysis

 Image segmentation
– Goal: Break up the image
into meaningful or
perceptually similar regions

 Summarization
– Reduce the size of large
data sets

Clustering precipitation in
Australia
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

Clustering results are crucially dependent on the measure of similarity (or distance) between
“points” to be clustered
Measure the Quality of Clustering

 Quality of clustering:
– There is usually a separate “quality” function that
measures the “goodness” of a cluster.
– It is hard to define “similar enough” or “good
enough”
 The answer is typically highly subjective

6
Considerations for Cluster Analysis
 Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
 Separation of clusters
– Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
 Similarity measure
– Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-
based (e.g., density or contiguity)
 Clustering space (Partial versus complete)
– Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
 Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities

7
Types of Clusters

 Well-separated clusters
 Center-based clusters
 Contiguous clusters
 Density-based clusters
 Property or Conceptual
 Described by an Objective Function
Types of Clusters: Well-Separated

 Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every point in the cluster than to any
point not in the cluster.

3 well-separated clusters
Types of Clusters: Center-Based

 Center-based
– A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a cluster,
than to the center of any other cluster
– The center of a cluster is often a centroid

4 center-based clusters
Types of Clusters: Contiguity-Based

 Contiguous Cluster (Nearest neighbor or Transitive)


– A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
Types of Clusters: Density-Based

 Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters
Types of Clusters: Conceptual Clusters

 Shared Property or Conceptual Clusters


– Finds clusters that share some common property or represent
a particular concept.
.

2 Overlapping Circles
Characteristics of the Input Data Are Important

 Type of proximity or density measure


– This is a derived measure, but central to clustering
 Sparseness
– Dictates type of similarity
– Adds to efficiency
 Attribute type
– Dictates type of similarity
 Type of Data
– Dictates type of similarity
– Other characteristics, e.g., autocorrelation
 Dimensionality
 Noise and Outliers
 Type of Distribution
Similarity and
Dissimilarity
Type of data in clustering analysis

 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Types of Attributes

 There are different types of attributes


– Nominal
 Examples: ID numbers, eye color, zip codes
– Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
– Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
– Ratio
 Examples: temperature in Kelvin, length, time, counts
Attribute Description Examples Operations
Type

Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )

Ordinal The values of an ordinal attribute hardness of minerals, median, percentiles,


provide enough information to order {good, better, best}, rank correlation,
objects. (<, >) grades, street numbers run tests, sign tests

Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., temperature in Celsius deviation, Pearson's
a unit of measurement exists. or Fahrenheit correlation, t and F
(+, - ) tests

Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent
length, electrical current variation
Attribute Transformation Comments
Level

Nominal Any permutation of values If all employee ID numbers


were reassigned, would it
make any difference?

Ordinal An order preserving change of An attribute encompassing


values, i.e., the notion of good, better best
new_value = f(old_value) can be represented equally
where f is a monotonic well by the values {1, 2, 3} or
function. by { 0.5, 1, 10}.

Interval new_value =a * old_value + b where Thus, the Fahrenheit and


a and b are constants Celsius temperature scales
differ in terms of where their
zero value is and the size of a
unit (degree).

Ratio new_value = a * old_value Length can be measured in


meters or feet.
Similarity and Dissimilarity

 Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
 Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
 Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.


Distance Measures

 Manhattan Distance

 Euclidean Distance
n
2
dist   ( pk  q k)
k 1
Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data
objects p and q.
Manhattan Distance

3
point x y
2 p1 p1 0 2
p3 p4 p2 2 0
1 p3 3 1
p2 p4 5 1
0
0 1 2 3 4 5 6
Acutal Points in 2D
Distance between p1 and p2 Manhattan Distance is the sum of
(X1,Y1) = (0 , 2) The absolute values of differences
(X2,Y2) = (2 , 0) Of the coordinates.
d=|0–2|+|2–0|
d= 2+2 =4
Distance between p1 and p3
(X1,Y1) = (0 , 2)
(X2,Y2) = (3 , 1)
d=|0–3|+|2–1|
d= 3+1 = 4 L1 p1 P2 p3 p4
Distance between p1 and p4 p1 0 4 4 6
(X1,Y1) = (0 , 2) p2 4 0 2 4
(X4,Y4) = (5 , 1) p3 4 2 0 2
d=|0–5|+|2–1| p4 6 4 2 0
d= 5+1 =6
Euclidean Distance

point x y
3 p1 0 2
p2 2 0
2 p1 p3 3 1
p3 p4 p4 5 1
1
p2 Acutal Points in 2D
0
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Clustering Algorithms

 K-means and its variants

 Hierarchical clustering

 Density-based clustering
Partitional Clustering
Divide data objects into non-overlapping subsets (clusters) such that
each data object is in exactly one subset
Typical methods: k-means, k-medoids, CLARANS

Original Points A Partitional Clustering


Hierarchical Clustering

A set of nested clusters organized as a hierarchical tree

p1
p3 p4
p2

p1 p2 p3
Traditional Hierarchical Clustering p4
Traditional Dendrogram

p1
p3 p4
p2
p1 p2
p3 p4
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
K-Means : Partitioning approach

 An iterative clustering algorithm


 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters, K, must be specified

Initialize: Pick K random points as cluster


centers
Repeat:
1.Assign data points to closest
cluster center
2.Change the cluster center to the
average of its assigned points
Until The centroids don’t change
K-Means : Partitioning approach

 An iterative clustering algorithm


 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters, K, must be specified

Initialize: Pick K random points as cluster


centers
Repeat:
1.Assign data points to closest
cluster center
2.Change the cluster center to the
average of its assigned points
Until The centroids don’t change
K-means clustering Example
K-means clustering Example

Iteration 1

Iterative Step 2:
Iterative Step 1: Change the cluster
Assign data points to center to average of the
closest cluster center assigned points
K-means clustering Example

Iteration 2

Iterative Step 1:
Assign data points to
closest cluster center

Repeat until convergence Iterative Step 2:


Change the cluster
center to average of the
assigned points
K-means clustering Example

Iteration 3

Iterative Step 1:
Assign data points to
closest cluster center

Repeat until convergence Iterative Step 2:


Change the cluster
center to average of the
assigned points
K-means Clustering – Details

 Initial centroids are often chosen randomly.


– Clusters produced vary from one run to another.

 The centroid is (typically) the mean of the points in the


cluster.

 ‘Closeness’ is measured by Euclidean distance, cosine


similarity, correlation, etc.
– K-means will converge for common similarity measures mentioned
above.
K-means Clustering – Details

 Most of the convergence happens in the first few


iterations.
– Often the stopping condition is changed to ‘Until relatively few
points change clusters’

 Complexity is O( n * K * I )
– n = number of points,
– K = number of clusters,
– I = number of iterations
Two different K-means Clusterings
3

2.5

2
Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5


1.5 2
x x
Optimal Clustering Sub-optimal Clustering
Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x
Importance of Choosing Initial Centroids

Iteration 6
2
3
4
5
1
3

2.5

1.5
1.5
y
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


-
2 x
-
1
.
5

-
1
Importance of Choosing Initial Centroids …

Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x
Importance of Choosing Initial Centroids …

Iteration 5
1
2
3
4
3

2.5

1.5
1.5
y
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


-
2 x
-
1
.
5

-
1
Evaluating K-means Clusters

 Most common measure is Sum of Squared Error (SSE)


– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE    dist 2 (mi ,
i1 xC x) i

x is a data point in cluster Ci and mi is the representative point for cluster Ci


 can show that mi corresponds to the center (mean) of the cluster

– Given two clusters, we can choose the one with the smallest error
– One easy way to reduce SSE is to increase K, the number of clusters
 A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Solutions to Initial Centroids Problem

 Multiple runs
– Helps, but probability is not on your side
 Sample and use hierarchical clustering to determine initial
centroids
 Select more than k initial centroids and then select among
these initial centroids
– Select most widely separated
 Postprocessing
 Bisecting K-means
– Not as susceptible to initialization issues
Pre-processing and Post-processing

 Pre-processing
– Normalize the data
– Eliminate outliers

 Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low SSE
Limitations of K-means

 K-means has problems when clusters are of


different
– Sizes
– Densities
– Non-globular shapes

 K-means has problems when the data contains


outliers.
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)


Overcoming K-means Limitations

Original Points K-means Clusters


Overcoming K-means Limitations

Original Points K-means Clusters


Density-Based Clustering Methods

 Clustering based on density (local cluster criterion), such as density-


connected points
 Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
 Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

132
DBSCAN: Density Based Spatial Clustering of Applications with Noise
 Locates region of high density that are separated by regions of low
density.
 In center- based approach, density of a point is the number of points
within specified radius, Eps, of that point
 A cluster is defined as a maximal set of density-connected points

In center-based approach, we can classify


point as being
1) in the interior of dense region (core)
2) on the edge of a dense region (border)
Outlier
3) in a sparely occupied region (noise)

Border
Eps = 1cm
MinPts = 5
Cor e
DBSCAN

 DBSCAN is a density-based algorithm.


– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number


of points (MinPts) within Eps
 These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in


the
neighborhood of a core point

– A noise point is any point that is not a core point or a border


point.
DBSCAN: Core, Border, and Noise Points
DBSCAN: The Algorithm

– Label all points as core, border or noise.

– Eliminate noise points.

– Put an edge between all core points that are within


Eps of each other.

– Make each group of connected core points into a


separate cluster.

– Assign each border points to one of the clusters of its


associated core points.

138
DBSCAN Algorithm

 Eliminate noise points


 Perform clustering on the remaining points
DBSCAN Algorithm

 Time Complexity
– O(N x time to find points in Eps-neighbourhood)
– where N is the no of points
– Worst case O(N2)
– KD-trees, allow efficient retreivel of all points within
given distance of a specified point in O(N logN)

 Space Complexity
– O(N)
DBSCAN: Core, Border and Noise Points

Original Points Point types: core,


border and noise

Eps = 10, MinPts = 4


When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
DBSCAN: Sensitive to Parameters

DBSCAN online Demo:


http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.ht
ml

You might also like