Cluster Analysis Introduction
Cluster Analysis Introduction
Introduction to Cluster
Analysis
Edgar M. Adina
Instructor
What is Cluster Analysis?
Cluster: a collection of data objects
o Similar to one another within the same cluster
o Dissimilar to the objects in other clusters
Cluster analysis
o Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined classes
Typical applications
o As a stand-alone tool to get insight into data distribution
o As a preprocessing step for other algorithms
General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
o create thematic maps in GIS by clustering feature spaces
o detect spatial clusters and explain them in spatial data mining
Image Processing
Economic Science (especially market research)
WWW
o Document classification
o Cluster Weblog data to discover groups of similar access patterns
Some specific applications
City-planning: Identifying groups of houses according to their house type, value,
and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation
database
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
Illustration: Thematic Maps
Illustration: Web Usage Mining
Clustering Approaches
Partitioning Algorithms
o Find k partitions, minimizing some objective functions
Hierarchy Algorithms
o Create a hierarchical decomposition of the set of objects
Density-based
o Find clusters based on connectivity and density functions
Other methods
o Grid-based
o Neural Networks
o Graph-theoretical methods, and many others…
Good Clustering
A good clustering method will produce high quality clusters with
o high intra-class similarity
o low inter-class similarity
The quality of a clustering result depends on both the similarity measure
used by the method and its implementation.
The quality of a clustering method is also measured by its ability to discover
some or all of the hidden patterns.
Requisites
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input
parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
Data Structures
x11 ... x1f ... x1p
Data matrix ... ... ... ... ...
(two modes) x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... x np
n1
0
Dissimilarity matrix d(2,1)
(one mode) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n, 2 ) ... ... 0
Quality of Clustering (Metrics)
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
There is a separate “quality” function that measures the “goodness” of a cluster.
The definitions of distance functions are usually very different for interval-
scaled, Boolean, categorical, ordinal and ratio variables.
Weights should be associated with different variables based on applications and
data semantics.
It is hard to define “similar enough” or “good enough”
o the answer is typically highly subjective.
Sample Distance Functions
Sample Distance Functions
Measuring Similarity
Types of Data for CA
Interval-scaled variables:
Binary variables:
where mf 1
n (x1 f x2 f ... xnf )
.