CE175-4C
Database Management in Construction
Introduction to Cluster
Analysis
Edgar M. Adina
Instructor
What is Cluster Analysis?
Cluster: a collection of data objects
o Similar to one another within the same cluster
o Dissimilar to the objects in other clusters
Cluster analysis
o Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined classes
Typical applications
o As a stand-alone tool to get insight into data distribution
o As a preprocessing step for other algorithms
General Applications of Clustering
Pattern Recognition
Spatial Data Analysis
o create thematic maps in GIS by clustering feature spaces
o detect spatial clusters and explain them in spatial data mining
Image Processing
Economic Science (especially market research)
WWW
o Document classification
o Cluster Weblog data to discover groups of similar access patterns
Some specific applications
City-planning: Identifying groups of houses according to their house type, value,
and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation
database
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
Illustration: Thematic Maps
Illustration: Web Usage Mining
Clustering Approaches
Partitioning Algorithms
o Find k partitions, minimizing some objective functions
Hierarchy Algorithms
o Create a hierarchical decomposition of the set of objects
Density-based
o Find clusters based on connectivity and density functions
Other methods
o Grid-based
o Neural Networks
o Graph-theoretical methods, and many others…
Good Clustering
A good clustering method will produce high quality clusters with
o high intra-class similarity
o low inter-class similarity
The quality of a clustering result depends on both the similarity measure
used by the method and its implementation.
The quality of a clustering method is also measured by its ability to discover
some or all of the hidden patterns.
Requisites
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to determine input
parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability
Data Structures
x11 ... x1f ... x1p
Data matrix ... ... ... ... ...
(two modes) x ... xif ... xip
i1
... ... ... ... ...
x ... xnf ... x np
n1
0
Dissimilarity matrix d(2,1)
(one mode) 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n, 2 ) ... ... 0
Quality of Clustering (Metrics)
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
There is a separate “quality” function that measures the “goodness” of a cluster.
The definitions of distance functions are usually very different for interval-
scaled, Boolean, categorical, ordinal and ratio variables.
Weights should be associated with different variables based on applications and
data semantics.
It is hard to define “similar enough” or “good enough”
o the answer is typically highly subjective.
Sample Distance Functions
Sample Distance Functions
Measuring Similarity
Types of Data for CA
Interval-scaled variables:
Binary variables:
Nominal, ordinal, and ratio variables:
Variables of mixed types:
Interval-Valued Data
Standardize data
o Calculate the mean absolute deviation:
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where mf 1
n (x1 f x2 f ... xnf )
.
o Calculate the standardized measurement (z-score)
xif m f
zif sf
Using mean absolute deviation is more robust than using
standard deviation
Binary Variables
A contingency table for binary data Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a c b d p
Simple matching coefficient (invariant, if the binary variable is symmetric):
d (i, j) bc
a bc d
Jaccard coefficient (noninvariant if the binary variable is asymmetric):
d (i, j) bc
a bc
Binary Variables
Rassel and Rao coefficient: J(i,j)= a/ a+b+c+d
Bravais coefficient: C(i,j)= ad-bc/ (a b)( a c)( d b)( d c)
Association coefficient Yule: Q(i,j)= ad-bc/ ad+bc
Hemming distance: H(i,j)= a+d
Dissimilarity in Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
o gender is a symmetric attribute
o the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green
Method 1: Simple matching
o m: # of matches, p: total # of variables
p
d (i, j) p m
Method 2: use a large number of binary variables
o creating a new binary variable for each of the M nominal states
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
o replace xif by their rank rif {1,...,M f }
o map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by rif 1
zif
M f
1
o compute the dissimilarity using methods for interval-scaled variables
Ratio-scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale,
approximately at exponential scale, such as AeBt or Ae-Bt
Methods:
o treat them like interval-scaled variables—not a good choice!
(why?—the scale can be distorted)
o apply logarithmic transformation yif = log(xif)
o treat them as continuous ordinal data treat their rank as interval-
scaled
Mixed Variables
A database may contain all the six types of variables
o symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio
One may use a weighted formula to combine their effects
pf 1 ij( f ) d ij( f )
d (i, j)
pf 1 ij( f )
o f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
o f is interval-based: use the normalized distance
o f is ordinal or ratio-scaled
compute ranks rif and
treat zif as interval-scaled zif
r
if 1
M f 1