0% found this document useful (0 votes)
64 views

Cluster Analysis Introduction

Cluster analysis is an unsupervised machine learning technique used to group similar data objects together. It can be applied to various domains such as image processing, web usage mining, and spatial data analysis. This document discusses different types of data that can be used for cluster analysis, including interval-scaled, binary, nominal, ordinal and ratio variables. It also covers distance measures, data structures, evaluation metrics, and approaches for handling mixed variable types. The goal of cluster analysis is to maximize similarity within clusters and minimize similarity between clusters.

Uploaded by

Erza Lee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Cluster Analysis Introduction

Cluster analysis is an unsupervised machine learning technique used to group similar data objects together. It can be applied to various domains such as image processing, web usage mining, and spatial data analysis. This document discusses different types of data that can be used for cluster analysis, including interval-scaled, binary, nominal, ordinal and ratio variables. It also covers distance measures, data structures, evaluation metrics, and approaches for handling mixed variable types. The goal of cluster analysis is to maximize similarity within clusters and minimize similarity between clusters.

Uploaded by

Erza Lee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

CE175-4C

Database Management in Construction

Introduction to Cluster
Analysis

Edgar M. Adina
Instructor
What is Cluster Analysis?
Cluster: a collection of data objects
o Similar to one another within the same cluster
o Dissimilar to the objects in other clusters
Cluster analysis
o Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined classes
Typical applications
o As a stand-alone tool to get insight into data distribution
o As a preprocessing step for other algorithms
General Applications of Clustering
 Pattern Recognition
 Spatial Data Analysis
o create thematic maps in GIS by clustering feature spaces
o detect spatial clusters and explain them in spatial data mining
 Image Processing
 Economic Science (especially market research)
 WWW
o Document classification
o Cluster Weblog data to discover groups of similar access patterns
Some specific applications
City-planning: Identifying groups of houses according to their house type, value,
and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation
database
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
Illustration: Thematic Maps
Illustration: Web Usage Mining
Clustering Approaches
 Partitioning Algorithms
o Find k partitions, minimizing some objective functions
 Hierarchy Algorithms
o Create a hierarchical decomposition of the set of objects
 Density-based
o Find clusters based on connectivity and density functions
 Other methods  
o Grid-based
o Neural Networks
o Graph-theoretical methods, and many others…
Good Clustering
 A good clustering method will produce high quality clusters with
o high intra-class similarity
o low inter-class similarity
 The quality of a clustering result depends on both the similarity measure
used by the method and its implementation.
 The quality of a clustering method is also measured by its ability to discover
some or all of the hidden patterns.
Requisites
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input
parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
Data Structures
 x11 ... x1f ... x1p 
 
Data matrix  ... ... ... ... ... 
(two modes) x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... x np 
 n1 

 0 
Dissimilarity matrix  d(2,1) 
(one mode)  0 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n, 2 ) ... ... 0
Quality of Clustering (Metrics)
 Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
 There is a separate “quality” function that measures the “goodness” of a cluster.
 The definitions of distance functions are usually very different for interval-
scaled, Boolean, categorical, ordinal and ratio variables.
 Weights should be associated with different variables based on applications and
data semantics.
 It is hard to define “similar enough” or “good enough”
o the answer is typically highly subjective.
Sample Distance Functions
Sample Distance Functions
Measuring Similarity
Types of Data for CA

 Interval-scaled variables:

 Binary variables:

 Nominal, ordinal, and ratio variables:

 Variables of mixed types:


Interval-Valued Data
 Standardize data
o Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where mf  1
n (x1 f  x2 f  ...  xnf )
.

o Calculate the standardized measurement (z-score)


xif  m f
zif  sf

 Using mean absolute deviation is more robust than using


standard deviation
Binary Variables
 A contingency table for binary data Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a  c b  d p

 Simple matching coefficient (invariant, if the binary variable is symmetric):


d (i, j)  bc
a bc d
 Jaccard coefficient (noninvariant if the binary variable is asymmetric):
d (i, j)  bc
a bc
Binary Variables

Rassel and Rao coefficient: J(i,j)= a/ a+b+c+d

Bravais coefficient: C(i,j)= ad-bc/ (a  b)( a  c)( d  b)( d  c)

Association coefficient Yule: Q(i,j)= ad-bc/ ad+bc

Hemming distance: H(i,j)= a+d


Dissimilarity in Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
o gender is a symmetric attribute
o the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green
Method 1: Simple matching
o m: # of matches, p: total # of variables
p
d (i, j)  p m

Method 2: use a large number of binary variables


o creating a new binary variable for each of the M nominal states
Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Can be treated like interval-scaled
o replace xif by their rank rif {1,...,M f }
o map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by rif 1
zif 
M f
1

o compute the dissimilarity using methods for interval-scaled variables


Ratio-scaled Variables
 Ratio-scaled variable: a positive measurement on a nonlinear scale,
approximately at exponential scale, such as AeBt or Ae-Bt
 Methods:
o treat them like interval-scaled variables—not a good choice!
(why?—the scale can be distorted)
o apply logarithmic transformation yif = log(xif)
o treat them as continuous ordinal data treat their rank as interval-
scaled
Mixed Variables
 A database may contain all the six types of variables
o symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) d ij( f )
d (i, j) 
 pf  1 ij( f )
o f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
o f is interval-based: use the normalized distance
o f is ordinal or ratio-scaled
 compute ranks rif and
 treat zif as interval-scaled zif 
r
if 1
M f 1

You might also like