0% found this document useful (0 votes)

99 views23 pages

Cluster Analysis in Construction

Cluster analysis is an unsupervised machine learning technique used to group similar data objects together. It can be applied to various domains such as image processing, web usage mining, and spatial data analysis. This document discusses different types of data that can be used for cluster analysis, including interval-scaled, binary, nominal, ordinal and ratio variables. It also covers distance measures, data structures, evaluation metrics, and approaches for handling mixed variable types. The goal of cluster analysis is to maximize similarity within clusters and minimize similarity between clusters.

Uploaded by

Erza Lee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views23 pages

Cluster Analysis in Construction

Uploaded by

Erza Lee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

CE175-4C

Database Management in Construction

Introduction to Cluster
Analysis

Edgar M. Adina
Instructor
What is Cluster Analysis?
Cluster: a collection of data objects
o Similar to one another within the same cluster
o Dissimilar to the objects in other clusters
Cluster analysis
o Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined classes
Typical applications
o As a stand-alone tool to get insight into data distribution
o As a preprocessing step for other algorithms
General Applications of Clustering
 Pattern Recognition
 Spatial Data Analysis
o create thematic maps in GIS by clustering feature spaces
o detect spatial clusters and explain them in spatial data mining
 Image Processing
 Economic Science (especially market research)
 WWW
o Document classification
o Cluster Weblog data to discover groups of similar access patterns
Some specific applications
City-planning: Identifying groups of houses according to their house type, value,
and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along
continent faults
Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth observation
database
Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost
Illustration: Thematic Maps
Illustration: Web Usage Mining
Clustering Approaches
 Partitioning Algorithms
o Find k partitions, minimizing some objective functions
 Hierarchy Algorithms
o Create a hierarchical decomposition of the set of objects
 Density-based
o Find clusters based on connectivity and density functions
 Other methods
o Grid-based
o Neural Networks
o Graph-theoretical methods, and many others…
Good Clustering
 A good clustering method will produce high quality clusters with
o high intra-class similarity
o low inter-class similarity
 The quality of a clustering result depends on both the similarity measure
used by the method and its implementation.
 The quality of a clustering method is also measured by its ability to discover
some or all of the hidden patterns.
Requisites
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to determine input
parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
Data Structures
 x11 ... x1f ... x1p 
 
Data matrix  ... ... ... ... ... 
(two modes) x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... x np 
 n1 

 0 
Dissimilarity matrix  d(2,1) 
(one mode)  0 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n, 2 ) ... ... 0
Quality of Clustering (Metrics)
 Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance
function, which is typically metric: d(i, j)
 There is a separate “quality” function that measures the “goodness” of a cluster.
 The definitions of distance functions are usually very different for interval-
scaled, Boolean, categorical, ordinal and ratio variables.
 Weights should be associated with different variables based on applications and
data semantics.
 It is hard to define “similar enough” or “good enough”
o the answer is typically highly subjective.
Sample Distance Functions
Sample Distance Functions
Measuring Similarity
Types of Data for CA

 Interval-scaled variables:

 Binary variables:

 Nominal, ordinal, and ratio variables:

 Variables of mixed types:

Interval-Valued Data
 Standardize data
o Calculate the mean absolute deviation:
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where mf  1
n (x1 f  x2 f  ...  xnf )
.

o Calculate the standardized measurement (z-score)

xif  m f
zif  sf

 Using mean absolute deviation is more robust than using

standard deviation
Binary Variables
 A contingency table for binary data Object j
1 0 sum
1 a b a b
Object i 0 c d cd
sum a  c b  d p

 Simple matching coefficient (invariant, if the binary variable is symmetric):

d (i, j)  bc
a bc d
 Jaccard coefficient (noninvariant if the binary variable is asymmetric):
d (i, j)  bc
a bc
Binary Variables

Rassel and Rao coefficient: J(i,j)= a/ a+b+c+d

Bravais coefficient: C(i,j)= ad-bc/ (a  b)( a  c)( d  b)( d  c)

Association coefficient Yule: Q(i,j)= ad-bc/ ad+bc

Hemming distance: H(i,j)= a+d

Dissimilarity in Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
o gender is a symmetric attribute
o the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green
Method 1: Simple matching
o m: # of matches, p: total # of variables
p
d (i, j)  p m

Method 2: use a large number of binary variables

o creating a new binary variable for each of the M nominal states
Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Can be treated like interval-scaled
o replace xif by their rank rif {1,...,M f }
o map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by rif 1
zif 
M f
1

o compute the dissimilarity using methods for interval-scaled variables

Ratio-scaled Variables
 Ratio-scaled variable: a positive measurement on a nonlinear scale,
approximately at exponential scale, such as AeBt or Ae-Bt
 Methods:
o treat them like interval-scaled variables—not a good choice!
(why?—the scale can be distorted)
o apply logarithmic transformation yif = log(xif)
o treat them as continuous ordinal data treat their rank as interval-
scaled
Mixed Variables
 A database may contain all the six types of variables
o symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) d ij( f )
d (i, j) 
 pf  1 ij( f )
o f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
o f is interval-based: use the normalized distance
o f is ordinal or ratio-scaled
 compute ranks rif and
 treat zif as interval-scaled zif 
r
if 1
M f 1

Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Clustering
No ratings yet
Clustering
47 pages
Unit 4
No ratings yet
Unit 4
65 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
K Medoids
No ratings yet
K Medoids
101 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
Cluster
No ratings yet
Cluster
120 pages
Chapter4 Clustering
No ratings yet
Chapter4 Clustering
77 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Python Notes
No ratings yet
Python Notes
77 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
Cluster Analysis Methods Guide
No ratings yet
Cluster Analysis Methods Guide
51 pages
Unit VI Clustering
No ratings yet
Unit VI Clustering
72 pages
Data Similarity
0% (1)
Data Similarity
18 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
Clustering
No ratings yet
Clustering
27 pages
Clustering in Data Mining Guide
No ratings yet
Clustering in Data Mining Guide
39 pages
Clustering
No ratings yet
Clustering
64 pages
Cluster Analysis and Applications
No ratings yet
Cluster Analysis and Applications
37 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
98 pages
V DM Clustering
No ratings yet
V DM Clustering
76 pages
Cluster Analysis: Introduction - I: Dr. A. Ramesh
No ratings yet
Cluster Analysis: Introduction - I: Dr. A. Ramesh
28 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Physics Cambridge Igcse Year 10 Paper 1
No ratings yet
Physics Cambridge Igcse Year 10 Paper 1
18 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
No ratings yet
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
119 pages
ML12 Clustering
No ratings yet
ML12 Clustering
34 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
02data Part4
No ratings yet
02data Part4
28 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
Lecture 07 2025 Clustering Prepr
No ratings yet
Lecture 07 2025 Clustering Prepr
17 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Data Mining: Clustering
No ratings yet
Data Mining: Clustering
46 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
97 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Lec 5
No ratings yet
Lec 5
24 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
Rangkuman Data Analitik Dan Big Data
No ratings yet
Rangkuman Data Analitik Dan Big Data
10 pages
Naked Statistics: Stripping The Dread From The Data Practical Business Statistics, Sixth Edition
No ratings yet
Naked Statistics: Stripping The Dread From The Data Practical Business Statistics, Sixth Edition
2 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Data Mining: Clustering Essentials
No ratings yet
Data Mining: Clustering Essentials
18 pages
Python Programming Exercises
No ratings yet
Python Programming Exercises
4 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Module 1 Lesson 2 Basic Concepts For Construction Database
No ratings yet
Module 1 Lesson 2 Basic Concepts For Construction Database
23 pages
Option Delta With Skew Adjustment
100% (1)
Option Delta With Skew Adjustment
33 pages
CE175-5C-Introduction and Course Output
No ratings yet
CE175-5C-Introduction and Course Output
12 pages
Construction Bonds and Contracts - Part 2
No ratings yet
Construction Bonds and Contracts - Part 2
18 pages
Secondary - 2018 - Class - 9 & 10 - Math Full - PDF Opt
No ratings yet
Secondary - 2018 - Class - 9 & 10 - Math Full - PDF Opt
390 pages
Year 5 Math Curriculum Guide
No ratings yet
Year 5 Math Curriculum Guide
22 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
Module 1 Lesson 1
No ratings yet
Module 1 Lesson 1
26 pages
Cot Math 4 q2 - Week6 2022
No ratings yet
Cot Math 4 q2 - Week6 2022
12 pages
Cambridge International As A Level Mathematics Probability Statistics 1 Practice Book Cambridge International Download
No ratings yet
Cambridge International As A Level Mathematics Probability Statistics 1 Practice Book Cambridge International Download
44 pages
M1-L1-Construction Bonds and Contracts
No ratings yet
M1-L1-Construction Bonds and Contracts
14 pages
9-4 Notes PDF
No ratings yet
9-4 Notes PDF
18 pages
Pavement Condition Assessment Using Soft Computing Techniques
No ratings yet
Pavement Condition Assessment Using Soft Computing Techniques
18 pages
Introduction To Jflap - Jar and Finite State Automata: Theory of Computation (Cs 333) Spring Term, 2011 (Prof. Mckelvey)
No ratings yet
Introduction To Jflap - Jar and Finite State Automata: Theory of Computation (Cs 333) Spring Term, 2011 (Prof. Mckelvey)
7 pages
Dose Effectiveness Analysis
No ratings yet
Dose Effectiveness Analysis
71 pages
Experiment 4 - Numerical Differentiation
No ratings yet
Experiment 4 - Numerical Differentiation
6 pages
Worksheet - 1 Tangent - Normal
No ratings yet
Worksheet - 1 Tangent - Normal
11 pages
CHP 6.1-Chp 6.7.4
No ratings yet
CHP 6.1-Chp 6.7.4
112 pages
Stas 2634 1980 en
No ratings yet
Stas 2634 1980 en
25 pages
Reliability, Validity, Sensitivity
No ratings yet
Reliability, Validity, Sensitivity
3 pages
Optimization of The SWAT Model To Adequately Predict Different Segments of A Managed Streamflow Hydrograph
No ratings yet
Optimization of The SWAT Model To Adequately Predict Different Segments of A Managed Streamflow Hydrograph
21 pages
ADC SNR Jitter
No ratings yet
ADC SNR Jitter
6 pages
Blockchain Cryptography Essentials
No ratings yet
Blockchain Cryptography Essentials
41 pages
City University of Hong Kong Course Syllabus Offered by Department of Mathematics With Effect From Semester - A - 20 - 15 - / 16
No ratings yet
City University of Hong Kong Course Syllabus Offered by Department of Mathematics With Effect From Semester - A - 20 - 15 - / 16
6 pages
Transformations Review Stations
No ratings yet
Transformations Review Stations
11 pages
Ibps RRB Officer Scale 1 Previous Year Paper 2013 Based On Old Pattern (Go Through The Questions For Practice)
No ratings yet
Ibps RRB Officer Scale 1 Previous Year Paper 2013 Based On Old Pattern (Go Through The Questions For Practice)
7 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
磁力计校准简介
No ratings yet
磁力计校准简介
4 pages
12 Hookes Law and Youngs Modulus
No ratings yet
12 Hookes Law and Youngs Modulus
6 pages
wJycrDjdkdBcm0jhiuPq 231231 101917
No ratings yet
wJycrDjdkdBcm0jhiuPq 231231 101917
7 pages
Bsed Math 163 Syllabus PDF
No ratings yet
Bsed Math 163 Syllabus PDF
6 pages
Exercise 3
No ratings yet
Exercise 3
4 pages
Math Reviewer
No ratings yet
Math Reviewer
1 page

Cluster Analysis in Construction

Uploaded by

Cluster Analysis in Construction

Uploaded by

CE175-4C

Database Management in Construction

 Nominal, ordinal, and ratio variables:

 Variables of mixed types:

o Calculate the standardized measurement (z-score)

 Using mean absolute deviation is more robust than using

 Simple matching coefficient (invariant, if the binary variable is symmetric):

Rassel and Rao coefficient: J(i,j)= a/ a+b+c+d

Bravais coefficient: C(i,j)= ad-bc/ (a  b)( a  c)( d  b)( d  c)

Association coefficient Yule: Q(i,j)= ad-bc/ ad+bc

Hemming distance: H(i,j)= a+d

Method 2: use a large number of binary variables

o compute the dissimilarity using methods for interval-scaled variables

You might also like