0% found this document useful (0 votes)

39 views

What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step

The document discusses cluster analysis and its applications. Cluster analysis involves grouping similar data objects into clusters, where objects within a cluster are similar to each other and dissimilar to objects in other clusters. It is an unsupervised learning technique used to gain insight into data distribution or as a preprocessing step. Examples of applications include marketing, land use analysis, insurance, and city planning. The document also discusses evaluating the quality of clustering results.

Uploaded by

drazil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step

Uploaded by

drazil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 21

What is Cluster Analysis?

 Cluster: a collection of data objects

 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
March 28, 2019 Data Mining: Concepts and Techniques 1
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults

March 28, 2019 Data Mining: Concepts and Techniques 2

Quality: What Is Good Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

March 28, 2019 Data Mining: Concepts and Techniques 3

Measure the Quality of Clustering

 Dissimilarity/Similarity metric: Similarity is expressed in

terms of a distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures the
“goodness” of a cluster.
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
 Weights should be associated with different variables
based on applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.
March 28, 2019 Data Mining: Concepts and Techniques 4
Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability

March 28, 2019 Data Mining: Concepts and Techniques 5

Data Structures
 Data matrix
 x11 ... x1f ... x1p 
 (two modes)  
 ... ... ... ... ... 
x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 

 Dissimilarity matrix  0 
 (one mode)  d(2,1) 0 
 
 d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

March 28, 2019 Data Mining: Concepts and Techniques 6

Type of data in clustering analysis

 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types

March 28, 2019 Data Mining: Concepts and Techniques 7

Interval-valued variables

 Standardize data
 Calculate the mean absolute deviation:
s f  1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)

where m f  1n (x1 f  x2 f  ...  xnf )

 Calculate the standardized measurement (z-score)

xif  m f
zif  sf
 Using mean absolute deviation is more robust than using
standard deviation

March 28, 2019 Data Mining: Concepts and Techniques 8

Similarity and Dissimilarity Between
Objects
 Distances are normally used to measure the similarity or
dissimilarity between two data objects
 Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x | q ... | x  x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
 If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

March 28, 2019 Data Mining: Concepts and Techniques 9

Similarity and Dissimilarity Between
Objects (Cont.)
 If q = 2, d is Euclidean distance:
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 Properties
 d(i,j)  0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
 Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures

March 28, 2019 Data Mining: Concepts and Techniques 10

Binary Variables
Object j
1 0 sum
 A contingency table for
1 a b a b
binary data Object i
0 c d cd
sum a  c b  d p
 Distance measure for
symmetric binary variables:
 Distance measure for bc
d (i, j) 
asymmetric binary variables: a bc
 Jaccard coefficient (similarity
measure for asymmetric simJaccard (i, j)  a
a bc
binary variables):
March 28, 2019 Data Mining: Concepts and Techniques 11
Dissimilarity between Binary Variables

 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

 gender is a symmetric attribute

 the remaining attributes are asymmetric binary
 let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
March 28, 2019 Data Mining: Concepts and Techniques 12
Nominal Variables

 A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green
 Method 1: Simple matching
 m: # of matches, p: total # of variables

d (i, j)  p 
p
m

 Method 2: use a large number of binary variables

 creating a new binary variable for each of the M
nominal states

March 28, 2019 Data Mining: Concepts and Techniques 13

Ordinal Variables

 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank
 Can be treated like interval-scaled
 replace xif by their rank rif {1,..., M f }
 map the range of each variable onto [0, 1] by replacing
i-th object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-
scaled variables

March 28, 2019 Data Mining: Concepts and Techniques 14

Ratio-Scaled Variables

 Ratio-scaled variable: a positive measurement on a

nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
 Methods:
 treat them like interval-scaled variables— not a good
choice! (why?—the scale can be distorted)
 apply logarithmic transformation
yif = log(xif)
 treat them as continuous ordinal data treat their rank as
interval-scaled

March 28, 2019 Data Mining: Concepts and Techniques 15

Variables of Mixed Types

 A database may contain all the six types of variables

 symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio

 One may use a weighted formula to combine their
effects  pf  1 ij( f ) d ij( f )
d (i, j ) 
 pf  1 ij( f )
 f is binary or nominal:

dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

 f is interval-based: use the normalized distance

 f is ordinal or ratio-scaled

 compute ranks r and

if
z 
r 1
if

 and treat z as interval-scaled if M 1

if f

March 28, 2019 Data Mining: Concepts and Techniques 16

Vector Objects

 Vector objects: keywords in documents, gene

features in micro-arrays, etc.
 Broad applications: information retrieval, biologic
taxonomy, etc.
 Cosine measure

 A variant: Tanimoto coefficient

March 28, 2019 Data Mining: Concepts and Techniques 17

Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

March 28, 2019 Data Mining: Concepts and Techniques 18

Major Clustering Approaches (II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
March 28, 2019 Data Mining: Concepts and Techniques 19
Summary
 Cluster analysis groups objects based on their similarity
and has wide applications
 Measure of similarity can be computed for various types
of data
 Clustering algorithms can be categorized into partitioning
methods, hierarchical methods, density-based methods,
grid-based methods, and model-based methods
 Outlier detection and analysis are very useful for fraud
detection, etc. and can be performed by statistical,
distance-based or deviation-based approaches
 There are still lots of research issues on cluster analysis

March 28, 2019 Data Mining: Concepts and Techniques 20

Problems and Challenges

 Considerable progress has been made in scalable

clustering methods
 Partitioning: k-means, k-medoids, CLARANS
 Hierarchical: BIRCH, ROCK, CHAMELEON
 Density-based: DBSCAN, OPTICS, DenClue
 Grid-based: STING, WaveCluster, CLIQUE
 Model-based: EM, Cobweb, SOM
 Frequent pattern-based: pCluster
 Constraint-based: COD, constrained-clustering
 Current clustering techniques do not address all the
requirements adequately, still an active area of research
March 28, 2019 Data Mining: Concepts and Techniques 21

Hologram Dream Sequence Manual
No ratings yet
Hologram Dream Sequence Manual
21 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
Clustering
No ratings yet
Clustering
123 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
8clst
No ratings yet
8clst
98 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
127 pages
8 Clustering
No ratings yet
8 Clustering
89 pages
8clst
No ratings yet
8clst
100 pages
Chapter 8. Cluster Analysis
No ratings yet
Chapter 8. Cluster Analysis
51 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
56 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
8 CLST
No ratings yet
8 CLST
98 pages
Kmeans Ex
No ratings yet
Kmeans Ex
98 pages
Cluster Analisys
No ratings yet
Cluster Analisys
100 pages
Unit 4
No ratings yet
Unit 4
65 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Algorithms
No ratings yet
Algorithms
107 pages
Outlier Analysis
No ratings yet
Outlier Analysis
104 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Lect3 Clustering
No ratings yet
Lect3 Clustering
86 pages
Clustering
No ratings yet
Clustering
84 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Data Mining
No ratings yet
Data Mining
98 pages
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
No ratings yet
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
119 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
تنقيب بيانات 7 بعد التعديل Maj
No ratings yet
تنقيب بيانات 7 بعد التعديل Maj
35 pages
Clustering
No ratings yet
Clustering
51 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Clustering
No ratings yet
Clustering
47 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Concepts and Techniques
100% (2)
Concepts and Techniques
118 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
Lecture 6 - Clustering
No ratings yet
Lecture 6 - Clustering
25 pages
Data Mining Summaries PDF
No ratings yet
Data Mining Summaries PDF
22 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
DATA_MINING_UNIT-4
No ratings yet
DATA_MINING_UNIT-4
15 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
Topic 4 - Data Mining Tools and Technique
No ratings yet
Topic 4 - Data Mining Tools and Technique
22 pages
CS822-DataMining-Week1 (1)
No ratings yet
CS822-DataMining-Week1 (1)
97 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Data Mining Notes Jntuh Compress
No ratings yet
Data Mining Notes Jntuh Compress
62 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
30 pages
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
No ratings yet
Lecture 4 - Data Pre-Processing: Fall 2010 Dr. Tariq MAHMOOD Nuces (Fast) - Khi
24 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
CNS Lab Manual
No ratings yet
CNS Lab Manual
25 pages
Week 4
No ratings yet
Week 4
89 pages
Markovian Projection For Equity, Fixed Income, and Credit Dynamics
No ratings yet
Markovian Projection For Equity, Fixed Income, and Credit Dynamics
31 pages
Calculation Workbook 5
No ratings yet
Calculation Workbook 5
50 pages
1. RACH CBRA Success Rate - Guideline PA7
No ratings yet
1. RACH CBRA Success Rate - Guideline PA7
58 pages
P11 A
No ratings yet
P11 A
9 pages
Welding Consumables Ppt. 16 17
67% (3)
Welding Consumables Ppt. 16 17
128 pages
Continuous Gas Analyzers, in Situ
No ratings yet
Continuous Gas Analyzers, in Situ
32 pages
Hall Effect and Magnetoresistivity Effect: Experimental Competition-Problem No.1
No ratings yet
Hall Effect and Magnetoresistivity Effect: Experimental Competition-Problem No.1
3 pages
Report
No ratings yet
Report
60 pages
Digital and Data Communications
No ratings yet
Digital and Data Communications
7 pages
Chemical Engineering Thermodynamics
No ratings yet
Chemical Engineering Thermodynamics
86 pages
What Is A Tree?
No ratings yet
What Is A Tree?
10 pages
Structural Steelwork Connections_Graham W. Owens
100% (1)
Structural Steelwork Connections_Graham W. Owens
335 pages
Answer-Past Paper Qs-Topic 1 and Topic 2
No ratings yet
Answer-Past Paper Qs-Topic 1 and Topic 2
2 pages
Vertex Cover Problem
100% (1)
Vertex Cover Problem
4 pages
KR MV IDU 4series R410A 5060Hz Saudi MFL55028435 0CVP0-03A (Jun.2020) Convertible Spec
No ratings yet
KR MV IDU 4series R410A 5060Hz Saudi MFL55028435 0CVP0-03A (Jun.2020) Convertible Spec
1 page
COT 1 - Prob Demo
No ratings yet
COT 1 - Prob Demo
39 pages
ASQLSSR Feb2020 Chakey With Online Figures PDF
No ratings yet
ASQLSSR Feb2020 Chakey With Online Figures PDF
10 pages
Physics: 1. Measurements and Units
100% (2)
Physics: 1. Measurements and Units
43 pages
Presentation 1111
No ratings yet
Presentation 1111
6 pages
Assembly & Maintenance Rack & Pinion
No ratings yet
Assembly & Maintenance Rack & Pinion
22 pages
Progfunhandouts 2010
No ratings yet
Progfunhandouts 2010
43 pages
June 2018 QP - Paper 1 OCR (B) Chemistry GCSE
No ratings yet
June 2018 QP - Paper 1 OCR (B) Chemistry GCSE
24 pages
Concrete Mix Design
No ratings yet
Concrete Mix Design
22 pages
Campus Recruitment System Is A Web Application Software
No ratings yet
Campus Recruitment System Is A Web Application Software
57 pages
DEFCON 23 Bart Kulach Hack The Legacy IBMi Revealed PDF
No ratings yet
DEFCON 23 Bart Kulach Hack The Legacy IBMi Revealed PDF
20 pages
Across-Pro Compress
No ratings yet
Across-Pro Compress
60 pages
Checkout Single Phase Induction Motor Catalogue and Price List.
No ratings yet
Checkout Single Phase Induction Motor Catalogue and Price List.
30 pages