29.measuring Data Similarity and Dissimilarity Introduction
29.measuring Data Similarity and Dissimilarity Introduction
Similarity and
Dissimilarity
Introduction
(Dis)similarity Introduction
Similarity
• Numerical measure
of how alike two data
objects are
Dissimilarity
• Numerical measure
of how different two
data objects are
• Lower when objects
are more alike
• Minimum dissimilarity
is often 0
• Upper limit varies
(Dis)similarity Introduction
Proximity
• Closeness of data
objects
• Objects with
relationship are placed
together
• Proximity is associated
with Balance
(Dis)similarity Introduction
Similarity measures
• Clustering
• Nearest neighbour
search
• Classification
• Prediction
• Characterization
• Categorization
• Correlation Analysis
Data Mining
Similarity and
Dissimilarity
Data Matrix and
Dissimilarity Matrix
Data Matrix
Data Matrix
Each dimension
describes one attribute.
Data Matrix
Data Matrix
• Represent n data points
with p dimensions
x ... x ... x
11 1k 1p
... ... ... ... ...
x ... x ... x
i1 ik ip
... ... ... ... ...
xn1 ... xnk ... xnp
Data Matrix
Dissimilarity Matrix
• n data points, but registers only the distance
• A triangular matrix
0
d(2,1) 0
d(3,1) d ( 3, 2 ) 0
: : :
d ( n,1) d ( n, 2 ) ... ... 0
Data Matrix
Data Matrix
QR Code
Data Mining
Similarity and
Dissimilarity
Proximity Measure
Proximity Measure
Nominal Attributes
• Nominal Attribute can take
two or more than two values.
• Method 1 for proximity
• Simple matching
• m: # of matches
• p: total # of variables
p
d (i, j) = p− m
Proximity Measure
Nominal attributes
Method 2:
Object j
1 0 sum
Object i
1 q r q+r
0 s t s+t
sum q+s r+t p
Proximity Measure
Binary attributes
Jaccard coefficient
q
simJaccard (i, j) =
q+r+s
0 +1
d ( jack , mary ) = = 0.33
2 + 0 +1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim, mary ) = = 0.75
1+1+ 2
Data Mining
Similarity and
Dissimilarity:
Standardizing
Numeric Data
Standardizing Numeric Data
Standardization
Process of transforming
receiving data into
common format
Purpose
Collaborative research
Large-scale analytics
Sharing of sophisticated
tools & methodologies
Standardizing Numeric Data
Z-score Standardization
X: raw score to be
standardized,
μ: mean of population
σ: standard deviation
x
z= −
Standardizing Numeric Data
Z-Score example
SAT individual score
1100.
The mean score of SAT
is 1026
Standard deviation is
209.
Z=354
Standardizing Numeric Data
Alternative way for Standardization
Calculate the mean absolute deviation
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
Where
m f = 1n (x1 f + x2 f + ... + xnf ) .
xif − m f
zif = sf
Data Mining
Similarity and
Dissimilarity
Distance on
Numeric Data
Distance on Numeric Data
Minkowski Distance
d (i, j) = h | x − x | + | x − x | +...+ | x − x |
h h h
i1 j1 i2 j 2 ip jp
Manhattan distance
H=1
Hamming distance: Number of
bits that are different between
two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
Distance on Numeric Data
Euclidean distance
• h= 2
• L2 norm
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
Distance on Numeric Data
Supremum distance
• h= ∞
• Maximum difference between any
component (attribute) of the vectors
Minkowski Distance
The data set
L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
Minkowski Distance
Euclidean (L2) distance
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Minkowski Distance
Supreme distance
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Data Mining
Similarity and
Dissimilarity:
Attributes of
Mixed Type
Attributes of Mixed Type
Attributes of different
types
Nominal
Symmetric binary
Asymmetric binary
Numeric
Ordinal
Attributes of Mixed Type
Combine effect of attributes
pf = 1 ij( f ) dij( f )
d (i, j) =
pf = 1 ij( f )
Attributes of Mixed Type
Binary attributes
f could be
• Binary
• Nominal
dij(f) = 0
• xif = xjf ,
dij(f) = 1
• otherwise
Attributes of Mixed Type
Ordinal attribute
Discrete / Continuous
Order is important
replace xif by their rank
rif {1,..., M f }
Similarity and
Dissimilarity:
Cosine Similarity
Cosine Similarity
Cosine Similarity
A document can be
represented by
thousands of
attributes.
How to measure
similarity b/w two
docs.
Cosine Similarity
Cosine Measure
Example
Assume two vectors
• d1 = (5, 0, 3, 0, 2, 0,
0, 2, 0, 0)
• d2 = (3, 0, 2, 0, 1, 1,
0, 1, 0, 1)
Cosine Similarity
Example
d1*d2 =
5*3+0*0+3*2+0*0+2*1+
0*1+0*1+2*1+0*0+0*1
= 25
||d1||= 6.481
||d2||= 4.12
cos(d1, d2 ) = 0.94
Cosine Similarity
Application
• Information retrieval
• Biologic taxonomy
• Plagiarism detection
• Recommendation
Systems