0% found this document useful (0 votes)
89 views43 pages

29.measuring Data Similarity and Dissimilarity Introduction

The document discusses various methods for measuring similarity and dissimilarity between data objects, including proximity measures for nominal and binary attributes, standardizing numeric data using z-score normalization, calculating distance between numeric data objects using Minkowski distance measures like Manhattan, Euclidean and Supreme distance, and handling attributes of mixed data types by combining attribute similarities using a weighted formula. It also describes cosine similarity for measuring similarity between document vectors in applications like information retrieval and recommendation systems.

Uploaded by

amna shahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views43 pages

29.measuring Data Similarity and Dissimilarity Introduction

The document discusses various methods for measuring similarity and dissimilarity between data objects, including proximity measures for nominal and binary attributes, standardizing numeric data using z-score normalization, calculating distance between numeric data objects using Minkowski distance measures like Manhattan, Euclidean and Supreme distance, and handling attributes of mixed data types by combining attribute similarities using a weighted formula. It also describes cosine similarity for measuring similarity between document vectors in applications like information retrieval and recommendation systems.

Uploaded by

amna shahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Data Mining

Similarity and
Dissimilarity
Introduction
(Dis)similarity Introduction

Similarity
• Numerical measure
of how alike two data
objects are

• Value is higher when


objects are more alike

• Often falls in range


[0,1]
(Dis)similarity Introduction

Dissimilarity
• Numerical measure
of how different two
data objects are
• Lower when objects
are more alike
• Minimum dissimilarity
is often 0
• Upper limit varies
(Dis)similarity Introduction
Proximity
• Closeness of data
objects

• Objects with
relationship are placed
together

• Proximity is associated
with Balance
(Dis)similarity Introduction

Similarity measures
• Clustering
• Nearest neighbour
search
• Classification
• Prediction
• Characterization
• Categorization
• Correlation Analysis
Data Mining

Similarity and
Dissimilarity
Data Matrix and
Dissimilarity Matrix
Data Matrix
Data Matrix

Objects with same fixed


number of attributes

Objects are presented


in multi-dim space.

Each dimension
describes one attribute.
Data Matrix
Data Matrix
• Represent n data points
with p dimensions
x ... x ... x 
 11 1k 1p 
 ... ... ... ... ... 
x ... x ... x 
 i1 ik ip 
 ... ... ... ... ... 
 
 xn1 ... xnk ... xnp 
Data Matrix
Dissimilarity Matrix
• n data points, but registers only the distance
• A triangular matrix

 0 
 d(2,1) 0 
 
 d(3,1) d ( 3, 2 ) 0 
 
 : : : 

 d ( n,1) d ( n, 2 ) ... ... 0

Data Matrix
Data Matrix
QR Code
Data Mining

Similarity and
Dissimilarity
Proximity Measure
Proximity Measure

Nominal Attributes
• Nominal Attribute can take
two or more than two values.
• Method 1 for proximity
• Simple matching
• m: # of matches
• p: total # of variables

p
d (i, j) = p− m
Proximity Measure

Nominal attributes
Method 2:

Use a large number of binary


attributes

• creating a new binary


attribute for each of the
M nominal states
Proximity Measure
Binary Attributes

• Binary attributes can take only two


values.
• A contingency table for binary data

Object j

1 0 sum
Object i

1 q r q+r
0 s t s+t
sum q+s r+t p
Proximity Measure
Binary attributes
Jaccard coefficient
q
simJaccard (i, j) =
q+r+s

Jaccard coefficient is the same as


coherence
sup(i,j)
coherence(i, j) =
sup i +sup j −sup(i,j)
q
=
𝑞+𝑟 + 𝑞+𝑠 −𝑞
Proximity Measure
Dissimilarity in binary attributes
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

0 +1
d ( jack , mary ) = = 0.33
2 + 0 +1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim, mary ) = = 0.75
1+1+ 2
Data Mining

Similarity and
Dissimilarity:
Standardizing
Numeric Data
Standardizing Numeric Data

Standardization
Process of transforming
receiving data into
common format

Purpose
Collaborative research
Large-scale analytics
Sharing of sophisticated
tools & methodologies
Standardizing Numeric Data

Z-score Standardization

X: raw score to be
standardized,
μ: mean of population
σ: standard deviation
x
z=  − 
Standardizing Numeric Data
Z-Score example
SAT individual score
1100.
The mean score of SAT
is 1026
Standard deviation is
209.

Z=354
Standardizing Numeric Data
Alternative way for Standardization
Calculate the mean absolute deviation

s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)

Where
m f = 1n (x1 f + x2 f + ... + xnf ) .

xif − m f
zif = sf
Data Mining

Similarity and
Dissimilarity
Distance on
Numeric Data
Distance on Numeric Data
Minkowski Distance

d (i, j) = h | x − x | + | x − x | +...+ | x − x |
h h h
i1 j1 i2 j 2 ip jp

p-dimensional data objects


• i = (xi1, xi2, …, xip)
• J= (xJ1, xJ2, …, xJp)
h is the order
Distance on Numeric Data

Manhattan distance
H=1
Hamming distance: Number of
bits that are different between
two binary vectors

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
Distance on Numeric Data

Euclidean distance

• h= 2
• L2 norm

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
Distance on Numeric Data

Supremum distance

• h= ∞
• Maximum difference between any
component (attribute) of the vectors
Minkowski Distance
The data set

point attribute 1 attribute 2


x1 1 2
x2 3 5
x3 2 0
x4 4 5
Minkowski Distance
The data set
Minkowski Distance
Manhattan (L1) distance

L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
Minkowski Distance
Euclidean (L2) distance

L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Minkowski Distance
Supreme distance

L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Data Mining

Similarity and
Dissimilarity:
Attributes of
Mixed Type
Attributes of Mixed Type

Attributes of different
types
Nominal

Symmetric binary

Asymmetric binary

Numeric

Ordinal
Attributes of Mixed Type
Combine effect of attributes

Weighted formula to combine different


types of attributes

 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )
Attributes of Mixed Type

Binary attributes
f could be
• Binary
• Nominal

dij(f) = 0
• xif = xjf ,
dij(f) = 1
• otherwise
Attributes of Mixed Type
Ordinal attribute
Discrete / Continuous
Order is important
replace xif by their rank
rif {1,..., M f }

Map to range [0,1]


rif −1
zif =
M f −1
Treat as interval scaled
Data Mining

Similarity and
Dissimilarity:
Cosine Similarity
Cosine Similarity

Cosine Similarity
A document can be
represented by
thousands of
attributes.

How to measure
similarity b/w two
docs.
Cosine Similarity

Cosine Measure

• If d1 and d2 are two vectors


(e.g., term-frequency
vectors),
Cosine Similarity

Example
Assume two vectors

• d1 = (5, 0, 3, 0, 2, 0,
0, 2, 0, 0)

• d2 = (3, 0, 2, 0, 1, 1,
0, 1, 0, 1)
Cosine Similarity

Example
d1*d2 =
5*3+0*0+3*2+0*0+2*1+
0*1+0*1+2*1+0*0+0*1
= 25

||d1||= 6.481
||d2||= 4.12

cos(d1, d2 ) = 0.94
Cosine Similarity
Application
• Information retrieval

• Biologic taxonomy

• Gene feature mapping

• Plagiarism detection

• Recommendation
Systems

You might also like