0% found this document useful (0 votes)

89 views43 pages

29.measuring Data Similarity and Dissimilarity Introduction

The document discusses various methods for measuring similarity and dissimilarity between data objects, including proximity measures for nominal and binary attributes, standardizing numeric data using z-score normalization, calculating distance between numeric data objects using Minkowski distance measures like Manhattan, Euclidean and Supreme distance, and handling attributes of mixed data types by combining attribute similarities using a weighted formula. It also describes cosine similarity for measuring similarity between document vectors in applications like information retrieval and recommendation systems.

Uploaded by

amna shahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views43 pages

29.measuring Data Similarity and Dissimilarity Introduction

Uploaded by

amna shahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Data Mining

Similarity and
Dissimilarity
Introduction
(Dis)similarity Introduction

Similarity
• Numerical measure
of how alike two data
objects are

• Value is higher when

objects are more alike

• Often falls in range

[0,1]
(Dis)similarity Introduction

Dissimilarity
• Numerical measure
of how different two
data objects are
• Lower when objects
are more alike
• Minimum dissimilarity
is often 0
• Upper limit varies
(Dis)similarity Introduction
Proximity
• Closeness of data
objects

• Objects with
relationship are placed
together

• Proximity is associated
with Balance
(Dis)similarity Introduction

Similarity measures
• Clustering
• Nearest neighbour
search
• Classification
• Prediction
• Characterization
• Categorization
• Correlation Analysis
Data Mining

Similarity and
Dissimilarity
Data Matrix and
Dissimilarity Matrix
Data Matrix
Data Matrix

Objects with same fixed

number of attributes

Objects are presented

in multi-dim space.

Each dimension
describes one attribute.
Data Matrix
Data Matrix
• Represent n data points
with p dimensions
x ... x ... x 
 11 1k 1p 
 ... ... ... ... ... 
x ... x ... x 
 i1 ik ip 
 ... ... ... ... ... 
 
 xn1 ... xnk ... xnp 
Data Matrix
Dissimilarity Matrix
• n data points, but registers only the distance
• A triangular matrix

 0 
 d(2,1) 0 
 
 d(3,1) d ( 3, 2 ) 0 
 
 : : : 

 d ( n,1) d ( n, 2 ) ... ... 0

Data Matrix
Data Matrix
QR Code
Data Mining

Similarity and
Dissimilarity
Proximity Measure
Proximity Measure

Nominal Attributes
• Nominal Attribute can take
two or more than two values.
• Method 1 for proximity
• Simple matching
• m: # of matches
• p: total # of variables

p
d (i, j) = p− m
Proximity Measure

Nominal attributes
Method 2:

Use a large number of binary

attributes

• creating a new binary

attribute for each of the
M nominal states
Proximity Measure
Binary Attributes

• Binary attributes can take only two

values.
• A contingency table for binary data

Object j

1 0 sum
Object i

1 q r q+r
0 s t s+t
sum q+s r+t p
Proximity Measure
Binary attributes
Jaccard coefficient
q
simJaccard (i, j) =
q+r+s

Jaccard coefficient is the same as

coherence
sup(i,j)
coherence(i, j) =
sup i +sup j −sup(i,j)
q
=
𝑞+𝑟 + 𝑞+𝑠 −𝑞
Proximity Measure
Dissimilarity in binary attributes
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

0 +1
d ( jack , mary ) = = 0.33
2 + 0 +1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim, mary ) = = 0.75
1+1+ 2
Data Mining

Similarity and
Dissimilarity:
Standardizing
Numeric Data
Standardizing Numeric Data

Standardization
Process of transforming
receiving data into
common format

Purpose
Collaborative research
Large-scale analytics
Sharing of sophisticated
tools & methodologies
Standardizing Numeric Data

Z-score Standardization

X: raw score to be
standardized,
μ: mean of population
σ: standard deviation
x
z=  − 
Standardizing Numeric Data
Z-Score example
SAT individual score
1100.
The mean score of SAT
is 1026
Standard deviation is
209.

Z=354
Standardizing Numeric Data
Alternative way for Standardization
Calculate the mean absolute deviation

s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)

Where
m f = 1n (x1 f + x2 f + ... + xnf ) .

xif − m f
zif = sf
Data Mining

Similarity and
Dissimilarity
Distance on
Numeric Data
Distance on Numeric Data
Minkowski Distance

d (i, j) = h | x − x | + | x − x | +...+ | x − x |
h h h
i1 j1 i2 j 2 ip jp

p-dimensional data objects

• i = (xi1, xi2, …, xip)
• J= (xJ1, xJ2, …, xJp)
h is the order
Distance on Numeric Data

Manhattan distance
H=1
Hamming distance: Number of
bits that are different between
two binary vectors

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
Distance on Numeric Data

Euclidean distance

• h= 2
• L2 norm

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
Distance on Numeric Data

Supremum distance

• h= ∞
• Maximum difference between any
component (attribute) of the vectors
Minkowski Distance
The data set

point attribute 1 attribute 2

x1 1 2
x2 3 5
x3 2 0
x4 4 5
Minkowski Distance
The data set
Minkowski Distance
Manhattan (L1) distance

L x1 x2 x3 x4
x1 0
x2 5 0
x3 3 6 0
x4 6 1 7 0
Minkowski Distance
Euclidean (L2) distance

L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Minkowski Distance
Supreme distance

L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
Data Mining

Similarity and
Dissimilarity:
Attributes of
Mixed Type
Attributes of Mixed Type

Attributes of different
types
Nominal

Symmetric binary

Asymmetric binary

Numeric

Ordinal
Attributes of Mixed Type
Combine effect of attributes

Weighted formula to combine different

types of attributes

 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )
Attributes of Mixed Type

Binary attributes
f could be
• Binary
• Nominal

dij(f) = 0
• xif = xjf ,
dij(f) = 1
• otherwise
Attributes of Mixed Type
Ordinal attribute
Discrete / Continuous
Order is important
replace xif by their rank
rif {1,..., M f }

Map to range [0,1]

rif −1
zif =
M f −1
Treat as interval scaled
Data Mining

Similarity and
Dissimilarity:
Cosine Similarity
Cosine Similarity

Cosine Similarity
A document can be
represented by
thousands of
attributes.

How to measure
similarity b/w two
docs.
Cosine Similarity

Cosine Measure

• If d1 and d2 are two vectors

(e.g., term-frequency
vectors),
Cosine Similarity

Example
Assume two vectors

• d1 = (5, 0, 3, 0, 2, 0,
0, 2, 0, 0)

• d2 = (3, 0, 2, 0, 1, 1,
0, 1, 0, 1)
Cosine Similarity

Example
d1*d2 =
5*3+0*0+3*2+0*0+2*1+
0*1+0*1+2*1+0*0+0*1
= 25

||d1||= 6.481
||d2||= 4.12

cos(d1, d2 ) = 0.94
Cosine Similarity
Application
• Information retrieval

• Biologic taxonomy

• Gene feature mapping

• Plagiarism detection

• Recommendation
Systems

Statistics, Statistical Modelling & Data Analytics
No ratings yet
Statistics, Statistical Modelling & Data Analytics
68 pages
Federated Learning For Healthcare Informatics
100% (1)
Federated Learning For Healthcare Informatics
19 pages
Modelling India's Population
100% (1)
Modelling India's Population
15 pages
CS822-DataMining-Week4 (2)
No ratings yet
CS822-DataMining-Week4 (2)
45 pages
Rao 2020
No ratings yet
Rao 2020
31 pages
An Efficient Privacy-Enhancing Cross-Silo Federated Learning and Applications For False Data Injection Attack Detection in Smart Grids
No ratings yet
An Efficient Privacy-Enhancing Cross-Silo Federated Learning and Applications For False Data Injection Attack Detection in Smart Grids
15 pages
Infant Mortality in Brazil a Survival Analysis Using Machine Learning Models7
No ratings yet
Infant Mortality in Brazil a Survival Analysis Using Machine Learning Models7
47 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
26 pages
Chapter2 - Graphical Method
No ratings yet
Chapter2 - Graphical Method
59 pages
Final AE Practice Test EL
100% (1)
Final AE Practice Test EL
65 pages
MATLAB Practical File (Codes) by Priyanshu Sinha
No ratings yet
MATLAB Practical File (Codes) by Priyanshu Sinha
35 pages
Similarity
No ratings yet
Similarity
19 pages
Frustrating Summer Greens PDF
100% (2)
Frustrating Summer Greens PDF
9 pages
Econ 605 - Static Optimization
No ratings yet
Econ 605 - Static Optimization
30 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Boas Data Preprint
No ratings yet
Boas Data Preprint
32 pages
Convent Chronicles. Women Writing About Women & Reform in The Late Middle Ages (Anne Winston-Allen) PDF
100% (3)
Convent Chronicles. Women Writing About Women & Reform in The Late Middle Ages (Anne Winston-Allen) PDF
364 pages
ROXII v2.13 RX1500 User-Guide CLI EN PDF
No ratings yet
ROXII v2.13 RX1500 User-Guide CLI EN PDF
892 pages
Lecture Note 2019 PDF
100% (1)
Lecture Note 2019 PDF
235 pages
Stochastic Processes Beamer
No ratings yet
Stochastic Processes Beamer
43 pages
The Difference Equation As The Predator-Prey Model
0% (1)
The Difference Equation As The Predator-Prey Model
22 pages
Forecasting Techniques
No ratings yet
Forecasting Techniques
36 pages
Japan and Philippines Similarities Differences
No ratings yet
Japan and Philippines Similarities Differences
5 pages
5 6-24V5, (If) 193+132y2.: Business Mathematics
No ratings yet
5 6-24V5, (If) 193+132y2.: Business Mathematics
49 pages
Blast Propagation and Damage in Urban Topographies: Williamdrazin
No ratings yet
Blast Propagation and Damage in Urban Topographies: Williamdrazin
215 pages
Application of Data Science
No ratings yet
Application of Data Science
8 pages
Competitive Equilibrium Theory and Applications by Bryan Ellickson
No ratings yet
Competitive Equilibrium Theory and Applications by Bryan Ellickson
394 pages
L G 0016125104 0051669710
50% (2)
L G 0016125104 0051669710
30 pages
Whole ML PDF 1614408656
100% (1)
Whole ML PDF 1614408656
214 pages
Rethinking Attention With Performers
No ratings yet
Rethinking Attention With Performers
38 pages
Kim Rosary Sallinas-Gerodiaz, Rcrim, MSCJ Lecturer
No ratings yet
Kim Rosary Sallinas-Gerodiaz, Rcrim, MSCJ Lecturer
17 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
Differential Geometry by Andrews PDF
No ratings yet
Differential Geometry by Andrews PDF
176 pages
An Introduction To Chaotic Dynamical Systems: October 2021
No ratings yet
An Introduction To Chaotic Dynamical Systems: October 2021
36 pages
Poisson Distribution Explained - Intuition, Examples, and Derivation - Towards Data Science
No ratings yet
Poisson Distribution Explained - Intuition, Examples, and Derivation - Towards Data Science
10 pages
ATPG Tool Flow
No ratings yet
ATPG Tool Flow
12 pages
3819-Article Text-6877-1-10-20190701
No ratings yet
3819-Article Text-6877-1-10-20190701
8 pages
Medical
No ratings yet
Medical
187 pages
Cost and Revenue Lecture
No ratings yet
Cost and Revenue Lecture
34 pages
Probability Theory III (B.Stat. 2017-2020)
No ratings yet
Probability Theory III (B.Stat. 2017-2020)
173 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
WATER-QUALITY-PREDICTION-USING-MACHINE-LEARNING-TECHNIQUE
No ratings yet
WATER-QUALITY-PREDICTION-USING-MACHINE-LEARNING-TECHNIQUE
9 pages
Se Study On Tesla Motors: Analysis of The Business Model and Growth Strategy
No ratings yet
Se Study On Tesla Motors: Analysis of The Business Model and Growth Strategy
26 pages
Week 6
No ratings yet
Week 6
15 pages
LectureNotes LinearAlgebra
No ratings yet
LectureNotes LinearAlgebra
98 pages
MAT137 Weekly Guide
No ratings yet
MAT137 Weekly Guide
111 pages
Wiley R Johnston (1982) NumericalMethods A Software Approach
No ratings yet
Wiley R Johnston (1982) NumericalMethods A Software Approach
295 pages
FINALExperian Report 14 01
No ratings yet
FINALExperian Report 14 01
20 pages
Environment and Ecosystem
No ratings yet
Environment and Ecosystem
30 pages
SSC 105 Adetutu-1
No ratings yet
SSC 105 Adetutu-1
75 pages
Eulerzig Zag
No ratings yet
Eulerzig Zag
6 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
Jayram Aryal Etabs Report
No ratings yet
Jayram Aryal Etabs Report
45 pages
Updated Syllabus - ME CSE Word Document PDF
No ratings yet
Updated Syllabus - ME CSE Word Document PDF
62 pages
Pallavi Aware: Online Worksheet (
No ratings yet
Pallavi Aware: Online Worksheet (
2 pages
Bluetooth: Objective
No ratings yet
Bluetooth: Objective
8 pages
Classs Xii-Matrix
No ratings yet
Classs Xii-Matrix
38 pages
Revel Brochure May 2025
No ratings yet
Revel Brochure May 2025
6 pages
Aiyagari
No ratings yet
Aiyagari
22 pages
Matrix Algebra: Ij M N Ij
No ratings yet
Matrix Algebra: Ij M N Ij
14 pages
Gaussian Tips
No ratings yet
Gaussian Tips
70 pages
10.2 Forecasting Example Using Data From NOAA
No ratings yet
10.2 Forecasting Example Using Data From NOAA
6 pages
Chapter 3 Advanced Philosophy
No ratings yet
Chapter 3 Advanced Philosophy
7 pages
Var Models in Stata
No ratings yet
Var Models in Stata
13 pages
Pineapple Harvest Index and Fruit Quality Improvement by Application of Gibberellin and Cytokinin
No ratings yet
Pineapple Harvest Index and Fruit Quality Improvement by Application of Gibberellin and Cytokinin
6 pages
Ph.D. Program in Political Science of The City University of New York
No ratings yet
Ph.D. Program in Political Science of The City University of New York
22 pages
Kpi Analysis
No ratings yet
Kpi Analysis
30 pages
Summative Test in Community Engagement Solidarity and Citizenship
No ratings yet
Summative Test in Community Engagement Solidarity and Citizenship
2 pages
Demographic Transition
No ratings yet
Demographic Transition
5 pages
34 10 CH en Questions
No ratings yet
34 10 CH en Questions
2 pages
A PageRank Model For Player Performance Assessment
No ratings yet
A PageRank Model For Player Performance Assessment
27 pages
Guide AIX Monitoring
No ratings yet
Guide AIX Monitoring
31 pages
GIPSA and GIC Re Reimbursement Claim Docs Check LIst
No ratings yet
GIPSA and GIC Re Reimbursement Claim Docs Check LIst
1 page
Signia NX Brochure - Hearing Aid Express
No ratings yet
Signia NX Brochure - Hearing Aid Express
9 pages
AG-HMC70 Manual
No ratings yet
AG-HMC70 Manual
8 pages
Childhood Asthma Prediction Model Using SVM
No ratings yet
Childhood Asthma Prediction Model Using SVM
9 pages
Unit I
No ratings yet
Unit I
26 pages
(Template) Prelim SFG 65-Answers
No ratings yet
(Template) Prelim SFG 65-Answers
3 pages
Floyd Warshall
No ratings yet
Floyd Warshall
6 pages
Anggun Rayi Arimurti
No ratings yet
Anggun Rayi Arimurti
2 pages
How To Use The Child/Adolescent Psychiatry Screen (CAPS)
No ratings yet
How To Use The Child/Adolescent Psychiatry Screen (CAPS)
3 pages
Time Series Modeling: Shouvik Mani April 5, 2018
No ratings yet
Time Series Modeling: Shouvik Mani April 5, 2018
46 pages
Oracle and Databases Administration Course
No ratings yet
Oracle and Databases Administration Course
3 pages
Use of The PERI Logo: Guideline PERI Corporate Design
No ratings yet
Use of The PERI Logo: Guideline PERI Corporate Design
1 page
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
No ratings yet
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
14 pages
OBE Syllabus Human Resource Management San Francisco College
No ratings yet
OBE Syllabus Human Resource Management San Francisco College
4 pages
FinalExam Mar21 Solutions
No ratings yet
FinalExam Mar21 Solutions
9 pages
Ecosystem Mangement Programme A New Approach To Sustainability
No ratings yet
Ecosystem Mangement Programme A New Approach To Sustainability
24 pages
Solution of Discretized Equations
No ratings yet
Solution of Discretized Equations
26 pages

29.measuring Data Similarity and Dissimilarity Introduction

Uploaded by

29.measuring Data Similarity and Dissimilarity Introduction

Uploaded by

Data Mining

• Value is higher when

• Often falls in range

Objects with same fixed

Objects are presented

Use a large number of binary

• creating a new binary

• Binary attributes can take only two

Jaccard coefficient is the same as

p-dimensional data objects

point attribute 1 attribute 2

Weighted formula to combine different

Map to range [0,1]

• If d1 and d2 are two vectors

• Gene feature mapping

You might also like