0% found this document useful (0 votes)
71 views

18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2

This document discusses various measures to quantify the similarity and dissimilarity between data objects and distributions. It defines similarity and dissimilarity measures, which numerically indicate how alike or different objects are. Common properties of these measures are outlined. Examples of specific measures are provided, including Euclidean distance, Minkowski distance, and Mahalanobis distance for continuous data, as well as simple matching and Jaccard coefficients for binary variables. Worked examples calculating some of these measures are presented.

Uploaded by

HoShang PAtel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2

This document discusses various measures to quantify the similarity and dissimilarity between data objects and distributions. It defines similarity and dissimilarity measures, which numerically indicate how alike or different objects are. Common properties of these measures are outlined. Examples of specific measures are provided, including Euclidean distance, Minkowski distance, and Mahalanobis distance for continuous data, as well as simple matching and Jaccard coefficients for binary variables. Worked examples calculating some of these measures are presented.

Uploaded by

HoShang PAtel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

18CSE397T– Computational Data Analysis

Unit – 3: Session – 8: SLO – 2


OBJECT DISSIMILARITIES

Similarity and Dissimilarity


Distance or similarity measures are essential to solve many pattern recognition problems such as
classification and clustering. Various distance/similarity measures are available in literature to
compare two data distributions.  As the names suggest, a similarity measures how close two
distributions are. For multivariate data complex summary methods are developed to answer this
question.

Similarity Measure
 Numerical measure of how alike two data objects are.
 Often falls between 0 (no similarity) and 1 (complete similarity).
Dissimilarity Measure
 Numerical measure of how different two data objects are.
 Range from 0 (objects are alike) to ∞ (objects are different).
Proximity refers to a similarity or dissimilarity.
Similarity/Dissimilarity for Simple Attributes
Here, p and q are the attribute values for two data objects.
Attribute Type Similarity Dissimilarity

d={01 if p=q if p
Nominal s={10 if p=q if p≠q
≠q
s=1−∥p−q∥n−1
Ordinal (values mapped to integer 0 to n-1, d=∥p−q∥n−1
where n is the number of values)

Interval or
s=1−∥p−q∥,s=11+∥p−q∥ d=∥p−q∥
Ratio

Common Properties of Dissimilarity Measures


Distance, such as the Euclidean distance, is a dissimilarity measure and has some well known
properties:
1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
2. d(p, q) = d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q.
A distance that satisfies these properties is called a metric.Following is a list of several common
distance measures to compare multivariate data. We will assume that the attributes are all continuous.
Euclidean Distance
Assume that we have measurements xik,  i = 1, … , N, on variables k = 1, … , p (also called
attributes).
The Euclidean distance between the ith and jth objects is
dE(i,j)=(∑k=1p(xik−xjk)2)12

for every pair (i, j) of observations.

The weighted Euclidean distance is

dWE(i,j)=(∑k=1pWk(xik−xjk)2)12

If scales of the attributes differ substantially, standardization is necessary.

 Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.

With the measurement,  xik ,  i = 1, … , N,  k = 1, … , p, the Minkowski distance is


dM(i,j)=(∑k=1p|xik−xjk|λ)1λ,

where λ ≥ 1.  It is also called the Lλ metric.


 λ = 1 : L1 metric, Manhattan or City-block distance.
 λ = 2 : L2 metric, Euclidean distance.
 λ → ∞ : L∞ metric, Supremum distance.

limλ→∞=(∑k=1p|xik−xjk|λ)1λ=max(|xi1−xj1|,...,|xip−xjp|)

Note that λ and p are two different parameters. Dimension of the data matrix remains finite.

Mahalanobis Distance
Let X be a N × p matrix. Then the ith row of X is
xTi=(xi1,...,xip)
The Mahalanobis distance is

dMH(i,j)=((xi−xj)TΣ−1(xi−xj))12
where ∑ is the p×p sample covariance matrix.

Self-check
Think About It!
Calculate the answers to these questions by yourself and then click the icon on the left to reveal the
answer.

   1. We have   X=⎛⎝⎜112322112222412⎞⎠⎟.

 Calculate the Euclidan distances.


 Calculate the Minkowski distances (λ=1 and λ→∞ cases).

  2. We have   X=⎛⎝⎜2103372⎞⎠⎟.

 Calculate the Minkowski distance (λ = 1, λ = 2, and λ → ∞ cases) between the first and
second objects.
 Calculate the Mahalanobis distance between the first and second objects.

Common Properties of Similarity Measures


Similarities have some well known properties:

1. s(p, q) = 1 (or maximum similarity) only if p = q,


2. s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data objects, p and q.
Similarity Between Two
Binary Variables
The above similarity or distance measures are appropriate for continuous variables. However, for
binary variables a different approach is necessary.

 Simple Matching and Jaccard Coefficients

 Simple matching coefficient = (n1,1+ n0,0) / (n1,1 + n1,0 + n0,1 + n0,0).


 Jaccard coefficient =  n1,1 / (n1,1 + n1,0 + n0,1).
Self-check
Think About It!
Calculate the answers to the question and then click the icon on the left to reveal the answer.
   1. Given data:

 p = 1 0 0 0 0 0 0 0 0 0
 q = 0 0 0 0 0 0 1 0 0 1

The frequency table is  

Calculate the Simple matching coefficient and the Jaccard coefficient.


V

You might also like