0% found this document useful (0 votes)
3 views

CS822-DataMining-Week4 (2)

The document discusses measuring data similarity and dissimilarity, explaining concepts such as similarity scores, dissimilarity metrics (Euclidean, Manhattan, Jaccard), and proximity measures for both numeric and nominal attributes. It also covers standardization techniques like Z-score and various distance metrics used in data analysis. Understanding these concepts is essential for improving the accuracy and efficiency of machine learning models.

Uploaded by

zainab zahid
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

CS822-DataMining-Week4 (2)

The document discusses measuring data similarity and dissimilarity, explaining concepts such as similarity scores, dissimilarity metrics (Euclidean, Manhattan, Jaccard), and proximity measures for both numeric and nominal attributes. It also covers standardization techniques like Z-score and various distance metrics used in data analysis. Understanding these concepts is essential for improving the accuracy and efficiency of machine learning models.

Uploaded by

zainab zahid
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

1

CS822
Data
Mining
Instructor: Dr. Muhammad Tahir

2
Measuring Data Similarity and
Dissimilarity

3
Measuring Data Similarity and
Dissimilarity

4
Similarity
• A numerical measure that indicates how alike two data
objects are.
• Often between 0 and 1, where 1 means identical and
0 means completely different.
• Example:
• Suppose we compare two text documents based on word
usage.
• Doc1: "Artificial intelligence is transforming industries."
• Doc2: "Industries are being transformed by artificial intelligence."
• Using cosine similarity, their similarity score might be
close to 1 because they contain the same words in a
different order. 5
Dissimilarity
• A numerical measure that indicates how different two
data objects are.
• The minimum dissimilarity is often 0 (identical objects),
but the upper limit varies depending on the method.
• Example:
• In a customer segmentation task, we compare two
customers based on their age:
• Customer A: Age 25
• Customer B: Age 50
• Using Euclidean distance, the dissimilarity is |50 - 25| =
25, meaning they are quite different in terms of age.
6
Dissimilarity …
Euclidean Distance (No Fixed Upper Limit)
• The straight-line distance between two points in a multi-
dimensional space.
• Example:
• Consider two points in a 2D space:
• A (2, 3)
• B (10, 15)
• Euclidean distance =
• Upper limit: No fixed value—it depends on the scale of the
data. In another dataset, distances might be in the hundreds
or thousands.
7
Dissimilarity …
Manhattan Distance (Upper Limit Depends on Data
Range)
• Measures the sum of absolute differences between
coordinates.
• Example:
• A (2,3) and B (10,15)

• Upper limit: Can be very high if data values are large (e.g.,
city distances in kilometers).

8
Dissimilarity …
Jaccard Distance (Fixed Upper Limit = 1)
• Used for comparing binary or set-based data. It is calculated as:

• Example:
• Set A = {apple, banana, mango}
• Set B = {banana, mango, orange}
• Intersection = {banana, mango} (2 elements)
• Union = {apple, banana, mango, orange} (4 elements)
• Jaccard Similarity = 2/4 = 0.5
• Jaccard Distance = 1 - 0.5 = 0.5
• Upper limit: Always 1, because the maximum possible difference is total
9
dissimilarity (no overlap)
Proximity
• A general term that can refer to either
similarity or dissimilarity between data
objects.
• Example:
• In a recommendation system (e.g., Netflix,
Spotify), proximity between users is measured to
suggest similar content:
• If User A and User B both watch sci-fi movies, their
proximity score is high, so they get similar movie
recommendations.

10
Data Matrix and Dissimilarity Matrix

11
Data Matrix
•A data matrix is a structured Obje Height Weight Age
table where: ct (cm) (kg) (years)
• rows represent objects
(instances) and A 170 65 25

• columns represent B 160 55 30


features (attributes).
C 175 70 28

Example of a Data Matrix


Each row = an entity (e.g., a person).
Each column = an attribute (e.g., height, weight,
age).
12
Dissimilarity
Matrix A B C
•A dissimilarity matrix shows how
different (or distant) each object is A 0 10 8
from the others.
•Instead of raw data, it contains B 10 0 5
distance values (e.g., Euclidean,
Manhattan, Jaccard). C 8 5 0
•It is often symmetric (distance
between A and B is the same as B
to A). Example of a Dissimilarity Matrix (Using
Euclidean Distance)
•Diagonal values (self-
A to B = 10 → They are more
comparison) are usually 0 (object is different.
identical to itself).
B to C = 5 → They are more similar.
C to A = 8 → Moderate difference.
13
Proximity Measure for Nominal
Attributes

14
Proximity Measure for Nominal
Attributes
• Proximity measures (similarity and dissimilarity) for
nominal attributes are used when dealing with
categorical data, where values represent distinct
categories with no inherent numerical ordering.

15
Proximity Measure for Nominal
Attributes
Understanding Nominal Attributes

• Nominal attributes are qualitative and describe


different categories or labels. Examples include:
• Colors: {Red, Blue, Green}
• Car Brands: {Toyota, Ford, Honda}
• Job Titles: {Engineer, Doctor, Teacher}
• Since these values do not have a numeric meaning,
we use specific techniques to measure proximity
(similarity/dissimilarity).
16
Proximity Measure for Nominal
Attributes
Dissimilarity Measure for Perso Car Eye
Job Title
Nominal Attributes n Brand Color

A Toyota Brown Engineer


The simplest method is Simple
Matching Coefficient (SMC): B Honda Brown Teacher

Example (Comparing two people based on their
attributes)

Total attributes = 3
Mismatches = 2 (Car Brand and Job Title are
different)
Dissimilarity = = 0.6732​=0.67 (higher value =
more dissimilar)
17
Proximity Measure for Nominal
Attributes

Similarity Measure for


Perso Car Eye Job
Nominal Attributes n Brand Color Title

Enginee
Simple Matching Similarity A Toyota Brown
r
(SMC):
B Honda Brown Teacher
Example:
•Matches = 1 (Eye Color is the same)
•Similarity = = 0.33 (lower value = less
similar)

18
Proximity Measure for Nominal
Attributes

Jaccard Similarity for Binary Likes


Nominal Data Likes Likes
User Samsun
Apple Sony
If nominal attributes are binary (Yes/No, 1/0), g
Jaccard Similarity is often used:
A 1 1 0

B 1 0 1
In the Jaccard Similarity measure, 1-1 matches
and 0-0 matches refer to how binary (yes/no, 1-1 Matches = 1 (Both users like the same
1/0) attributes align between two objects. product: Likes Apple)
• 1-1 Match: Both objects have the same 0-0 Matches = 0 (Both users dislike the same
attribute as "1" (yes, true, present, etc.). product)
• 0-0 Match: Both objects have the same Total Attributes - 0-0 Matches = 3
attribute as "0" (no, false, absent, etc.). Jaccard Similarity = =0.3331​=0.33
19
Proximity Measure for Nominal
Attributes
• A contingency table is a table that summarizes the
frequency of different combinations of two
categorical (or binary) variables.
• It helps in analyzing the relationship between two
variables.
• It allows us to compare two objects or variables.
• It helps in calculating similarity and dissimilarity.
• Used in data mining, statistics, and machine learning
to measure how similar or different two entities are.

20
Contingency Table for Binary Data

• A contingency table for binary


data (data with only 1s (Yes) and Object j → 1 0 Total
0s (No)) looks like this:
• What do these values mean?
Object i
• q → Both objects are 1 (Yes, q r q+r
=1
Yes)
• r → Object i is 1, but j is 0 (Yes, Object i
s t s+t
No) =0
• s → Object i is 0, but j is 1 (No,
Yes) q+ r+ p (total
Total s t data)
• t → Both objects are 0 (No, No)

21
Example: Sports Preferences of
Alice & Bob
Alice Bob
Imagine Alice and Bob are being compared based on
whether they like certain sports (Yes = 1, No = 0) Sport (i) (j)
Football 1 1
In the contingency table:
Cricket 1 0
q = 1 (Football)
Tennis 0 0
Contingency Table for Alice & Bob
r = 1 (Cricket)
Object j (Bob) Object j (Bob)
Total
s = 0 (No case where Bob likes but Alice =1 =0
doesn’t)
Object i q=1 q+r
r = 1 (Cricket)
(Alice) = 1 (Football) =2
t = 1 (Tennis)
Object i s+t=
s=0 t = 1 (Tennis)
(Alice) = 0 1
Now, let's use this table to understand similarity and
dissimilarity measures. Total q+s=1 r+t=2 p=3
22
Symmetric vs. Asymmetric Binary
Distance
• When comparing two objects based on binary (Yes/No or
1/0) attributes, we use different distance measures
depending on whether both 1s and 0s matter
equally or not.

23
Symmetric Binary Distance
• This considers both agreements (1-1) and disagreements (1-0 or 0-1)
equally.
• It is useful when both presence and absence of an attribute are meaningful.
• Example: Comparing disease symptoms in two patients (having or not having a
symptom matters equally).
Formula:
​where:
• q = both are 1 (1-1 match)
• r = Alice is 1, Bob is 0 (1-0 mismatch)
• s = Alice is 0, Bob is 1 (0-1 mismatch)
• t = both are 0 (0-0 match)
24
Asymmetric Binary Distance
• Here, 0-0 matches (both saying "No") are ignored
because only the presence of an attribute matters.
• Used when only positive occurrences (1s) are
meaningful and 0s are unimportant.
• Example: Diagnosing rare diseases (if both don’t
have the disease, it doesn’t matter, but if one does and
the other doesn’t, it does).
Formula:

25
Jaccard Similarity
• Jaccard Similarity is a measure of how similar two binary objects are,
considering only the presence (1s) of attributes and
ignoring 0s.

• q = Both objects have 1 (1-1 match)


• r = Object i has 1, but object j has 0 (1-0 mismatch)
• s = Object i has 0, but object j has 1 (0-1 mismatch)
• t (0-0 matches are ignored in Jaccard similarity)

Jaccard is useful when only the presence of attributes


matters, such as in:
• Text analysis (words present in two documents)
• Market basket analysis (common items in shopping carts)
• Genetic similarity (shared mutations between species) 26
Example: Sports Preferences of
Alice & Bob
•Example: Sports Preferences of Alice Alice Bob
& Bob Sport (i) (j)
Football 1 1
•Symmetric Binary Distance Calculation: Cricket 1 0
Tennis 0 0
From the contingency table:
•Asymmetric Binary Distance Calculation:
Bob (j) = Tota
Bob (j) = 1
0 l
•Jaccard Similarity:
Alice (i) q=1 r=1
2
=1 (Football) (Cricket)
(50% similarity based on shared
Alice (i) t=1
preferences) s=0 1
=0 (Tennis)
Total 1 2 27 3
Standardizing Numeric Data with Z-
score

28
Standardizing Numeric Data with Z-
score
What is Standardization?
• Standardization is a technique used to transform
numerical data so that different features have a
common scale. This is crucial when:
• Data has different units (e.g., height in cm vs. weight in kg).
• Features have different ranges (e.g., one feature ranges from
1–1000, another from 0–1).
• Machine learning models are sensitive to magnitudes (e.g.,
distance-based algorithms like k-NN, K-means).

29
Standardizing Numeric Data with Z-
score
What is Z-score Standardization?
Z-score standardization transforms data to have:
• Mean (μ) = 0
• Standard deviation (σ) = 1
Formula for Z-score:
• X = Original value
• μ = Mean of the dataset
• σ = Standard deviation of the dataset
• This means each value is now represented in terms of how
many standard deviations it is away from the mean.
30
Stude Score
Example: Standardizing Exam nt (X)
A 80
Scores
•Raw Data (Math Exam Scores of 5 Students): B 60
•Step 1: Compute Mean (μ):
C 75
•Step 2: Compute Standard Deviation (σ):
D 90
•Step 3: Compute Z-score for Each Student
E 85

Stude Score
Z-score
nt (X)
A 80

Interpretation of Z-scores B 60
• Positive Z-score (Z > 0) → Value is
above the mean (e.g., Student D got
C 75
1.17 standard deviations above the
mean).
• Negative Z-score (Z < 0) → Value is D 90

below the mean (e.g., Student B got -


1.75, meaning much lower than the E 85
mean). 31
Commonly Used Distance Measures/Metrics
• Euclidean distance measures the straight-line
distance between two points in a multi-dimensional
space
• Manhattan distance is useful when the
dimensions in the data have different units of
measurement
• Chebyshev distance is ideal for applications where
the maximum difference between two dimensions is
more important than the individual differences.
• Mahalanobis distance takes into account the
covariance between variables. This is especially
useful in applications where the dimensions are
correlated
• Hamming distance is used to measure the
difference between two strings of equal length.
• The Haversine distance is used to calculate the
distance between two points on a sphere
• Cosine distance is a measure of similarity between
two non-zero vectors of an inner product space
32
Commonly Used Distance
Measures/Metrics
• Understanding the strengths and weaknesses of each
distance metric is crucial in selecting the appropriate
metric for a given problem.
• By choosing the right distance metric, we can improve
the accuracy and efficiency of our machine learning
models.

May be included in the exam

33
Minkowski Distance
• Minkowski distance is a generalized distance metric
that includes Euclidean and Manhattan distances as
special cases. It is defined as:

• and are two points in n-dimensional space.


• is the order of the Minkowski distance.
• is the absolute difference between the coordinates of
the two points.
34
Special Cases of Minkowski Distance
• Minkowski Distance varies depending on the value
of :
Manhattan Distance (p = 1) (city block, L1 norm)

• Interpretation: The sum of absolute differences


between coordinates
• Use case: When movement is restricted to grid-based
paths (like city blocks).
• Example: Points: A(1, 2) and B(4, 6)
• Distance = 35
Special Cases of Minkowski Distance
• Euclidean Distance (p = 2) (L2 norm)

• Interpretation: The straight-line distance between two


points.
• Use case: When a direct path is possible.
• Example: Points: A(1, 2) and B(4, 6)
• Distance =

36
Special Cases of Minkowski Distance
• Chebyshev Distance (p → ∞) (“supremum” (Lmax
norm, L norm))

• Interpretation: The maximum absolute difference


along any dimension.
• Use case: When diagonal moves are allowed and have
the same cost as horizontal/vertical moves.
• Example: Points: A(1, 2) and B(4, 6)
• Distance =
37
Special Cases of Minkowski Distance
• Minkowski Distance generalizes multiple
distance metrics.
• Choice of affects the result:
• (City block movement)
• (Straight-line movement)
• (Max absolute difference)

38
Properties of Minkowski Distance
• Positive Definiteness
• Distance is always positive:
• Distance is zero only when the points are the same:
• Symmetry
• Distance is the same in both directions:
• Triangle Inequality
• The direct distance between two points is always less than or
equal to the sum of the distances via a third point:
• Example: If going from A to C directly is 5 units, but going A →
B → C is 6 units, then direct travel is the shortest path.

Since Minkowski Distance satisfies these three conditions, it is 39


considered a valid distance metric.
Ordinal Variables

40
Ordinal Variables
• Ordinal data is a type of categorical data where values are ranked or
ordered, but the differences between them are not necessarily equal.
• Unlike numerical data, you cannot perform meaningful arithmetic operations like
addition or subtraction.
Examples of Ordinal Data
1.Movie Ratings (e.g., 1 star, 2 stars, ..., 5 stars)
2.Education Level (e.g., High School < Bachelor's < Master's < PhD)
3.Customer Satisfaction (e.g., Very Dissatisfied < Dissatisfied < Neutral < Satisfied <
Very Satisfied)
4.Pain Level in Medical Surveys (e.g., No Pain < Mild Pain < Moderate Pain < Severe
Pain)
• Even though these values have an order, the difference between "Satisfied" and
"Very Satisfied" is not necessarily the same as between "Neutral" and "Satisfied."
41
Cosine Similarity with Ordinal Data:
Example
•Scenario: Movie Ratings User A User B
•Two users (A & B) rate three movies on a scale of 1 Rating Rating
to 5 (where 1 = worst, 5 = best). Movie s s
•Step 1: Compute the Dot Product
Movie
5 4
1
•Step 2: Compute Euclidean Norms
Movie
• For User A: 3 2
• For User B: •
2Cosine similarity = 0.97 → Very high
• Step 3: Compute Cosine Similarity similarity between User A and User B's
Movie
ratings.
4 closer to 0,5it would
• If the similarity were
3mean their ratings are quite different.
• If it were negative, it would mean they
have opposite preferences. 42
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or phrase
in the document.

• Other vector objects: gene features in micro-arrays, …


• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...

• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then

cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,


where  indicates vector dot product, ||d||: the length of vector d 43
Example: Cosine Similarity
• cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
44
cos(d , d ) = 0.94
You are welcome

45

You might also like