CS822-DataMining-Week4 (2)
CS822-DataMining-Week4 (2)
CS822
Data
Mining
Instructor: Dr. Muhammad Tahir
2
Measuring Data Similarity and
Dissimilarity
3
Measuring Data Similarity and
Dissimilarity
4
Similarity
• A numerical measure that indicates how alike two data
objects are.
• Often between 0 and 1, where 1 means identical and
0 means completely different.
• Example:
• Suppose we compare two text documents based on word
usage.
• Doc1: "Artificial intelligence is transforming industries."
• Doc2: "Industries are being transformed by artificial intelligence."
• Using cosine similarity, their similarity score might be
close to 1 because they contain the same words in a
different order. 5
Dissimilarity
• A numerical measure that indicates how different two
data objects are.
• The minimum dissimilarity is often 0 (identical objects),
but the upper limit varies depending on the method.
• Example:
• In a customer segmentation task, we compare two
customers based on their age:
• Customer A: Age 25
• Customer B: Age 50
• Using Euclidean distance, the dissimilarity is |50 - 25| =
25, meaning they are quite different in terms of age.
6
Dissimilarity …
Euclidean Distance (No Fixed Upper Limit)
• The straight-line distance between two points in a multi-
dimensional space.
• Example:
• Consider two points in a 2D space:
• A (2, 3)
• B (10, 15)
• Euclidean distance =
• Upper limit: No fixed value—it depends on the scale of the
data. In another dataset, distances might be in the hundreds
or thousands.
7
Dissimilarity …
Manhattan Distance (Upper Limit Depends on Data
Range)
• Measures the sum of absolute differences between
coordinates.
• Example:
• A (2,3) and B (10,15)
• Upper limit: Can be very high if data values are large (e.g.,
city distances in kilometers).
8
Dissimilarity …
Jaccard Distance (Fixed Upper Limit = 1)
• Used for comparing binary or set-based data. It is calculated as:
• Example:
• Set A = {apple, banana, mango}
• Set B = {banana, mango, orange}
• Intersection = {banana, mango} (2 elements)
• Union = {apple, banana, mango, orange} (4 elements)
• Jaccard Similarity = 2/4 = 0.5
• Jaccard Distance = 1 - 0.5 = 0.5
• Upper limit: Always 1, because the maximum possible difference is total
9
dissimilarity (no overlap)
Proximity
• A general term that can refer to either
similarity or dissimilarity between data
objects.
• Example:
• In a recommendation system (e.g., Netflix,
Spotify), proximity between users is measured to
suggest similar content:
• If User A and User B both watch sci-fi movies, their
proximity score is high, so they get similar movie
recommendations.
10
Data Matrix and Dissimilarity Matrix
11
Data Matrix
•A data matrix is a structured Obje Height Weight Age
table where: ct (cm) (kg) (years)
• rows represent objects
(instances) and A 170 65 25
14
Proximity Measure for Nominal
Attributes
• Proximity measures (similarity and dissimilarity) for
nominal attributes are used when dealing with
categorical data, where values represent distinct
categories with no inherent numerical ordering.
15
Proximity Measure for Nominal
Attributes
Understanding Nominal Attributes
Total attributes = 3
Mismatches = 2 (Car Brand and Job Title are
different)
Dissimilarity = = 0.6732=0.67 (higher value =
more dissimilar)
17
Proximity Measure for Nominal
Attributes
Enginee
Simple Matching Similarity A Toyota Brown
r
(SMC):
B Honda Brown Teacher
Example:
•Matches = 1 (Eye Color is the same)
•Similarity = = 0.33 (lower value = less
similar)
18
Proximity Measure for Nominal
Attributes
B 1 0 1
In the Jaccard Similarity measure, 1-1 matches
and 0-0 matches refer to how binary (yes/no, 1-1 Matches = 1 (Both users like the same
1/0) attributes align between two objects. product: Likes Apple)
• 1-1 Match: Both objects have the same 0-0 Matches = 0 (Both users dislike the same
attribute as "1" (yes, true, present, etc.). product)
• 0-0 Match: Both objects have the same Total Attributes - 0-0 Matches = 3
attribute as "0" (no, false, absent, etc.). Jaccard Similarity = =0.3331=0.33
19
Proximity Measure for Nominal
Attributes
• A contingency table is a table that summarizes the
frequency of different combinations of two
categorical (or binary) variables.
• It helps in analyzing the relationship between two
variables.
• It allows us to compare two objects or variables.
• It helps in calculating similarity and dissimilarity.
• Used in data mining, statistics, and machine learning
to measure how similar or different two entities are.
20
Contingency Table for Binary Data
21
Example: Sports Preferences of
Alice & Bob
Alice Bob
Imagine Alice and Bob are being compared based on
whether they like certain sports (Yes = 1, No = 0) Sport (i) (j)
Football 1 1
In the contingency table:
Cricket 1 0
q = 1 (Football)
Tennis 0 0
Contingency Table for Alice & Bob
r = 1 (Cricket)
Object j (Bob) Object j (Bob)
Total
s = 0 (No case where Bob likes but Alice =1 =0
doesn’t)
Object i q=1 q+r
r = 1 (Cricket)
(Alice) = 1 (Football) =2
t = 1 (Tennis)
Object i s+t=
s=0 t = 1 (Tennis)
(Alice) = 0 1
Now, let's use this table to understand similarity and
dissimilarity measures. Total q+s=1 r+t=2 p=3
22
Symmetric vs. Asymmetric Binary
Distance
• When comparing two objects based on binary (Yes/No or
1/0) attributes, we use different distance measures
depending on whether both 1s and 0s matter
equally or not.
23
Symmetric Binary Distance
• This considers both agreements (1-1) and disagreements (1-0 or 0-1)
equally.
• It is useful when both presence and absence of an attribute are meaningful.
• Example: Comparing disease symptoms in two patients (having or not having a
symptom matters equally).
Formula:
where:
• q = both are 1 (1-1 match)
• r = Alice is 1, Bob is 0 (1-0 mismatch)
• s = Alice is 0, Bob is 1 (0-1 mismatch)
• t = both are 0 (0-0 match)
24
Asymmetric Binary Distance
• Here, 0-0 matches (both saying "No") are ignored
because only the presence of an attribute matters.
• Used when only positive occurrences (1s) are
meaningful and 0s are unimportant.
• Example: Diagnosing rare diseases (if both don’t
have the disease, it doesn’t matter, but if one does and
the other doesn’t, it does).
Formula:
25
Jaccard Similarity
• Jaccard Similarity is a measure of how similar two binary objects are,
considering only the presence (1s) of attributes and
ignoring 0s.
28
Standardizing Numeric Data with Z-
score
What is Standardization?
• Standardization is a technique used to transform
numerical data so that different features have a
common scale. This is crucial when:
• Data has different units (e.g., height in cm vs. weight in kg).
• Features have different ranges (e.g., one feature ranges from
1–1000, another from 0–1).
• Machine learning models are sensitive to magnitudes (e.g.,
distance-based algorithms like k-NN, K-means).
29
Standardizing Numeric Data with Z-
score
What is Z-score Standardization?
Z-score standardization transforms data to have:
• Mean (μ) = 0
• Standard deviation (σ) = 1
Formula for Z-score:
• X = Original value
• μ = Mean of the dataset
• σ = Standard deviation of the dataset
• This means each value is now represented in terms of how
many standard deviations it is away from the mean.
30
Stude Score
Example: Standardizing Exam nt (X)
A 80
Scores
•Raw Data (Math Exam Scores of 5 Students): B 60
•Step 1: Compute Mean (μ):
C 75
•Step 2: Compute Standard Deviation (σ):
D 90
•Step 3: Compute Z-score for Each Student
E 85
Stude Score
Z-score
nt (X)
A 80
Interpretation of Z-scores B 60
• Positive Z-score (Z > 0) → Value is
above the mean (e.g., Student D got
C 75
1.17 standard deviations above the
mean).
• Negative Z-score (Z < 0) → Value is D 90
33
Minkowski Distance
• Minkowski distance is a generalized distance metric
that includes Euclidean and Manhattan distances as
special cases. It is defined as:
36
Special Cases of Minkowski Distance
• Chebyshev Distance (p → ∞) (“supremum” (Lmax
norm, L norm))
38
Properties of Minkowski Distance
• Positive Definiteness
• Distance is always positive:
• Distance is zero only when the points are the same:
• Symmetry
• Distance is the same in both directions:
• Triangle Inequality
• The direct distance between two points is always less than or
equal to the sum of the distances via a third point:
• Example: If going from A to C directly is 5 units, but going A →
B → C is 6 units, then direct travel is the shortest path.
40
Ordinal Variables
• Ordinal data is a type of categorical data where values are ranked or
ordered, but the differences between them are not necessarily equal.
• Unlike numerical data, you cannot perform meaningful arithmetic operations like
addition or subtraction.
Examples of Ordinal Data
1.Movie Ratings (e.g., 1 star, 2 stars, ..., 5 stars)
2.Education Level (e.g., High School < Bachelor's < Master's < PhD)
3.Customer Satisfaction (e.g., Very Dissatisfied < Dissatisfied < Neutral < Satisfied <
Very Satisfied)
4.Pain Level in Medical Surveys (e.g., No Pain < Mild Pain < Moderate Pain < Severe
Pain)
• Even though these values have an order, the difference between "Satisfied" and
"Very Satisfied" is not necessarily the same as between "Neutral" and "Satisfied."
41
Cosine Similarity with Ordinal Data:
Example
•Scenario: Movie Ratings User A User B
•Two users (A & B) rate three movies on a scale of 1 Rating Rating
to 5 (where 1 = worst, 5 = best). Movie s s
•Step 1: Compute the Dot Product
Movie
5 4
1
•Step 2: Compute Euclidean Norms
Movie
• For User A: 3 2
• For User B: •
2Cosine similarity = 0.97 → Very high
• Step 3: Compute Cosine Similarity similarity between User A and User B's
Movie
ratings.
4 closer to 0,5it would
• If the similarity were
3mean their ratings are quite different.
• If it were negative, it would mean they
have opposite preferences. 42
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or phrase
in the document.
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 =
4.12
44
cos(d , d ) = 0.94
You are welcome
45