0% found this document useful (0 votes)
27 views

Lec2 2-Dataset2

This document discusses various statistical measures for describing data, including variance, standard deviation, histograms, boxplots, and scatter plots. It also covers distance measures that can quantify similarity between data points, such as Minkowski distance. Distance measures are useful for clustering and other analytics tasks.

Uploaded by

Shanti Grover
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Lec2 2-Dataset2

This document discusses various statistical measures for describing data, including variance, standard deviation, histograms, boxplots, and scatter plots. It also covers distance measures that can quantify similarity between data points, such as Minkowski distance. Distance measures are useful for clustering and other analytics tasks.

Uploaded by

Shanti Grover
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Data Sets (2)

Measuring the Dispersion of Data


• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
𝑁 𝑁
1 1
𝜎2 = ෍(𝑥𝑖 − 𝜇)2 = ෍ 𝑥𝑖 2 − 𝜇2
𝑁 𝑁
𝑖=1 𝑖=1
𝑛 𝑛 𝑛
2
1 2
1 1
𝑠 = ෍(𝑥𝑖 − 𝑥)
lj = [෍ 𝑥𝑖 − (෍ 𝑥𝑖 )2 ]
2
𝑛−1 𝑛−1 𝑛
𝑖=1 𝑖=1 𝑖=1

N: population size; n: sample size


• Standard deviation s (or σ) is the square root of variance s2
(or σ2)
𝑁 𝑁
1 1
𝜎= ෍(𝑥𝑖 − 𝜇)2 = ෍ 𝑥𝑖 2 − 𝜇 2
𝑁 𝑁
𝑖=1 𝑖=1

𝑛 𝑛 𝑛
1 1 1
𝑠= lj 2=
෍(𝑥𝑖 − 𝑥) [෍ 𝑥𝑖 2 − (෍ 𝑥𝑖 )2 ]
𝑛−1 𝑛−1 𝑛
𝑖=1 𝑖=1 𝑖=1 3
Properties of Normal Distribution Curve

• The normal (distribution) curve


• From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)
• From μ–2σ to μ+2σ: contains about 95% of it
• From μ–3σ to μ+3σ: contains about 99.7% of it

4
Graphic Displays of Basic Statistical Descriptions

• Boxplot: graphic display of five-number summary

• Histogram: x-axis are values, y-axis represents frequencies or frequencies per


unit

• Scatter plot: each pair of values is a pair of coordinates and plotted as points in
the plane

5
Histogram Analysis
• Histogram: Graph display of tabulated
frequencies, shown as bars 40

• It shows what proportion of cases fall 35


into each of several categories 30
• Differs from a bar chart in that it is the 25
area of the bar that denotes the value, 20
not the height as in bar charts, a crucial
15
distinction when the categories are not
of uniform width 10

• The categories are usually specified as 5


non-overlapping intervals of some 0
10000 30000 50000 70000 90000
variable. The categories (bars) must be
adjacent

6
Histogram example: uneven width
Unit price ($) 40 43 47 … 74 75 78 … 115 117 120
Count of 275 300 250 … 360 515 540 … 320 270 350
items sold

300
𝑐𝑜𝑢𝑛𝑡 𝑜𝑓 𝑖𝑡𝑒𝑚 𝑠𝑜𝑙𝑑

200
𝑤𝑖𝑑𝑡ℎ

9000

4350
100 2900

40-59 60-99 100-120


unit price
7
Histogram example: even width
Unit price ($) 40 43 47 … 74 75 78 … 115 117 120
Count of 275 300 250 … 360 515 540 … 320 270 350
items sold

8
Histograms Often Tell More than Boxplots

◼ The two histograms


shown in the left may
have the same boxplot
representation
◼ The same values
Q1 Q2 Q3 for: min, Q1,
median, Q3, max
◼ But they have rather
different data
distributions

9 Q1 Q2 Q3
Scatter plot
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

10
Positively and Negatively Correlated Data

• The left half fragment is positively


correlated

• The right half is negative correlated

11
Uncorrelated Data

12
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

13
Data Matrix and Dissimilarity Matrix
• Data matrix
• n data points with p  x11 ... x1f ... x1p 
dimensions  
 ... ... ... ... ... 
• Two modes x ... xif ... xip 
 i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 
• Dissimilarity matrix
• n data points, but  0 
registers only the  d(2,1) 0 
 
distance  d(3,1) d ( 3, 2 ) 0 
• A triangular matrix  
 : : : 
• Single mode d ( n,1) d ( n, 2 ) ... ... 0

14
Proximity Measure for Nominal Attributes

• Can take 2 or more states, e.g., red, yellow, blue, green


(generalization of a binary attribute)
• Simple matching
• m: # of matches, p: total # of variables
𝑝−𝑚
𝑑(𝑖, 𝑗) =
𝑝

• Note: for each attribute, we assume all its values are


equally important
• Otherwise, higher weights should be given to ‘more
important’ values (detail omitted)
15
Distance measure for nominal attributes
• Consider the following data set
Person id Gender language Hair-color
1 M English brown
2 F English black
3 M Spanish brown

3−1
 d(1,2) = = 0.67
3
3−2
 d(1,3) = = 0.33
3

16
Distance on Numeric Data: Minkowski Distance

• Minkowski distance: A general distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric

17
metric
• Claim:
• Minkowski distance is a metric for any h
• The distance defined for nominal attributes is a
metric
• The distance defined for ordinal attributes is a
metric (later)
• The distance defined for mix types attribute is a
metric (later)

18
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different between two binary vectors

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp

• h → . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component (attribute) of the vectors

19
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different between two binary vectors

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp

• h → . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component (attribute) of the vectors

20
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
21
Ordinal Variables
• An ordinal variable has an order on its values
• Based on the order, we can rank to each value
• Then can treat it like interval-scaled
• For the ith object and fth attribute, replace xif by its rank
𝑟𝑖𝑓 ∈ {1, . . . , 𝑀𝑓 }

• map the rank of each variable onto [0, 1] by replacing rif by


𝑟𝑖𝑓 − 1
𝑧𝑖𝑓 =
𝑀𝑓 − 1
• compute the dissimilarity using methods for interval-scaled
variables

22
𝑟𝑖𝑓 ∈ {1, . . . , 𝑀𝑓 }
𝑟𝑖𝑓 − 1
Ordinal Variables: example 𝑧𝑖𝑓 =
𝑀𝑓 − 1

 Consider the data set to the right


day temperature rank mapping
 Rank four values:
1 very warm 4 1
 very cold: 1 2 cold 2 0.33
 cold: 2 3 warm 3 0.66
 warm: 3 4 very cold 1 0
 very warm: 4
 Map to [0, 1] interval:
 Very cold: (1-1)/(4-1) = 0
 Cold: (2-1)/(4-1) = 0.33
 Warm: (3-1)/(4-1) = 0.66
 Very warm: (4-1)/(4-1) = 1 0
0.67 0
 Dissimilarity matrix: 0.34 0.33 0
1 0.33 0.66 0

23
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, numeric, ordinal
• One may use a weighted formula to combine their effects
𝑝 𝑓 𝑓
Σ𝑓=1 𝛿𝑖𝑗 𝑑𝑖𝑗
𝑑(𝑖, 𝑗) = 𝑝 𝑓
Σ𝑓=1 𝛿𝑖𝑗

𝑓 𝑓
• 𝛿𝑖𝑗 is the indicator, and 𝑑𝑖𝑗 is the contribution, of attribute f
to the distance between object i and j
𝑓 𝑓
• 𝛿𝑖𝑗 = 0 if one of 𝑥𝑖𝑓 and 𝑥𝑗𝑓 is missing, 𝛿𝑖𝑗 = 1 otherwise
• f is nominal:
𝑓 𝑓
𝑑𝑖𝑗 = 0 𝑖𝑓 𝑥𝑖𝑓 = 𝑥𝑗𝑓 , 𝑑𝑖𝑗 = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal: Compute ranks rif and then zif , then treat zif as numeric
24
𝑝 𝑓 𝑓
Σ𝑓=1 𝛿𝑖𝑗 𝑑𝑖𝑗
Mixed attribute types: example 𝑑(𝑖, 𝑗) = 𝑝
Σ𝑓=1 𝛿𝑖𝑗
𝑓

nominal ordinal nominal numeric


Car id color age (year) model price ($)
1 black > 10 Honda 22,000
2 red 5 – 10 Honda 30,000
3 grey > 10 Buick 40,000
4 red <5 Ford 25,000
𝑐𝑜𝑙𝑜𝑟 𝑑 𝑐𝑜𝑙𝑜𝑟 +𝛿 𝑎𝑔𝑒 𝑎𝑔𝑒 𝑚𝑜𝑑𝑒𝑙 𝑑 𝑚𝑜𝑑𝑒𝑙 +𝛿 𝑝𝑟𝑖𝑐𝑒 𝑝𝑟𝑖𝑐𝑒
𝛿12 12 12 𝑑12 +𝛿12 12 12 𝑑12
 𝑑 1,2 = 𝑐𝑜𝑙𝑜𝑟 +𝛿 𝑎𝑔𝑒 𝑚𝑜𝑑𝑒𝑙 +𝛿 𝑝𝑟𝑖𝑐𝑒
𝛿12 12 +𝛿12 12
𝑐𝑜𝑙𝑜𝑟 𝑎𝑔𝑒
𝑚𝑜𝑑𝑒𝑙 𝑝𝑟𝑖𝑐𝑒
 𝛿12 = 𝛿12 = 𝛿12 = 𝛿12 =1
𝑐𝑜𝑙𝑜𝑟 𝑚𝑜𝑑𝑒𝑙
 𝑑12 = 1, 𝑑12 =0
 For age:
 rank the values for age: ‘< 5’ → 1, ‘5 – 10’ → 2, ‘> 10’ → 3
 Normalize to [0,1]: 1 → 0, 2 → 0.5, 3 → 1
𝑎𝑔𝑒
 𝑑12 = 1 − 0.5 = 0.5
𝑝𝑟𝑖𝑐𝑒 30000−22000
 𝑑12 = = 0.44
40000−22000
1×1+1×0.5+1×0+1×0.44
 𝑑 1,2 = = 0.485 25
4
Cosine Similarity
• Objects viewed as vectors
• Similarity measures emphasize on directions

O1

more similar than O1


O2
O2

more similar than

26
Cosine Similarity
 Directions for vectors can be measure by their angle

O1
O1 ∙ O2
sim 𝑂1 , 𝑂2 = cos 𝛼 =
𝛼 O2 O1 O2

where • indicates vector dot product, ||O|| is the length of vector O


 Let 𝑂1 = 𝑥1 , ⋯ , 𝑥𝑝 , 𝑂2 = 𝑦1 , ⋯ , 𝑦𝑝 , then
𝑝
 𝑂1 ∙ 𝑂2 = σ𝑖=1 𝑥𝑖 𝑦𝑖

 𝑂1 = σ𝑝𝑖=1 𝑥𝑖2

 𝑂2 = σ𝑝𝑖=1 𝑦𝑖2

27
Cosine Similarity: example
• A document can be represented by thousands of
attributes, each recording the frequency of a particular
word (such as keywords) or phrase in the document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy,
gene feature mapping, ...

28
Cosine Similarity: example

• Find the similarity between documents d1 and d2 where


d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
• Sim(d1, d2) = cos(d1, d2) = (d1 • d2) /||d1|| ||d2||
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
sim(d1, d2 ) = 0.94

29

You might also like