Revision: Chapter 1-6
Applied Multivariate Statistics – Spring 2012
Overview
Cov, Cor, Mahalanobis, MV normal distribution
Visualization: Stars plot, mosaic plot with shading
Outlier: [Link]
Missing values: [Link], mice
MDS: Metric / non-metric
Dissimilarities: daisy
PCA
LDA
Two variables: Covariance and Correlation
Covariance: Cov(X; Y ) = E[(X ¡ E[X])(Y ¡ E[Y ])] 2 [¡1; 1]
Correlation: Corr(X; Y ) = Cov(X;Y )
¾X ¾Y 2 [¡1; 1]
Pn
Sample covariance: Cov(x;
d y) = 1
n¡1 i=1 (xi ¡ x)(yi ¡ y)
c
Sample correlation: d y) = Cov(x;y)
rxy = Cor(x; ¾
^x ¾
^y
Correlation is invariant to changes in units,
covariance is not
(e.g. kilo/gram, meter/kilometer, etc.)
2
Scatterplot: Correlation is scale invariant
3
Intuition and pitfalls for correlation
Correlation = LINEAR relation
4
Covariance matrix / correlation matrix:
Table of pairwise values
True covariance matrix: §ij = Cov(Xi; Xj )
True correlation matrix: Cij = Cor(Xi; Xj )
Sample covariance matrix: Sij = Cov(x
d i ; xj )
Diagonal: Variances
Sample correlation matrix: Rij = Cor(x
d i; xj )
Diagonal: 1
R: Functions “cov”, “cor” in package “stats”
5
Sq. Mahalanobis Distance MD2(x)
Multivariate Normal Distribution:
=
Most common model choice
Sq. distance from mean in
standard deviations
IN DIRECTION OF X
1
¡ 1 T ¡1
¢
f(x; ¹; §) = p exp ¡ 2 ¢ (x ¡ ¹) § (x ¡ ¹)
2¼j§j
6
µ ¶
Mahalanobis distance: Example 0
¹= ;
0
µ ¶
25 0
§=
0 1
(0,10)
MD = 10
7
µ ¶
Mahalanobis distance: Example 0
¹= ;
0
µ ¶
25 0
§=
0 1
(10, 7)
MD = 7.3
8
Glyphplots:
Stars
• Which cities are special?
• Which cities are like
New Orleans?
• Seattle and Miami are quite
far apart; how do they
compare?
• R: Function “stars” in package
“stats”
9
Mosaic plot with shading
Suprisingly small
R: Function “mosaic” in package “vcd”
observed cell
count
p-value of
independence
Suprisingly large
test: Highly
observed cell
significant
count
10
Outliers: Theory of Mahalanobis Distance
Assume data is multivariate normally distributed
(d dimensions)
Squared Mahalanobis distance of samples follows a Chi-Square distribution
with d degrees of freedom
Expected value: d
(“By definition”: Sum of d standard normal random variables has
Chi-Square distribution with d degrees of freedom.)
11
Outliers: Check for multivariate outlier
Are there samples with estimated Mahalanobis distance
that don’t fit at all to a Chi-Square distribution?
Check with a QQ-Plot
Technical details:
- Chi-Square distribution is still reasonably good for
estimated Mahalanobis distance
- use robust estimates for ¹; §
R: Function «[Link]» in package «mvoutlier»
12
Outliers: [Link]
Outlier easily detected !
13
Missing values: Problem of Single Imputation
Too optimistic: Imputation model (e.g. in Y = a + bX) is
just estimated, but not the true model
Thus, imputed values have some uncertainty
Single Imputation ignores this uncertainty
Coverage probability of confidence intervals is wrong
Solution: Multiple Imputation
Incorporates both
- residual error
- model uncertainty (excluding model mis-specification)
R: Package «mice» for Multiple Imputation using chained
equations
14
Multiple Imputation: MICE
Aggregate
results
Do standard analysis
Impute several times for each imputed data set;
get estimate and [Link]
15
Idea of MDS
Represent high-dimensional point cloud in few (usually 2)
dimensions keeping distances between points similar
Classical/Metric MDS: Use a clever projection
- guaranteed to find optimal solution only for euclidean
distance
- fast
R: Function “cmdscale” in base distribution
Non-metric MDS:
- Squeeze data on table = minimize STRESS
- only conserve ranks = allow monotonic transformations
before reducing dimensions
- slow(er)
R: Function “isoMDS” in package “MASS”
16
Distance: To scale or not to scale…
If variables are not scaled
- variable with largest range has most weight
- distance depends on scale
Scaling gives every variable equal weight
Similar alternative is re-weighing:
p
d(i; j) = w1(xi1 ¡ xj1)2 + w2(xi2 ¡ xj2)2 + ::: + wp(xip ¡ xjp)2
Scale if,
- variables measure different units (kg, meter, sec,…)
- you explicitly want to have equal weight for each variable
Don’t scale if units are the same for all variables
Most often: Better to scale.
17
Dissimilarity for mixed data: Gower’s Dissim.
Idea: Use distance measure between 0 and 1 for each
variable: d(f
ij
)
Pp
Aggregate: d(i; j) = p i=1 d(f)
1
ij
Binary (a/s), nominal: Use methods discussed before
- asymmetric: one group is much larger than the other
(f) jx ¡x j
Interval-scaled: dij = ifRf jf
xif: Value for object i in variable f
Rf: Range of variable f for all objects
Ordinal: Use normalized ranks; then like interval-scaled
based on range
R: Function “daisy” in package “cluster”
18
PCA: Goals
Goal 1: Dimension reduction to a few dimensions while
explaining most of the variance
(use first few PC’s)
Goal 2: Find one-dimensional index that separates objects
best
(use first PC)
19
PCA (Version 1): Orthogonal directions
• PC 1 is direction of largest variance
• PC 2 is
- perpendicular to PC 1 PC 1
- again largest variance
• PC 3 is PC 3
- perpendicular to PC 1, PC 2
- again largest variance PC 2
• etc.
20
How many PC’s: Blood Example
Rule 1: 5 PC’s
Rule 3: Ellbow after PC 1 (?)
Rule 2: 3 PC’s
21
Biplot: Show info on samples AND variables
Approximately true:
• Data points: Projection on first two PCs
Distance in Biplot ~ True Distance
• Projection of sample onto arrow gives
original (scaled) value of that variable
• Arrowlength: Variance of variable
• Angle between Arrows: Correlation
Approximation is often crude;
good for quick overview
22
Supervised Learning: LDA
P (C)P (XjC)
P (CjX) = P (X) » P (C)P (XjC)
Prior / prevalence:
Find some estimate Assume:
Fraction of samples
XjC » N(¹c; §)
in that class
Bayes rule:
Choose class where P(C|X) is maximal
(rule is “optimal” if all types of error are equally costly)
Special case: Two classes (0/1)
- choose c=1 if P(C=1|X) > 0.5 or
- choose c=1 if posterior odds P(C=1|X)/P(C=0|X) > 1
In Practice: Estimate 𝑃 𝐶 , 𝜇𝐶 , Σ
23
LDA Orthogonal directions of best separation
1. Principal Component
Linear decision boundary
1. Linear Discriminant
=
1. Canonical Variable
Balance prior and mahalanobis distance
1
Classify to which class? – Consider:
• Prior
0
• Mahalanobis distance to class center
24
LDA: Quality of classification
Use training data also as test data: Overfitting
Too optimistic for error on new data
Separate test data
Test
Training
Cross validation (CV; e.g. “leave-one-out cross validation):
Every row is the test case once, the rest in the training data
25