0% found this document useful (0 votes)
47 views

Principal Components Analysis (PCA) : 2.1 Outline of Technique

PCA transforms a set of possibly correlated variables into a set of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. The principal components are ordered so that the first explains the most variance, the second explains the second most variance, and so on.

Uploaded by

George Wang
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Principal Components Analysis (PCA) : 2.1 Outline of Technique

PCA transforms a set of possibly correlated variables into a set of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. The principal components are ordered so that the first explains the most variance, the second explains the second most variance, and so on.

Uploaded by

George Wang
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

2.

Principal Components Analysis (PCA)


2.1 Outline of technique
PCA is a technique for dimensionality reduction from j dimensions to 1 _ j dimensions. Let
x
T
= (r
1
, r
2
..., r
p
) be a random vector with mean j and covariance matrix . Generally we
consider x to be centred, i.e. = 0 ( x = 0) or if not, then x
0
= x explicitly. PCA aims to
nd a set of 1 uncorrelated variables j
1
, j
2
, ..., j
K
representing the "most informative" 1 linear
combinations of x:
The procedure is sequential, i.e. / = 1, 2, ..., 1 and the choice of 1 is an important practical
step of a PCA.
Here information will be interpreted as a percentage of the total variation (as previously dened)
in : The 1 sample PCs that can "best explain" the total variation in a sample covariance matrix
S may be similarly dened.
2.2 Formulation
PCs may be dened in terms of the population (using ) or in terms of a sample (using S). Let
j
1
= a
T
1
x
j
2
= a
T
2
x
.
.
.
j
p
= a
T
p
x
where j
j
= a
1j
r
1
+a
2j
r
2
+... +a
pj
r
p
are a sequence of "standardized" linear combinations (SLCs)
of the the r
0
: such that a
T
j
a
j
= 1
_

p
i=1
a
2
ij
= 1
_
and a
T
j
a
k
= 0 (
p
i=1
a
ij
a
ik
= 0) for , ,= /. i.e.
a
1
; a
2
; :::; a
p
form an orthonormal set of jvectors.
Equivalently, the jj matrix A formed from the columns a
j
satises A
T
A = I
p
_
= AA
T
_
,
so by denition is an orthogonal matrix. Geometrically the transformation from r
j
to j
j
is
a rotation in jdimensional space that aligns the axes along successive directions of maximum
variation. These are geometrically the principal axes of the ellipsoid dened by the matrix A:
We choose a
1
to maximize
\ ar (j
1
) = a
T
1
a
1
(1)
subject to the normalization condition a
T
1
a
1
= 1. Then we choose a
2
to maximize
\ ar (j
2
) = a
T
2
a
2
(2)
subject to the conditions a
T
2
a
2
= 1 (normalization) and a
T
2
a
1
= 0 (orthogonality) so that a
1
a
2
are orthonormal vectors. We shall see that j
2
is uncorrelated with j
1
1
Co (j
1
, j
2
) = Co
_
a
T
1
x; a
T
2
x
_
= a
T
1
a
2
= 0
Subsequent PCs for / = 3, 4, ..., j are chosen as the SLCs that have maximum variance subject to
being uncorrelated with previous PCs.
NB. Usually the PCs are taken to be "mean-corrected" linear transformations of the r
0
: i.e.
j
j
= a
T
j
(x ) (3)
emphasizing that the PCSs can be considered as direction vectors in jspace relative to the
"centre" of a distribution in which the spread is maximized. In any case \ ar (j
j
) is the same
whichever denition is used.
2.3 Computation
To nd the rst PC we use the Lagrange multiplier technique for nding the maximum of a function
) (x) subject to an equality constraint q (x) = 0. We dene the Lagrangean function
1(a
1
) = a
T
1
a
1
`
_
a
T
1
a
1
1
_
(4)
where ` is a Lagrange multiplier. We need a result on vector dierentiation:
Result
Let x = (r
1
, ..., r
n
) and
d
dx
=
_
0
0r
1
, ...,
0
0r
n
_
T
.
If b (: 1) and A (: :) , symmetric, are given constant matrices, then
d
dx
_
b
T
x
_
= b
d
dx
_
x
T
Ax
_
=
1
2
Ax
1st PC
Dierentiating (33) using these results, gives
d1
da
1
= 2a
1
2`a
1
= 0
a
1
= `a
1
(5)
showing that a
1
should be chosen to be an eigenvector of ; say a
1
= v with eigenvalue `. Suppose
the eigenvalues of are ranked in decreasing order `
1
_ `
2
_ ... _ `
p
0.
2
\ ar (j
1
) = a
T
1
a
1
= `a
T
1
a
1
= ` (6)
since a
T
1
a
1
= 1. Equivalently we observe that ` = max
a
T
a
a
T
a
, a ratio known as the Rayleigh
quotient. Therefore, in order to maximize \ ar (j
1
) , a
1
should be chosen as the eigenvector v
1
corresponding to the largest eigenvalue `
1
of .
2nd PC
The Lagrangean is
1(a
2
) = a
T
2
a
2
`
_
a
T
2
a
2
1
_
j
_
a
T
2
a
1
_
(7)
where `, j are Lagrange multipliers.
d1
da
2
= 2 (`I
p
) a
2
ja
1
= 0 (8)
2a
2
= 2`a
2
+ ja
1
(9)
after premultiplying by a
T
1
and using a
T
1
a
2
= a
T
2
a
1
= 0 and a
T
1
a
1
= 1
2a
T
1
a
2
j = 0
However
a
T
1
a
2
= a
T
2
(a
1
)
= `
1
a
T
2
a
1
= 0 (10)
using (34) with ` = `
1
. Therefore j = 0 and
a
2
= `a
2
(11)
` =
a
T
2
a
2
a
T
2
a
2
(12)
Therefore a
2
is the eigenvector of corresponding to the second largest eigenvalue `
2
.
From (39) we see that Co (j
1
, j
2
) = 0 so that j
1
and j
2
are uncorrelated.
3
2.4 Example
The covariance matrix corresponding to scaled (standardized) variables r
1
, r
2
is
=
_
1 j
j 1
_
(in fact a correlation matrix). Note has total variation =2.
The eigenvalues of are the roots of [ `I[ = 0

1 ` j
j 1 `

= 0
(1 `)
2
j
2
= 0
Hence roots are ` = 1 + j and ` = 1 j.
If j 0 then `
1
= 1 +j and `
2
= 1 j (`
1
`
2
) To nd a
1
we substitute `
1
into a
1
= `a
1
.
Let a
T
1
= (a
1
, a
2
)
a
1
+ ja
2
= (1 + j) a
1
ja
1
+ a
2
= (1 + j) a
2
(n.b. only one independent equation) so a
1
= a
2
. Apply standardization
a
T
1
a
1
= a
2
1
+ a
2
2
= 1
we obtain [a
1
[
2
= [a
2
[
2
= 1,2
a
1
=
1
_
2
_
1
1
_
=
_
1,
_
2
1,
_
2
_
Similarly
a
2
=
_
1,
_
2
1,
_
2
_
so the PCs are
j
1
= a
T
1
x =
1
p
2
(r
1
+ r
2
)
j
2
= a
T
2
x =
1
p
2
(r
1
r
2
)
are the PCs explaining respectively
100`
1
`
1
+ `
2
= 50 (1 + j) %
100`
2
`
1
+ `
2
= 50 (1 j) %
4
of the total variation tr = 2. Notice that the PCs are independent of j while the proportion of
the total variation explained by each PC does depend on j.
2.5 PCA and spectral decomposition
Since (also S) is a real symmetric matrix, we know that it has the spectral decomposition
(eigenanalysis)
= AA
T
(13)
=
p

i=1
`
i
a
i
a
T
i
(14)
where a
i
are the j eigenvectors of which form the columns of the (j j) orthogonal matrix
A and `
1
_ `
2
_ ... _ `
p
are the corresponding eigenvalues.
If some eigenvalues are not distinct, so `
k
= `
k+1
= ... = `
l
= `, the eigenvectors are not
unique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension
| / + 1 (cf. the major/minor axes of an ellipse
r
2
a
2
+
j
2
/
2
= 1 as / a). Such a situation arises
with the equicorrelation matrix (see Class Exercise 1).
Summary
The transformation of a random jvector x (corrected for its mean ) to its set of principal
components, a set of new variablescontained in the jvector y is
y = A
T
(x ) (15)
where A is an orthogonal matrix whose columns are the eigenvectors of : Given a mean-centred
data matrix X
y
1
= Xa
1
, ...., y
p
= Xa
p
are the PC scores where the score on the rst PC, j
1
, is the standardized linear combination
(SLC) of x having maximum variance, j
2
is the SLC having maximum variance subject to being
uncorrelated with j
1
etc. We have seen that \ ar (j
1
) = `
1
, \ ar (j
2
) = `
2
, etc. and in fact
Co (y) = diaq (`
1
, ..., `
p
)
2.6 Explanation of variance
The interpretation of PCs (y)as components of variance "explaining" the total variation, i.e. the
sum of the variances of the original variables (x) is claried by the following result
5
Result [A note on tracc ()]
The sum of the variances of the original variables and their PCs are the same.
Proof
The sum of diagonal elements of a (j j) square matrix is known as the trace of
tr () =
p

i=1
o
ii
(16)
We show from this denition that tr (AB) = tr (BA) whenever AB and BA are dened [i.e. A
is (::) and B is (: :)]
tr (AB) =

i
(AB)
ii
=

i

j
a
ij
/
ji
(17)
=

j

i
/
ji
a
ij
=

j
(BA)
jj
(18)
= tr (BA) (19)
The sum of the variances for the PCs is

i
\ ar (j
i
) =

i
`
i
= tr () (20)
Now = AA
T
is the spectral decomposition, so = A
T
A and columns of A are a set of
orthonormal vectors so
A
T
A = AA
T
= I
p
Hence
tr () = tr
_
AA
T
_
= tr
_
A
T
A
_
= tr () (21)
Since = Cov (x) the sum of diagonal elements is the sum of the variances o
ii
of the original
variables. Hence the result is proved.
Consequence (interpretation of PCs)
6
It is therefore possible to interpret
`
i
`
1
+ `
2
+ ... + `
p
(22)
as the proportion of the total variation in the original data explained by the i
th
principal component
and
`
1
+ .. + `
k
`
1
+ `
2
+ ... + `
p
(23)
as the proportion of the total variation explained by the rst / PCs.
From a PCA on a (10 10) sample covariance matrix S, we could for example conclude that
the rst 3 PCs (out of a total of j = 10 PCs) account for 80% of the total variation in the data.
This would mean that the variation in the data is largely conned to a 3-dimensional subspace
described by the PCs j
1
, j
2
, j
3
.
2.7 Scale invariance
This unfortunately is a property that PCA does not possess!
In practice we often have to choose units of measurement for our individual variables r
i
and
the amount of the total variation accounted for by a particular variable r
i
is dependent on this
choice (tonnes, kg. or grams).
In a practical study, the data vector x often comprises of physically incomparable quantities
(e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibility
is to perform PCA on a correlation matrix (eectively choosing each variable to have unit sample
variance), but this is still an implicit choice of scaling. The main point is that the results of a PCA
depends on the scaling adopted.
2.8 Principal component scores
The sample PC transform on a data matrix X takes the form for the r
th
individual (r
th
row of the
sample)
y
0
r
= A
T
(x
r
x) = A
T
x
0
r
(24)
where the columns of A are the eigenvectors of the sample covariance matrix S: Notice that the
rst component of y
0
r
corresponds to the scalar product of the rst column of A with x
0
r
etc.
The components of y
r
are known as the (mean-corrected) principal component scores for the
r
th
individual. The quantities
y
r
= A
T
x
r
(25)
7
are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of each
data point with respect to new axes dened by the PCs, i.e. w.r.t. a rotated frame of reference.
The scores can provide qualitative information about individuals.
2.9 PC loadings (correlations)
The correlations j (r
i
, j
k
) of the /
th
PC with variable r
i
is known as the loading of the i
th
variable
within the /
th
PC.
The PC loadings are an aid to interpreting the PCs.
Since y = A
T
(x ) we have
Co (x; y) = E
_
(x ) y
T

= E
_
(x ) (x )
T
A
_
= A (26)
and from the spectral decomposition
A =
_
AA
T
_
A
= A (27)
Post-multiplying A by a diagonal matrix has the eect of scaling its columns, so that
Co (r
i
, j
k
) = `
k
a
ik
(28)
is the covariance between the i
th
variable and the /
th
PC.
The correlation between r
i
, j
k
is
j (r
i
, j
k
) =
Co (r
i
, j
k
)
\ ar (r
i
) \ ar (j
k
)
=
`
k
a
ik
_
o
ii
_
`
k
= a
ik
_
`
k
o
ii
_1
2
(29)
can be interpreted as a weighting of the i
th
variable r
i
in the /
th
PC.
(The relative magnitude of the coecients a
k
themselves are another measure.)
2.10 Perpendicular regression (bivariate case)
PCs constitute a rotation of axes. Consider bivariate regression of r
2
(j) on r
1
(r) . The usual
linear regression estimate is a straight line that minimizes the SS of residuals in the direction of j.
8
The line formed by the 1
st
PC minimizes the total SS of perpendicular distances from points to the
line.
Let the (: 2) data matrix X contain in r
1
and r
2
columns the centred data. Following PCA
the second axis contain the PC scores orthogonal to the line representing the rst PC
y
2
= Xa
2
Therefore the total SS of residuals perpendicular to a
1
is
[y
2
[
2
= y
T
2
y
2
= a
T
2
X
T
Xa
2
= (: 1) a
2
Sa
2
= (: 1) `
2
since `
2
= mina
T
Sa subject to a
T
a = 1 and orthogonality to a
1
.
2.11
2. Principal Components Analysis (PCA)
2.1 Outline of technique
PCA is a technique for dimensionality reduction from j dimensions to 1 _ j dimensions. Let
x
T
= (r
1
, r
2
..., r
p
) be a random vector with mean j and covariance matrix . Generally we
consider x to be centred, i.e. = 0 ( x = 0) or if not, then x
0
= x explicitly. PCA aims to
nd a set of 1 uncorrelated variables j
1
, j
2
, ..., j
K
representing the "most informative" 1 linear
combinations of x:
The procedure is sequential, i.e. / = 1, 2, ..., 1 and the choice of 1 is an important practical
step of a PCA.
Here information will be interpreted as a percentage of the total variation (as previously dened)
in : The 1 sample PCs that can "best explain" the total variation in a sample covariance matrix
S may be similarly dened.
2.2 Formulation
PCs may be dened in terms of the population (using ) or in terms of a sample (using S). Let
j
1
= a
T
1
x
j
2
= a
T
2
x
.
.
.
j
p
= a
T
p
x
9
where j
j
= a
1j
r
1
+a
2j
r
2
+... +a
pj
r
p
are a sequence of "standardized" linear combinations (SLCs)
of the the r
0
: such that a
T
j
a
j
= 1
_

p
i=1
a
2
ij
= 1
_
and a
T
j
a
k
= 0 (
p
i=1
a
ij
a
ik
= 0) for , ,= /. i.e.
a
1
; a
2
; :::; a
p
form an orthonormal set of jvectors.
Equivalently, the jj matrix A formed from the columns a
j
satises A
T
A = I
p
_
= AA
T
_
,
so by denition is an orthogonal matrix. Geometrically the transformation from r
j
to j
j
is
a rotation in jdimensional space that aligns the axes along successive directions of maximum
variation. These are geometrically the principal axes of the ellipsoid dened by the matrix A:
We choose a
1
to maximize
\ ar (j
1
) = a
T
1
a
1
(30)
subject to the normalization condition a
T
1
a
1
= 1. Then we choose a
2
to maximize
\ ar (j
2
) = a
T
2
a
2
(31)
subject to the conditions a
T
2
a
2
= 1 (normalization) and a
T
2
a
1
= 0 (orthogonality) so that a
1
a
2
are orthonormal vectors. We shall see that j
2
is uncorrelated with j
1
Co (j
1
, j
2
) = Co
_
a
T
1
x; a
T
2
x
_
= a
T
1
a
2
= 0
Subsequent PCs for / = 3, 4, ..., j are chosen as the SLCs that have maximum variance subject to
being uncorrelated with previous PCs.
NB. Usually the PCs are taken to be "mean-corrected" linear transformations of the r
0
: i.e.
j
j
= a
T
j
(x ) (32)
emphasizing that the PCSs can be considered as direction vectors in jspace relative to the
"centre" of a distribution in which the spread is maximized. In any case \ ar (j
j
) is the same
whichever denition is used.
2.3 Computation
To nd the rst PC we use the Lagrange multiplier technique for nding the maximum of a function
) (x) subject to an equality constraint q (x) = 0. We dene the Lagrangean function
1(a
1
) = a
T
1
a
1
`
_
a
T
1
a
1
1
_
(33)
where ` is a Lagrange multiplier. We need a result on vector dierentiation:
Result
10
Let x = (r
1
, ..., r
n
) and
d
dx
=
_
0
0r
1
, ...,
0
0r
n
_
T
.
If b (: 1) and A (: :) , symmetric, are given constant matrices, then
d
dx
_
b
T
x
_
= b
d
dx
_
x
T
Ax
_
= 2Ax
1st PC
Dierentiating (33) using these results, gives
d1
da
1
= 2a
1
2`a
1
= 0
a
1
= `a
1
(34)
showing that a
1
should be chosen to be an eigenvector of ; say a
1
= v with eigenvalue `. Suppose
the eigenvalues of are ranked in decreasing order `
1
_ `
2
_ ... _ `
p
0.
\ ar (j
1
) = a
T
1
a
1
= `a
T
1
a
1
= ` (35)
since a
T
1
a
1
= 1. Equivalently we observe that ` = max
a
T
a
a
T
a
, a ratio known as the Rayleigh
quotient. Therefore, in order to maximize \ ar (j
1
) , a
1
should be chosen as the eigenvector v
1
corresponding to the largest eigenvalue `
1
of .
2nd PC
The Lagrangean is
1(a
2
) = a
T
2
a
2
`
_
a
T
2
a
2
1
_
j
_
a
T
2
a
1
_
(36)
where `, j are Lagrange multipliers.
d1
da
2
= 2 (`I
p
) a
2
ja
1
= 0 (37)
2a
2
= 2`a
2
+ ja
1
(38)
after premultiplying by a
T
1
and using a
T
1
a
2
= a
T
2
a
1
= 0 and a
T
1
a
1
= 1
2a
T
1
a
2
j = 0
11
However
a
T
1
a
2
= a
T
2
(a
1
)
= `
1
a
T
2
a
1
= 0 (39)
using (34) with ` = `
1
. Therefore j = 0 and
a
2
= `a
2
(40)
` =
a
T
2
a
2
a
T
2
a
2
(41)
Therefore a
2
is the eigenvector of corresponding to the second largest eigenvalue `
2
.
From (39) we see that Co (j
1
, j
2
) = 0 so that j
1
and j
2
are uncorrelated.
2.4 Example
The covariance matrix corresponding to scaled (standardized) variables r
1
, r
2
is
=
_
1 j
j 1
_
(in fact a correlation matrix). Note has total variation =2.
The eigenvalues of are the roots of [ `I[ = 0

1 ` j
j 1 `

= 0
(1 `)
2
j
2
= 0
Hence roots are ` = 1 + j and ` = 1 j.
If j 0 then `
1
= 1 +j and `
2
= 1 j (`
1
`
2
) To nd a
1
we substitute `
1
into a
1
= `a
1
.
Let a
T
1
= (a
1
, a
2
)
a
1
+ ja
2
= (1 + j) a
1
ja
1
+ a
2
= (1 + j) a
2
(n.b. only one independent equation) so a
1
= a
2
. Apply standardization
a
T
1
a
1
= a
2
1
+ a
2
2
= 1
we obtain a
2
1
= a
2
2
= 1,2
a
1
=
1
_
2
_
1
1
_
=
_
1,
_
2
1,
_
2
_
12
Similarly
a
2
=
_
1,
_
2
1,
_
2
_
so the PCs are
j
1
= a
T
1
x =
1
p
2
(r
1
+ r
2
)
j
2
= a
T
2
x =
1
p
2
(r
1
r
2
)
are the PCs explaining respectively
100`
1
`
1
+ `
2
= 50 (1 + j) %
100`
2
`
1
+ `
2
= 50 (1 j) %
of the total variation tr = 2. Notice that the PCs are independent of j while the proportion of
the total variation explained by each PC does depend on j.
2.5 PCA and spectral decomposition
Since (also S) is a real symmetric matrix, we know that it has the spectral decomposition
(eigenanalysis)
= AA
T
(42)
=
p

i=1
`
i
a
i
a
T
i
(43)
where a
i
are the j eigenvectors of which form the columns of the (j j) orthogonal matrix
A and `
1
_ `
2
_ ... _ `
p
are the corresponding eigenvalues.
If some eigenvalues are not distinct, so `
k
= `
k+1
= ... = `
l
= `, the eigenvectors are not
unique but we may choose an orthonormal set of eigenvectors to span a subspace of dimension
| / + 1 (cf. the major/minor axes of an ellipse
r
2
a
2
+
j
2
/
2
= 1 as / a). Such a situation arises
with the equicorrelation matrix (see Class Exercise 1).
Summary
The transformation of a random jvector x (corrected for its mean ) to its set of principal
components, a set of new variablescontained in the jvector y is
y = A
T
(x ) (44)
where A is an orthogonal matrix whose columns are the eigenvectors of : Given a mean-centred
13
data matrix X
y
1
= Xa
1
, ...., y
p
= Xa
p
are the PC scores where the score on the rst PC, j
1
, is the standardized linear combination
(SLC) of x having maximum variance, j
2
is the SLC having maximum variance subject to being
uncorrelated with j
1
etc. We have seen that \ ar (j
1
) = `
1
, \ ar (j
2
) = `
2
, etc. and in fact
Co (y) = diaq (`
1
, ..., `
p
) (45)
2.6 Explanation of variance
The interpretation of PCs (y)as components of variance "explaining" the total variation, i.e. the
sum of the variances of the original variables (x) is claried by the following result
Result [A note on tracc ()]
The sum of the variances of the original variables and their PCs are the same.
Proof
The sum of diagonal elements of a (j j) square matrix is known as the trace of
tr () =
p

i=1
o
ii
(46)
We show from this denition that tr (AB) = tr (BA) whenever AB and BA are dened [i.e. A
is (::) and B is (: :)]
tr (AB) =

i
(AB)
ii
=

i

j
a
ij
/
ji
(47)
=

j

i
/
ji
a
ij
=

j
(BA)
jj
(48)
= tr (BA) (49)
The sum of the variances for the PCs is

i
\ ar (j
i
) =

i
`
i
= tr () (50)
Now = AA
T
is the spectral decomposition, so = A
T
A and columns of A are a set of
14
orthonormal vectors so
A
T
A = AA
T
= I
p
Hence
tr () = tr
_
AA
T
_
= tr
_
A
T
A
_
= tr () (51)
Since = Cov (x) the sum of diagonal elements is the sum of the variances o
ii
of the original
variables. Hence the result is proved.
Consequence (interpretation of PCs)
It is therefore possible to interpret
`
i
`
1
+ `
2
+ ... + `
p
(52)
as the proportion of the total variation in the original data explained by the i
th
principal component
and
`
1
+ .. + `
k
`
1
+ `
2
+ ... + `
p
(53)
as the proportion of the total variation explained by the rst / PCs.
From a PCA on a (10 10) sample covariance matrix S, we could for example conclude that
the rst 3 PCs (out of a total of j = 10 PCs) account for 80% of the total variation in the data.
This would mean that the variation in the data is largely conned to a 3-dimensional subspace
described by the PCs j
1
, j
2
, j
3
.
2.7 Scale invariance
This unfortunately is a property that PCA does not possess!
In practice we often have to choose units of measurement for our individual variables r
i
and
the amount of the total variation accounted for by a particular variable r
i
is dependent on this
choice (tonnes, kg. or grams).
In a practical study, the data vector x often comprises of physically incomparable quantities
(e.g. height, weight, temperature) so there is no "natural scaling" to adopt. One possibility
is to perform PCA on a correlation matrix (eectively choosing each variable to have unit sample
variance), but this is still an implicit choice of scaling. The main point is that the results of a PCA
depends on the scaling adopted.
15
2.8 Principal component scores
The sample PC transform on a data matrix X takes the form for the r
th
individual (r
th
row of the
sample)
y
0
r
= A
T
(x
r
x) = A
T
x
0
r
(54)
where the columns of A are the eigenvectors of the sample covariance matrix S: Notice that the
rst component of y
0
r
corresponds to the scalar product of the rst column of A with x
0
r
etc.
The components of y
r
are known as the (mean-corrected) principal component scores for the
r
th
individual. The quantities
y
r
= A
T
x
r
(55)
are the raw PC scores for that individual. Geometrically the PC scores are the coordinates of each
data point with respect to new axes dened by the PCs, i.e. w.r.t. a rotated frame of reference.
The scores can provide qualitative information about individuals.
2.9 PC loadings (correlations)
The correlations j (r
i
, j
k
) of the /
th
PC with variable r
i
is known as the loading of the i
th
variable
within the /
th
PC.
The PC loadings are an aid to interpreting the PCs.
Since y = A
T
(x ) we have
Co (x; y) = E
_
(x ) y
T

= E
_
(x ) (x )
T
A
_
= A (56)
and from the spectral decomposition
A =
_
AA
T
_
A
= A (57)
Post-multiplying A by a diagonal matrix has the eect of scaling its columns, so that
Co (r
i
, j
k
) = `
k
a
ik
(58)
is the covariance between the i
th
variable and the /
th
PC.
16
The correlation between r
i
, j
k
is
j (r
i
, j
k
) =
Co (r
i
, j
k
)
\ ar (r
i
) \ ar (j
k
)
=
`
k
a
ik
_
o
ii
_
`
k
= a
ik
_
`
k
o
ii
_1
2
(59)
can be interpreted as a weighting of the i
th
variable r
i
in the /
th
PC.
(The relative magnitude of the coecients a
k
themselves are another measure.)
2.10 Perpendicular regression (bivariate case)
PCs constitute a rotation of axes. Consider bivariate regression of r
2
(j) on r
1
(r) . The usual
linear regression estimate is a straight line that minimizes the SS of residuals in the direction of j.
The line formed by the 1
st
PC minimizes the total SS of perpendicular distances from points to the
line.
Let the (: 2) data matrix X contain in r
1
and r
2
columns the centred data. Following PCA
the second axis contain the PC scores orthogonal to the line representing the rst PC
y
2
= Xa
2
Therefore the total SS of residuals perpendicular to a
1
is
[y
2
[
2
= y
T
2
y
2
= a
T
2
X
T
Xa
2
= (: 1) a
2
Sa
2
= (: 1) `
2
since `
2
= mina
T
Sa subject to a
T
a = 1 and orthogonality to a
1
.
2.11 Exercise [Johnson & Wichern Example 8.1]
Find the PCs of the covariance matrix
=
_

_
1 2 0
2 5 0
0 0 2
_

_
(60)
17
and show that they account for amounts
`
1
= 5.83
`
2
= 2.00
`
3
= 0.17
of the total variation in : Compute the correlations j (r
i
, j
k
) and try to interpret the PCs quali-
tatively.
Solution
The eigenvalues are the roots ` of the characteristic equation [ I[ = 0
(2 `) [(1 `) (5 `) 4] = 0
so `
1
= 3 + 2
_
2, `
2
= 2, `
3
= 3 2
_
2
`
1
= 5.83, `
2
= 2, `
3
= 0.17
To nd a
1
, we solve the system a
1
= `
1
a
1
. Set a
1
(1, c, ,)
T
, then
1 2c = 3 + 2
_
2
2, =
_
3 + 2
_
2
_
,
so c =
_
1 + 2
_
2
_
, , = 0. Standardizing gives a unit length vector
a
T
1
= (.383, .924, 0) (61)
Next set a
2
= (c, ,, )
T
, we nd c = , = 0, = 1, so
a
T
2
= (0, 0, 1) (62)
Finally as a
3
is orthogonal to both a
1
and a
2
a
T
3
= (.924, .383, 0) (63)
The PCs are
j
1
= .383r
1
.924r
2
j
2
= r
3
j
3
= .924r
1
+ .383r
2
18
The rst PC j
1
accounts for a proportion
`
1
`
1
+ `
2
+ `
3
= 5.83,8 = .73
of the total variation. The rst two PCs j
1
, j
2
account for a proportion
`
1
+ `
2
`
1
+ `
2
+ `
3
= 7.83,8 = .98
of the total variation.
The (3 2) submatrix of A corresponding to the rst two PCs is
A = (a
1
, a
2
) =
_
_
_
.383 0
.924 0
0 1
_
_
_
(64)
The correlations of r
1
, r
2
with the rst PC j
1
are
j (r
1
, j
1
) = a
11
_
`
1
o
11
= .383
_
5.83
1
= .925
j (r
2
, j
1
) = a
21
_
`
1
o
22
= .924
_
5.83
5
= .998
In terms of correlations:- both r
1
and r
2
contribute equally (in magnitude) towards j
1
.
According to the coecients:-
a
21
= .924 while a
11
= .383 is much smaller in magnitude. This suggests that r
2
contributes
more to j
1
than does r
1
.
Exercise [Johnson & Wichern Example 8.1]
Find the PCs of the covariance matrix
=
_

_
1 2 0
2 5 0
0 0 2
_

_
(65)
and show that they account for amounts
`
1
= 5.83
`
2
= 2.00
`
3
= 0.17
of the total variation in : Compute the correlations j (r
i
, j
k
) and try to interpret the PCs quali-
tatively.
19
Solution
The eigenvalues are the roots ` of the characteristic equation [ I[ = 0
(2 `) [(1 `) (5 `) 4] = 0
so `
1
= 3 + 2
_
2, `
2
= 2, `
3
= 3 2
_
2
`
1
= 5.83, `
2
= 2, `
3
= 0.17
To nd a
1
, we solve the system a
1
= `
1
a
1
. Set a
1
(1, c, ,)
T
, then
1 2c = 3 + 2
_
2
2, =
_
3 + 2
_
2
_
,
so c =
_
1 + 2
_
2
_
, , = 0. Standardizing gives a unit length vector
a
T
1
= (.383, .924, 0) (66)
Next set a
2
= (c, ,, )
T
, we nd c = , = 0, = 1, so
a
T
2
= (0, 0, 1) (67)
Finally as a
3
is orthogonal to both a
1
and a
2
a
T
3
= (.924, .383, 0) (68)
The PCs are
j
1
= .383r
1
.924r
2
j
2
= r
3
j
3
= .924r
1
+ .383r
2
The rst PC j
1
accounts for a proportion
`
1
`
1
+ `
2
+ `
3
= 5.83,8 = .73
of the total variation. The rst two PCs j
1
, j
2
account for a proportion
`
1
+ `
2
`
1
+ `
2
+ `
3
= 7.83,8 = .98
of the total variation.
20
The (3 2) submatrix of A corresponding to the rst two PCs is
A = (a
1
, a
2
) =
_
_
_
.383 0
.924 0
0 1
_
_
_
(69)
The correlations of r
1
, r
2
with the rst PC j
1
are
j (r
1
, j
1
) = a
11
_
`
1
o
11
= .383
_
5.83
1
= .925
j (r
2
, j
1
) = a
21
_
`
1
o
22
= .924
_
5.83
5
= .998
In terms of correlations:- both r
1
and r
2
contribute equally (in magnitude) towards j
1
.
According to the coecients:-
a
21
= .924 while a
11
= .383 is much smaller in magnitude. This suggests that r
2
contributes
more to j
1
than does r
1
.
21

You might also like