Principal Component Analysis
Principal Component Analysis
REDUCTION
The Curse of Dimensionality (CoD)
• Refers to various phenomena that
• arise when analyzing and organizing data in high-dimensional spaces
• do not occur in low-dimensional settings such as the three-dimensional physical space of
everyday experience
• arise when number of datapoints is small (in a suitably defined sense) relative to the
intrinsic dimension of the data.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 2
CoD in Regression Models
• Case of Linear Regression Models
• The number of parameters to be estimated equals the number of variables (𝑝𝑝), plus the
intercept.
• So, the number of parameters grows one-for-one with the number of features.
• Provided 𝑛𝑛 > 𝑝𝑝, the model is identifiable, and the parameters can be estimated.
• Estimation precision drops as 𝑝𝑝 grows relative to 𝑛𝑛.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 3
PRINCIPAL COMPONENT
ANALYSIS
An Introduction
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 4
What is Principal Component Analysis
(PCA)?
A method of data reduction
Starting with a large number of variables, captures the
variation in the data with a smaller number of components
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 5
Principal Component Analysis
The new random variables
• are linear combinations of the original ones
• are uncorrelated with one another
• Orthogonal in original dimension space
• capture all of the variability in the original data
• are called Principal Components
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 6
Principal Components
′
• First principal component is a linear combination 𝒖𝒖𝟏𝟏 𝒙𝒙 of the original 𝑝𝑝-
variate vector variable 𝒙𝒙 where 𝒖𝒖𝟏𝟏 is so chosen that variance of 𝒖𝒖′𝟏𝟏 𝒙𝒙 is
maximized.
′
• Second principal component is a linear combination 𝒖𝒖𝟐𝟐 𝒙𝒙 which has the
greatest variance among all such linear combinations that are
uncorrelated with 𝒖𝒖′𝟏𝟏 𝒙𝒙.
′
• Third principal component is a linear combination 𝒖𝒖𝟑𝟑 𝒙𝒙 which has the
greatest variance among all such linear combinations that are
uncorrelated with 𝒖𝒖′𝟏𝟏 𝒙𝒙 and 𝒖𝒖′𝟐𝟐 𝒙𝒙 .
• And so on …
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 7
Illustration with 𝑝𝑝 = 2
𝑋𝑋2 B
Variable
PC 2 PC 1
Variable
Original
Original
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 8
Illustration with 𝑝𝑝 = 3
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 9
Population Principal Components
• Let 𝐗𝐗 be a 𝑝𝑝-dimensional random variable with 𝐸𝐸(𝐗𝐗) = 𝟎𝟎 and 𝐷𝐷 𝐗𝐗 = 𝚺𝚺.
• The first principal component 𝒖𝒖′𝟏𝟏 𝒙𝒙 is obtained by solving the constrained optimization problem
• Maximize 𝒖𝒖𝑻𝑻 𝚺𝚺𝒖𝒖 subject to the constraint 𝒖𝒖𝑻𝑻 𝒖𝒖 = 𝟏𝟏.
• To solve this problem, use Lagrange’s Method of Undetermined Multipliers
• Construct the Lagrangian 𝜑𝜑 𝒖𝒖 = 𝒖𝒖𝑻𝑻 𝚺𝚺𝒖𝒖 − 𝜆𝜆(𝒖𝒖𝑻𝑻 𝒖𝒖 − 1), λ being the undetermined multiplier.
𝜕𝜕𝜕𝜕
• Solve the equation 𝜕𝜕𝒖𝒖 = 0 to obtain the answer to your problem.
𝜕𝜕𝜕𝜕
• Observe that 𝜕𝜕𝒖𝒖 = 𝟎𝟎 ⟺ 𝚺𝚺𝒖𝒖 − 𝜆𝜆𝒖𝒖 = 𝟎𝟎 ⟺ 𝚺𝚺 − λ𝐈𝐈 𝒖𝒖 = 𝟎𝟎.
• As 𝒖𝒖 ≠ 𝟎𝟎, 𝒖𝒖 must be an eigenvector of 𝚺𝚺 corresponding to the eigenvalue λ.
• As 𝑣𝑣𝑣𝑣𝑣𝑣 𝒖𝒖𝑻𝑻𝒙𝒙 = 𝜆𝜆, which is to be maximized, it follows that the 1st PC 𝒖𝒖′𝟏𝟏 𝒙𝒙 is such that 𝒖𝒖𝟏𝟏 is the normalized
eigenvector of 𝚺𝚺 corresponding to its largest eigenvalue, say, 𝜆𝜆1 .
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 10
The Second Principal Component
• WLG, let us denote the eigenvalues of 𝚺𝚺 by 𝜆𝜆1 ≥ 𝜆𝜆2 ≥ ⋯ ≥ 𝜆𝜆𝑝𝑝 > 0.
• Note that 𝚺𝚺 is symmetric, positive definite, so its eigenvalues are real and positive-valued.
• By the same reasoning as before, the second PC 𝒖𝒖′𝟐𝟐 𝒙𝒙 is such that 𝒖𝒖𝟐𝟐 too is a normalized
eigenvector of 𝚺𝚺 .
• Since 𝒖𝒖′𝟏𝟏 𝒙𝒙 and 𝒖𝒖′𝟐𝟐 𝒙𝒙 must be uncorrelated, i.e.,
𝑐𝑐𝑐𝑐𝑐𝑐(𝒖𝒖′𝟏𝟏 𝒙𝒙, 𝒖𝒖′𝟐𝟐 𝒙𝒙) = 𝒖𝒖′𝟏𝟏 𝚺𝚺𝒖𝒖𝟐𝟐 = 𝒖𝒖′𝟐𝟐 𝚺𝚺𝒖𝒖𝟏𝟏 = 0,
or, 𝜆𝜆1 𝒖𝒖′𝟐𝟐 𝒖𝒖𝟏𝟏 = 0,
it follows that 𝒖𝒖′𝟐𝟐 𝒖𝒖𝟏𝟏 = 0 or 𝒖𝒖𝟐𝟐 ⊥ 𝒖𝒖𝟏𝟏 since 𝜆𝜆1 ≠ 0.
• Thus the second PC 𝒖𝒖′𝟐𝟐 𝒙𝒙 is such that 𝒖𝒖𝟐𝟐 is a normalized eigenvector of 𝚺𝚺 corresponding to
its second largest eigenvalue 𝜆𝜆2 and 𝒖𝒖𝟐𝟐 ⊥ 𝒖𝒖𝟏𝟏 .
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 11
Remaining Principal Components
• By the same reasoning as before, it follows that
• the 3rd PC 𝒖𝒖′𝟑𝟑 𝒙𝒙 is such that 𝒖𝒖𝟑𝟑 is a normalized eigenvector of 𝚺𝚺 corresponding to its
third largest eigenvalue 𝜆𝜆3 ;
• the fourth PC 𝒖𝒖′𝟒𝟒 𝒙𝒙 is such that 𝒖𝒖𝟒𝟒 is a normalized eigenvector of 𝚺𝚺 corresponding to
its fourth largest eigenvalue 𝜆𝜆4 .
• …….
• the 𝑝𝑝-th PC 𝒖𝒖′𝒑𝒑 𝒙𝒙 is such that 𝒖𝒖𝒑𝒑 is a normalized eigenvector of 𝚺𝚺 corresponding to its
smallest eigenvalue 𝜆𝜆𝑝𝑝 .
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 12
Computing the Components
• Data points 𝒙𝒙𝟏𝟏 , 𝒙𝒙𝟐𝟐 , ⋯ , 𝒙𝒙𝒏𝒏 are vectors in the 𝑝𝑝-dimensional space
• Assumption: the mean vector of the data points is the null vector 𝟎𝟎.
1 𝑛𝑛
• The sample covariance matrix is 𝑆𝑆 = ∑𝑖𝑖=1 𝒙𝒙𝒊𝒊 𝒙𝒙𝑻𝑻𝒊𝒊
𝑛𝑛
• Projection of random vector 𝒙𝒙 onto an axis (dimension) 𝒖𝒖 is 𝒖𝒖𝑻𝑻𝒙𝒙.
• Choose 𝒖𝒖 such that var 𝒖𝒖𝑻𝑻𝒙𝒙 = 𝒖𝒖𝑻𝑻 𝑺𝑺𝑺𝑺 is maximized subject to the constraint
𝒖𝒖𝑻𝑻 𝒖𝒖 = 𝟏𝟏.
• This direction of 𝒖𝒖 is the direction of the first Principal Component.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 13
Population PCs: Important Properties
• The PCs are uncorrelated, that is,
𝑐𝑐𝑐𝑐𝑐𝑐 𝒖𝒖𝑻𝑻𝒊𝒊 𝒙𝒙 , 𝒖𝒖𝑻𝑻𝒋𝒋 𝒙𝒙 = 0 for all 𝑖𝑖 ≠ 𝑗𝑗.
• The variance of the 𝒊𝒊-th PC 𝒖𝒖𝑻𝑻𝒊𝒊 𝒙𝒙 is 𝝀𝝀𝒊𝒊
• since
𝑣𝑣𝑣𝑣𝑣𝑣 𝒖𝒖𝑻𝑻𝒊𝒊 𝒙𝒙 = 𝒖𝒖𝑻𝑻𝒊𝒊 𝚺𝚺𝒖𝒖𝒊𝒊
= 𝒖𝒖𝑻𝑻𝒊𝒊 𝝀𝝀𝒊𝒊 𝒖𝒖𝒊𝒊
= 𝝀𝝀𝒊𝒊 𝒖𝒖𝑻𝑻𝒊𝒊 𝒖𝒖𝒊𝒊
= 𝝀𝝀𝒊𝒊
as 𝒖𝒖𝑻𝑻𝒊𝒊 𝒖𝒖𝒊𝒊 = 𝟏𝟏.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 14
Important Properties (contd.)
• The PCs are not scale-invariant.
• The covariance of the 𝑗𝑗-th original variable 𝑋𝑋𝑗𝑗 with the vector of principal
′
components 𝒀𝒀 = 𝒖𝒖𝟏𝟏 , 𝒖𝒖𝟐𝟐 , ⋯ , 𝒖𝒖𝒑𝒑 𝑿𝑿 − 𝝁𝝁 = 𝑼𝑼′ 𝑿𝑿 − 𝝁𝝁 is 𝑼𝑼𝜦𝜦,
• that is, 𝑐𝑐𝑐𝑐𝑐𝑐(𝑿𝑿, 𝒀𝒀) = 𝑼𝑼𝜦𝜦
where 𝑼𝑼 = (𝒖𝒖𝟏𝟏 , 𝒖𝒖𝟐𝟐 , ⋯ , 𝒖𝒖𝒑𝒑 ) and 𝜦𝜦 = 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝜆𝜆1 , 𝜆𝜆2 , ⋯ , 𝜆𝜆𝑝𝑝 ).
1
𝜆𝜆𝑗𝑗 2
• In particular, 𝑐𝑐𝑐𝑐𝑐𝑐 𝑋𝑋𝑖𝑖 , 𝑌𝑌𝑗𝑗 = 𝑢𝑢𝑖𝑖𝑖𝑖 .
𝜎𝜎𝑖𝑖𝑖𝑖
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 15
Principal Components of Normalized
Random Variables
1
• If 𝑍𝑍𝑖𝑖 = 𝑋𝑋 for 𝑖𝑖 = 1,2, ⋯ , 𝑝𝑝, that is, if the original random variables are normalized,
𝜎𝜎𝑖𝑖𝑖𝑖 𝑖𝑖
then 𝐷𝐷 𝐙𝐙 = 𝛒𝛒 where 𝛒𝛒 is the correlation matrix of the original random vector 𝐗𝐗.
• PCs of 𝐙𝐙 can be determined in the same way as for 𝐗𝐗 with 𝚺𝚺 replaced by 𝛒𝛒.
• In general, the (𝜆𝜆𝑖𝑖 , 𝒖𝒖𝑖𝑖 ) pairs obtained with 𝚺𝚺 are not the same as those derived from 𝛒𝛒.
• In particular, for the 𝜆𝜆𝑖𝑖 ’s of 𝛒𝛒,
𝑝𝑝
� 𝜆𝜆𝑖𝑖 = 𝑝𝑝.
𝑖𝑖=1
• In other words, the principal components obtained from 𝚺𝚺 and 𝛒𝛒 are different in general.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 16
Large Sample Properties of Sample PCs
• Let 𝚲𝚲 be the diagonal matrix of the eigenvalues 𝜆𝜆1 , 𝜆𝜆2 , ⋯ , 𝜆𝜆𝑝𝑝 of 𝚺𝚺.
� ′ = 𝜆𝜆̂1 , 𝜆𝜆̂ 2 , ⋯ , 𝜆𝜆̂ 𝑝𝑝 be the vector of eigenvalues of the sample
• Let 𝝀𝝀
� 𝑖𝑖 denoting the eigenvector corresponding to 𝜆𝜆̂ 𝑖𝑖 .
covariance matrix 𝑺𝑺 with 𝒖𝒖
• As 𝑛𝑛 → ∞,
𝐿𝐿
• 𝑛𝑛 𝜆𝜆̂ − 𝜆𝜆 → 𝒩𝒩𝑝𝑝 𝟎𝟎, 2𝚲𝚲2 ; (This implies that for each 𝑖𝑖, 𝜆𝜆̂ 𝑖𝑖 has an
approximate 𝒩𝒩(𝜆𝜆𝑖𝑖 , 2𝜆𝜆𝑖𝑖 2 ⁄𝑛𝑛) distribution independently of 𝜆𝜆̂𝑗𝑗 , 𝑗𝑗 ≠ 𝑖𝑖. )
𝐿𝐿 𝜆𝜆𝑘𝑘
• � 𝑖𝑖 − 𝒖𝒖𝑖𝑖 → 𝒩𝒩𝑝𝑝 𝟎𝟎, 𝐄𝐄𝑖𝑖 , where 𝐄𝐄𝑖𝑖 = 𝜆𝜆𝑖𝑖 ∑𝑝𝑝𝑘𝑘=1
𝑛𝑛 𝒖𝒖 𝒖𝒖 𝒖𝒖′
;
𝜆𝜆𝑘𝑘 −𝜆𝜆𝑖𝑖 2 𝑖𝑖 𝑖𝑖
𝑘𝑘≠𝑖𝑖
• each 𝜆𝜆̂ 𝑖𝑖 is distributed independently of the elements of 𝒖𝒖
� 𝑖𝑖 for all 𝑖𝑖.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 17
Scree Plot
• A plot of the eigenvalues of
the sample dispersion matrix,
that is, the variances of
the principal components
• Used to determine the
number of variables (PCs) to
retain
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 18
Dimensionality Reduction
• From the Scree Plot
• If the scree plot contains an elbow (a sharp change in the slopes of
adjacent line segments), that location might indicate a good number of
principal components (PCs)
• If detecting the elbow is too imprecise, start at the right-hand side of the
scree plot and look at the points that lie (approximately) on a straight
line. Discard the PCs lying on that line.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 19
Dimensionality Reduction
By ignoring the components of lesser significance.
25
20
Variance (%)
15
10
0
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
You do lose some information, but if the eigenvalues are small, you do not lose much
– If there are 𝑝𝑝 dimensions in original data, calculate 𝑝𝑝 eigenvectors and
eigenvalues
– choose only the first 𝑘𝑘 PCs; final data set has only 𝑘𝑘 dimensions
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 20
Dimensionality Reduction
• By proportion of variation explained
• Compute
∑𝑘𝑘
𝑖𝑖=1 𝜆𝜆𝑖𝑖
𝑃𝑃𝑘𝑘 = ∑𝑝𝑝 ,
𝑗𝑗=1 𝜆𝜆𝑗𝑗
the proportion of variability explained by the first 𝑘𝑘
PCs, for 1 ≤ 𝑘𝑘 < 𝑝𝑝 − 1. Note that 𝑃𝑃𝑝𝑝 = 1.
• If 𝑝𝑝𝑘𝑘 is sufficiently close to 1, retain the first 𝑘𝑘 PCs.
• Using the average of eigenvalues
• The average-eigenvalue test (Kaiser-Guttman test)
retains the eigenvalues that exceed the average
eigenvalue.
• For a 𝑝𝑝 × 𝑝𝑝 correlation matrix, the average value of the
eigenvalues is 1.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 21
PCA: Example
• Compute the principal components
of the following two-dimensional
dataset
• Solution by hand
• The biased covariance matrix of the
data is
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 22
PCA: Example
• Data
5
x1 x2 𝑚𝑚 =
5
1 2
3 3
3 5
5 4
5 6
6 5
8 7
9 8
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 23
PCA: Example (contd.)
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 24
Example: Iris Data (PCA with R)
LOADINGS Comp.1 Comp.2 Comp.3 Comp.4
Sepal.Length 0.521 0.377 0.720 0.261
Sepal.Width 0.269 0.923 0.244 0.124
Petal.Length 0.580 0 0.142 0.801
Petal.Width 0.565 0 0.634 0.524
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 25
PLOTS FOR PCA
Illustration with Data on the expression of 15 genes from 60 mice
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 26
The Scree Plots from the Dataset
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 27
PCA Score Plot
• Mice that have similar expression profiles are
now clustered together. Just glancing at this
plot, we can see that there are 3 clusters of mice.
• If 2 clusters of mice are different based on PC1,
like the blue and orange clusters in this plot,
such differences are likely to be due to the genes
that have heavy influences on PC1.
• If 2 clusters are different based on PC2, like the
red and blue clusters, then the genes that
heavily influence PC2 are likely to be
responsible.
• Differences among clusters along PC1 axis are
actually larger than the similar-looking distances
along PC2 axis.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 28
PCA Loading Plot
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 29
PCA Loading Plot (contd.)
• These vectors are pinned at the origin of PCs (PC1 = 0 and PC2 = 0).
• Their projected values on each PC show how much weight they have on that PC.
• NPC2 and CHIT1 strongly influence PC1
• GBA and LCAT have more say in PC2.
• The angles between the vectors indicate how variables correlate with one another.
• When two vectors are close, forming a small angle, the two variables they represent are
positively correlated. Example: UGT8 and NPC1
• If they meet each other at 90°, they are not likely to be correlated. Example: NPC2 and GBA.
• When they diverge and form a large angle (close to 180°), they are negative correlated.
Example: NPC2 and MAG.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 30
PCA Biplot
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 31
PCA Biplot (contd.)
• A PCA biplot simply combines a PCA plot with a plot of loadings.
• The arrangement of axes is as follows:
• Bottom axis: PC1 score.
• Left axis: PC2 score.
• Top axis: loadings on PC1.
• Right axis: loadings on PC2.
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 32
PCA WITH R
Illustration with the USArrests dataset
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 33
The Data
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 34
Computation: The loadings
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 35
Computation: The Transform
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 36
Computation: The summary function
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 37
plot(prcomp(USArrests))
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 38
The Biplot
July 26, 2024 STAT 0992: Applied Multivariate Analysis and Resampling, 2024 39