Ch. 10 Principal Components Analysis (PCA)
Ch. 10 Principal Components Analysis (PCA)
This material is loosely related to Section 10B. I would encourage you to read the rest of Chapter 10, but very critically; factor analysis is very popular in social science but also very controversial.
Why use PCA? There are several uses (and abuses) for PCA. The most important use of PCA is probably in multiple regression. Suppose a response variable Y is to be regressed against a large number of covariates. Variable selection techniques are not often very eective, and there may be scientic interest in including information from most or all of the covariates. Retaining all covariates will likely lead to severe multicollinearity or non-identiability of regression coecients. Without remedy, standard errors will be unacceptably large, and predictions may be very inaccurate.
2
What does PCA do? Objectives: 1. To nd a small set of linear combinations of the covariates which are uncorrelated with each other. This will avoid the multicollinearity problem. 2. To ensure that the linear combinations chosen have maximal variance. A good regression design chooses values of the covariates which are spread out.
Calculating Principal Components Suppose n independent observations are taken on X1, X2, . . . , Xk , where the covariance between Xi and Xj is Cov(Xi, Xj ) = i,j = 2Ri,j for i, j = 1, 2, . . . , k. Let 1 > 2 > > k > 0 be the eigenvalues of R and let z1, z2, . . . , zk be the corresponding eigenvectors. T Normalize these so that zj zj = 1.
Calculating Principal Components Dene W1 to be the rst principal component. It will be a linear combination of the Xs which has the largest possible variance: W1 = aT X = 1 where aT a1 = 1. 1 Var(W1) = aT a1 = 2aT Ra1 1 1 Constrained maximization leads to the Lagrangian: L = aT Ra1 + (aT a1 1) 1 1
k
a1iXi
i=1
Calculating Principal Components Solution: a1 is a unit vector satisfying Ra1 = a1 i.e. a1 must be an eigenvector of R. Remember that we want to maximize aT Ra1 = aT a1 1 1 where is the eigenvalue which corresponds to the eigenvector a1. Therefore, take a1 = z1. That is,
T W1 = z1 X = k
z1iXi.
i=1
W1 has the largest variance among all linear combinations of the Xs.
Calculating the Second Principal Component Let W2 be a second linear combination of the Xs which has the largest possible variance: W2 = aT X = 2
k
a2iXi,
i=1
but Corr(W1, W2) = 0. i.e. aT z1 = 0. 2 Now the constrained maximization leads to the Lagrangian L = aT Ra2 + 1(aT a2 1) + 2(aT z1) 2 2 2
Calculating the Second Principal Component Solution: (R + 1I)a2 + 2z1 = 0 with aT a2 = 1, Rz1 = 1z1, and aT z1 = 0. 2 2 T to nd that = 0. Multiply through by z1 2 Therefore, Ra2 = 1a2. Therefore, a2 is an eigenvector (which cannot be z1 because of the orthogonality condition). a2 = z2 will make Var(W2) = 22 which is large as possible under the given constraints. That is,
T W 2 = z2 X = k
z2iXi.
i=1
W2 has the largest variance among all linear combinations of the Xs which are orthogonal to W1.
8
Calculating the Remaining Principal Component The third, fourth, fth, ... principal components follow from the same reasoning. Wj is the linear combination of the Xs which has largest variance, subject to the constraint that Wk is uncorrelated with W1, W2, . . . , Wj1. The constrained maximization problem is solved by setting
T Wj = zj X = k
zjiXi.
i=1
The variance of Wj is 2j .
How many Principal Components should be used? Remember that the objective is to use only the rst few components. The usual technique is to look for where there is a sharp drop in the component variance. (Remember that a good regression design will have spread out covariates, so the components with small variance (i.e. small eigenvalues) will be omitted.
10
Principal Components Regression After identifying the principal components which account for most of the variance in X1, X2, . . . , Xk (often, 2 to 4 of the components), these components can be used in regression. e.g. Original data set x1 x2 x3 ... xk y 2 3 1 ... 5 20 4 3 3 ... 5 25 ...................
11
W1 =
i=1 k
z1k Xk
W2 =
i=1 k
z2k Xk
W3 =
i=1
z3k Xk
New data set x1 x2 x3 ... xk y w1 w2 w3 2 3 1 ... 5 20 2z11+3z12+...5z1k 2z21+3z22+...+5z2k .... 4 3 3 ... 5 25 4z11+3z12+...5z1k 4z21+3z22+...+5z2k .... ...............................................................
12
Y = 0 +
i=1
iXi +
Advantage: W1, W2 and W3 are orthogonal so t-tests for coecients are easy to interpret. No multicollinearity.
13
Using PROC FACTOR for PCA We will apply PCA to the data in swiss.txt which is a data set that can be found in the textbook Data Analysis and Regression: A Second Course in Statistics by F. Mosteller and J.W. Tukey (1977). These data were collected in about 1888 in the 47 French-speaking provinces of Switzerland. The variables are Fertility, Agriculture, Examination, Education, and Catholic as well as Infant.Mortality which we will treat as a response variable.
14
Using PROC FACTOR for PCA DATA SWISS; INFILE swiss.txt FIRSTOBS=2; INPUT FERT AGRI ARMYEXAM EDUC CATHOL INFMORT; PROC FACTOR PREPLOT PLOT ROTATE=VARIMAX NFACTORS=2 OUT=FACT SCREE; TITLE PCA for 1988 Swiss Data; VAR FERT AGRI ARMYEXAM EDUC CATHOL;
15
Using PROC FACTOR for PCA Preplot shows a factor plot before rotation (i.e. computing the principal components). Plot shows a factor plot after rotation (i.e. computing the principal components). ROTATE=VARIMAX causes PROC FACTOR to compute the principal components in the way described above. NFACTORS controls the number of principal computes that will be computed. SCREE gives a scree plot which is useful for choosing the number of components. Look for a sharp drop.
16
Principal Components Regression Now regress Y = 0 + 1W1 + 2W2 + /*Since DATA FACT from PROC FACTOR contains all the data in the DATA=SWISS plus new variables called Factor1, Factor2, we can continue to do PROC REG*/ PROC REG DATA=FACT; TITLE Regresson of INFMORT on Factor1 and Factor2; MODEL INFMORT = FACTOR1 FACTOR2; RUN;
17