Factor Analysis Note 4
Factor Analysis Note 4
com
Chapter 420
Factor Analysis
Introduction
Factor Analysis (FA) is an exploratory technique applied to a set of observed variables that seeks to find
underlying factors (subsets of variables) from which the observed variables were generated. For example, an
individual’s response to the questions on a college entrance test is influenced by underlying variables such as
intelligence, years in school, age, emotional state on the day of the test, amount of practice taking tests, and so on.
The answers to the questions are the observed variables. The underlying, influential variables are the factors.
Factor analysis is carried out on the correlation matrix of the observed variables. A factor is a weighted average of
the original variables. The factor analyst hopes to find a few factors from which the original correlation matrix
may be generated.
Usually the goal of factor analysis is to aid data interpretation. The factor analyst hopes to identify each factor as
representing a specific theoretical factor. Therefore, many of the reports from factor analysis are designed to aid
in the interpretation of the factors.
Another goal of factor analysis is to reduce the number of variables. The analyst hopes to reduce the interpretation
of a 200-question test to the study of 4 or 5 factors. One of the most subtle tasks in factor analysis is determining
the appropriate number of factors.
Factor analysis has an infinite number of solutions. If a solution contains two factors, these may be rotated to form
a new solution that does just as good a job at reproducing the correlation matrix. Hence, one of the biggest
complaints of factor analysis is that the solution is not unique. Two researchers can find two different sets of
factors that are interpreted quite differently yet fit the original data equally well.
NCSS provides the principal axis method of factor analysis. The results may be rotated using varimax or
quartimax rotation. The factor scores may be stored for further analysis.
Many books are devoted to factor analysis. We suggest you obtain a book on the subject from an author in your
own field. An excellent introduction to the subject is provided by Tabachnick (1989).
420-1
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
Technical Details
Mathematical Development
This section will document the basic formulas used by NCSS in performing a factor analysis. The following table
lists many of the matrices that are used in the discussion to follow.
420-2
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
p
1
∑ max r jk
over j ≠ k
( )
c ii = 1 - ii
k =1
R p
1
∑ 1 - R
k =1
kk
where Rii is the ith diagonal element of R-1 and rjk is an element of R. The value of U is then estimated by 1-cii.
Varimax Rotation
Varimax rotation is the most popular orthogonal rotation technique. In this technique, the axes are rotated to
maximize the sum of the variances of the squared loadings within each column of the loadings matrix.
Maximizing according to this criterion forces the loadings to be either large or small. The hope is that by rotating
the factors, you will obtain new factors that are each highly correlated with only a few of the original variables.
420-3
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
This simplifies the interpretation of the factor to a consideration of these two or three variables. Another way of
stating the goal of varimax rotation is that it clusters the variables into groups; each “group” is actually a new
factor.
Since varimax seeks to maximize a specific criterion, it produces a unique solution (except for differences in
sign). This has added to its popularity. Let the matrix G = {gij} represent the rotated factors. The goal of varimax
rotation is to maximize the quantity:
p ∑p g 4 - ∑p g 2
k ij ij
Q1 = ∑ i =1 i =1
j =1 p
This equation gives the raw varimax rotation. This rotation has the disadvantage of not spreading the variance
very evenly among the new factors. Instead, it tends to form one large factor followed by many small ones. To
correct this, NCSS uses the normalized-varimax rotation. The quantity maximized in this case is:
p g 4 2
- ∑ g ij
p
k ∑
p ij
i =1 c i
i =1 c i
QN = ∑
j =1
p2
where ci is the square root of the communality of variable i.
Quartimax Rotation
Quartimax rotation is similar to varimax rotation except that the rows of G are maximized rather than the columns
of G. This rotation is more likely to produce a “general” factor than will varimax. Often, the results are quite
similar. The quantity maximized for the quartimax is:
p g ij 4
∑
k i = 1 c i
QN = ∑
j =1
p
Miscellaneous Topics
420-4
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
Data Structure
The data for a factor analysis consists of two or more columns. We have created an artificial data set in which
each of the six variables (X1 - X6) were created using weighted averages of two original variables (V1 and V2)
plus a small random error. For example, X1 = .33 V1 + .65 V2 + error. Each variable had a different set of
weights (.33 and .65 are the weights) in the weighted average.
Rows two and three of the data set were modified to be outliers so that their influence on the analysis could be
observed. Note that even though these two rows are outliers, their values on each of the individual variables are
not outliers. This shows one of the challenges of multivariate analysis: multivariate outliers are not necessarily
univariate outliers. In other words, a point may be an outlier in a multivariate space and yet you cannot detect it
by scanning the data one variable at a time.
This data is contained in the dataset PCA2. The data given below are the first few rows of this dataset.
420-5
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
Procedure Options
This section describes the options available in this procedure.
Variables Tab
This panel specifies the variables used in the analysis.
Input Variables
Variables
Designates the variables to be analyzed. If matrix input is selected, indicate the variables containing the matrix.
Note that for matrix input, the number of rows used is equal to the number of variables specified. Other rows will
be ignored.
Data Input Format
Indicates whether raw data is to be analyzed or if a previously summarized correlation or covariance matrix is to
be used.
• Regular Data
The data is to be input in its raw format.
• Lower-Triangular
The data is in a correlation or covariance matrix in lower-triangular format. This matrix could have been
created by a previous run of an NCSS program or from direct keyboard input.
• Upper-Triangular
The data is in a correlation or covariance matrix in upper triangular format. The number of rows used is equal
to the number of variables specified. This matrix could have been created by a previous run of an NCSS
program or from direct keyboard input.
• None
No missing value imputation. Rows with missing values in any of the selected variables are ignored.
• Average
The average-value imputation method is used. Each missing value is estimated by the average value of that
variable. The process is iterated as many times as is indicated in the second box.
420-6
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
• Multivariate Normal
The multivariate-normal method. Each missing value is estimated using a multiple regression of the missing
variable(s) on the variables that contain data in that row. This process is iterated as many times as indicated.
See the discussion of missing value imputation methods elsewhere in this chapter.
Maximum Iterations
This option specifies the number of iterations used by either Missing Value Imputation or Robust Covariance
Estimation. Robust estimation usually requires only four or five iterations to converge. Missing value imputation
may require as many as twenty iterations if there are a lot of missing values.
When using this option, it is better to specify too many iterations than too few. After considering the Percent
Change values in the Iteration Report, you can decide upon an appropriate number of iterations and re-run the
problem.
Factor Options
Factor Rotation
Specifies the type of rotation, if any, that should be used on the solution. If rotation is desired, either varimax or
quartimax rotation is available.
Number of Factors
This option specifies the number of factors to be used. On the first run, you would set this rather large (say eight
or so). After viewing the eigenvalues you would reset this appropriately and make a second run.
Communality Options
Communality Iterations
This option specifies how many iterations to use in estimating the communalities. Some authors suggest a value
of one here. Others suggest as many as four or five.
Reports Tab
The following options control the format of the reports.
Select Reports
Descriptive Statistics - Scores Report
These options let you specify which reports are displayed.
Report Options
Minimum Loading
Specifies the minimum absolute value that a loading can have and still remain in the Variable List report.
Precision
Specify the precision of numbers in the report. A single-precision number will show seven-place accuracy, while
a double-precision number will show thirteen-place accuracy. Note that the reports are formatted for single
precision. If you select double precision, some numbers may run into others. Also note that all calculations are
performed in double precision regardless of which option you select here. This is for reporting purposes only.
420-7
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
Variable Names
This option lets you select whether to display only variable names, variable labels, or both.
Plots Tab
These sections specify the pair-wise plots of the scores and loadings.
Select Plots
Scores Plot - Loadings Plot
These options let you specify which reports and plots are displayed. Click the plot format button to change the
plot settings.
Plot Options
Number Factors Plotted
You can limit the number of plots generated using this parameter. Usually, you will only have interest in the first
three or four factors.
Storage Tab
The factor scores and/or the correlation matrix may be stored on the current dataset for further analysis. This
group of options let you designate which statistics (if any) should be stored and which columns should receive
these statistics. The selected statistics are automatically stored to the current dataset. Note that existing data are
replaced.
420-8
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
This report presents the progress of the robust iterations. The trace of the covariance matrix gives a measure of
what is happening at each iteration. When this value stabilizes, the program has converged. The percent change is
reported to let you determine how much the trace has changed. In this particular example, we see very little
420-9
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
change between iterations five and six. We would feel comfortable stopping at this point. A look at the
Descriptive Statistics section will let you see how much the means and standard deviations have changed.
A look at the Residual Section will let you see the robust weights that are assigned to each row. Those weights
that are near zero indicate observations whose influence have been removed by the robust procedure.
Correlation Section
Correlation Section
Variables
Variables X1 X2 X3 X4 X5
X1 1.000000 0.271780 0.127016 0.881604 0.814686
X2 0.271780 1.000000 0.988909 0.683206 0.778649
X3 0.127016 0.988909 1.000000 0.568933 0.677480
X4 0.881604 0.683206 0.568933 1.000000 0.986945
X5 0.814686 0.778649 0.677480 0.986945 1.000000
X6 0.484907 0.973093 0.928454 0.831949 0.901975
Phi=0.769781 Log(Det|R|)=-29.547320 Bartlett Test=773.15 DF=15 Prob=0.000000
Variables
Variables X6
X1 0.484907
X2 0.973093
X3 0.928454
X4 0.831949
X5 0.901975
X6 1.000000
Phi=0.769781 Log(Det|R|)=-29.547320 Bartlett Test=773.15 DF=15 Prob=0.000000
420-10
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
Variables
Variables X6
X1 ||||||||||
X2 ||||||||||||||||||||
X3 |||||||||||||||||||
X4 |||||||||||||||||
X5 |||||||||||||||||||
X6
Phi=0.769781 Log(Det|R|)=-29.547320 Bartlett Test=773.15 DF=15 Prob=0.000000
This report gives the correlations alone for a test of the overall correlation structure in the data. In this example,
we notice several high correlation values. The Gleason-Staelin redundancy measure, phi, is 0.736, which is quite
large. There is apparently some correlation structure in this data set that can be modeled. If all the correlations are
small (say less then .3), there would be no need for a factor analysis.
Correlations
The simple correlations between each pair of variables. Note that using the missing value imputation or robust
estimation options will affect the correlations in this report. When the above options are not used, the correlations
are constructed from those observations having no missing values in any of the specified variables.
Phi
This is the Gleason-Staelin redundancy measure of how interrelated the variables are. A zero value of ϕ2 means
that there is no correlation among the variables, while a value of one indicates perfect correlation among the
variables. This coefficient may have a value less than 0.5 even when there is obvious structure in the data, so care
should to be taken when using it. This statistic is especially useful for comparing two or more sets of data. The
formula for computing ϕ3 is:
p p
∑ ∑ r ij 2 -p
i =1 j =1
ϕ=
p(p - 1)
Log(Det|R|)
This is the log (base e) of the determinant of the correlation matrix. If you used the covariance matrix, this is the
log (base e) of the determinant of the covariance matrix.
Bartlett Test, df, Prob
This is Bartlett’s sphericity test (Bartlett, 1950) for testing the null hypothesis that the correlation matrix is an
identity matrix (all correlations are zero). If you get a probability (Prob) value greater than 0.05, you should not
perform a factor analysis on the data. The test is valid for large samples (N>150). It uses a Chi-square distribution
with p(p-1)/2 degrees of freedom. Note that this test is only available when you analyze a correlation matrix. The
formula for computing this test is:
χ2 =
(11 + 2p - 6N ) Log R
e
6
Bar Chart of Absolute Correlation Section
This chart graphically displays the absolute values of the correlations. It lets you quickly find high and low
correlations.
420-11
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
Eigenvalues Section
Eigenvalues after Varimax Rotation
Individual Cumulative
No. Eigenvalue Percent Percent Scree Plot
1 3.288191 54.89 54.89 |||||||||||
2 2.701207 45.09 99.99 ||||||||||
3 0.001207 0.02 100.01 |
4 -0.000099 0.00 100.01 |
5 -0.000121 0.00 100.00 |
6 -0.000295 0.00 100.00 |
Eigenvalues
The eigenvalues of the R-U matrix. Often, these are used to determine how many factors to retain. (In this
example, we would retain the first two eigenvalues.)
One rule-of-thumb is to retain those factors whose eigenvalues are greater than one. The sum of the eigenvalues is
equal to the number of variables. Hence, in this example, the first factor retains the information contained in 3.3
of the original variables.
Note that, unlike in PCA where all eigenvalues are positive, the eigenvalues may be negative in factor analysis.
Usually, these factors would be discarded and the analysis would be re-run.
Individual and Cumulative Percents
The first column gives the percentage of the total variation in the variables accounted for by this factor. The
second column is the cumulative total of the percentage. Some authors suggest that the user pick a cumulative
percentage, such as 80% or 90%, and keep enough factors to attain this percentage.
Scree Plot
This is a rough bar plot of the eigenvalues. It enables you to quickly note the relative size of each eigenvalue.
Many authors recommend it as a method of determining how many factors to retain.
The word scree, first used by Cattell (1966), is usually defined as the rubble at the bottom of a cliff. When using
the scree plot, you must determine which eigenvalues form the “cliff” and which form the “rubble.” You keep the
factors that make up the cliff. Cattell and Jaspers (1967) suggest keeping those that make up the cliff plus the first
factor of the rubble.
Interpretation of the Example
The first question that we would ask is how many factors should be kept. The scree plot shows that the first two
factors are indeed the largest. The cumulative percentages show that the first two factors account for over 99.99%
of the variation.
Eigenvectors Section
Eigenvectors after Varimax Rotation
Factors
Variables Factor1 Factor2
X1 -0.303444 -0.662220
X2 -0.416551 0.378018
X3 -0.382768 0.491154
X4 -0.428167 -0.317929
X5 -0.448606 -0.204840
X6 -0.450912 0.185189
420-12
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
Eigenvector
The eigenvectors of the R-U matrix.
Bar Chart of Absolute Eigenvectors
This chart graphically displays the absolute values of the eigenvectors. It lets you quickly interpret the eigenvector
structure. By looking at which variables correlate highly with a factor, you can determine what underlying
structure it might represent.
Factor Loadings
These are the correlations between the variables and factors.
Bar Chart of Absolute Factor Loadings
This chart graphically displays the absolute values of the factor loadings. It lets you quickly interpret the
correlation structure. By looking at which variables correlate highly with a factor, you can determine what
underlying structure it might represent.
420-13
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
Communality Section
Communality after Varimax Rotation
Factors
Variables Factor1 Factor2 Communality
X1 0.000397 0.997586 0.997983
X2 0.935999 0.063793 0.999791
X3 0.988109 0.011476 0.999585
X4 0.228883 0.763139 0.992023
X5 0.353958 0.646114 1.000072
X6 0.780845 0.219099 0.999944
Communality
The communality is the proportion of the variation of a variable that is accounted for by the factors that are
retained. It is similar to the R-Squared value that would be achieved if this variable were regressed on the retained
factors. This table value gives the amount added to the communality by each factor.
Bar Chart of Communalities
This chart graphically displays the values of the communalities.
Interpretation
This report is provided to summarize the factor structure. Variables with an absolute loading greater than the
amount set in the Minimum Loading option are listed under each factor. Using this report, you can quickly see
which variables are related to each factor. Note that it is possible for a variable to have high loadings on several
factors, although varimax rotation makes this very unlikely.
420-14
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
Score Coefficients
These are the coefficients that are used to form the factor scores. The factor scores are the values of the factors for
a particular row of data. These score coefficients are similar to the eigenvectors. They have been scaled so that the
scores produced have a variance of one rather than a variance equal to the eigenvalue. This causes each of the
factors to have the same variance.
You would use these scores if you wanted to calculate the factor scores for new rows not included in your original
analysis.
Factor1 - Factor2
The factor scores are the values of the factors for a particular row of data. They have been scaled so they have a
variance of one.
This set of plots shows each factor plotted against every other factor.
420-15
© NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Factor Analysis
This set of plots shows each of the factor loading columns plotted against each other.
420-16
© NCSS, LLC. All Rights Reserved.