Important Matrices For Multivariate Analysis
Important Matrices For Multivariate Analysis
Important Matrices - 1
The most important matrix for any statistical procedure is the data matrix. The observations form the rows of the data matrix and the variables form the columns. The most important requirement for the data matrix is that the rows of the matrix should be statistically independent. That is, if we pick any single row of the data matrix, then we should not be able to predict any other row in the matrix. Practically speaking, statistical independence is guaranteed when each row of the matrix is an independent observation. To illustrate the data matrix and the other important matrices in this section, let us consider a simple example. Sister Sal of the Benevolent Beatific Bounty of Saints Boniface and Bridget was the sixth grade teacher at The Most Sacred Kidney of Saint Zepherinus School. During her tenure there Sister Sal not only kept track of the students' grades but also wrote down her own rating of the chances that a student will eventually grow up to become an ax murderer. Below is a table of five of Sister Sal's students, their age in sixth grade, Sister Sal's ax murder rating, and their scores as adults on the Psychopathic-deviate scale on the Minnesota Multiphasic Personality Inventory (MMPI Pd). Table 1.1. Follow up of Sister Sal's sixth grade class. Student Abernathy Beulah Cutworth Dinwitty Euthanasia Age 10 12 20 10 8 Rating 3 4 10 1 7 38 34 74 . 40 64 MMPI Pd 38 34 74 40 64
Important Matrices - 2
Important Matrices - 3
There is a second formula that expresses the CSSCP matrix in terms of the raw data matrix. This formula is used mostly for computational reasons. Let x denote a column vector of means and let N denote sample size. Then CSSCP = Xt X N xx t . For the present example, 10 3 38 10 12 20 10 12 8 12 4 34 5 (12 5 50) . CSSCP = 3 4 10 1 7 20 10 74 4 1 40 38 34 74 40 64 10 50 8 7 64
Covariance Matrix
A covariance matrix is a symmetric matrix where each diagonal element equals the variance of a variable and each diagonal element is the covariance between the row variable and the column variable. The definition of the variance for variable X is . N 1 The definition of a covariance between two variables, X and Y, is . N 1 You should verify that the covariance of a variable with itself equals the variance of the variable. A covariance is a statistic that measures the extent to which two variables "vary together" or 'covary." Covariances have two properties--magnitude and sign. Covariances that are close to 0, relative to the scale of measurement of the two variables, imply that the two variables are not related--i.e. one cannot predict scores on one variable by knowing scores on the other variable. Covariances that are large (either positive large or negative large) relative to the measurement scale of the variables indicate that the variables are related. In this case, one can predict scores on one variable from knowledge of scores on another variable. The sign of a covariance denotes the direction of the relationship. A positive covariance signifies a direct relationship. Here high scores on one variable are associated with high scores on the other variable, and conversely low scores on one variable are associated with low scores on the other variable. A negative covariance denotes an inverse relationship. Here, high scores on one variable predict low scores on the other variable, and conversely low scores on the first variable are associated with high scores on the second variable. The covariance between amount of time spent studying and grades is positive while the covariance between amount of time spent partying and grades would be negative. cov(X , Y) = VX =
(X X)
i i =1
(X
i=1
X )(Yi Y )
Important Matrices - 4
In matrix terms, a covariance matrix equals the corrected sums of squares and cross products matrix in which each element is divided by (N - 1). Let C denote the covariance matrix. Then 1 1 . C = CSSCP = Dt D N 1 N 1 For the present example, 88 44 180 22 11 45 C = 44 50 228 4 = 11 12.5 57 . 180 228 1272 45 57 318
Correlation Matrix
A correlation matrix is a special type of covariance matrix. A correlation matrix is a covariance matrix that has been calculated on variables that have previously been standardized to have a mean of 0 and a standard deviation of 1.0. Many texts refer to variables standardized in this way as Z scores. The generic formula for a correlation coefficient between variables X and Y is cov(X, Y ) corr( X, Y ) = sX sY where sX and sY are, respectively, the standard deviations of variables X and Y. Because a correlation is a specific form of a covariance, it has the same two properties--magnitude and sign--as a covariance. The sign indicates the direction of the relationship. Positive correlations imply a direct relationship, and negative correlations imply an inverse relationship. Similarly, correlations close to 0 denote no statistical association or predictability between the two variables. Correlations that deviate from 0 in either direction (positive or negative) indicate stronger statistical associations and predictability. The correlation coefficient has one important property that distinguishes it from other types of covariances. The correlation coefficient has a mathematical lower boundary of -1.0 and an upper bound of 1.0. This property permits correlation coefficients to be compared, while ordinary covariances usually cannot be compared. For example, if X and Y correlate .86 while X and Z correlate .32, then we can conclude that X is more strongly related to Y than to Z. However, if variables A and B have a covariance of 103.6 while A and C have a covariance of 12.8, then we cannot conclude anything about the magnitude of the relationship. The reason is that the magnitude of a covariance depends upon the measurement scale of the variables. If the measurement scale for variable C has a much lower variance than that for variable B, then A might actually be more strongly related to C than to B. The correlation coefficient avoids this interpretive problem by placing all variables on the same measurement scale--the Z score with a mean of 0 and a standard deviation of 1.0. The formula for a correlation matrix may also be written in matrix algebra. Let S denote a diagonal matrix of standard deviations. That is, the standard deviation for the first variable in on the first diagonal element, that for the second variable is the second diagonal element, and so on. All off diagonals elements are 0. Matrix S may be easily computed from the covariance matrix, C, by letting
Important Matrices - 5
taking the square root of the diagonal elements and setting all off diagonal elements to 0. For the present example, 22 0 0 4.69 0 0 0 12.5 0 . S= 0 = 0 3.54 0 17.83 0 318 0 0 Let R denote the correlation matrix. Then the general formula for R is R = S 1CS1 . Although we do not need to know how to compute an inverse, the inverse of a diagonal matrix is quite easy to calculate--simply take the inverse of each diagonal element. For example, 1 0 0 4.69 1 1 S = 0 0 . 3.54 1 0 0 17.83 Consequently, the correlation matrix for the Sister Sal data is 1 1 0 0 0 0 4.69 11 45 4.69 22 1 1 R= 0 0 11 12.5 57 0 0 3.54 3.54 45 57 318 1 1 0 0 0 0 17.83 17.83 1.000 .663 .538 = .663 1.000 .904 . .538 .904 1.000
Important Matrices - 6
standard deviation would inform us about how spread out the dots are around the mean. Because classic multivariate analysis involves three types of matrices, it is important for us to take time to reflect on the meaning of these three matrices. Let us do that now. A matrix or vector of means tell us where variables are located along the number lines in a multidimensional space. For example, a vector consisting of two means tell us where the "dots" in a scatterplot are centered, and a vector of three means tells us where the dots in a three dimensional space are centered. In techniques such as MANOVA, the multivariate analysis of variance, we will often compare a vector of means for one group to a vector from another group. Effectively this comparison is equal to asking whether the "dots" for one group are centered in the same place as the "dots" for the other group. The matrix of standard deviations is a measure of the extent to which the dots in space are spread out around their center. If variable X has a standard deviation of 12, then we should expect a number of "dots" within + or - 24 units of the center of X and relatively few dots beyond +24 or -24 units away from the center of X. In many cases, it is convenient to think of standard deviations as "scaling factors" for the variables, analogous to currency conversions. For example, if variable X has a standard deviation of 2 and variable Y has a standard deviation of 5, then one unit of X is "worth" .4 units of Y and one unit of Y is "worth" 2.5 units of X. Finally, the correlation matrix expresses the geometric shape of the dots in hyperspace when each variable is measured on the same scale (i.e., each variable has a standard deviation of 1.0). Specifically, the correlation matrix informs us about the extent to which the dots are spherical or elliptical in various dimensions. For example, the correlation matrix for two variables tells us whether the dots in a scattergram are circular (when the correlation is close to 0), elliptical (when the correlation is greater to 0 but not close to 1.0 or -1.0), or approach forming a straight line (when the correlation approaches 1.0 or -1.0). The correlation matrix also indicates the direction of the dots. For example, a positive correlation for two variables implies that the dots are oriented from the "southwest towards the northeast" while a negative correlation denotes that the dots are going from the "northwest towards the southeast." To summarize, classic multivariate analysis uses summary statistics to inform us about three properties of the data points in hyperspace. The first property is location and is summarized by the means. The second property is spread or scale and is summarized by the standard deviations. The third property is shape and is summarized by the correlations. A final comment is in order. We have seen how a covariance matrix is a function of the standard deviations and the correlation matrix. Consequently, we could logically conclude that two types of matrices are needed for classic multivariate analysis--matrices of means and covariance matrices. The means would inform us about location and the covariance matrix would inform us about the spread and shape of the dots in hyperspace. Indeed, this is true and many multivariate techniques are expressed in just this way. However, it is much easier for us humans to think in terms of correlations than it is in terms of covariances. So it is best for us to conceptualize multivariate analysis in terms standard deviations and correlations instead of covariances.
Important Matrices - 7