Multicollinearity
Multicollinearity
X2
Multicollinearity
Y
X1
ASIF TARIQ
Overview of the presentation
• What is the nature of multicollinearity?
• Is multicollinearity really a problem?
• What are the sources of multicollinearity?
• What are its practical consequences?
• How does one detect it?
• What remedial measures can be taken to alleviate the problem of multicollinearity?
CLRM Assumption
• One of the assumptions of the classical linear regression model (CLRM) is that there is
no multicollinearity among the regressors included in the regression model.
• That is, there is no exact linear relationship among explanatory variables, Xs, included in
a multiple regression.
What is multicollinearity?
• The term multicollinearity was introduced by Ragnar Frisch in 1934.
• Strictly speaking, collinearity refers to the existence of a single linear relationship, and multicollinearity
refers to the existence of more than one exact linear relationship. But this distinction is rarely
maintained in practice, and multicollinearity refers to both cases.
• Originally it meant the existence of a “perfect” or “exact”, linear relationship among some or all
explanatory variables of a regression model.
• Today, however, the term multicollinearity is used in a broader sense to include the case of perfect
multicollinearity, as well as the case where the X variables are intercorrelated but not perfectly
(imperfect multicollinearity or near multicollinearity).
Variants of Multicollinearity
• Structural: Occurs by generating a new regressor from other regressors.
𝐘𝐢 = b𝟎 + b𝟏 𝐗 𝟏𝐢 + b𝟐 𝐗 𝟐𝟏𝐢 + 𝐮𝐢
where λ1, λ2, . . . , λk are constants such that all of them are not zero simultaneously.
λ𝟏 λ λ
X2i = - X1i - 𝟑 X3i - …… - 𝐤 Xki (2)
λ𝟐 λ𝟐 λ𝟐
• This shows how X2 is exactly linearly related to other variables or how it can be derived from a linear
combination of other X variables.
• In this situation, the coefficient of correlation between the variable X2 and the linear combination on the
right side of Eq. (2) is bound to be unity.
Imperfect multicollinearity
• For the k-variable regression involving explanatory variables X1, X2, . . . , Xk, an imperfect linear
relationship is said to exist if the following condition is satisfied:
λ𝟏 λ λ 𝟏
X2i = - X1i - 𝟑 X3i - …… 𝐤 Xki + vi (4)
λ𝟐 λ𝟐 λ𝟐 λ𝟐
• This shows that X2 is not an exact linear combination of other X’s because it is also determined by the
stochastic error term vi.
An numerical example
X2 X3 𝑿𝟑∗
10 50 52
15 75 75
18 90 97
24 120 129
30 150 152
• As we see from the table that X3i = 5X2i Perfect Collinearity i.e. r23 = 1
• The variable 𝐗 ∗𝟑 was created from X3 by simply adding to it the following random numbers: 2, 0, 7, 9, 2
• Now there is no longer perfect collinearity between X2 and 𝐗 ∗𝟑 . i.e., X*3i = 5X2i + vi
• However, the two variables are highly correlated because r23* = 0.9959.
Why multicollinearity is a problem?
• Why does the classical linear regression model assume that there is no multicollinearity among the X’s?
• If multicollinearity is perfect in the sense of Eq. (1), the regression coefficients of the X variables are
indeterminate and their standard errors are infinite.
• If multicollinearity is less than perfect, as in Eq. (3), the regression coefficients, although
determinate, possess large standard errors (in relation to the coefficients themselves), which
means the coefficients cannot be estimated with great precision or accuracy.
Sources of multicollinearity
1. Tendency of economic variables to move together over time. Economic magnitudes are influenced by the
same factors and in consequence once these determining factors become operative, the economic variables
show the same broad pattern of behaviour over time.
• For example, income, consumption, savings, investment, prices, employment, tend to rise in periods
of economic expansion and decrease in periods of recession (common in time series data).
2. The use of lagged values of some explanatory variables as separate independent variables in the
relationship.
• Like GDP depends upon investment in current and investment in previous years as well*.
• For example, in the regressing electricity consumption (Y) on income (X2) and house size (X3)
there is a physical constraint in the population in that families with higher incomes generally
have larger homes than families with lower incomes.
Sources of multicollinearity
4. An overdetermined model. This happens when the model has more explanatory variables than the
number of observations*. i.e, when n < k.
• This could happen in medical research where there may be a small number of patients about
whom information is collected on a large number of variables.
5. Model specification. For example, adding polynomial terms to a regression model, especially when the
range of the X variable is small.
Estimation in presence of perfect multicollinearity
• It was stated previously that in the case of perfect multicollinearity the regression coefficients remain
indeterminate and their standard errors are infinite.
• Using the deviation form, we can write the three-variable regression model as:
1𝒙𝟏𝒊 + 𝜷
yi = 𝜷 2𝒙𝟐𝒊 + 𝒖
ෝi (5)
𝟐 is also indeterminate.
• Which is an indeterminate expression. Note that in a similar fashion 𝜷
Estimation in presence of perfect multicollinearity
• What it means, then, is that there is no way of disentangling the separate influences of X1 and X2
from the given sample.
• In applied econometrics this problem is most damaging since the entire intent is to separate the
partial effects of each X upon the dependent variable.
Theoretical consequences
• Even if multicollinearity is very high, as in the case of near multicollinearity, the OLS estimators still
retain the property of BLUE.
• First, it is true that even in the case of near multicollinearity the OLS estimators are unbiased. Because
unbiasedness is a multisample or repeated sampling property.
= 𝛃 as the number of samples increases.
• This means, keeping X fixed in repeated sampling, 𝐄 𝛃
• Second, it is also true that collinearity does not destroy the property of minimum variance: In the
class of all linear unbiased estimators, the OLS estimators have minimum variance; that is, they are
efficient. But this does not mean that the variance of an OLS estimator will necessarily be small (in
relation to the value of the estimator) in any given sample.
Practical consequences
1) Although BLUE, the OLS estimators have large variances and standard errors, making precise
estimation difficult.
2) Because of consequence 1, the confidence intervals tend to be much wider, leading to the acceptance
of the “zero null hypothesis” (i.e., the true population coefficient is zero) more readily*.
3) Also because of consequence 1, the t-ratio of one or more coefficients tends to be statistically
insignificant.
4) Although the t ratio of one or more coefficients is statistically insignificant, R2 can be very high.
5) The OLS estimators and their standard errors can be sensitive to small changes in the data.
𝝈𝟐𝒖
𝟏) =
var (𝜷 (9)
σ 𝒙𝟐𝟏𝒊 (𝟏−𝒓𝟐𝟏𝟐 )
𝟐) = 𝝈𝟐𝒖
var (𝜷 σ 𝒙𝟐𝟐𝒊 (𝟏−𝒓𝟐𝟏𝟐 )
(10)
• It is apparent from Eqs. (9) and (10) that as r12 tends toward 1, that is, as collinearity increases, the
variances of the two estimators increase and in the limit when r12 = 1, they are infinite.
Variance Inflating Factor (VIF)
• The speed with which variances increase can be seen with the variance-inflating factor (VIF), which is
defined as:
𝟏
𝑽𝑰𝑭 = (𝟏𝟏)
(𝟏 − 𝒓𝟐𝟏𝟐 )
• VIF shows how the variance of an estimator is inflated by the presence of multicollinearity. As 𝒓𝟐𝟏𝟐
approaches 1, the VIF approaches infinity. That is, as the extent of collinearity increases, the variance of
an estimator increases, and in the limit it can become infinite.
• Equation (11) is only valid for a regression model with two explanatory variables, with a few simple
changes we can generalize this equation to gain insights into collinearity in the more general multiple
regression model with k explanatory variables.
𝟏
𝑽𝑰𝑭 = (𝟏𝟐)
(𝟏−𝑹𝟐𝒋 )
where 𝑹𝟐𝒋 is the R2 from the auxiliary regression Xj on all other explanatory variables in the model.
Variance and Variance Inflating Factor (VIF)
𝟐
• As can be readily seen, if there is no collinearity between X1 and X2 (i.e. 𝐫𝟏𝟐 = 0), VIF will be 1.
𝟐
𝟏 ) = 𝝈𝒖𝟐 VIF
var (𝜷 (13)
σ 𝒙𝟏𝒊
𝟐) = 𝝈𝟐𝒖
var (𝜷 σ 𝒙𝟐𝟐𝒊
VIF (14)
𝟏 and 𝜷
• Eq. (13) and (14) show that the variances of 𝜷 𝟐 are directly proportional to the VIF.
How variance increases with multicollinearity
Wider Confidence Intervals
• Because of the large standard errors, the confidence intervals for the relevant population parameters tend
to be larger.
ෝ 𝟐𝒖
𝝈
∓ 𝒕 . SE(𝜷)
𝜷 2 =
where, SE 𝜷 σ 𝒙𝟐
“Insignificant” t-ratios
𝛃 2
• Recall that to test the null hypothesis that, say, β2 = 0, we use the t-ratio, that is, t = ) , and compare
𝐒𝐄(𝛃 2
the estimated t-value with the critical t-value from the t-table.
• But as we have seen, in cases of high collinearity the estimated standard errors increase dramatically,
thereby making the t-values smaller. Therefore, in such cases, one will increasingly accept the null
hypothesis that the relevant true population value is zero*.
A High R2 but Few Significant t-ratios
• In cases of high collinearity, it is possible to find that one or more of the partial slope coefficients are
individually statistically insignificant on the basis of the t-test.
• Indeed, this is one of the signals of multicollinearity—insignificant t values but a high overall R2 (and
a significant F value).
Detection
• J. Kmenta quotes:
• Multicollinearity is a question of degree and not of kind. The meaningful distinction is not between
the presence and the absence of multicollinearity, but between its various degrees.
• Since multicollinearity refers to the condition of the explanatory variables that are assumed to be
nonstochastic, it is a feature of the sample and not of the population.
• We do not have one unique method of detecting it or measuring its strength. What we have are some rules
of thumb, some informal and some formal methods.
Detection
1. High R2 but few significant t ratios:
• If R2 is high, say, more than 0.8, the F test in most cases will reject the hypothesis that the partial
slope coefficients are simultaneously equal to zero, but the individual t-tests will show that none or
very few of the partial slope coefficients are statistically different from zero.
• Another suggested rule of thumb is that if the pair-wise or zero-order correlation coefficient
between two regressors is high, say, in excess of 0.8, then multicollinearity is a serious problem.
• The problem with this criterion is that, in models involving more than two explanatory variables,
the simple or zero-order correlation will not provide a foolproof guide to the presence of
multicollinearity.
• Of course, if there are only two explanatory variables, the zero-order correlations will suffice.
Detection
3. Examination of partial correlations:
• Because of the problem just mentioned in relying on zero-order correlations, Farrar and Glauber have
suggested that one should look at the partial correlation coefficients.
𝟏
𝑽𝑰𝑭 = (In case of more than two regressors)
(𝟏−𝑹𝟐𝒋 )
• As 𝐑𝟐𝐣 increases toward unity, that is, as the collinearity of Xj with the other regressors increases, VIF
also increases and in the limit it can be infinite.
• Rule of thumb: When VIF ≥ 10 (i.e., 𝐑𝟐𝐣 ≥ 0.9), there is severe multicollinearity involving the jth
explanatory variable.
Detection
5. Tolerance (TOL): The inverse of the VIF is called tolerance (TOL). That is
𝟏 𝟏
𝐓𝐎𝐋 = 𝟐
= 𝟏 − 𝐫𝟏𝟐 or 𝐓𝐎𝐋 = = 𝟏 − 𝐑𝟐𝐣
𝐕𝐈𝐅 𝐕𝐈𝐅
When 𝐫𝟏𝟐 or 𝐑𝟐𝐣 = 1 (i.e., perfect collinearity), TOL = 0 and when 𝐫𝟏𝟐 or 𝐑𝟐𝐣 = 0 (i.e., no collinearity),
TOL = 1.
• Because of the intimate connection between VIF and TOL, one can use them interchangeably.
• Obviously, closer the value of TOL, to zero, the greater is the degree of collinearity of jth
explanatory variable with other explanatory variables. On the other hand, closer the value of TOL, to
1, the greater the evidence that jth explanatory variable is not collinear with other explanatory
variables.
• Rule of thumb: If the tolerance value is 0.10 or less, it indicates the presence of severe
multicollinearity.
Remedy
• What can be done if multicollinearity is serious?
• do nothing
or
• A priori information.
• Transformation of variables.
• When students run their first ordinary least squares (OLS) regression, the first problem that they usually
encounter is that of multicollinearity. Many of them conclude that there is something wrong with OLS;
some resort to new and often creative techniques to get around the problem.
• But, we tell them, this is wrong. Multicollinearity is God’s will, not a problem with OLS or statistical
technique in general**.
• What Blanchard is saying is that multicollinearity is essentially a data deficiency problem and sometimes
we have no choice over the data we have available for empirical analysis.
A priori information
• Suppose we consider the model:
Yi = b0 + b1X1i + b2X2i + ui
• As noted before, income and wealth variables tend to be highly collinear. But suppose a priori we believe
that β2 = 0.10β1 , We can then run the following regression:
Yi = b0 + b1X1i + 0.10b1X2i + ui
𝟏 , we can estimate 𝛃
Once we obtain 𝛃 𝟐 from the postulated relationship between β1 and β2 .
Transformation of variables
• Suppose we have time series data on consumption expenditure, income, and wealth. One reason for high
multicollinearity between income and wealth in such data is that over time both the variables tend to move in the
same direction.
If the relation
holds at time t, it must also hold at time t -1 because the origin of time is arbitrary anyway.
where Y is consumption expenditure in dollars, X1 is GDP, and X2 is total population. Since GDP and
population grow over time, they are likely to be correlated.
• One “solution” to this problem is to express the model on a per capita basis, i.e., by dividing Eq. (15) by X2, to obtain:
𝐘𝐭 𝟏 𝐗 𝟏𝐭 ut
= b (
0 𝐗 )+ b (
1 𝐗 ) + b 2 + (18)
𝐗 𝟐𝐭 𝟐𝐭 𝟐𝐭 𝐗
𝟐𝐭
• For instance, if the original disturbance term ut is serially uncorrelated, the error term vt obtained previously will
in most cases be serially correlated. Therefore, the remedy may be worse than the disease.
• Moreover, there is a loss of one observation due to the differencing procedure, and therefore the degrees of
freedom are reduced by one.
• Furthermore, the first differencing procedure may not be appropriate in cross-sectional data where there is no
logical ordering of the observations.
• But in dropping a variable from the model we may be committing a specification bias or specification
error. It arises from incorrect specification of the model used in the analysis.
• Thus, if economic theory says that income and wealth should both be included in the model explaining the
consumption expenditure, dropping the wealth variable would constitute specification bias.
Combining cross-sectional and time series data
• A variant of the extraneous or a priori information technique is the combination of cross-sectional and
time series data, known as pooling the data.
Is Multicollinearity Necessary Bad?
• It has been said that if the sole purpose of regression analysis is prediction or forecasting, then
multicollinearity is not a serious problem because the higher the R2, the better the prediction.
• But, if the objective of the analysis is not only prediction but also reliable estimation of the
parameters, serious multicollinearity will be a problem because we have seen that it leads to large
standard errors of the estimators.