What Is Multiple Linear Regression
What Is Multiple Linear Regression
f or i = 1n .
S ome ti me s th e de p e nd e n t vari ab l e i s al so c all e d en d og e n ou s vari ab l e ,
p rog n osti c vari ab l e or re g re ss an d . The i n de p en d e n t vari ab l e s a re al so
c all e d exog e n ou s vari ab l e s, p re di ct or va ri ab le s or re g re sso rs.
We have regression with an intercept and the regressors HH SIZE and CUBED HH
SIZE
The population regression model is: y = 1 + 2 x2 + 3 x3 + u
It is assumed that the error u is independent with constant variance
(homoskedastic) - see EXCEL LIMITATIONS at the bottom.
We wish to estimate the regression line:
y = b1 + b2 x2 + b3 x3
The only change over one-variable regression is to include more than one column in
the Input X Range.
Note, however, that the regressors need to be in contiguous columns (here columns
B and C).
If this is not the case in the original data, then columns need to be copied to get the
regressors in contiguous columns.
Hitting OK we obtain
ANOVA table
Explanation
Multiple R
0.89582
R = square root of R2
8
R Square
0.80250 2
R
8
Adjusted R
Square
0.60501
Adjusted R2 used if more than one x variable
6
Standard Error
Observations
df
SS
MS
Significance F
Regression
1.6050
0.8025
4.0635
0.1975
Residual
0.3950
0.1975
Total
2.0
The ANOVA (analysis of variance) table splits the sum of squares into its
components.
Total sums of squares
= Residual (or error) sum of squares + Regression (or explained) sum of squares.
Thus i (yi - ybar)2 = i (yi - yhati)2 + i (yhati - ybar)2
where yhati is the value of yi predicted from the regression line
and ybar is the sample mean of y.
For example:
R2 = 1 - Residual SS / Total SS (general formula for R2)
= 1 - 0.3950 / 1.6050
(from data in the ANOVA table)
= 0.8025
(which equals R2 given in the regression Statistics table).
The column labeled F gives the overall F-test of H0: 2 = 0 and 3 = 0 versus Ha: at
least one of 2 and 3 does not equal zero.
Aside: Excel computes F this as:
F = [Regression SS/(k-1)] / [Residual SS/(n-k)] = [1.6050/2] / [.39498/2] = 4.0635.
The column labeled significance F has the associated P-value.
Since 0.1975 > 0.05, we do not reject H0 at signficance level 0.05.
Coefficient
Upper 95%
Intercept
0.89655
-2.3924
4.1855
HH SIZE
0.33647
-1.4823
2.1552
CUBED HH SIZE
0.00209
-0.0543
0.0585
Let j denote the population coefficient of the jth regressor (intercept, HH SIZE and
CUBED HH SIZE).
Then
Column "t Stat" gives the computed t-statistic for H0: j = 0 against Ha: j
0.
This is the coefficient divided by the standard error. It is compared to a t with
(n-k) degrees of freedom where here n = 5 and k = 3.
Column "P-value" gives the p-value for test of H0: j = 0 against Ha: j 0..
This equals the Pr{|t| > t-Stat}where t is a t-distributed random variable with
n-k degrees of freedom and t-Stat is the computed value of the t-statistic
given in the previous column.
Note that this p-value is for a two-sided test. For a one-sided test divide this
p-value by 2 (also checking the sign of the t-Stat).
Columns "Lower 95%" and "Upper 95%" values define a 95% confidence
interval for j.
Do not reject the null hypothesis at level .05 since the p-value is > 0.05.
We computed t = -1.569
The critical value is t_.025(2) = TINV(0.05,2) = 4.303. [Here n=5 and k=3 so
n-k=2].
So do not reject null hypothesis at level .05 since t = |-1.569| < 4.303.
More specialized software such as STATA, EVIEWS, SAS, LIMDEP, PC-TSP, ... is
needed.
The models have similar "LINE" assumptions. The only real difference is that
whereas in simple linear regression we think of the distribution of errors at a fixed
value of the single predictor, with multiple linear regression we have to think of the
distribution of errors at a fixed set of values for all the predictors. All of the model
checking procedures we learned earlier are useful in the multiple linear regression
framework, although the process becomes more involved since we now have
multiple predictors. We'll explore this issue further in Lesson 7.
The use and interpretation of r2 (which we'll denote R2 in the context of
multiple linear regression) remains the same. However, with multiple linear
regression we can also make use of an "adjusted" R2 value, which is useful for
model building purposes. We'll explore this measure further in Lesson 10.
With a minor generalization of the degrees of freedom, we use t-tests and tintervals for the regression slope coefficients to assess whether a predictor is
significantly linearly related to the response, after controlling for the effects of all
the opther predictors in the model.
With a minor generalization of the degrees of freedom, we use prediction
intervals for predicting an individual response and confidence intervals for
estimating the mean response. We'll explore these further in Lesson 7.
Learning objectives and outcomes
Upon completion of this lesson, you should be able to do the following:
Some researchers (Colby, et al, 1987) wanted to find out if nestling bank swallows,
which live in underground burrows, also alter how they breathe. The researchers
conducted a randomized experiment on n = 120 nestling bank swallows. In an
underground burrow, they varied the percentage of oxygen at four different levels
(13%, 15%, 17%, and 19%) and the percentage of carbon dioxide at five different
levels (0%, 3%, 4.5%, 6%, and 9%). Under each of the resulting 5 4 = 20
experimental conditions, the researchers observed the total volume of air breathed
per minute for each of 6 nestling bank swallows. In this way, they obtained the
following data (babybirds.txt) on the n = 120 nestling bank swallows:
Response (y): percentage increase in "minute ventilation," (Vent), i.e., total
volume of air breathed per minute.
Potential predictor (x1): percentage of oxygen (O2) in the air the baby birds
breathe.
Potential predictor (x2): percentage of carbon dioxide (CO2) in the air the
baby birds breathe.
Here's a scatter plot matrix of the resulting data obtained by the researchers:
What does this particular scatter plot matrix tell us? Do you buy into the following
statements?
We assume that the ii have a normal distribution with mean 0 and constant
variance 22. These are the same assumptions that we used in simple regression
with one x-variable.
The subscript i refers to the ithith individual or unit in the population. In the
notation for the x-variables, the subscript following isimply denotes which x-variable
it is.
Estimates of the Model Parameters
The estimates of the coefficients are the values that minimize the sum of
squared errors for the sample. The exact formula for this is given in the next section
on matrix notation.
The letter b is used to represent a sample estimate of a coefficient.
Thus b0b0 is the sample estimate of 00, b1b1 is the sample estimate of11, and
so on.
MSE=SSEnpMSE=SSEnp estimates 22, the variance of the errors. In the
formula, n = sample size, p = number of coefficients in the model (including the
intercept) and SSESSE = sum of squared errors. Notice that for simple linear
regression p = 2. Thus, we get the formula for MSE that we introduced in that
context of one predictor.
S=MSES=MSE estimates and is known as the regression standard
error or the residual standard error.
In the case of two predictors, the estimated regression equation yields a
plane (as opposed to a line in the simple linear regression setting). For more than
two predictors, the estimated regression equation yields a hyperplane.
Interpretation of the Model Parameters
Each coefficient represents the change in the mean response, E(y), per
unit increase in the associated predictor variable when all the other predictors are
held constant.
For example, 11 represents the change in the mean response, E(y), per
unit increase in x1x1 when x2x2, x3x3, ..., xp1xp1 are held constant.
The intercept term, 00, represents the mean response, E(y), when all the
predictors x1x1, x2x2, ..., xp1xp1, are all zero (which may or may not have any
practical meaning).
Predicted Values and Residuals
A predicted value is calculated as y^i=b0+b1xi,1+b2xi,2+
+bp1xi,p1y^i=b0+b1xi,1+b2xi,2++bp1xi,p1, where the b values come
from statistical software and the x-values are specified by us.
A residual (error) term is calculated as ei=yiy^iei=yiy^i, the difference
between an actual and a predicted value of y.
A plot of residuals versus predicted values ideally should resemble a
horizontal random band. Departures from this form indicates difficulties with the
model and/or data.
Other residual analyses can be done exactly as we did in simple regression.
For instance, we might wish to examine a normal probability plot (NPP) of the
residuals. Additional plots to consider are plots of residuals versus each x-variable
separately. This might help us identify sources of curvature or nonconstant variance.
We'll explore this further in Lesson 7.
ANOVA Table
Source
df
SS
MS
Regres
sion
p1
SSR
MSR = SSR / (p
1)
Error
np
SSE
MSE = SSE /
(n p)
Total
n1
SSTO
MSR / MSE
suggests x2x2 is not needed in a model with all the other predictors included. But,
this doesn't necessarily mean that both x1x1 and x2x2 are not needed in a model
with all the other predictors included. It may well turn out that we would do better
to omit either x1x1 or x2x2 from the model, but not both. How then do we
determine what to do? We'll explore this issue further in Lesson 6.
5.4 - A Matrix Formulation of the Multiple Regression Model
Printer-friendly version
Note: This portion of the lesson is most important for those students who will
continue studying statistics after taking Stat 501. We will only rarely use the
material within the remainder of this course. It is, however, particularly important
for students who plan on taking Stat 502, 503, 504, or 505.
A matrix formulation of the multiple regression model
In the multiple regression setting, because of the potentially large number of
predictors, it is more efficient to use matrices to define the regression model and
the subsequent analyses. Here, we review basic matrix algebra, as well as learn
some of the more important multiple regression formulas in matrix form.
As always, let's start with the simple case first. Consider the following simple linear
regression function:
yi=0+1xi+ifor i=1,...,nyi=0+1xi+ifor i=1,...,n
If we actually let i = 1, ..., n, we see that we obtain n equations:
y1y2yn=0+1x1+1=0+1x2+2=0+1xn+ny1=0+1x1+1y2=0+1x
2+2yn=0+1xn+n
Well, that's a pretty inefficient way of writing it all out! As you can see, there is a
pattern that emerges. By taking advantage of this pattern, we can instead formulate
the above simple linear regression function in matrix notation:
That is, instead of writing out the n equations, using matrix notation, our simple
linear regression function reduces to a short and simple statement:
Y=X+Y=X+
Now, what does this statement mean? Well, here's the answer:
X is an n 2 matrix.
is an example of matrix multiplication. Now, there are some restrictions you can't
just multiply any two old matrices together. Two matrices can be multiplied
together only if the number of columns of the first matrix equals the number of
rows of the second matrix. Then, when you multiply the two matrices:
the number of rows of the resulting matrix equals the number of rows of the
first matrix, and
the number of columns of the resulting matrix equals the number of columns
of the second matrix.
For example, if A is a 2 3 matrix and B is a 3 5 matrix, then the matrix
multiplication AB is possible. The resulting matrix C = AB has 2 rows and 5
columns. That is, C is a 2 5 matrix. Note that the matrix multiplication BA is not
possible.
For another example, if X is an n p matrix and is a p 1 column vector, then
the matrix multiplication X is possible. The resulting matrix X has n rows and 1
column. That is, X is an n 1 column vector.
Okay, now that we know when we can multiply two matrices together, how do we
do it? Here's the basic rule for multiplying A by B to get C = AB:
The entry in the ith row and jth column of C is the inner product that is, elementby-element products added together of the ith row of A with the jth column of B.
For example:
C=AB=[189172]356249176538=[904110138106278859]C=AB=[197812]
[321554736968]=[901011068841382759]
That is, the entry in the first row and first column of C, denoted c11, is obtained
by:
c11=1(3)+9(5)+7(6)=90c11=1(3)+9(5)+7(6)=90
And, the entry in the first row and second column of C, denoted c12, is obtained
by:
c12=1(2)+9(4)+7(9)=101c12=1(2)+9(4)+7(9)=101
And, the entry in the second row and third column of C, denoted c23, is obtained
by:
c23=8(1)+1(7)+2(6)=27c23=8(1)+1(7)+2(6)=27
You might convince yourself that the remaining five elements of C have been
obtained correctly.
Matrix addition
Recall that X + that appears in the regression function:
Y=X+Y=X+
is an example of matrix addition. Again, there are some restrictions you can't just
add any two old matrices together. Two matrices can be added together
only if they have the same number of rows and columns. Then, to add two
matrices, simply add the corresponding elements of the two matrices. That is:
Add the entry in the first row, first column of the first matrix with the entry in
the first row, first column of the second matrix.
Add the entry in the first row, second column of the first matrix with the entry
in the first row, second column of the second matrix.
And, so on.
For example:
C=A+B=213485176+792531218=91059561814C=A+B=[24
1187356]+[752931218]=[99110585614]
That is, the entry in the first row and first column of C, denoted c11, is obtained
by:
c11=2+7=9c11=2+7=9
And, the entry in the first row and second column of C, denoted c12, is obtained
by:
c12=4+5=9c12=4+5=9
You might convince yourself that the remaining seven elements of C have been
obtained correctly.
That is, when you multiply a matrix by the identity, you get the same matrix back.
Definition of the inverse of a matrix
The inverse A-1 of a square (!!) matrix A is the unique matrix such that:
A1A=I=AA1A1A=I=AA1
That is, the inverse of A is the matrix A-1 that you have to multiply A by in order to
obtain the identity matrix I. Note that I am not just trying to be cute by including (!!)
in that first sentence. The inverse only exists for square matrices!
Now, finding inverses is a really messy venture. The good news is that we'll always
let computers find the inverses for us. In fact, we won't even know that Minitab is
finding inverses behind the scenes!
An example
Ugh! All of these definitions! Let's take a look at an example just to convince
ourselves that, yes, indeed the least squares estimates are obtained by the
following matrix formula:
b=b0b1bp1=(XX)1XYb=[b0b1bp1]=(XX)1XY
Let's see if we can obtain the same answer using the above matrix formula. We
previously showed that:
XX=[nni=1xini=1xini=1x2i]XX=[ni=1nxii=1nxii=1nxi2]
Using the calculator function in Minitab, we can easily calculate some parts of this
formula:
are linearly dependent, since (at least) one of the columns can be written as a
linear combination of another, namely the third column is 4 the first column. If
none of the columns can be written as a linear combination of the other columns,
then we say the columns arelinearly independent.
Unfortunately, linear dependence is not always obvious. For example, the columns
in the following matrix A:
A=123432111A=[141231321]
are linearly dependent, because the first column plus the second column equals 5
the third column.
Now, why should we care about linear dependence? Because the inverse of a
square matrix exists only if the columns are linearly independent. Since the vector
of regression estimates b depends on (X'X)-1, the parameter estimates b0, b1, and so
on cannot be uniquely determined if some of the columns of X are linearly
dependent! That is, if the columns of your X matrix that is, two or more of your
predictor variables are linearly dependent (or nearly so), you will run into trouble
when trying to estimate the regression equation.
For example, suppose for some strange reason we multiplied the predictor
variable soap by 2 in the dataset soapsuds.txt. That is, we'd have two predictor
variables, say soap1 (which is the original soap) and soap2 (which is 2 the original
soap):
In short, the first moral of the story is "don't collect your data in such a way that the
predictor variables are perfectly correlated." And, the second moral of the story is "if
your software package reports an error message concerning high correlation among
your predictor variables, then think about linear dependence and how to get rid of
it."