Lecture Note 4 To 7 OLS
Lecture Note 4 To 7 OLS
In this lecture, we rewrite the multiple regression model in the matrix form. A general
multiple-regression model can be written as
yi = β 0 + β 1 xi1 + β 2 xi 2 + ... + β k xik + u i for i = 1, … ,n.
β 0
y1 1 x11 x12 ... x1k u1
y 1 x x ... x β 1 u
2 = 21 22 2k
β + 2
2
y n 1 x n1 x n 2 ... x nk β u n
k
Y = Xβ + u
We want to estimate β .
The strategy in the least squared residual approach is the same as in the bivariate linear
regression model. First, we calculate the sum of squared residuals and, second, find a set
of estimators that minimize the sum. Thus, the minimizing problem of the sum of the
squared residuals in matrix form is
min u ′u = (Y − Xβ ) ′(Y − Xβ )
1xn nx1
1
Notice here that u′u is a scalar or number (such as 10,000) because u ′ is a 1 x n matrix
and u is a n x 1 matrix and the product of these two matrices is a 1 x 1 matrix (thus a
scalar). Then, we can take the first derivative of this object function in matrix form. First,
we simplify the matrices:
u ′u = (Y ′ − β ′X ′)(Y − Xβ )
= Y ′Y − β ′X ′Y − Y ′Xβ + β ′X ′Xβ
= Y ′Y − 2 β ′X ′Y + β ′X ′Xβ
− 2 X ′Y + 2 X ′Xβˆ = 0
X ′Xβˆ = X ′Y
Notice that I have replaced β with βˆ because βˆ satisfy the F.O.C, by definition.
βˆ = ( X ′X ) −1 X ′Y (1)
This is the least squared estimator for the multivariate regression linear model in matrix
form. We call it as the Ordinary Least Squared (OLS) estimator.
Note that the first order conditions (4-2) can be written in matrix form as
2
X ′(Y − Xβˆ ) = 0
This is the same as the first order conditions, k+1 conditions, we derived in the previous
lecture note (on the simple regression model):
n ∧ ∧ ∧
∑(y
i =1
i − β 0 − β 1 xi1 − β 2 xi 2 − ... − bk xik ) = 0
n ∧ ∧ ∧
∑ xi1 ( yi − β 0 − β 1 xi1 − β 2 xi 2 − ... − bk xik ) = 0
i =1
n ∧ ∧ ∧
∑ xik ( yi − β 0 − β 1 xi1 − β 2 xi 2 − ... − bk xik ) = 0
i =1
y i = β 0 + β 1 x i1 + u i for i = 1, …, n.
Y = Xβ + u
3
y1 1 x1 u1
y 1 x β
2 = 2 0 + u 2
β 1
y n 1 x n u n
βˆ = ( X ′X ) −1 X ′Y (2)
1 x1 n n
∑ 1 ∑x i n nx
1 1 ... 1 1 x 2
X ′X =
= i =1 i =1
= n
x1 x 2 ... xn
n
∑ xi ∑
n
nx
xi2 ∑
i =1
x i
2
1 x n i =1 i =1
n 2
( X ′X ) −1 =
1 ∑ x i − nx
n i =1
n ∑x
i =1
i
2
−n x2 − nx n
n 2
=
1 ∑ xi − nx
n i =1
n ( ∑ xi − x ) 2 − nx n
i =1
4
y1 n
∑ yi ny
1 1 ... 1 y 2
X ′Y =
= i =1 = n
x1 x2 ... x n
n
∑ xi y i
∑ xi y i
i =1
yn i =1
n 2 ny
βˆ = ( X ′X ) −1 X ′Y =
1 ∑ xi − nx n
n i =1 ∑ x i y i
n (∑ xi − x ) 2 − nx n i =1
i =1
n n
∑ i − ∑
2
ny x n x xi y i
1 i =1 i =1
=
2
n n
n (∑ xi − x ) − nx ny + n ∑ xi y i
i =1 i =1
n n
y ∑ xi2 − x ∑ xi y i
1 i =1 i =1
= n
n
( ∑ x i − x ) ∑ x i y i − nx y
2
i =1 i =1
n 2 n
1 y ∑ x i − y x 2
+ y x 2
− x ∑ xi y i
= n i =1 i =1
n
(∑ xi − x ) ∑ ( xi y i − x y )
2
i =1 i =1
n n
y (∑ xi2 − x 2 ) − x (∑ xi y i − y x )
1 i =1 i =1
= n
2
n
(∑ xi − x ) ∑ ( xi − x )( y i − y )
i =1 i =1
y − βˆ1 x
n
( xi − x )( y i − y )
= ∑ i =1
n
(∑ xi − x ) 2
i =1
βˆ 0
=
ˆ
β 1
5
This is what you studied in the previous lecture note.
End of Example 4-1
Unbiasedness of OLS
In this sub-section, we show the unbiasedness of OLS under the following assumptions.
Assumptions:
E 1 (Linear in parameters): Y = Xβ + u
E 2 (Zero conditional mean): E (u | X ) = 0
E 3 (No perfect collinearity): X has rank k.
βˆ = ( X ′X ) −1 X ′Y
βˆ = ( X ′X ) −1 X ′( Xβ + u )
= ( X ′X ) −1 X ′Xβ + ( X ′X ) −1 X ′u
= β + ( X ′X ) −1 X ′u
By taking the expectation on the both sides of the equation, we have:
E ( βˆ ) = β + ( X ′X ) −1 E ( X ′u )
E ( βˆ ) = β
6
Next, we consider the variance of the estimators.
Assumption:
Therefore,
Var ( βˆ ) = Var [ β + ( X ′X ) −1 X ′u ]
= Var[( X ′X ) −1 X ′u ]
= E[( X ′X ) −1 X ′uu ′X ( X ′X ) −1 ]
= ( X ′X ) −1 X ′E (uu ′) X ( X ′X ) −1
= ( X ′X ) −1 X ′σ 2 I X ( X ′X ) −1 (E4: Homoskedasticity)
Var ( βˆ ) = σ 2 ( X ′X ) −1 (3)
GAUSS-MARKOV Theorem:
7
Example 4-2: Step by Step Regression Estimation by STATA
In this sub-section, I would like to show you how the matrix calculations we have studied
are used in econometrics packages. Of course, in practices you do not create matrix
programs: econometrics packages already have built-in programs.
The following are matrix calculations with STATA using data called,
NFIncomeUganda.dta. Here we want to estimate the following model:
All the variables are defined in Example 3-1. Descriptive information about the variables
are here:
. su;
-------------+--------------------------------------------------------
ln_income | 648 12.81736 1.505715 7.600903 16.88356
First, we need to define matrices. In STATA, you can load specific variables (data) into
matrices. The command is called mkmat. Here we create a matrix, called y, containing
the dependent variable, ln_nfincome, and a set of independent variables, called x,
containing female, educ, educsq.
8
. matrix xx=x'*x;
. mat list xx;
symmetric xx[4,4]
female edu edusq const
female 144
. matrix ixx=syminv(xx);
. mat list ixx;
symmetric ixx[4,4]
female edu edusq const
female .0090144
Here is X ′Y :
. matrix xy=x'*y;
. mat list xy;
xy[4,1]
ln_nfincome
female 1775.6364
edu 55413.766
edusq 519507.74
const 8305.6492
. ** Estimating b hat;
. matrix bhat=ixx*xy;
9
. mat list bhat;
bhat[4,1]
ln_nfincome
female -.59366458
edu .04428822
edusq .00688388
const 12.252496
symmetric ss[1,1]
ln_nfincome
ln_nfincome 1.8356443
------------------------------------------------------------------------------
ln_nfincome | Coef. Std. Err. t P>|t| [95% Conf. Interval]
10
-------------+----------------------------------------------------------------
female | -.5936646 .1286361 -4.62 0.000 -.8462613 -.3410678
edu | .0442882 .0314153 1.41 0.159 -.0174005 .105977
edusq | .0068839 .0020818 3.31 0.001 .002796 .0109718
_cons | 12.2525 .1216772 100.70 0.000 12.01356 12.49143
------------------------------------------------------------------------------
end of do-file
11
Lecture 5: OLS Inference under Finite-Sample Properties
So far, we have obtained OLS estimations for E ( βˆ ) and Var ( βˆ ) . But we need to know
the shape of the full sampling distribution of βˆ in order to conduct statistical tests, such
distribution of the errors. Thus, we make the following assumption (again, under finite-
sample properties).
Assumption
Note that N (0 x×1 , σ 2 I n×n ) indicates a multivariate normal distribution of u with mean
Remember again that only assumptions E1-3 are necessary to have unbiased OLS
estimators. In addition, assumption 4 is needed to show that the OLS estimators are the
best linear unbiased estimator (BLUE), the Gauss-Markov theorem. We need assumption
5 to conduct statistical tests.
Assumptions E1-5 are collectively called as the Classical Linear Model (CLM)
assumptions. The model with all assumptions E1-5 is called the classical linear model.
The OLS estimators with the CLM assumptions are the minimum variance unbiased
estimators. This indicates that the OLS estimators are the most efficient estimators
among all models (not only among linear models).
12
Normality of βˆ
βˆ ~ N [ β , σ 2 ( X ′X ) −1 ]
distributed:
βˆ k ~ N [ β k , σ 2 ( X ′X ) −1 kk ]
( X ′X ) −1kk is the k-th diagonal element of ( X ′X ) −1 . Let’s denote the k-th diagonal
element of ( X ′X ) −1 as Skk. Then,
S11 . . . . σ 2 S11 . . . .
. S . . . . σ 2
S . . .
σ 2 ( X ′X ) −1 = σ 2 22
=
22
. . . . S
. . . . σ S kk
2
kk
This is the variance-covariance matrix of the OLS estimator. On the diagonal, there are
variances of the OLS estimators. Off-the diagonal, there are covariance between the
estimators. Because each OLS estimator is assumed to be normally distributed, we can
obtain a standard normal distribution of an OSL estimator by subtracting the mean and
dividing it by the standard deviation:
βˆ k − β k
zk = .
σ 2 S kk
13
uˆ ′uˆ
s2 =
n − (k + 1)
uˆ′uˆ is the sum of squared errors. (Remember uˆ′uˆ is a product of a (1 x n) matrix and a
(n x 1) matrix, which gives a single number.) Therefore by replacing σ 2 with s2, we
have
βˆ k − β k
tk = .
s 2 S kk
This ratio has a t-distribution with (n-k-1) degree of freedom. It has a t-distribution
because it is a ratio of a variable that has a standard normal distribution (the nominator in
the parenthesis) and a variable that has a chi-squared distribution divided by (n-k-1).
Testing a Hypothesis on β̂ k
When we test the null hypothesis, the t-statistics is just a ratio of an OLS estimator over
its standard error.
We may test the null hypothesis against the one-sided alternative or two-sided
alternatives.
14
y i = β 0 + β1 xi1 + β 2 xi 2 + β 3 xi 3 + β 4 xi 4 + β 5 xi 5 + u i
Sometimes we want to test to see whether a group of variables jointly has effects of y.
Suppose we want to know whether independent variables x3, x4, and x5 jointly have
effects on y.
H0: β 3 = β 4 = β 5 = 0.
The null hypothesis, therefore, poses a question whether these three variables can be
excluded from the model. Thus the hypothesis is also called exclusion restrictions. A
model with the exclusion is called the restricted model:
yi = β 0 + β1 xi1 + β 2 xi 2 + u i
On the other hand, the model without the exclusion is called the unrestricted model:
y i = β 0 + β1 xi1 + β 2 xi 2 + β 3 xi 3 + β 4 xi 4 + β 5 xi 5 + u i
We can generalize this problem by changing the number of restrictions from three to q.
The joint significance of q variables is measured by how much the sum of squared
residuals (SSR) increases when the q-variables are excluded. Let denote the SSR of the
restricted and unrestricted models as SSRr and SSRur, respectively. Of course the SSRur
is smaller than the SSRr because the unrestricted model has more variables than the
restricted model. But the question is how much compared with the original size of SSR.
The F-statistics is defined as
( SSRr − SSRur ) / q
F-test: F≡ .
SSRur / (n − k − 1)
The numerator measures the change in SSR, moving from unrestricted model to restricted
model, per one restriction. Like percentage, the change in SSR is divided by the size of
SSR at the starting point, the SSRur standardized by the degree of freedom.
The above definition is based on how much the models cannot explain, SSR’s. Instead,
we can measure the contribution of a set of variables by asking how much of the
explanatory power is lost by excluding a set of q variables.
15
The F-statistics can be re-defined as
( Rur2 − Rr2 ) / q
F-test: F≡ .
(1 − Rur2 ) / (n − k − 1)
Again, because the unrestricted model has more variables, it has a larger R-squared than
the restricted model. (Thus the numerator is always positive.) The numerator measures
the loss in the explanatory power, per one restriction, when moving from the unrestricted
model to the restricted model. This change is divided by the unexplained variation in y
by the unrestricted model, standardized by the degree of freedom.
If the decrease in explanatory power is relatively large, then the set of q-variables is
considered a jointly significant in the model. (Thus these q-variables should stay in the
model.)
16
Lecture 6: OLS Asymptotic Properties
P(|Wn - θ| > e) → 0 as n → ∞.
This says that the probability that the absolute difference between Wn and θ being larger
than e goes to zero as n gets bigger. Which means that this probability could be non-zero
while n is not large. For instance, let’s say that we are interested in finding the average
income of American people and take small samples randomly. Let’s assume that the
small samples include Bill Gates by chance. The sample mean income is way over the
population average. Thus, when sample sizes are small, the probability that the
difference between the sample and population averages is larger than e, which is any
positive number, can be non-zero. However, the difference between the sample and
population averages would be smaller as the sample size gets bigger (as long as the
sampling is properly done). As a result, as the sample size goes to infinity, the
probability that the difference between the two averages is bigger than e (no matter how
small e is) becomes zero.
plim (Wn) = θ.
Under the finite-sample properties, we say that Wn is unbiased, E(Wn) =θ. Under the
asymptotic properties, we say that Wn is consistent because Wn converges to θ as n gets
larger.
βˆ = ( X ′X ) −1 X ′Y
17
βˆ = β + ( X ′X ) −1 X ′u
p lim βˆ = β
Nest, we focus on the asymmetric inference of the OLS estimator. To obtain the
asymptotic distribution of the OLS estimator, we first derive the limit distribution of the
non-zero yet finite variance asymptotically; see Cameron and Trivedi; also n will be
squared later and becomes n, very convenient) on the OLS estimators:
−1
1 1
βˆ = β + X ′X X ′ u
n n
−1
1 1
n ( βˆ − β ) = X ′X X ′ u
n n
18
The probability limit of n ( βˆ − β ) goes to zero because of the consistency of βˆ . The
limit variance of n ( βˆ − β ) is
−1 ′ −1
1 1 1 1
n ( βˆ − β ) ⋅ n ( βˆ − β )′ = X ′X X ′ u X ′ u X ′X
n n n n
−1 −1
1 1 1
= X ′X X ′ u u ′ X X ′X
n n n
1
From E4, the probability limit of u u ′ goes to σ 2 I , and we assumed plim of X ′X is Q.
n
Thus,
σ 2
−1
= Q X ′X Q −1
n
= σ 2 Q −1QQ −1
= σ 2 Q −1
ˆ − Β) ~ d N [0, σ 2 Q −1 ] .
n (Β
From this, we can obtain the asymptotically distribution of the OLS estimator by
multiplying n and manipulating:
Βˆ ~ a N [Β, σ 2 N −1Q −1 ] .
19
Example 6-1: Consistency of OLS Estimators in Bivariate Linear Estimation
n
∑ (x i − x ) yi
A bivariate model: y i = β 0 + β1 xi1 + u i and βˆ1 = i =1
n
∑ (x
i =1
i − x)2
i =1
Under the assumption of zero conditional mean (SLR 3: E(u|x) = 0), we can separate the
expectation of x and u:
n
∑ ( xi − x ) E (u i )
E ( βˆ1 ) = β 1 + i =1 n .
∑ ( xi − x )
2
i =1
Thus we need the SLR 3 to show the OLS estimator is unbiased.
Now, suppose we have a violation of SLR 3 and cannot show the unbiasedness of the
OLS estimator. We consider a consistency of the OLS estimator.
n
∑ ( xi − x ) u i
p lim βˆ1 = p lim β 1 + p lim i =1n
∑ ( xi − x )
2
i =1
1 n
p lim ∑ ( xi − x ) u i
p lim βˆ1 = β 1 + n i =1
1 n
p lim ∑ ( xi − x ) 2
n i =1
cov( x, u )
p lim βˆ1 = β 1 +
var(x)
Thus, as long as the covariance between x and u is zero, the OLS estimator of a bivariate
model consistent.
End of Example 6-1
20
Lecture 7: OLS Further Issues
In this lecture, we will discuss some practical issues related to OLS estimations, such as
functional forms and interpretations of several types of variables. For details, please read
Wooldridge chapter 6 and 7.
y *i = β 0 + β1 x1i + β 2 x 2i + ... + β k x ki + u i
Thus, if e0 + u satisfy the OLS assumptions (such as E(e0 + u|X)=0), then OLS estimators
are unbiased (or consistent). But the variance of the disturbance is larger by Var(e0) with
the measurement error (e0) than without.
Note, however, that the measurement error in the dependent variable could be correlated
with independent variables [Cov (xk e0)≠0]. In that case, the estimators will be biased.
Let xk* denote an independent variable, which could be observed in xk with the
measurement error, ek, where E(ek)=0.
ek = xk - x*k (8-1)
y *i = β 0 + β1 x1i + β 2 x 2i + ... + β k ( x ki − ek ) + u i
21
Assumption 1: Cov (xk, ek) = 0
Under this assumption, the error term (u - β k ek) has zero mean and uncorrelated with the
independent variables. Thus the estimators are unbiased (consistent). The error variance,
however, is bigger by ( β k ek)2.
Thus, we have the omitted variables problem, which gives inconsistent estimators of all
independent variables.
For a bivariate regression model it is easy to show the exact bias cased by the CEV.
Now, x1 is measured with the measurement errors, instead of xk. In a bivariate regression
model, the least square estimator can be written as
∑ (x 1i − x1 ) yi
βˆ1 = i =1
n
∑ (x
i =1
1i − x1 ) 2
n
∑ (x 1i − x1 ) (u i − β1e1 )
= β1 + i =1
n
∑ (x
i =1
1i − x1 ) 2
22
cov( x1 , u − β1e1 )
p lim(βˆ1 ) = β1 +
Var ( x1 )
(− β1σ e21 )
= β1 +
Var ( x1∗ + e1 )
σ2
= β1 1 − 2 e1 2
σ ∗ + σ e1
x1
σ x2∗
= β 1 2 1 2 < β1
σ ∗ + σ e1
x1
Thus, the plim( βˆ1 ) is always closer to zero (or biased toward zero) than β1 . This is
σ r2∗
p lim(βˆ1 ) = β1 2 1 2 < β1
σ ∗ + σ e1
r1
∗
where r1 is the population error in the equation
x1∗ = α 0 + α 1 x2 + ... + α k −1 x k + r1∗
Again the implication is the same as before. The estimated coefficient of the variable
with measurement errors is biased toward zero (or less likely to reject the null hypothesis).
Data Scaling
Many variables we use have units, such as monetary units and quantity units. The bottom
line is that data scaling does not change much.
23
If you scale up/down the dependent variable by ∀, the OLS estimators and standard
errors will be also scaled up/down by ∀, but not t-statistics. Thus the significance level
remains the same as before scaling.
If you scale up/down one independent variable, the estimated coefficient of the
independent variable will be scale down/up by the same scale. Again the t-statistics (or
significance level) does not change.
Logarithmic Forms
For a small change in x, a change in log(x) times 100, 100)log(x), is approximately close
to a percentage change in x, ∆x / x . Therefore, we can interpret the following cases using
percentage changes:
∆y / y
(1) log-log: log( y ) = β k log( x k ) + ... βˆ k =
∆x / x
∆y / y
(2) log-level: log( y ) = β k x k + ... βˆ k =
∆x
∆y
(3) level-log: y = β k log( x k ) + ... βˆ k =
∆x / x
24
∆y
(4) level-level: y = β k x k + ... βˆ k =
∆x
When a change in log is not small, the approximation between a change in log(x) and a
change x may not be accurate. For instance, the log-level model gives us
∧ ∧
log( y ′) − log( y ) = β̂ k . (8-1)
If the change in log is small, then there is no problem of interpreting this as “one unit of
∧ ∧
xk changes y by (100 β̂ k ) percent,” because log( y ′) − log( y ) ≈ ( y ′ − y ) / y . But when a
change in log is not small, the approximation may not be approximate. Thus we need to
transform (8-1) as:
∧ ∧ ∧
log( y ′) − log( y ) = log( y ′ / y ) = β̂ k
∧
( y ′ / y ) = exp( βˆ k )
∧
( y ′ / y ) − 1 = exp( βˆ k ) − 1
∧
( y ′ − y ) / y = exp( βˆ k ) − 1
∧
% ∆y = 100[exp( βˆ k ) − 1]
Quadratic Form
∂ y / ∂ x1 = βˆ1 + 2 βˆ 2 x1
Interpretation:
25
βˆ1 > 0 and β 2 < 0 “an increase in x increases y with a diminishing rate ”
βˆ1 < 0 and β 2 > 0 “an increase in x decreases y with a diminishing rate ”
Turning Point:
At the turning point, the first derivative of y with respect to x1 is zero:
∂ y / ∂ x1 = βˆ1 + 2 βˆ 2 x1 = 0
x1* = − βˆ1 / ( 2 βˆ 2 )
Interaction Terms
y = βˆ 0 + βˆ1 x1 + βˆ 2 x2 + βˆ 3 ( x1 x 2 ) + ... + βˆ k x k
The impact of x1 on y is
∂ y / ∂ x1 = βˆ1 + βˆ 3 x 2
A Dummy Variable
“The group B, with x1=1, has a lower or higher y than the base group, with x1=0.”
26
Interaction Terms with Dummies
between a group with x1=0 and a group with x1=1, or a difference in slopes of x2:
∂ y / ∂ x 2 = βˆ 2 when x1 = 0
∂ y / ∂ x 2 = βˆ 2 + βˆ 3 when x1 = 1
27
Multicollinearity
From the previous lecture, we know that the variance of OLS estimators is
σ2
Var ( βˆ k ) = n
(1 − Rk2 ) ∑ ( xik − x k )
i −1
(See Wooldridge pp94 or Greene pp57) Rk2 is the R-squared in the regression of xk
against all other variables. In other words, Rk2 is the proportion of the total variation in xk
If two variables are perfectly correlated, then Rk2 will be one for both of the perfectly
correlated variables, and the variance of those two variables will not be measured.
Obviously, you need to drop one of the two perfectly correlated variables. In STATA,
STATA drops one of perfectly correlated variables automatically. So if you see a
dropped variable in STATA outputs, you should suspect that you have included perfectly
correlated variables without realizing.
Even if two or more variables are not perfectly correlated, if they are highly correlated
(high Rk2 ), the variance of estimators will be large. This problem is called
multicollinearity.
The problem of multicollinearity can be avoided to some extent by collecting more data
and increase variance in independent variables. For instance, let’s think about a sample
of four individuals. All of four could be male and unmarried. In this case, a gender
variable and a marital status variable will be perfectly correlated. Suppose three of them
have collage education. Then an education variable will be highly correlated with the
gender and marital-status variables. Of course, correlations between these variables will
disappear (to some extent) as the size of sample and the variation in variables increase.
28
When in doubt, conduct a F-test on variables that you suspect causing multicollinearity.
A typical symptom of multicollinearity is
A high joint significance and low individual significance
(a high F-statistics but low t-statistics)
A simple solution is to keep them in the model. If your main focus is on a variable which
is not the part of multicollinearity, then it is not a serious problem to have
multicollinearity in your model. You could drop one of highly correlated variables, but
by doing so may create an omitted variable problem. Remember that an omitted variable
problem can cause biases in on all of estimators. This could be a more serious problem
than multicollinearity.
29