Ols Estimates
Ols Estimates
ECON 30331/Evans
Suppose there is a sample with n observations and two variables (xi and yi). Then
x
i 1
i x1 x2 x3 xn
1 n
x xi
n i 1
1 n
y yi
n i 1
Throughout the semester when I write at the board, I will shorten the notation some and write
n
x
i 1
i as simply x i i
n
Result (1): (x
i 1
i x ) 0 . The sum of deviations from means equals zero.
n n n n n
1 n
Proof: ( xi x )
i 1
xi x
i 1 i 1
i 1
x i nx
i 1
x i n n xi
i 1
n n
x x
i 1
i
i 1
i 0
n n n
Result (2): (xi 1
i x )( yi y ) xi ( yi y ) ( xi x ) yi
i 1 i 1
n n n
Proof: (x
i 1
i x )( yi y ) xi ( yi y ) x ( yi y )
i 1 i 1
n n n
( xi x )( yi y ) xi ( yi y ) x ( yi y )
i 1 i 1 i 1
1
Given the results from above (summation of deviation from means equal
n
zero), x ( yi y ) 0 so
i 1
n n
( xi x )( yi y ) xi ( yi y )
i 1 i 1
n n
( xi x )( yi y ) ( xi x ) yi
i 1 i 1
n n
Result (3): (x
i 1
i x ) 2 xi ( xi x )
i 1
Proof: This is the same proof as above. Expand the terms on the right hand side of the
equality
n n n n
( xi x ) 2 ( xi x )( xi x ) xi ( xi x ) x ( xi x )
i 1 i 1 i 1 i 1
In the final term on the right, note that because x̄ is a constant, you can
take it outside the summation, and
n n n
( xi x ) 2 xi ( xi x ) x ( xi x )
i 1 i 1 i 1
n
And given result 1 above, x ( x1 x ) 0 , so
i 1
n n
(x
i 1
i x ) 2 xi ( xi x )
i 1
2
Deriving the OLS estimates for the Bivariate Regression Model
Model: yi 0 xi 1 i
The residuals (εi) are unobserved, but for candidate values of β0 and β1, we can obtain an
estimate of the residual.
n n
SSR ˆi2 yi ˆ0 xi ˆ1
2
i 1 i 1
n
(1) SSR / ˆ0 2 yi ˆ0 xi ˆ1 0
i 1
n
(2) SSR / ˆ1 2 yi ˆ0 xi ˆ1 xi 0
i 1
n
(1a) yi ˆ0 xi ˆ1 0
i 1
1 n ˆ
The first term is ȳ, the second is 0 ˆ0 and the third is
n i 1
ˆ n
1 n ˆ 1
i 1 n
n i 1
x
i 1
xi x ˆ1 and therefore, we can re-write (1b) as
n
( 2a ) (y
i 1
i ˆ0 xi ˆ1 ) xi 0
Expand the terms in the summation and complete the square, and because β̂1 is a
constant, you can bring it outside the summation
n n
( 2d ) ( yi y ) xi ˆ1 ( xi x ) xi 0
i 1 i 1
(y
i 1
i y ) xi ( yi y )( xi x )
i 1
n n n
( xi x ) xi ( xi x )( xi x ) ( xi x ) 2
i 1 i 1 i 1
(y i y )( xi x )
(2 g ) ˆ1 i 1
n
(x
i 1
i x)2
4
Some useful properties of OLS estimates:
1. From (1c) above, note that y ˆ1 x ˆ1 . The OLS models fits means of X
through the means of y. OLS is sometimes referred to as a mean regression.
y ˆ
n
2. From (1a) above, note that i 0 xi ˆ1 0 and note further that
i 1
n
ˆi yi ˆ0 xi ˆ1 . Therefore ˆ 0 which indicates that the sample mean
i 1
i
n
1
of ˆ is equal to zero, or ˆ ˆi 0
n i 1
3. From (2a) above, recall that ˆi yi ˆ0 xi ˆ1 so (2a) can be written as
n n
4. Looking at the OLS estimate in (2g), divide the numerator and denominator by (n-
1)
1 n
( yi y )( xi x )
n 1 i 1
( 2h) ˆ
1
1 n
n 1 i 1
( xi x ) 2
ˆ xy ˆ xyˆ y ˆ xy ˆ y ˆ xy y
ˆ1
ˆ x2 ˆ xˆ xˆ y ˆ xˆ y ˆ x ˆ x
If one knows the variances and correlations coefficients, one can easily estimate
the OLS value for β̂1
5
Deriving the R2
1 n 1 n 1 n
y
n i 1
yi yˆi ˆi yˆ ˆ
n i 1 n i 1
Remember that the sample average of ˆ is zero, so ȳ= ŷ (the sample mean of y equals the
sample mean of predicted y).
n
(2) SST ( yi y ) 2
i 1
This is nothing more than a statement about how much movement there is in y in your sample.
Noting that yi yˆi ˆi and ȳ= ŷ , substitute these values into SST and complete the square
n n n
(4) SST ( yi y )2 ( yˆi ˆi yˆ ) 2 [( yˆi yˆ ) 2 ˆi2 2ˆi ( yˆi yˆi )]
i 1 i 1 i 1
n n n
( yˆi yˆ )2 ˆi2 2 ˆi ( yˆi yˆ )
i 1 i 1 i 1
Focus on the third term in the equality. Note a few things. First, since yˆ i ˆ0 ˆ1 xi
and yˆ y and y ˆ0 ˆ1 x , then it is easy to show that ( yˆ i yˆ ) ˆ1 ( xi x ) . Substitute
this value into the third term
n n n
(5) 2 ˆi ( yˆi yˆ ) 2 ˆi ˆ1 ( xi x ) 2ˆ1 ˆi ( xi x )
i 1 i 1 i 1
In equation (5), we can take ˆ1 outside the summation because it is the same value over
all i. Look at the notes for “Deriving the OLS Estimates for the Bivariate Regression
Model”. On the final page, we note some useful properties of the OLS estimates.
6
n
Condition 3 states that by construction, ˆ x
i 1
i i 0 which means that equation (5) above is
The SST or the total variation in y has two separate parts. The first is
n
(7) SSM ( yˆ
i 1
i yˆ ) 2
Where SSM is defined as the sum of squared model. This is a measure of the
variation in the predicted value in Y.
The final term in equation (6) should look very familiar; it is none other than the
objective function or, the sum of squared residuals (SSR).
n
(8) SSR ˆ
i 1
i
2
…or the actual variation in y (SST) is a function of two components. The first is
the variation predicted by the model (SSM), while the second is the variation that
we cannot predict (SSR).
SSM SSR
1
SST SST
Or alternatively
SSM SSR
(10) R2 1
SST SST
7
n
Just a note about the textbook. The author calls the term ( yˆi yˆ ) 2 the SSE or sum of squared
i 1
SSE SSR
explained. PLEASE NOTE: The textbook definition of R2 is R 2 1 where the
SST SST
n
author defines SSE as sum of squared estimated. I do not like this abbreviation for ( yˆ yˆ )
i 1
i
2
.
Our definition SSM matches much better with the STATA prints out – SST is sum of squared
total, SSM is sum of squared model and SSR is sum of squared residuals – so we will use these
abbreviations.
8
Proof that β̂1 is an Unbiased Estimate
( y y )( x x )
i i
(1) ˆ1 i 1
n
(x x )
i 1
i
2
n n
(2) ( yi y )( xi x ) yi ( xi x )
i 1 i 1
Note further that the true relationship between y and y is given by the population regression line
(3) yi 0 xi 1 i
Using (2) and substituting the true value for y into the model,
n n n
( yi y )( xi x ) yi ( xi x ) ( 0 1 xi i )( xi x )
(4) ˆ1 i 1
n
i 1
n
i 1
n
( xi x )
i 1
2
( xi x )
i 1
2
(x x )
i 1
i
2
0 ( xi x ) 1 xi ( xi x ) i ( xi x )
(5) ˆ1 i 1 i 1
n
i 1
(x x )
i 1
i
2
In the first term in the numerator, note that β0 is a constant and can be pulled
outside the summation. As a result, we have the summation of a deviation from a
mean, which equals zero
n n
0 ( xi x ) 0 ( xi x ) 0 (0) 0
i 1 i 1
In the second term in the numerator, β1 is a constant and it can pulled outside the
n n
summation. Recall also that ( xi x ) 2 xi ( xi x ) so
i 1 i 1
9
n n n
1 x i ( xi x ) 1 xi ( xi x ) 1 ( xi x ) 2
I 1 i 1 i 1
The first term in the numerator drops out, the second term reduces to β1 and therefore, we
can write the OLS estimate for β̂1 as
(x x )
i i
(6) ˆ1 1 i 1
n
(x x )
i 1
i
2
Equation (6) points out two important things. First, the estimate for β̂1 is a function of the
‘truth’ that is the true value of β1. Likewise, the estimated value for β̂1 is a function of
the n people who were selected for this sample. The true source of randomness in the
model is therefore the unknown residual εi. As a result, the properties of β̂1 will be a
function of the properties we assume about εi. We typically make four assumptions about
εi
1) E(εi) = E(εi|xi) = 0
2) V(εi) = V(εi|xi) = σ2ε
3) Cov(εi, εj) = 0 for all i≠j
The first assumption says that on average, the expected error is zero and that this
expectation does not depend on the value of x. The second assumption says that the
errors are “homoskedastic” or they have the same variance. Assumption (3) states that
errors are not correlated across observations. The second and third assumptions will be
relaxed throughout the semester.
Assumption (1) is the killer. If (1) is true, the model has very nice properties, if it false,
the model is useless.
Assumption (1) states that ε and x are independent. This says that the realization of x
conveys no information about the likely value of ε and therefore, the conditional
expectation E(εi|xi) provides the same information as the unconditional expectation E(εi).
Recall that cov( xi , i ) E( xi i ) E( xi ) E( i ) . Because E(εi) = E(εi|xi) = 0 then the second
term drops out and cov( xi , i ) E ( xi i ) . Let’s work with the right hand side of this term.
E ( xi i ) E ( i | xi ) xi E( i ) xi 0 and hence cov( xi , i ) 0. In essence by conditioning
on x, we “fix” this value and E ( xi i ) becomes E ( i ) xi which equals zero by assumption
(1).
10
A key result we will use time and time again throughout the semester is that if we
maintain assumption (1) and we see E ( i xi ), this reduces to E ( i ) xi which will equal
zero.
As we will see, if the value of x conveys information about ε then the model is sunk. We
will go over this in detail about two dozen times throughout the semester.
Let’s also work with condition (2) a little. This states that the variance of εi is the same
whether we know x or not. Recall the definition of variance Var ( i ) E[( i E ( i ))2 ].
Because E ( i ) 0, the definition of the variance reduces Var ( i ) E[ i2 ] 2 .
Therefore, any time we see a E[ i2 ] this means 2 .
In the derivations below, we will also see a lot of terms that are E[ i2 xi2 ]. Given
assumption (2), E[ i2 xi2 ] E[ i2 | xi2 ] E[ i2 ]xi2 i2 xi2
Therefore, a key result we will use time and time again throughout the semester -- if
we maintain assumption (2) and we see E[ i2 xi2 ] this reduces to E[ i2 ]xi2 ] which
equal i2 xi2 .
For now let’s concentrate on the case if (1) is true and see what that buys us.
We have established that β̂1 is a random variable. Any time you have a random variable,
the first two questions you need to ask are a) what is the expected value and b) what is
the variance. In this section, we will produce E[ β̂1]
First start with the definition of β̂1 from in equation (6) and take the expectation
n
n n
i ( xi x ) i ( xi x ) E
i ( xi x )
(7) ˆ
E[ 1 ] E 1 n
i 1
E[ 1 ] E n
i 1
1 i 1
( xi x ) 2 ( x x )2
n
i 1
i 1
i
i 1
( xi x ) 2
There is a lot going on in equation (7). First note that E[a+b]=E[a]+E[b] so we can break
apart the two big terms in the expectation. Second, note that the true value β1 is a fixed
constant there E[β1]= β1. Note also that because we assume x is “fixed”
n
then ( xi x ) 2 is not random and it too can be brought outside the expectation.
i 1
n
Therefore, the properties of E[ β̂1] will be driven by the expectation E i ( xi x )2 .
i 1
Let’s work with this term. First, write out the terms in the summation under the
expectation
11
n
(8) E i ( xi x ) 2 E 1 ( x1 x ) E 2 ( x2 x ) E 3 ( x3 x ) E n ( xn x )
i 1
Consider one of these expectations E i ( xi x ) for any i. Break this term apart to
read E[ i xi ] E[ i x ] . Note assumption (1) above states that E(εi|xi) = 0. Looking at the
first term of E[ i xi ] E[ i x ] , we can easily write it as
E[ i xi ] E[ i | xi ] 0
Therefore, if assumption (1) is correct, this term should be zero. The second term in
E[ i xi ] E[ i x ] requires the definition of x which is
1
x x1 x2 .....xn
n
x x x x x x x
E[ i x ] E i 1 i 2 .. i i .. i n E i 1 E i 2 ...E i n
n n n n n n n
x x
Note that each term E i j for j≠i is 0 and the term E i i must be equal to zero for
n n
the same arguments above. As a result, the far right hand term in equation (7) is zero and
therefore,
(9) E[ β̂1] = β1
The estimate β̂1 is an unbiased estimate of β1, that is, if one were to draw a large number
of samples at random, estimate β̂1 each time, the average of all these estimates would be
the true value β1
Please note --- an unbiased estimate does not mean you have the correct estimate – it
simply means that you used a procedure that on average will give you the correct answer.
Here is another way to think about how the correlation between x and ε would get you
into trouble
From equation (6), divide the numerator and denominator of the right hand term by (n-1)
1 n 1 n
n 1
i ( xi x )
n 1
( i )( xi x )
(10) ˆ1 1 i 1
1 i 1
1 n 1 n
n 1 i 1
( x i x ) 2
n 1 i 1
( xi x ) 2
12
Notice also in the final term in (10), we use the fact that the numerator can be
written as
n n
i ( xi x ) ( i )( xi x )
i 1 i 1
The numerator is the nothing more than the sample correlation between xi and the
ACTUAL error term εi. The denominator is the sample variance in x.
1 n
ˆ x ( i )( xi x )
n 1 i 1
1 n
ˆ x2
n 1 i 1
( xi x ) 2
ˆ
(11) ˆ1 1 x2
ˆ x
13
The Variance of β̂1
Demonstrating the var( ˆ1 ) is the most detailed and complicated derivation we will do all
semester. In the end, it is a lot of algebra but it simply exploits the properties of definitions and
expectations we have already used.
( y y )( x x )
i i
(1) ˆ1 i 1
n
(x x )
i 1
i
2
(b) Recall also that the true underlying relationship between x and y is given by the
equation
(2) yi 0 xi 1 i
(c) To analyze some of the properties of β̂1, we substituted the true value for yi, as
defined by equation (2) into the estimate (1). This substitution leads to the following
result:
(x x ) i i
(3) ˆ1 1 i 1
n
(x x )
i 1
i
2
By definition,
14
n n
i ( xi x ) (x x )
i i n
(6) ˆ1 1 i 1
n
i 1
where SSTx ( xi x ) 2
(x x ) 2 SSTx i 1
i
i 1
The variable SSTx is the sum of squared total for x and similar to the SST for y used in the
construction of the R2
n
2
n
2
i ( xi x ) i xi
ˆ ˆ
(7) Var ( 1 ) E[( 1 1 ) ] E
2 i 1
E i 1
SSTx SSTx
1 n
2
E i xi
(8) Var ( ˆ1 )
SSTx2 i 1
Let’s work with the numerator in the far right hand term in equation (8). Complete the square on
this term.
2
n
i xi 1 x1 2 x2 .... n xn [1 x1 2 x2 ... n xn 21 x1 2 x2 21 x1 3 x3 ...2 n 1 xn 1 n xn ]
2 2 2 2 2 2 2
i 1
E[12 x12 ] E[ 22 x22 ] ...E[ n2 xn2 ] E[21x1 2 x2 ] E[21x1 3 x3 ] ...E[2 n1xn1 n xn ]
Let’s look at the terms in equation (10). Consider E[ 2j x 2j ] for any j=1, 2…n. Recall above from
assumption (2) that anytime we see E[ 2j x 2j ] this reduces to E[ 2j ]x 2j 2 x 2j because
E[ i2 | xi ] E[ i2 ] . Note also that we established above that any time we see E[ i2 ] this equals
E[ i2 ] 2 . Therefore, the first n terms in the second line of equation (10), E[ 2j x 2j ],
equal 2 x 2j for j=1,2,…n.
Next, consider the expectation of the cross terms E[2 i j xi x j ] . The 2 is a constant so it can be
brought outside the expectation. By assumption, xi x j are also constants so they can be brought
15
outside the expectations as well. Therefore, E[2 i j xi x j ] 2 xi x j E[ i j ] . Recall above that we
assumed that cov( i , j ) 0 and the definition of cov( i , j ) is cov( i , j ) E[ i j ] E[ i ]E[ j ]
and since E[εi]=E[εj]=0, cov( i , j ) E[ i j ] 0 . Therefore, all the expectation of cross-terms
in (10) are zero. Combining these results
1 n
2
1
(11) Var ( ˆ1 ) 2
E i xi 2
[ 2 x12 2 x22 2 x32 .... 2 xn2 ]
SSTx i 1 SSTx
N N
(12) [ 2 x12 2 x22 2 x32 .... 2 xn2 ] 2 xi2 2 ( xi x)2 2 SSTx
i 1 i 1
And therefore:
1 2 2
(13) Var ( ˆ1 ) 2
SSTx n
SSTx2
(x x )
SSTx 2
i
i 1
Notice that the definition of (13) includes a 2 which is the Var(εi). Unfortunately, we do not
know 2 so it must be estimated
ˆ i
2
SSR
(14) ˆ2 i 1
n k 1 n k 1
Where k is the number of x’s included in the model. Thus in the simple bivariate model, k=1
and the degrees of freedom in the denominator is n-2.
ˆ2
(15) Est.Var ( ˆ1 ) n
(x x )
i 1
i
2
As with all variances, the units of measure on (15) are in β̂1 squared units so we need to take the
square root. The square root of this variance is typically called the “Standard error”
ˆ
(16) se( ˆ1 ) 1/2
n 2
( xi x )
i 1
16