Simple-Linear-Regression-Model-3 24
Simple-Linear-Regression-Model-3 24
Tobias Broer
using the method of moments and OLS. Also implies β̂0 = ȳ − β̂1 x̄.
I We showed algebraic properties of OLS estimates that hold for any
sample, independently of the true model.
I And derived R 2 as a measure of goodness of fit.
I We also discussed changing the units of x and y , and using log(y), log(x).
Recap 2: Statistical properties of the OLS estimator
I Is β̂1 a “good" estimator, i.e. unbiased and “efficient”, i.e. with low
variance?
I We made 5 assumptions:
SLR.1: Model is linear (in parameters)
SLR.2: Random Sampling
SLR.3: Sample Variation in the Explanatory Variable
SLR.4: Zero Conditional Mean SLR.5: Homoskedasticity, or Constant
Variance Var (u|x) = σ 2
I We showed that SLR.1-4 implies E [βb1 ] = β1 . For wi = (xi −x̄) :
SSTx
Recap 2: Statistical properties of the OLS estimator
I Is β̂1 a “good" estimator, i.e. unbiased and “efficient”, i.e. with low
variance?
I We made 5 assumptions:
SLR.1: Model is linear (in parameters)
SLR.2: Random Sampling
SLR.3: Sample Variation in the Explanatory Variable
SLR.4: Zero Conditional Mean SLR.5: Homoskedasticity, or Constant
Variance Var (u|x) = σ 2
I We showed that SLR.1-4 implies E [βb1 ] = β1 . For wi = (xSST i −x̄)
x
:
Pn
i=1 (xi −x̄)yi P n P n
E (β̂1 ) = Pn (x −x̄)2 = E β1 + i=1 wi ui = β1 + i=1 wi E (ui ) = β1
i=1 i
2
σ
I SLR.1-5 implies Var (βb1 ) = SST
x
I Taking xi as fixed
n
! n
X X
Var (β̂1 ) = Var wi ui = Var (wi ui ) (2)
i=1 i=1
n
X n
X n
X
= wi2 Var (ui ) = wi2 σ 2 = σ 2 wi2 (3)
i=1 i=1 i=1
where (2) follows from SLR.2 (independence of ui and uj for any i 6= j),
and (3) from SLR.5 (homosc.) and fixed xi
Recap 3: Variance of the OLS estimator
2
σ
I SLR.1-5 implies Var (βb1 ) = SST
x
I Taking xi as fixed
n
! n
X X
Var (β̂1 ) = Var wi ui = Var (wi ui ) (2)
i=1 i=1
n
X n
X n
X
= wi2 Var (ui ) = wi2 σ 2 = σ 2 wi2 (3)
i=1 i=1 i=1
where (2) follows from SLR.2 (independence of ui and uj for any i 6= j),
and (3) from SLR.5 (homosc.) and fixed xi
Pn (xi −x̄)2 SSTx 1
I But: i=1 (SSTx )2 = (SSTx )2 = SSTx
Recap 4: Factors effecting the variance of β̂1
σ2
Var (β̂1 ) = (4)
SSTx
x̄ 2 + 1 (xi −x̄)2
Pn
= σ 2 Pn n i=1 (x −x̄) 2
i=1 i
Recap 5: Expectation and Variance of β̂0
x̄ 2 + 1 (xi −x̄)2
Pn
= σ 2 Pn n i=1 (x −x̄) 2
i=1 i
x̄ 2 + 1 2 2
Pn
i=1 (xi −x̄ )
= σ2 Pn
n (x −x̄)2
i=1 i
x2
Pn
= σ 2 n Pn i=1 i
(x −x̄)2
i=1 i
Recap 5: Expectation and Variance of β̂0
x̄ 2 + 1 (xi −x̄)2
Pn
= σ 2 Pn n i=1 (x −x̄) 2
i=1 i
x̄ 2 + 1 2 2
Pn
i=1 (xi −x̄ )
= σ2 Pn
n (x −x̄)2
i=1 i
x2
Pn
= σ 2 n Pn i=1 i
(x −x̄)2
i=1 i
3. Note also that β̂0 = ȳ − β̂1 x̄ and COV (β̂1 , ū) = 0 implies
Cov (β̂1 , β̂0 ) = −x̄Var (β̂1 )
Recap 6: Unbiased Estimator of σ 2
E (σ̂ 2 ) = σ 2 . (5)
I In regression output, it is
√ p
σ̂ = σ̂ 2 = SSR/(n − 2) (6)
σ̂
se(β̂1 ) = √ (7)
SSTx
where both the numerator and denominator are easily computed from the
data.
Recap 6: Units of measurement and functional form
I We saw that OLS estimator for β̂1 is unbiased, and derived its variance
I Two tasks left undone:
I Is the OLS estimator ‘better’ than others? (postpone)
I How can we test (probabilistic) hypothesis about it?
Interval estimation and hypothesis testing in the SLR
I The OLS estimators β̂1 , β̂0 are random variables; estimates differ across
different random samples.
I We showed: β̂1 , β̂0 are unbiased for β1 , β0 , derived their variance
I Does this allow us to give answers to questions like:
1. How far away from β1 is my estimate β̂1 likely to be?
2. How can I construct a range of values in which the true β1 lies with x
percent probability when constructed on repeated samples?
3. How confident can I be that β1 is different from 0?
I Answer: No! To answer these questions, we need the distribution of β̂
across repeated samples
I The mean and variance are just two features (“moments") of that
distribution. They usually do not characterise it fully.
I (Exception: the Normal distribution, entirely characterised by µ and σ 2 .)
The distribution of u
I NB: since β̂1 , β̂0 are linear functions of ui , i = 1, ..n, their distribution
depends on the distribution of u.
I So far we only assumed zero conditional mean (SLR.4: E [u|x] = 0) and
homoskedasticity (SLR.5: Var [u|x] = σ 2 ∀x)
I Since u is the sum of many (independent, smaller) factors influencing y , it
seems reasonable to assume it is normally distributed
u ∼ N(0, σ 2 )
The distribution of u
I NB: since β̂1 , β̂0 are linear functions of ui , i = 1, ..n, their distribution
depends on the distribution of u.
I So far we only assumed zero conditional mean (SLR.4: E [u|x] = 0) and
homoskedasticity (SLR.5: Var [u|x] = σ 2 ∀x)
I Since u is the sum of many (independent, smaller) factors influencing y , it
seems reasonable to assume it is normally distributed
u ∼ N(0, σ 2 )
I This “saves” us!
The distribution of u
I NB: since β̂1 , β̂0 are linear functions of ui , i = 1, ..n, their distribution
depends on the distribution of u.
I So far we only assumed zero conditional mean (SLR.4: E [u|x] = 0) and
homoskedasticity (SLR.5: Var [u|x] = σ 2 ∀x)
I Since u is the sum of many (independent, smaller) factors influencing y , it
seems reasonable to assume it is normally distributed
u ∼ N(0, σ 2 )
I This “saves” us!- linear functions of normal RVs are normal
I So β̂1 , β̂0 are normally distributed!!!
I And we already know they mean and variance, so know the whole
distribution!!!
The classical linear regression model (CLR)
and so
β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )
β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )
Source:Wooldridge
β̂1 − β1 β̂1 − β1
−c ≤ ∧ ≤c
sd(β̂1 ) sd(β̂1 )
β1 ≤ β̂1 + β1 + c sd(β̂1 ) ∧ β̂1 + β1 − csd(β̂1 ) ≤ β1
= β1 ∈ [β̂1 − c × sd(β̂1 ), β̂1 + c × sd(β̂1 )] (9)
Suppose we knew σ 2 : confidence intervals
β̂1 − β1 β̂1 − β1
−c ≤ ∧ ≤c
sd(β̂1 ) sd(β̂1 )
β1 ≤ β̂1 + β1 + c sd(β̂1 ) ∧ β̂1 + β1 − csd(β̂1 ) ≤ β1
= β1 ∈ [β̂1 − c × sd(β̂1 ), β̂1 + c × sd(β̂1 )] (9)
β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )
β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )
6 steps
1. H0 : β1 = b, H1 : β1 6= b (maintained: SLR.1-3,6, known σ 2 )
β̂1 −b
2. Significance level α = 0.05; Test statistic T = sd(β̂1 )
.
3. Sampling distribution of T under H0 : N(0, 1)
4. Critical value c = 1.96 s.t. Prob(|T | > c|β1 = b) = 0.05, defines rejection
region T : T < −c v T > c.
5. Calculate t, the value of T on the random sample at hand.
6. Reject H0 if t is in rejection region.
The normal distribution
Known σ 2 : one-sided hypothesis
β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )
1. H0 : β1 <= b, H1 : β1 > b
β̂1 −b 0
2. Significance level α = 0.05; Test statistic T = sd(β̂1 )
.
3. But: To derive distribution of T and critical value c, need some numerical
value b 0 consistent with H0 , i.e. b 0 <= b
4. Which one to choose?
5. Note: Significance level α: "Probability of T1 error (rejecting H0 although
it is true), equal to Prob(T ≥ c|β1 = b 0 ≤ b), is at most α."
6. But: For any critical value c, Prob(T ≥ c|β1 = b 0 ) is increasing in b 0 .
7. So choose b 0 = argmax(Prob(T ≥ c|β1 = b 0 ≤ b) = b
8. In words: "Replace H0 : β1 <= b with Ho’:β1 = b to get most stringent
test compatible with H0"
Known σ 2 : one-sided hypothesis
β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )
6 steps
1. H0 : β1 <= b, H1 : β1 > b,
β̂1 −b
2. Significance level α = 0.05; Test statistic T = sd(β̂1 )
.
NB b is highest value of β1 compatible with H0 , minimises probability of
rejection under H0 among β1 <= b
3. Sampling distribution of T with β = b: N(0, 1)
4. Critical value c = 1.645 s.t. Prob(T > c|β1 = b) = 0.95, defines rejection
region T > c.
5. Calculate t, the value of T on the random sample.
6. Reject H0 if t is in rejection region.
What if we don’t know σ 2
PN 2
i=1 ûi
1. Again use σ̂ 2 = N−2
2. Again, gives the “standard error" of βˆ1 as an estimator for its standard
σ̂
deviation, equal to se(β̂1 ) = √SST .
x
β̂1 −b
3. But: σ̂ 2 is a random variable itself! So T = se(β̂1 )
is NOT normally
distributed.
Recap: The χ2 distribution
Source:Wikipedia
Recap: The t distribution
Source:Wikipedia
Recap: The t distribution
Source:Wooldridge
THEOREM (t Distribution for Standardized Estimators)
β̂1 − β1
∼ tn−2 = tdf
se(β̂1 )
distributed tn−2
I So it is replacing σ (an unknown constant) with σ̂ (an estimator that
varies across samples), that takes us from the standard normal to the t
distribution.
Testing hypotheses about β1
\ = 1.142 + .0993educ
lwage (10)
(.109) (.0081)
Source:Wooldridge
The t distribution
Source:Wooldridge
EXAMPLE: Return to Education Using WAGE2.DTA
\ = 1.142 + .0993educ
lwage (12)
(.109) (.0081)
\ = 1.142 + .0993educ
lwage (14)
(.109) (.0081)
Source:Wooldridge
EXAMPLE: Return to Education Using WAGE2.DTA
\ = 1.142 + .0993educ
lwage (16)
(.109) (.0081)
\ = 1.142 + .0993educ
lwage (18)
(.109) (.0081)
\ = 1.142 + .0993educ
lwage (20)
(.109) (.0081)
Three ways:
1. Test explicitly the hypothesis as usual
2. Check if p − value < α. If so, reject. If not, do not reject.
3. Check the confidence interval: If 0 ∈ CI 1−α , do not reject. Otherwise
reject.
Proof:
0 ∈ CI 95 = [β̂ − se(β̂)t1−α/2
N−2 N−2
, β̂ + se(β̂)t1−α/2 ]
N−2
⇔ [β̂ − se(β̂)t1−α/2 < 0 ∧ 0 < β̂ + se(β̂)t N−2 ]
N−2 N−2
⇔ β̂ < se(β̂)t1−α/2 ∧ β̂ > −se(β̂)t1−α/2
N−2 β̂ − 0 N−2
⇔ −t1−α/2 < < t1−α/2
se(β̂)
β̂ − 0 N−2
⇔ | | < t1−α/2
se(β̂)
β̂−0
So se(β̂)
is not in the rejection region of H0 against H1.
Remaining: Prediction Intervals
I More generally, suppose we would like to make a prediction yei about yi for
a given xi (not necessarily in the sample).
I If we knew the true coefficients, a natural prediction would be
yei = E [y |xi ] = β0 + β1 xi .
I The true yi would equal yei + ui , so would have a distribution around yei
I Problem: Don’t know β0 , β1 . A natural candidate for yei is the OLS
prediction ŷi = β̂0 + β̂1 xi = ȳ + β̂1 (xi − x̄)
I Questions:
1. Is ŷi on average correct, in the sense that E [ŷi ] = E [y |xi ], or
E [ei ] = E [yi − ŷi ] = 0?
2. If so, how large is the variance of the error term ei = yi − ŷi ?
3. Can we construct confidence intervals for yi ?
Prediction Intervals
I More generally, suppose we would like to make a prediction yei about yi for
a given xi (not necessarily in the sample).
I If we knew the true coefficients, a natural prediction would be
yei = E [y |xi ] = β0 + β1 xi .
I The true yi would equal yei + ui , so would have a distribution around yei
I Problem: Don’t know β0 , β1 . A natural candidate for yei is the OLS
prediction ŷi = β̂0 + β̂1 xi = ȳ + β̂1 (xi − x̄)
I Questions:
1. Is ŷi on average correct, in the sense that E [ŷi ] = E [y |xi ], or
E [ei ] = E [yi − ŷi ] = 0?
2. If so, how large is the variance of the error term ei = yi − ŷi ?
3. Can we construct confidence intervals for yi ?
I Two sources of uncertainty:
1. Uncertainty about the true parameters β0 and β1 (we only have estimates
β̂0 and β̂1 ). More data will reduce this.
2. Uncertainty about the error that will be drawn ui . More data will not
reduce this.
Prediction Intervals
I More generally, there are two kinds of intervals we can be interested in.
1. The first is just a confidence interval for E (y |xi ) given xi .
2. The second is a confidence interval for y given xi : this also has the
(estimated) randomness from the error u - this would remain even if we
knew β0 and β1 !
Two confidence intervals
Prediction Intervals
I Write the Prediction Error ei for OLS prediction ŷi = β̂0 + β̂1 xi as
.
ei = yi − ŷi (22)
Prediction Intervals
I Write the Prediction Error ei for OLS prediction ŷi = β̂0 + β̂1 xi as
.
ei = yi − ŷi (22)
= β0 + β1 xi + ui − (β̂0 + β̂1 xi ) (23)
Prediction Intervals
I Write the Prediction Error ei for OLS prediction ŷi = β̂0 + β̂1 xi as
.
ei = yi − ŷi (22)
= β0 + β1 xi + ui − (β̂0 + β̂1 xi ) (23)
= [β0 − β̂0 ] + (β1 − β̂1 )xi + ui (24)
Prediction Intervals
I Write the Prediction Error ei for OLS prediction ŷi = β̂0 + β̂1 xi as
.
ei = yi − ŷi (22)
= β0 + β1 xi + ui − (β̂0 + β̂1 xi ) (23)
= [β0 − β̂0 ] + (β1 − β̂1 )xi + ui (24)
= [E [y |x̄] − β1 x̄ − (ȳ − β̂1 x̄] + (β1 − β̂1 )xi + ui (25)
Prediction Intervals
I Write the Prediction Error ei for OLS prediction ŷi = β̂0 + β̂1 xi as
.
ei = yi − ŷi (22)
= β0 + β1 xi + ui − (β̂0 + β̂1 xi ) (23)
= [β0 − β̂0 ] + (β1 − β̂1 )xi + ui (24)
= [E [y |x̄] − β1 x̄ − (ȳ − β̂1 x̄] + (β1 − β̂1 )xi + ui (25)
= (E [y |x̄] − ȳ ) + (β1 − β̂1 )(xi − x̄) + ui (26)
I (25) follows from
1. Conditional expectation of y given sample mean of x:
E [y |x̄] = β0 + β1 x̄ + 0 implying β0 = E [y |x̄] − β1 x̄
2. Average sample regression function β̂0 = ȳ − β̂1 x̄
Prediction Intervals
I NB: 1
N
and (xi − x̄)2 SST
1
x
converge to 0 as N grows large
Prediction Intervals
I Thus, can construct confidence intervals for yi |xi using the t-distribution
ŷi −yi
r h i
1 +(x −x̄)2 1
σ 2 1+ N i SST
x ŷi − yi
T = q
σ̂ 2
= r h i ∼ tn−2 (28)
1 1
σ2 σ̂ 2 1 + N
+ (xi − x̄)2 SST x
Prediction Intervals
Source:Wooldridge
Prediction Intervals
I The prediction is 2.99. But we want the CI, which goes from 2.87 to 3.11.
This is fairly tight, but this is just for ŷ |xi , the average colGPA for those
with hsGPA = 3, ACT = 25, and skipped = 0.
I In other words: this misses the uncertainty from the error u - this would
remain even if we knew β0 and β1 !
I In most cases, except for very small sample sizes, σ̂ 2 is by far the largest
√
component. Remember that se(ŷ ) shrinks to zero at the rate 1/ n, while
2
σ̂ is the estimated variance for a single draw.
Prediction Intervals
Source:Wooldridge
Prediction Intervals
Source:Wooldridge
Summary
ȳ and β̂1 are independent
I ȳ = E [y |x̄] = β0 + β1 x̄ + ū
I Since both ū and β̂1 are normally distributed, suffices to show that
Cov (ū, β̂1 ) = 0
1. β̂1 = β1 + ni=1 wi ui , where wi = (xi − x̄)/SSTx
P
= 0 + 1n E [ ni=1 wi ui2 ]
P
= 1n σ 2 [ ni=1 wi ] = 0
P
I More restrictive question: when is the OLS estimator most efficient among
linear unbiased estimators? (best linear unbiased estimator, “BLUE")
I Show that this is always the case under assumptions SLR.1-5
I Strategy of the proof:
1. Impose the assumption of linearity and unbiasedness on on an arbitrary
alternative linear estimator βe1 , derive restrictions
2. Show that difference between βe1 and β̂1 must be non-negative
When is the OLS estimator “BLUE"? - the Gauss Markov Theorem
I Linearity in yi : βe1 = N
P
i=1 wi yi , for some {wi } functions of {xi }
N
X N
X N
X
βe1 = β0 wi + β1 wi xi + wi ui (29)
i=1 i=1 i=1
I Unbiasedness: E [βe1 |X ] = β0 N
P PN
i=1 wi + β1 wi xi + 0 = β1
PN PN PNi=1
I Implies: i=1 wi = 0, i=1 wi xi = 1 and i=1 wi (xi − x̄) = 1
I Now consider ∆V ≡ Var (βe1 ) − Var (βˆ1 ) = σ 2 ( N 2
P
i=1 wi − 1/SSTx )
P 2
N
I Multiply 1/SSTx by i=1 wi (xi − x̄) = 1:
P 2
N N
∆V = σ 2 2
P
w
i=1 i − i=1 w i (xi − x̄) /SST x
PN
I i=1 wi (xi − x̄)/SSTx = γ̂wx , a reg coefficient of wi on (xi − x̄)
PN PN (xi −x̄)2
∆V = σ 2 2
i=1 wi − i=1 SSTx2 SSTx =
2 PN 2 2
σ ( i=1 [wi − (γ̂wx (xi − x̄)) ])
I But since
PN PN w (x −x̄) PN 2 2
i=1 wi γ̂wx (xi − x̄) = i=1 (xi − x̄) = (γ̂wx (xi − x̄))
PN i i
i=1 γ̂wx (x −x̄)2
i
i=1
we have P
∆V = σ 2 ( Ni=1 [wi − γ̂wx (xi − x̄)]2
I This is just the sum of squared residuals from a regression of wi on xi , so
non-negative. QED!!!