0% found this document useful (0 votes)
25 views

Simple-Linear-Regression-Model-3 24

The document provides an outline and recap of key concepts in the simple linear regression model, including properties of OLS estimators, confidence intervals, hypothesis testing, and how the model is represented in matrix notation. Specific topics covered include the expectations and variances of estimators, factors that affect variance, how to interpret coefficients when changing units or using logs, and how to handle binary explanatory variables.

Uploaded by

traczyk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Simple-Linear-Regression-Model-3 24

The document provides an outline and recap of key concepts in the simple linear regression model, including properties of OLS estimators, confidence intervals, hypothesis testing, and how the model is represented in matrix notation. Specific topics covered include the expectations and variances of estimators, factors that affect variance, how to interpret coefficients when changing units or using logs, and how to handle binary explanatory variables.

Uploaded by

traczyk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Introduction to Econometrics

INFERENCE IN THE SIMPLE REGRESSION MODEL

Tobias Broer

March 20, 2024


Outline of today’s lecture

1. Recap from last session


2. Recap: Confidence intervals and hypothesis testing
3. The classical simple linear regression model
4. Confidence intervals and hypothesis testing in the CSLRM
5. The simple linear regression model in matrix notation
Recap 1: OLS ’mechanics’

I We derived the OLS estimator of β1


Pn
i=1 (xi − x̄)(yi − ȳ ) Sample Covariance(xi , yi )
β̂1 = Pn 2
= (1)
i=1 (xi − x̄) Sample Variance(xi )

using the method of moments and OLS. Also implies β̂0 = ȳ − β̂1 x̄.
I We showed algebraic properties of OLS estimates that hold for any
sample, independently of the true model.
I And derived R 2 as a measure of goodness of fit.
I We also discussed changing the units of x and y , and using log(y), log(x).
Recap 2: Statistical properties of the OLS estimator

I Is β̂1 a “good" estimator, i.e. unbiased and “efficient”, i.e. with low
variance?
I We made 5 assumptions:
SLR.1: Model is linear (in parameters)
SLR.2: Random Sampling
SLR.3: Sample Variation in the Explanatory Variable
SLR.4: Zero Conditional Mean SLR.5: Homoskedasticity, or Constant
Variance Var (u|x) = σ 2
I We showed that SLR.1-4 implies E [βb1 ] = β1 . For wi = (xi −x̄) :
SSTx
Recap 2: Statistical properties of the OLS estimator

I Is β̂1 a “good" estimator, i.e. unbiased and “efficient”, i.e. with low
variance?
I We made 5 assumptions:
SLR.1: Model is linear (in parameters)
SLR.2: Random Sampling
SLR.3: Sample Variation in the Explanatory Variable
SLR.4: Zero Conditional Mean SLR.5: Homoskedasticity, or Constant
Variance Var (u|x) = σ 2
I We showed that SLR.1-4 implies E [βb1 ] = β1 . For wi = (xSST i −x̄)
x
:
Pn
i=1 (xi −x̄)yi P n  P n
E (β̂1 ) = Pn (x −x̄)2 = E β1 + i=1 wi ui = β1 + i=1 wi E (ui ) = β1
i=1 i

So βb1 is an unbiased estimator for β1 .


Recap 3: Variance of the OLS estimator

2
σ
I SLR.1-5 implies Var (βb1 ) = SST
x

I Proof: Take β̂1 = β1 + ni=1 wi ui ,where wi = (xi − x̄)/SSTx


P

I Taking xi as fixed
n
! n
X X
Var (β̂1 ) = Var wi ui = Var (wi ui ) (2)
i=1 i=1
n
X n
X n
X
= wi2 Var (ui ) = wi2 σ 2 = σ 2 wi2 (3)
i=1 i=1 i=1

where (2) follows from SLR.2 (independence of ui and uj for any i 6= j),
and (3) from SLR.5 (homosc.) and fixed xi
Recap 3: Variance of the OLS estimator

2
σ
I SLR.1-5 implies Var (βb1 ) = SST
x

I Proof: Take β̂1 = β1 + ni=1 wi ui ,where wi = (xi − x̄)/SSTx


P

I Taking xi as fixed
n
! n
X X
Var (β̂1 ) = Var wi ui = Var (wi ui ) (2)
i=1 i=1
n
X n
X n
X
= wi2 Var (ui ) = wi2 σ 2 = σ 2 wi2 (3)
i=1 i=1 i=1

where (2) follows from SLR.2 (independence of ui and uj for any i 6= j),
and (3) from SLR.5 (homosc.) and fixed xi
Pn (xi −x̄)2 SSTx 1
I But: i=1 (SSTx )2 = (SSTx )2 = SSTx
Recap 4: Factors effecting the variance of β̂1

σ2
Var (β̂1 ) = (4)
SSTx

I Error variance σ 2 increasesVar (β̂1 )


I More variation in {xi } decreases it.
Recap 5: Expectation and Variance of β̂0

I β̂0 = ȳ − β̂1 x̄ = β0 + (β1 − β̂1 )x̄ + ū


I Implies
Recap 5: Expectation and Variance of β̂0

I β̂0 = ȳ − β̂1 x̄ = β0 + (β1 − β̂1 )x̄ + ū


I Implies
1. E [β̂0 ] = β0
2. Var [β̂0 ] = Var [β̂1 ]x̄ 2 + Var [ū] + 2COV [β̂1 , ū]
2 2 2
= σ SST

+ σn + 0
x
Recap 5: Expectation and Variance of β̂0

I β̂0 = ȳ − β̂1 x̄ = β0 + (β1 − β̂1 )x̄ + ū


I Implies
1. E [β̂0 ] = β0
2. Var [β̂0 ] = Var [β̂1 ]x̄ 2 + Var [ū] + 2COV [β̂1 , ū]
2 2 2
= σ SST

+ σn + 0
x

x̄ 2 + 1 (xi −x̄)2
Pn
= σ 2 Pn n i=1 (x −x̄) 2
i=1 i
Recap 5: Expectation and Variance of β̂0

I β̂0 = ȳ − β̂1 x̄ = β0 + (β1 − β̂1 )x̄ + ū


I Implies
1. E [β̂0 ] = β0
2. Var [β̂0 ] = Var [β̂1 ]x̄ 2 + Var [ū] + 2COV [β̂1 , ū]
2 2 2
= σ SST

+ σn + 0
x

x̄ 2 + 1 (xi −x̄)2
Pn
= σ 2 Pn n i=1 (x −x̄) 2
i=1 i

x̄ 2 + 1 2 2
Pn
i=1 (xi −x̄ )
= σ2 Pn
n (x −x̄)2
i=1 i

x2
Pn
= σ 2 n Pn i=1 i
(x −x̄)2
i=1 i
Recap 5: Expectation and Variance of β̂0

I β̂0 = ȳ − β̂1 x̄ = β0 + (β1 − β̂1 )x̄ + ū


I Implies
1. E [β̂0 ] = β0
2. Var [β̂0 ] = Var [β̂1 ]x̄ 2 + Var [ū] + 2COV [β̂1 , ū]
2 2 2
= σ SST

+ σn + 0
x

x̄ 2 + 1 (xi −x̄)2
Pn
= σ 2 Pn n i=1 (x −x̄) 2
i=1 i

x̄ 2 + 1 2 2
Pn
i=1 (xi −x̄ )
= σ2 Pn
n (x −x̄)2
i=1 i

x2
Pn
= σ 2 n Pn i=1 i
(x −x̄)2
i=1 i

3. Note also that β̂0 = ȳ − β̂1 x̄ and COV (β̂1 , ū) = 0 implies
Cov (β̂1 , β̂0 ) = −x̄Var (β̂1 )
Recap 6: Unbiased Estimator of σ 2

Under Assumptions SLR.1 to SLR.5, and conditional on {x1 , . . . , xn },

E (σ̂ 2 ) = σ 2 . (5)

I In regression output, it is

√ p
σ̂ = σ̂ 2 = SSR/(n − 2) (6)

that is usually reported. This is an estimator of sd(u), the standard


deviation of the population error.
I We just plug σ̂ in for σ:

σ̂
se(β̂1 ) = √ (7)
SSTx
where both the numerator and denominator are easily computed from the
data.
Recap 6: Units of measurement and functional form

I Consider the SLR model y = β0 + β1 x + u


I When we want to know the coefficients β00 , β10 for y 0 = ay , or x 0 = bx, we
can just replace y = y 0 /a or x = x 0 /b in our SLR model, and solve for
y 0 = β0 + β1 x 0 + u 0
I Instead of x and y we may want to use log(y) or log(x) if we think that
the effects of x on y are more reasonably constant in ‘percentage terms’
I The coefficient β1 can then be interpreted as follows
Model Dep. Var. Indep. Var. β1
Level-Level y x ∆y = β1 ∆x
Level-Log y log(x) ∆y = (β1 /100)%∆x
Log-Level log(y ) x %∆y = (100β1 )∆x
Log-Log log(y ) log(x) %∆y = β1 %∆x (an elasticity)
Recap 7: Binary RHS variables

I We considered the case where x ∈ {0, 1}, 1 corresponding, e.g., to


‘treatment’ in Randomised controled trial (RCT)
I Individual treatment effect tei = yi (1) − yi (0) = τate + [ui (1) − ui (0)],
where yi (1) = α1 + ui (1) and τate = α1 − α0
I Then yi = α0 + τate xi + ui , where ui = ui (0) + [ui (1) − ui (0)]xi for τate the
‘average treatment effect’
I As long as E (ui |xi ) = 0, then regressing y on x will give us an unbiased
estimator for τate .
Recap 8: The normal distribution and its ‘relatives’

[see recap slides]


Rest of today’s lecture

1. The classical simple linear regression model


2. Confidence intervals and hypothesis testing in the CSLRM
This Lecture

I We saw that OLS estimator for β̂1 is unbiased, and derived its variance
I Two tasks left undone:
I Is the OLS estimator ‘better’ than others? (postpone)
I How can we test (probabilistic) hypothesis about it?
Interval estimation and hypothesis testing in the SLR

I The OLS estimators β̂1 , β̂0 are random variables; estimates differ across
different random samples.
I We showed: β̂1 , β̂0 are unbiased for β1 , β0 , derived their variance
I Does this allow us to give answers to questions like:
1. How far away from β1 is my estimate β̂1 likely to be?
2. How can I construct a range of values in which the true β1 lies with x
percent probability when constructed on repeated samples?
3. How confident can I be that β1 is different from 0?
I Answer: No! To answer these questions, we need the distribution of β̂
across repeated samples
I The mean and variance are just two features (“moments") of that
distribution. They usually do not characterise it fully.
I (Exception: the Normal distribution, entirely characterised by µ and σ 2 .)
The distribution of u

I NB: since β̂1 , β̂0 are linear functions of ui , i = 1, ..n, their distribution
depends on the distribution of u.
I So far we only assumed zero conditional mean (SLR.4: E [u|x] = 0) and
homoskedasticity (SLR.5: Var [u|x] = σ 2 ∀x)
I Since u is the sum of many (independent, smaller) factors influencing y , it
seems reasonable to assume it is normally distributed
u ∼ N(0, σ 2 )
The distribution of u

I NB: since β̂1 , β̂0 are linear functions of ui , i = 1, ..n, their distribution
depends on the distribution of u.
I So far we only assumed zero conditional mean (SLR.4: E [u|x] = 0) and
homoskedasticity (SLR.5: Var [u|x] = σ 2 ∀x)
I Since u is the sum of many (independent, smaller) factors influencing y , it
seems reasonable to assume it is normally distributed
u ∼ N(0, σ 2 )
I This “saves” us!
The distribution of u

I NB: since β̂1 , β̂0 are linear functions of ui , i = 1, ..n, their distribution
depends on the distribution of u.
I So far we only assumed zero conditional mean (SLR.4: E [u|x] = 0) and
homoskedasticity (SLR.5: Var [u|x] = σ 2 ∀x)
I Since u is the sum of many (independent, smaller) factors influencing y , it
seems reasonable to assume it is normally distributed
u ∼ N(0, σ 2 )
I This “saves” us!- linear functions of normal RVs are normal
I So β̂1 , β̂0 are normally distributed!!!
I And we already know they mean and variance, so know the whole
distribution!!!
The classical linear regression model (CLR)

I SLR.6: Normal errors


Conditional on x, the errors u are normally distributed:

u ∼ N(0, σ 2 )∀x (8)


The classical linear regression model (CLR)

I SLR.6: Normal errors


Conditional on x, the errors u are normally distributed:

u ∼ N(0, σ 2 )∀x (8)

I Note: SLR.6 implies SLR 4 (Zero Conditional Mean)and SLR 5


(Homoskedasticity).
I Motivated e.g. by u being the sum of many independent smaller factors
influencing y
The classical linear regression model (CLR)

I Does SLR.6 help us to derive distribution of β̂1 ? Yes!


PN
(xi −x̄)ui
I βˆ1 = β1 + i=1SST where SSTx = N 2
P
x i=1 (xi − x̄) .
(xi −x̄)
I Conditional on {xi }, SSTx
is just a number
PN
(xi −x̄)ui
I Under SLR.2 (random sampling) and SLR.6, i=1SST just a weighted
x
sum of i.i.d. normal variables ui . So it also has a normal distribution.
σ2
I So βˆ1 ∼ N(β1 , SST
x
) (since we already derived expected value and
variance).
THEOREM (Normal Sampling Distributions)

Under SLR.1,.2,.3,.6,(and conditional on the sample outcomes of the


explanatory variables),

β̂1 ∼ Normal[β1 , Var (β̂1 )]

and so

β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )

I Does this help us? Yes, but: don’t know σ 2 .


I Postpone this problem: suppose first we know σ 2 .
Suppose we knew σ 2 : confidence intervals

β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )

I With known σ 2 , we know sd(β̂1 )


I So we know the probability p of x = β̂1 −β1 ∈ [−c, +c] from the normal
sd(β̂1 )
tables that plot CDF F (x) for different values of x ∼ N(0, 1)
The normal distribution
The tabulated normal distribution

Source:Wooldridge

I Gives CDF F (z) = prob(x ≤ z) for x ∼ N(0, 1) and different values of z


Suppose we knew σ 2 : confidence intervals

I Find c such that prob(x ∈ [−c, +c]) = p, for x = β̂1 −β1


sd(β̂1 )
∼ N(0, 1)
I NB: symmetry implies 1 − F (c) = (1 − p)/2
I E.g. p = 0.95 (95 percent)
I Look for c s.t prob(x ≤ c) = F (c) = 1 − (1 − p)/2 = 0.975
I Yields c = 1.96
I Implies prob(x > c) = 1 − F (c) = 0.025
I Symmetry implies prob(x > c) = prob(x < −c); yields
prob(x > c v x < −c) = 0.05
I So have p = prob(−c ≤ x ≤ c) = 1 − 0.05 = 0.95
Suppose we knew σ 2 : confidence intervals

I Have identified c s.t. prob(−c ≤ β̂1 −β1


sd(β̂1 )
≤ c) = p = 0.95
I But this depends on true β1 !
Suppose we knew σ 2 : confidence intervals

I Have identified c s.t. prob(−c ≤ β̂1 −β1


sd(β̂1 )
≤ c) = p = 0.95
I But this depends on true β1 ! Ok, but for ∧ "and":
Suppose we knew σ 2 : confidence intervals

I Have identified c s.t. prob(−c ≤ β̂1 −β1


sd(β̂1 )
≤ c) = p = 0.95
I But this depends on true β1 ! Ok, but for ∧ "and":
I Can write the event β̂1 −β1
sd(β̂1 )
∈ [−c, +c] as

β̂1 − β1 β̂1 − β1
−c ≤ ∧ ≤c
sd(β̂1 ) sd(β̂1 )
β1 ≤ β̂1 + β1 + c sd(β̂1 ) ∧ β̂1 + β1 − csd(β̂1 ) ≤ β1
= β1 ∈ [β̂1 − c × sd(β̂1 ), β̂1 + c × sd(β̂1 )] (9)
Suppose we knew σ 2 : confidence intervals

I Have identified c s.t. prob(−c ≤ β̂1 −β1


sd(β̂1 )
≤ c) = p = 0.95
I But this depends on true β1 ! Ok, but for ∧ "and":
I Can write the event β̂1 −β1
sd(β̂1 )
∈ [−c, +c] as

β̂1 − β1 β̂1 − β1
−c ≤ ∧ ≤c
sd(β̂1 ) sd(β̂1 )
β1 ≤ β̂1 + β1 + c sd(β̂1 ) ∧ β̂1 + β1 − csd(β̂1 ) ≤ β1
= β1 ∈ [β̂1 − c × sd(β̂1 ), β̂1 + c × sd(β̂1 )] (9)

I CF95,β1 = [β̂1 − c × sd(β̂1 ), β̂1 + c × sd(β̂1 )] is a “95 percent confidence


interval” for β1
I NB: Across repeated samples, β1 will lie in the different CF95,β1 s
constructed in this fashion 95 percent of the time
I It is wrong to say that β1 lies in any given CF95,β1 with a probability
(because β1 is just a number, although unknown)
Suppose we knew σ 2 : hypothesis testing

β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )

I Sometimes we want to know how likely it is that β1 = b, given our sample.


I But this makes no sense: β1 is just a number (if unknown)
I Classical statistics deals with this by testing “hypotheses”, e.g. β1 = b
I Idea: design rules for rejecting a hypothesis such that, across repeated
samples, there is a particular probability of “false rejection”
Known σ 2 : hypothesis testing

I “Null hypothesis” H0 about, say, β1 :


I Simple hypothesis β1 = b
I Or composite hypothesis: β1 ≤ b (or ≥, of course)
I On the basis of a data sample, we want to "not reject" H0 , or reject it in
favor of H1 : β1 6= b (two-sided test, for simple hypothesis) or β1 ≥ b
(one-sided test)
I Can make two types of errors
I Type 1 error (reject true H0 )
I Type 2 error (accept wrong H0 )
I Need strong evidence to reject Null. For this, choose a small significance
level, defined as the probability of a Type 1 error across repeated samples,
usually 10, 5 or 1 percent.
I vs. the “Power of a test": one minus the probability of a type 2 error.
I To test H0 need a test statistic (i.e. a sample function of the random
sample), and a critical value.
Hypothesis testing: 6 steps

1. Formulate H0 and H1 (and state maintained assumptions.)


2. Choose significance level α and test statistic T .
3. Derive sampling distribution of T under assumption that H0 is true.
4. Calculate critical value c that defines rejection region.
5. Calculate t, the value of T on the random sample at hand.
6. Reject H0 if t is in rejection region.
Suppose we knew σ 2 : two-sided hypothesis

β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )
6 steps
1. H0 : β1 = b, H1 : β1 6= b (maintained: SLR.1-3,6, known σ 2 )
β̂1 −b
2. Significance level α = 0.05; Test statistic T = sd(β̂1 )
.
3. Sampling distribution of T under H0 : N(0, 1)
4. Critical value c = 1.96 s.t. Prob(|T | > c|β1 = b) = 0.05, defines rejection
region T : T < −c v T > c.
5. Calculate t, the value of T on the random sample at hand.
6. Reject H0 if t is in rejection region.
The normal distribution
Known σ 2 : one-sided hypothesis

β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )

1. H0 : β1 <= b, H1 : β1 > b
β̂1 −b 0
2. Significance level α = 0.05; Test statistic T = sd(β̂1 )
.
3. But: To derive distribution of T and critical value c, need some numerical
value b 0 consistent with H0 , i.e. b 0 <= b
4. Which one to choose?
5. Note: Significance level α: "Probability of T1 error (rejecting H0 although
it is true), equal to Prob(T ≥ c|β1 = b 0 ≤ b), is at most α."
6. But: For any critical value c, Prob(T ≥ c|β1 = b 0 ) is increasing in b 0 .
7. So choose b 0 = argmax(Prob(T ≥ c|β1 = b 0 ≤ b) = b
8. In words: "Replace H0 : β1 <= b with Ho’:β1 = b to get most stringent
test compatible with H0"
Known σ 2 : one-sided hypothesis

β̂1 − β1
∼ Normal(0, 1)
sd(β̂1 )
6 steps
1. H0 : β1 <= b, H1 : β1 > b,
β̂1 −b
2. Significance level α = 0.05; Test statistic T = sd(β̂1 )
.
NB b is highest value of β1 compatible with H0 , minimises probability of
rejection under H0 among β1 <= b
3. Sampling distribution of T with β = b: N(0, 1)
4. Critical value c = 1.645 s.t. Prob(T > c|β1 = b) = 0.95, defines rejection
region T > c.
5. Calculate t, the value of T on the random sample.
6. Reject H0 if t is in rejection region.
What if we don’t know σ 2

PN 2
i=1 ûi
1. Again use σ̂ 2 = N−2

2. Again, gives the “standard error" of βˆ1 as an estimator for its standard
σ̂
deviation, equal to se(β̂1 ) = √SST .
x
β̂1 −b
3. But: σ̂ 2 is a random variable itself! So T = se(β̂1 )
is NOT normally
distributed.
Recap: The χ2 distribution

I Let Zi ∼ N(0, 1) for i = 1, .., N independent. Then X = N 2


P
i=1 Zi has a
“chi-squared distribution with N degrees of freedom", write X ∼ χ2N .
I E (X ) = N, Var (X ) = 2N.
Recap: The χ2 distribution

Source:Wikipedia
Recap: The t distribution

I Let Z ∼ N(0, 1) and X ∼ χ2N , and Z , X independent. Then T = √ Z


X /N
has a “t distribution with N degrees of freedom", write T ∼ tn .
I For N > 2, E (T ) = 0 and Var (T ) = N/(N − 2)
I Shape similar to normal, but more spread out / mass in tails.
Recap: The t distribution

Source:Wikipedia
Recap: The t distribution

Source:Wooldridge
THEOREM (t Distribution for Standardized Estimators)

Under the CLM Assumptions,

β̂1 − β1
∼ tn−2 = tdf
se(β̂1 )

I We will not prove this (now), but note:


β̂1 −β1j
I ∼ N(0, 1)
sd(β̂1 )
σ̂ 2
I σ2
∼ χ22 (prove later)
I β̂1 and σ̂ 2 are independent (prove later)
β̂1 −β1j
β̂1 −β1j r
σ2
β̂1 −β1j
I So sd(β̂1 )
r = r x
SST
= r (where the true σ 2 conveniently cancels) is
σ̂ 2 σ̂ 2 σ̂ 2
σ2 σ2 SSTx

distributed tn−2
I So it is replacing σ (an unknown constant) with σ̂ (an estimator that
varies across samples), that takes us from the standard normal to the t
distribution.
Testing hypotheses about β1

I The practicalities are the same as with any t − test


I Statistical packages give us the OLS estimate of β1 and the standard error
of the OLS estimate se(β̂1 ).
I How we can test a Null hypothesis of the form
6 b (two-sided test)
1. H0 : β1 = b against the alternative H1 : β1 =
2. H0 : β1 ≤ b against the alternative H1 : β1 > b (one-sided test)
EXAMPLE: Return to Education Using WAGE2.DTA

\ = 1.142 + .0993educ
lwage (10)
(.109) (.0081)

n = 759,R 2 = .165 (11)

I Typical Null hypothesis: no effect of X on Y


I So H0 : β1 = 0, against H1 : β1 6= 0
I Under H0 β̂1 −0 is distributed tn−2
se(β̂1 )
I Pick significance level (prob of Type 1 error) at 5 percent
The t distribution
The t distribution

Source:Wooldridge
The t distribution

Source:Wooldridge
EXAMPLE: Return to Education Using WAGE2.DTA

\ = 1.142 + .0993educ
lwage (12)
(.109) (.0081)

n = 759,R 2 = .165 (13)

I Typical Null hypothesis: no effect of X on Y


I So H0 : β1 = 0, against H1 : β1 6= 0
I Under H0 β̂1 −0 is distributed tn−2
se(β̂1 )
I Pick significance level (prob of Type 1 error) at 5 percent
I Critical value for t-stat with n − 2 > 100 df, in a two-sided test at 5 percent
significance (approx equal to that of standard normal): +/ − 1.96
I t-statistic: .0993/.0081 = 12.25
I So we reject H0.
EXAMPLE: Return to Education Using WAGE2.DTA

\ = 1.142 + .0993educ
lwage (14)
(.109) (.0081)

n = 759,R 2 = .165 (15)

I Different Null hypothesis H0: β1 <= b = 0.05, against


H1 : β1 > b = 0.05
I Note: H0 is “complement” to H1
I But which β1 to use to construct a distribution of β̂1 under H0?
I As before, choose most “stringent” β1 = b, that gives least rejections of H0
I Under H0 β̂1 −b
se(β̂1 )
is distributed tn−2
I Again, pick significance level (prob of Type 1 error) at 5 percent
I Critical value for t-stat with >100 df, in one-sided test at 5 percent
significance (equal to that of standard normal): 1.65
I t-statistic: (.0993 − 0.05)/.0081 = 6.09
I So we again reject H0.
The t distribution

Source:Wooldridge
EXAMPLE: Return to Education Using WAGE2.DTA

\ = 1.142 + .0993educ
lwage (16)
(.109) (.0081)

n = 759,R 2 = .165 (17)

I Could have written H0 as β1 = 0.05, against H1 : β1 > 0.05


I Note: H0 is not “complement” to H1. But again we only use evidence of
β1 > 0.05 against H0.
I With this alternative H0 , don’t need to choose β1 to construct test
statistic.
I Remaining steps same as before.
Confidence interval for β1

I Works in exactly the same way as with a normally, or any other


t-distributed, variable.

\ = 1.142 + .0993educ
lwage (18)
(.109) (.0081)

n = 759,R 2 = .165 (19)

I 95 percent confidence interval is CI = .0993 + 1.96 ∗ .0081


I Beware: we cannot say that the true β1 lies within CI with 95 percent
probability.
I Rather: If we constructed CI s in many independent samples as above, the
true β1 would lie in it 95 percent of the times.
P-value

\ = 1.142 + .0993educ
lwage (20)
(.109) (.0081)

n = 759,R 2 = .165 (21)

I “P-value", defined as "the smallest significance level under which we can


just not reject H0 : β1 = 0, against H1 : β1 6= 0
β̂1 −0
I We know that t =
se(β̂1 )
is distributed as T N−2 under H0.
I For t obs the t-statistic from the sample, p-value p is :

p = P(|T N−2 | ≥ |tb̂obs |)
I To find it, ...
β̂1 −0
1. calculate t obs =
se(β̂1 )
2. look for prob(−t obs ≤T N−2 ≤ t obs )=1-2*prob(TN−2 ≥ t obs ) in tabulated
t distribution
3. Here it is < 0.01 percent.
Three ways of testing significance
"Test the significance of estimate β hat at the α percent level"
= Test H0: β hat = 0, against a two-sided alternative at the α percent level"

Three ways:
1. Test explicitly the hypothesis as usual
2. Check if p − value < α. If so, reject. If not, do not reject.
3. Check the confidence interval: If 0 ∈ CI 1−α , do not reject. Otherwise
reject.
Proof:

0 ∈ CI 95 = [β̂ − se(β̂)t1−α/2
N−2 N−2
, β̂ + se(β̂)t1−α/2 ]
N−2
⇔ [β̂ − se(β̂)t1−α/2 < 0 ∧ 0 < β̂ + se(β̂)t N−2 ]
N−2 N−2
⇔ β̂ < se(β̂)t1−α/2 ∧ β̂ > −se(β̂)t1−α/2

N−2 β̂ − 0 N−2
⇔ −t1−α/2 < < t1−α/2
se(β̂)
β̂ − 0 N−2
⇔ | | < t1−α/2
se(β̂)
β̂−0
So se(β̂)
is not in the rejection region of H0 against H1.
Remaining: Prediction Intervals

I So far: Confidence interval for the effect of an extra year of schooling on


salaries.
I But how about: Confidence interval for my (or observation i’s) salary after
2 additional years in school.
I Two differences:
I Particular x (= xi + 2).
I Includes uncertainty from other factors that affect salaries (via u)
Prediction Intervals

I More generally, suppose we would like to make a prediction yei about yi for
a given xi (not necessarily in the sample).
I If we knew the true coefficients, a natural prediction would be
yei = E [y |xi ] = β0 + β1 xi .
I The true yi would equal yei + ui , so would have a distribution around yei
I Problem: Don’t know β0 , β1 . A natural candidate for yei is the OLS
prediction ŷi = β̂0 + β̂1 xi = ȳ + β̂1 (xi − x̄)
I Questions:
1. Is ŷi on average correct, in the sense that E [ŷi ] = E [y |xi ], or
E [ei ] = E [yi − ŷi ] = 0?
2. If so, how large is the variance of the error term ei = yi − ŷi ?
3. Can we construct confidence intervals for yi ?
Prediction Intervals

I More generally, suppose we would like to make a prediction yei about yi for
a given xi (not necessarily in the sample).
I If we knew the true coefficients, a natural prediction would be
yei = E [y |xi ] = β0 + β1 xi .
I The true yi would equal yei + ui , so would have a distribution around yei
I Problem: Don’t know β0 , β1 . A natural candidate for yei is the OLS
prediction ŷi = β̂0 + β̂1 xi = ȳ + β̂1 (xi − x̄)
I Questions:
1. Is ŷi on average correct, in the sense that E [ŷi ] = E [y |xi ], or
E [ei ] = E [yi − ŷi ] = 0?
2. If so, how large is the variance of the error term ei = yi − ŷi ?
3. Can we construct confidence intervals for yi ?
I Two sources of uncertainty:
1. Uncertainty about the true parameters β0 and β1 (we only have estimates
β̂0 and β̂1 ). More data will reduce this.
2. Uncertainty about the error that will be drawn ui . More data will not
reduce this.
Prediction Intervals

I More generally, there are two kinds of intervals we can be interested in.
1. The first is just a confidence interval for E (y |xi ) given xi .
2. The second is a confidence interval for y given xi : this also has the
(estimated) randomness from the error u - this would remain even if we
knew β0 and β1 !
Two confidence intervals
Prediction Intervals

I Write the Prediction Error ei for OLS prediction ŷi = β̂0 + β̂1 xi as
.
ei = yi − ŷi (22)
Prediction Intervals

I Write the Prediction Error ei for OLS prediction ŷi = β̂0 + β̂1 xi as
.
ei = yi − ŷi (22)
= β0 + β1 xi + ui − (β̂0 + β̂1 xi ) (23)
Prediction Intervals

I Write the Prediction Error ei for OLS prediction ŷi = β̂0 + β̂1 xi as
.
ei = yi − ŷi (22)
= β0 + β1 xi + ui − (β̂0 + β̂1 xi ) (23)
= [β0 − β̂0 ] + (β1 − β̂1 )xi + ui (24)
Prediction Intervals

I Write the Prediction Error ei for OLS prediction ŷi = β̂0 + β̂1 xi as
.
ei = yi − ŷi (22)
= β0 + β1 xi + ui − (β̂0 + β̂1 xi ) (23)
= [β0 − β̂0 ] + (β1 − β̂1 )xi + ui (24)
= [E [y |x̄] − β1 x̄ − (ȳ − β̂1 x̄] + (β1 − β̂1 )xi + ui (25)
Prediction Intervals

I Write the Prediction Error ei for OLS prediction ŷi = β̂0 + β̂1 xi as
.
ei = yi − ŷi (22)
= β0 + β1 xi + ui − (β̂0 + β̂1 xi ) (23)
= [β0 − β̂0 ] + (β1 − β̂1 )xi + ui (24)
= [E [y |x̄] − β1 x̄ − (ȳ − β̂1 x̄] + (β1 − β̂1 )xi + ui (25)
= (E [y |x̄] − ȳ ) + (β1 − β̂1 )(xi − x̄) + ui (26)
I (25) follows from
1. Conditional expectation of y given sample mean of x:
E [y |x̄] = β0 + β1 x̄ + 0 implying β0 = E [y |x̄] − β1 x̄
2. Average sample regression function β̂0 = ȳ − β̂1 x̄
Prediction Intervals

ei = yi − ŷi = (E [y |x̄] − ȳ ) + (β1 − β̂1 )(xi − x̄) + ui (27)

I Note: xi and x̄ are given (by assumption, and by the sample)


I E [y |x̄] and β1 are fixed numbers that we don’t know
I Three sources of error in predicting yi by ŷi
1. ȳ (unbiased for E [y |x̄], but not typically equal to it)
2. β̂1 (unbiased for β1 , but not typically equal to it)
3. ui
I Prediction intervals for E (y |xi ) only include 1. and 2.
I Prediction intervals for y |xi include 1., 2. and 3.
Prediction Intervals

I Prediction error ei = yi − ŷi = (E [y |x̄] − ȳ ) + (β1 − β̂1 )(xi − x̄) + ui


I Unbiasedness: E [ei ] = E (yi |xi − ŷi ) = 0, because
1. E (ŷi ) = E (y |x̄) + E (β̂1 )(xi − x̄)) = µy + β1 (xi − x̄)
2. E (yi |xi ) = E (y |x̄) + E (β1 (xi − x̄)|xi ) + E (ui |xi ) = µy + β1 (xi − x̄)
Prediction Intervals

I Prediction error ei = yi − ŷi = (E [y |x̄] − ȳ ) + (β1 − β̂1 )(xi − x̄) + ui


I Unbiasedness: E [ei ] = E (yi |xi − ŷi ) = 0, because
1. E (ŷi ) = E (y |x̄) + E (β̂1 )(xi − x̄)) = µy + β1 (xi − x̄)
2. E (yi |xi ) = E (y |x̄) + E (β1 (xi − x̄)|xi ) + E (ui |xi ) = µy + β1 (xi − x̄)
I Variance: Var (ei ) = V (yi |xi ) + Var (ŷi ) − 2 ∗ Cov (yi |xi , ŷi ). Need
covariance of ui , ȳ and β̂1
I ui is independent of ȳ and β̂1 (by SLR.6 and “random sampling")
I Shown in Lecture SLR 2: ȳ and β̂1 are independent
I Yields:
σ2 σ 2
1. Var (ŷi ) = Var (ȳ ) + Var [β̂1 (xi − x̄)] = N
+ (xi − x̄)2 SST
x
σ2 σ 2
2. Var (ei ) = Var (ui ) + Var (ŷi ) = σ 2 + N
+ (xi − x̄)2 SST =
h i x

σ 2 1 + N1 + (xi − x̄)2 SST1


x

I NB: 1
N
and (xi − x̄)2 SST
1
x
converge to 0 as N grows large
Prediction Intervals

I ei = yi − ŷi = (E [y |x̄] − ȳ ) + (β1 − β̂1 )(xi − x̄) + ui


I To construct intervals, need to know (joint) distribution of ȳ , β̂1 and u -
so need normality assumption SLR.6.
I Impose SLR.1,2,6, so know
σ2
1. ȳ ∼ N(E [y |x̄], N
)
2
σ
2. β̂1 ∼ N(β1 , SST )
x
2
3. ui ∼ N(0, σ )
Prediction Intervals

I ei = yi − ŷi = (E [y |x̄] − ȳ ) + (β1 − β̂1 )(xi − x̄) + ui


I To construct intervals, need to know (joint) distribution of ȳ , β̂1 and u -
so need normality assumption SLR.6.
I Impose SLR.1,2,6, so know
σ2
1. ȳ ∼ N(E [y |x̄], N
)
2
σ
2. β̂1 ∼ N(β1 , SST )
x
2
3. ui ∼ N(0, σ )
I Same problem as before: Don’t know σ 2
I Same solution as before: Substitute σ̂ 2 = SSR
N−2
Prediction Intervals

I Thus, can construct confidence intervals for yi |xi using the t-distribution

ŷi −yi
r h i
1 +(x −x̄)2 1
σ 2 1+ N i SST
x ŷi − yi
T = q
σ̂ 2
= r h i ∼ tn−2 (28)
1 1
σ2 σ̂ 2 1 + N
+ (xi − x̄)2 SST x
Prediction Intervals

I Computationally, you can construct PIs in different ways.


I One way to calculate the se of ŷi is to calculate a linear combination of
the coefficients and its standard error (Stata’s “lincom” command is the
simplest way).
I Use the data in GPA1.DTA.

colGPA = β0 + β1 hsGPA + β2 ACT + β3 skipped + u


I colGPA is grade point average in college, hsGPA that in high school,
skipped average number of lectures missed per week, ACT achievement
test score
I Get the 95% CI for the prediction when hsGPA = 3, ACT = 25, and
skipped = 0.
Prediction Intervals

Source:Wooldridge
Prediction Intervals

I The prediction is 2.99. But we want the CI, which goes from 2.87 to 3.11.
This is fairly tight, but this is just for ŷ |xi , the average colGPA for those
with hsGPA = 3, ACT = 25, and skipped = 0.
I In other words: this misses the uncertainty from the error u - this would
remain even if we knew β0 and β1 !
I In most cases, except for very small sample sizes, σ̂ 2 is by far the largest

component. Remember that se(ŷ ) shrinks to zero at the rate 1/ n, while
2
σ̂ is the estimated variance for a single draw.
Prediction Intervals

Source:Wooldridge
Prediction Intervals

Source:Wooldridge
Summary
ȳ and β̂1 are independent

I ȳ = E [y |x̄] = β0 + β1 x̄ + ū
I Since both ū and β̂1 are normally distributed, suffices to show that
Cov (ū, β̂1 ) = 0
1. β̂1 = β1 + ni=1 wi ui , where wi = (xi − x̄)/SSTx
P

2. E [β̂1 ū] = β1 E [ū] + 1n E [ ni=1 wi ui nj=1 uj ]


P P

= 0 + 1n E [ ni=1 wi ui2 ]
P

= 1n σ 2 [ ni=1 wi ] = 0
P

I Useful for prediction intervals


I But also for calculating Varβ̂
0
Appendix
When is the OLS estimator best? - the Gauss Markov Theorem

I More restrictive question: when is the OLS estimator most efficient among
linear unbiased estimators? (best linear unbiased estimator, “BLUE")
I Show that this is always the case under assumptions SLR.1-5
I Strategy of the proof:
1. Impose the assumption of linearity and unbiasedness on on an arbitrary
alternative linear estimator βe1 , derive restrictions
2. Show that difference between βe1 and β̂1 must be non-negative
When is the OLS estimator “BLUE"? - the Gauss Markov Theorem
I Linearity in yi : βe1 = N
P
i=1 wi yi , for some {wi } functions of {xi }
N
X N
X N
X
βe1 = β0 wi + β1 wi xi + wi ui (29)
i=1 i=1 i=1

I Unbiasedness: E [βe1 |X ] = β0 N
P PN
i=1 wi + β1 wi xi + 0 = β1
PN PN PNi=1
I Implies: i=1 wi = 0, i=1 wi xi = 1 and i=1 wi (xi − x̄) = 1
I Now consider ∆V ≡ Var (βe1 ) − Var (βˆ1 ) = σ 2 ( N 2
P
i=1 wi − 1/SSTx )
P 2
N
I Multiply 1/SSTx by i=1 wi (xi − x̄) = 1:
 P 2 
N N
∆V = σ 2 2
P
w
i=1 i − i=1 w i (xi − x̄) /SST x
PN
I i=1 wi (xi − x̄)/SSTx = γ̂wx , a reg coefficient of wi on (xi − x̄)
PN PN (xi −x̄)2 
∆V = σ 2 2
i=1 wi − i=1 SSTx2 SSTx =
2 PN 2 2
σ ( i=1 [wi − (γ̂wx (xi − x̄)) ])
I But since
PN PN w (x −x̄) PN 2 2
i=1 wi γ̂wx (xi − x̄) = i=1 (xi − x̄) = (γ̂wx (xi − x̄))
PN i i
i=1 γ̂wx (x −x̄)2
i
i=1
we have P
∆V = σ 2 ( Ni=1 [wi − γ̂wx (xi − x̄)]2
I This is just the sum of squared residuals from a regression of wi on xi , so
non-negative. QED!!!

You might also like