0% found this document useful (0 votes)
13 views

Topic 1

econometrics

Uploaded by

yangzhibo0404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Topic 1

econometrics

Uploaded by

yangzhibo0404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

ECON6001: APPLIED ECONOMETRICS

S&W
Chapters 2 and 3
Introduction & Statistics
Review

Dr. Gedeon Lim

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Overview
1. Introduction: causality, why econometrics?
2. Observational studies (are often not enough): causality and
counterfactuals
3. Review of probability theory & statistics

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Overview
1. Introduction: causality, why econometrics?
2. Observational studies (are often not enough): causality and
counterfactuals
3. Review of probability theory & statistics

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Brief Overview of the Course
• Economic theory suggests important relationships but:
– Doesn’t tell us the direction of cause and effect nor
– Quantitative magnitudes of effects
• Econometrics is an “art/science” that plugs this gap:
– Goal: To build (a body of) evidence to make predictions about
(public/business) policy changes
– Method: By saying something about causality (but difficult to do so)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


E.g.: Minimum Wage Debate
• Often ideological
• Econometrics brings quantitative evidence to bear
• In 1978, economists asked if minimum wages increased unemployment
among low-skilled workers
– 90% of surveyed economists agreed

• Evidence: Dozens of econometrics studies since then estimate negative, zero,


and even positive employment effects
– Most, but not all, studies find relatively small negatively employment effects or
zero effects

• In 2013, economists asks: “Raising the federal minimum wage to $9 per hour
would make it noticeably harder for low-skilled workers to find employment.”
– Only 34% agreed (see https://www.igmchicago.org/igm-economic-experts-
panel/)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Evidence-based policy: Role of Econometrics

• Many important/interesting questions in social sciences for


which econometrics can provide some evidence for:
– Running example: What is the causal effect of reducing class size on
student achievement?
– Causal effect of schooling (MEcon degree) on wages?
– Causal effect of speaking Cantonese on renting an apartment?
• We typically start with a theory of change
– Use this to generate a hypothesis about policy impact/channels
• Econometrics helps us test the hypothesis and make predictions:
– We can then gauge the evidence and see if it changes our prior beliefs

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Method: Using data to measure causal
effects
• Ideally, we would like an experiment
– What would be an experiment to estimate the effect of class size on
standardized test scores?

• But almost always we only have observational


(nonexperimental) data.
– returns to education
– cigarette prices
• Most of the course deals with difficulties arising from using
observational data to estimate causal effects
– confounding effects (omitted factors)
– simultaneous causality
– “correlation does not imply causation”

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Objectives of ECON6001:
• Learn to conduct regression analysis (regressions) to:
– Estimate causal effects using observational data
– Predict – for which knowing causal effects is not necessary – including
forecasting using time series data;

• Focus on application – theory is used only to understand the


whys of the methods;
• Learn to evaluate regressions –to read/understand empirical
economics papers in other courses;
• Get hands-on experience with regressions in your problem sets.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Statistical Package: Stata
• Designed especially for econometric analysis; easy to use
• Comprehensive implementation of modern econometric tools
• Available
– at the computer classrooms, KKL 1009, 1102
– on an SEF server (accessible remotely, info to be distributed)

• (Stata) Tutorial (time and venue to be announced)


• You can use other packages, but the instructor and tutor won’t be
able to provide support

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Our joint objective by midterms..

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Problem sets, Midterms, and Finals
• Problem sets:
– 4 in total
 Posted every Monday, due following Sunday

• In-class quizzes
• Midterms:
– Possibly September 13th/16th morning

• Finals:
– Date TBC

• Office hours: TBC


• Topics: see syllabus

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Me, myself, and I
• Development economist
• Born and raised in Singapore
• Did my Ph.D. at Boston University
• Joined HKU in 2020

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Go to www.menti.com and enter code:
3338 8006
• Q1: What did you major in during undergrad?
• Q2: Which parts of the world and/or which parts of China is
everyone from?

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Overview
1. Introduction: Causality, why econometrics?
2. Observational studies (are often not enough): causality and
counterfactuals
3. Review of probability and statistical theory

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Terminology

y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor
LHS RHS

• Explained/explanatory most descriptive and applicable to


different contexts
• More often: dependent and independent. (No relation with
statistical independence.)
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Observational Study: Class Size in California
Policy question: does class size count for student learning?
• Sample of 420 California school districts. For each district, we
have:
– Outcome/Dep. Var Y: Mean test score
– Treatment/Ind. Var X: Class size measure (student-teacher ratio, STR)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Observational Study: Class Size in California
• Divide districts into small (STR<20) and regular (STR>=20).
• Initial: Use estimation and hypothesis testing to compare test
scores between small and regular class size districts.

Class Size Average score Standard deviation n


(Ῡ ) (s2)
Small 657.4 19.4 238
Large 650.0 17.9 182

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


We need to get some numerical evidence on whether
districts with low STRs have higher test scores – but how?
1. Compare average test scores in districts with low STRs to those
with high STRs (“estimation”)
2. Test null hypothesis that mean test scores are the same, against
alternative hypothesis that they differ (“hypothesis testing”)
3. Estimate an interval for the difference in mean test scores, high
v. low STR districts (“confidence interval”)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Initial data analysis: Compare districts with “small”
(STR < 20) and “large” (STR ≥ 20) class sizes:

Class Size Average score Standard deviation n


(Ῡ ) (s2)
Small 657.4 19.4 238
Large 650.0 17.9 182

1. Estimation of Δ = difference between group means


2. Test the null hypothesis (𝐻𝐻0 ) that Δ = 0
3. Construct a confidence interval for Δ

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Causality
• The important question for a policymaker/economist is whether
this relationship is causal
– That is, does a smaller class size truly cause students to learn more?

• The goal of this class is to train you in econometrics, so that you


can critically assess statistical evidence
• And… to distinguish causality from correlation

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Causality
• “Relationship” can take two forms:
– Correlation (or association): If a variable X is correlated (or associated)
with a variable Y, the variables “move together” in the data
– If X and Y are correlated (or associated), knowing the value of X helps
you predict the value of Y.
– Causation: If X is causally related Y, then changing the value of X
would lead to a change in the value of Y.

“An action is said to cause an outcome if the outcome is the


direct result, or consequence, of that action”

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Causality
• Consider a plan to reduce class sizes by 20%. Why is it
important whether the relationship between class size and
student outcomes is causal or correlational?
• Causality means that changes in a specific
factor/policy/treatment (X) lead to changes in an outcome (Y)
– Y may be a function of many factors other than X.
– We want to measure any changes in Y that are directly attributable to a
change in X, not to these other factors

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Causality: Counterfactuals
• To do so, we have to think about the

Counterfactual: What would have happened to Y if X did not occur?

• What is the counterfactual for each of these examples?


– Does getting an MEcon (X) affect your job prospects (Y)?

– Did Lee Kuan Yew (X) have an effect on Singapore’s development (Y)?

– Does having only one child (X) affect a child’s development (Y)?

• Inferring causality is hard because we never see counterfactuals!


Instead, we “mimic” the counterfactual using data and statistics.
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Randomized Experiments
• Conceptually, randomized experiments are the “gold standard,”
the ideal method for estimating the causal effect of a “treatment.”
• RCTs compare 2 groups similar in all ways except for
______________.
– A sample of people is obtained from a population.
– People are randomly assigned to the treatment or control group
– The treatment group is offered the treatment
– The control group is not offered the treatment because: _____________

• This isolates the impact of the treatment from the impact of other
factors.
– Only difference between groups should be the treatment
– Any difference in outcomes must be due to the treatment

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Experiments vs Observational Studies
• Randomized experiments allow us to isolate the impact of one
factor from the impact of other factors
– What if you do not have a randomized experiment?
– In an observational study, the researcher analyzes data from a situation
over which he or she had no _______________________.

• We will spend much of this course learning how to extract


useful causal information from observational studies.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Does having only one child (X) affect a
child’s development (Y)?

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Go to www.menti.com and enter code:
3338 8006

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Extracting useful causal information from
observational studies and what comes
next…
• Eventually: regressions

• Before turning to regression, however, we will review underlying


theory of estimation, hypothesis testing, and confidence intervals

• These concepts extend directly to regressions

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Application to the California Test Score –
Class Size data

• Estimated slope = 𝛽𝛽̂1 = −2.28


• Estimated intercept = 𝛽𝛽̂0 = 698.9

• Estimated regression line: 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 = 698.9 − 2.28 × 𝑆𝑆𝑆𝑆𝑆𝑆
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Overview
1. Introduction: Causality, why econometrics?
2. Observational Studies vs Causality/Counterfactuals
3. Review of probability and statistical theory
1. (Simple) Probability Theory
2. Estimation
3. (Hypothesis) Testing
4. Confidence Intervals

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Review of Statistical Theory
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence Intervals

The probability framework for statistical inference


a) Population, random variable, and distribution
b) Moments of a distribution (mean, variance, standard deviation,
covariance, correlation)
c) Conditional distributions and conditional means
d) Distribution of a sample of data drawn randomly from a
population: Y1,…, Yn
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
(a) Population, random variable, and
distribution
Population
• The group or collection of all possible units of interest (school
districts)
• We will think of populations as infinitely large (∞ is an
approximation to “very big”)

Random variable Y (outcomes)


• Numerical summary of a random outcome (district average test
score, district STR)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Population distribution of Y
• The probabilities of different values of Y that occur in the
population, for ex. Pr[Y = 650] (when Y is discrete)
• OR: The probabilities of sets of these values, for ex. Pr[640 ≤ Y
≤ 660] (when Y is continuous).

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


(b) Moments of a population distribution: mean,
variance, standard deviation, covariance, correlation
(1 of 3)

mean = expected value (expectation) of Y


= E(Y ) = μY
= long-run average value of Y over repeated realizations of Y
variance = E(Y – μY)2
= 𝜎𝜎𝑌𝑌2
= measure of the squared spread of the distribution

standard=
deviation variance σ Y
=

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


(b) Moments of a population distribution: mean,
variance, standard deviation, covariance, correlation
(2 of 3)
E (Y − µY ) 
 3

skewness =  
3
σY
= measure of asymmetry of a distribution
• skewness = 0: distribution is symmetric
• skewness > (<) 0: distribution has long right (left) tail

E (Y − µY ) 
 4

kurtosis =  
4
σY
= measure of mass in tails
= measure of probability of large values
• kurtosis = 3: normal distribution
• skewness > 3: heavy tails (“leptokurtotic”)
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
(b) Moments of a population distribution: mean,
variance, standard deviation, covariance, correlation
(3 of 3)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


2 random variables: joint distributions and
covariance
• Random variables X and Z have a joint distribution
• The covariance between X and Z is
cov(X,Z) = E[(X – μX)(Z – μZ)] = σXZ
• The covariance is a measure of the linear association between
X and Z
• cov(X,Z) > 0 means a positive relation between X and Z
• If X and Z are independently distributed, then cov(X,Z) = 0 (but
not vice versa!!)
• The covariance of a r.v. with itself is its variance:
cov( X , X ) = E[( X − µ X )( X − µ X )] = E[( X − µ X ) 2 ] = σ X2

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


2 random variables: joint distributions and
covariance
• Random variables X and Z have a joint distribution

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


2 random variables: joint distributions and
covariance
The covariance between Test Score and STR is negative:

So is the correlation…
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
The correlation coefficient is defined in
terms of the covariance:
cov( X , Z ) σ XZ
corr( =
X ,Z) = = rXZ
var( X ) var( Z ) σ X σ Z

• –1 ≤ corr(X,Z) ≤ 1
• corr(X,Z) = 1 mean perfect positive linear association
• corr(X,Z) = –1 means perfect negative linear association
• corr(X,Z) = 0 means no linear association

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


The correlation coefficient measures linear
association

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


(c) Conditional distributions and
conditional means (1 of 3)
Conditional distributions
• The distribution of Y, given value(s) of some other random variable, X
• Ex: the distribution of test scores, given that STR < 20
Conditional expectations and conditional moments
• conditional mean = mean of conditional distribution
= E(Y | X = x) (important concept and notation)

• conditional variance = variance of conditional distribution


• Example: E(Test score|STR < 20) = the mean of test scores among
districts with small class sizes
The difference in means is the difference between the means of two
conditional distributions:
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Conditional Expectations/Mean
• Before: Summarize distributions of single R.V. by mean,
variance
• Now: (Regressions will involve) 2 R.V. , we need a way to
similarly summarize conditional distributions
E[Y | X = x] = ∑𝑘𝑘𝑖𝑖=1 𝑦𝑦𝑖𝑖 Pr 𝑌𝑌 = 𝑌𝑌𝑖𝑖 𝑋𝑋 = 𝑥𝑥)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


(c) Conditional distributions and
conditional means (2 of 3)
Δ = E(Test score|STR < 20) – E(Test score|STR ≥ 20)

Other examples of conditional means:


• Wages of all female workers (Y = wages, X = sex)
• Mortality rate of those given an experimental treatment (Y =
live/die; X = treated/not treated)
• If E(X |Z) = const, then corr(X,Z) = 0 (not necessarily vice
versa however) (E.g. Average wage for male vs female)

The conditional mean is a term for the familiar idea of the


group mean

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Law of iterated expectations (Good to know)
𝐸𝐸 𝑌𝑌 = 𝐸𝐸𝑥𝑥 [ 𝐸𝐸 𝑌𝑌 𝑋𝑋 ]
• Relates conditional expectation and marginal mean (e.g. of Y)
• Overall (marginal) expectation on Y = average of conditional
expectations (over X)
• Mathematically: if X takes on l values 𝑥𝑥1 , … , 𝑥𝑥𝑙𝑙 , then
𝑙𝑙

𝐸𝐸 𝑌𝑌 = � 𝐸𝐸 𝑌𝑌 𝑋𝑋 = 𝑥𝑥𝑖𝑖 Pr(𝑋𝑋 = 𝑥𝑥𝑖𝑖 )


𝑖𝑖=1

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


(d) Distribution of a sample of data drawn
randomly from a population: Y1,…, Yn
Need to connect observed sample data with population we want to
derive conclusions on
• We assume simple random sampling
Units of analysis (district, entity) are drawn at random from the
population
Randomness and data
• Prior to sample selection, the value of Y is random because the
individual selected is random
• Once the individual is selected and the value of Y is observed, then Y
is just a number – not random
• The data set is (Y1, Y2,…, Yn), where Yi = value of Y for the ith
individual (district, entity) sampled
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
i.i.d. : Distribution of Y1,…, Yn under simple
random sampling
• Because individuals #1 and #2 are selected at random, the value
of Y1 has no information content for Y2. Thus:
– Y1 and Y2 are independently distributed
– Y1 and Y2 come from the same distribution, that is, Y1, Y2 are identically
distributed
– That is, under simple random sampling, Y1 and Y2 are independently and
identically distributed (i.i.d.).
– More generally, under simple random sampling, {Yi}, i = 1,…, n, are i.i.d.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Independence
• Definition: Two random variables Y and X are independent if
probability of various events for the two r.v.s are independent

𝑷𝑷𝑷𝑷 𝑿𝑿 ∈ 𝑨𝑨, 𝒀𝒀 ∈ 𝑩𝑩 = 𝑷𝑷𝑷𝑷 𝑿𝑿 ∈ 𝑨𝑨 𝑷𝑷𝑷𝑷(𝒀𝒀 ∈ 𝑩𝑩)

• Some useful results. If X and Y are independent:


– E[XY ] = E[X]E[Y ] (cov(X,Y) = E[XY] – E[X]E[Y])
– E[Y |X] = E[Y ]

• Very important property: If we want to know if X has a causal


effect on Y, need to know if (assignment) of treatment is
independent of any other characteristics
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Recall: The mean as an important moment of a
population distribution
mean = expected value (expectation) of Y
= E(Y ) = μY
= long-run average value of Y over repeated realizations of Y
variance = E(Y – μY)2

= σ Y2
= measure of the squared spread of the distribution
standard=
deviation variance σ Y
=

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


This framework allows rigorous statistical inferences about
moments of population distributions using a sample of data
from that population…
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence Intervals
Estimation
Ῡ is the natural estimator of the mean. But:
a) What are the properties of Ῡ ?
b) Why should we use Ῡ rather than some other estimator?
 Y1 (the first observation)
 maybe unequal weights – not simple average
 median(Y1,…, Yn)
The starting point is the sampling distribution of Ῡ …
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Roadmap

1.The probability framework for statistical


inference
2. Estimation
3. Testing
4. Confidence Intervals

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


(a) The sampling distribution of Ῡ

Ῡ is a random variable:
– The individuals in the sample are drawn at random. Thus the values of
(Y1, …, Yn) are random
– Thus functions of (Y1, …, Yn), such as Ῡ , are random: had a different
sample been drawn, they would have taken on a different value
– I.e. Ῡ has a probability distribution
Properties are determined by the sampling distribution of Ῡ
– The probability distribution associated with possible values of Ῡ for
different possible samples
– The mean and variance of Ῡ are the mean and variance of its sampling
distribution, E(Ῡ ) and var(Ῡ ).
– The concept of the sampling distribution underpins all of econometrics.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


(a) The sampling distribution of Ῡ
Example: Suppose Y takes on 0 or 1 (a Bernoulli random variable) with
the probability distribution,
Pr[Y = 0] = .22, Pr(Y = 1) = .78
Then
E(Y ) = p × 1 + (1 – p) × 0 = p = .78
𝜎𝜎𝑌𝑌2 = 𝐸𝐸[𝑌𝑌 − 𝐸𝐸(𝑌𝑌)]2 = 𝑝𝑝(1 − 𝑝𝑝)
= .78 × (1 – .78) = 0.1716

The sampling distribution of Ῡ depends on n.


Consider n = 2. The sampling distribution of Ῡ is,
– Pr(Ῡ = 0) = .222 = .0484
– Pr(Ῡ = ½) = 2 × .22 × .78 = .3432
– Pr(Ῡ = 1) = .782 = .6084
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
(a) The sampling distribution of Ῡ (3 of 3)

The sampling distribution of Ῡ when Y is Bernoulli ( p = .78):

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Things we want to know about the sampling
distribution:
• What is the mean of Ῡ ?
– If E(Ῡ ) = 0.78 (true μ), then Ῡ is an unbiased estimator of μ

• What is the variance of Ῡ ?


– How does var(Ῡ) depend on n (famous 1/n formula)

• Does Ῡ become close to μ when n is large?


– Law of large numbers: Ῡ is a consistent estimator of μ

• Ῡ – μ appears bell shaped for n large…is this generally true?


– In fact, Ῡ – μ is approximately normally distributed for n large (Central
Limit Theorem)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


The mean and variance of the sampling
distribution of Ῡ (1 of 3)
• General case – that is, for Yi i.i.d. from any distribution, not just
Bernoulli:
1 n 1 n 1 n
• mean:
= ( ∑ Yi ) =
E (Y ) E= ∑ E (Yi ) =∑ µY µY
= n i 1= ni1 = ni1

• Variance: Y ) E[Y − E (Y )]2


var(=
= E[Y − µY ]2
2
 1   n
= E  ∑ Yi  − µY 
 n i =1  
2
1 n

= E  ∑ (Yi − µY ) 
 n i =1 
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
The mean and variance of the sampling
distribution of Ῡ (2 of 3)
2
1 n

so var(Y ) E  ∑ (Yi − µY ) 
=
 n i =1 
  1 n  1 n  
= E   ∑ (Yi − µY )  ×  ∑ (Y j − µY )  
=   n i 1 =  n j 1  
1 n n
= 2 ∑∑ E (Yi − µY )(Y j − µY ) 
n =i 1 =j 1
1 n n
= 2 ∑∑ cov(Yi , Y j )
n =i 1 =j 1
1 n 2
= 2 ∑σ Y
n i =1
σ Y2
=
n
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
The mean and variance of the sampling
distribution of Ῡ (3 of 3)
E (Y ) = µY
σ Y2
var(Y ) =
n

Implications:
1. Ῡ is an unbiased estimator of μY (that is, E(Ῡ ) = μY)
2. var(Ῡ ) is inversely proportional to n
1. the spread of the sampling distribution is proportional to 1/ n
2. Thus the sampling uncertainty associated with Y is proportional
to 1/ n (larger samples, less uncertainty, but square-root law)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Ῡ is an unbiased estimator of the mean
• What are the desirable properties of Ῡ ?
1. Unbiasedness
2. Consistency
3. (Variance and) Efficiency

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Unbiasedness E (Y ) = µY
• Theoretical property (in the sense that you usually only get 1
sample, so you can’t observe this)
• Simply: if you take repeated, random samples from a population
� and…
 you get many 𝑌𝑌’s
�𝑖𝑖
𝑌𝑌
• ∑𝑘𝑘𝑖𝑖=1 → 𝜇𝜇𝑌𝑌
𝑘𝑘

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Sampling Distribution of Ῡ when n large
• For small sample sizes, distribution of Ῡ is complicated, but if n
is large, sampling distribution simplifies by LLN/CLT.
• 2 key results:
1. As n increases, the distribution of Ῡ becomes more tightly centered
around μY (the Law of Large Numbers)
2. Moreover, the distribution of (Ῡ – μY ) becomes normal (the Central
Limit Theorem)

• For this class, LLN and CLT are auxiliary results that you won’t
need to prove but need to understand the intuition for:
– LLN helps prove/show that our estimators are consistent
– CLT allows us to construct confidence intervals and do hypotheses
testing on our estimator

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


The Law of Large Numbers: Sequence of
Sample Means
• What can we say about sample mean as n gets large?
• Think about sequences of sample means with increasing n:
𝑌𝑌�1 = 𝑌𝑌1
1

𝑌𝑌2 = ∗ (𝑌𝑌1 + 𝑌𝑌2 )
2
1

𝑌𝑌3 = ∗ (𝑌𝑌1 + 𝑌𝑌2 + 𝑌𝑌3 )
3


1
𝑌𝑌𝑛𝑛 = ∗ (𝑌𝑌1 + 𝑌𝑌2 + ⋯ + 𝑌𝑌𝑛𝑛 )
𝑛𝑛

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


The Law of Large Numbers:
An estimator is consistent if the probability that its falls within an
interval of the true population value tends to one as the sample size
increases.
If (Y1 , , Yn ) are i.i.d. and σ Y2 < ∞, then Y is a consistent estimator
of µY , that is,
Pr[| Y − µY | < µ ] → 1 as n → ∞
p
which can be written, Y → µY
p
(“Y → µY ” means “Y converges in probability to µY ”).
σ Y2
(the math : as n → ∞, var(Y=
) → 0, which implies that
n
Pr[|Y − µY | < ε ] → 1.)
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Ῡ is a consistent estimator of the mean
• What are the desirable properties of Ῡ ?
1. Unbiasedness
2. Consistency
3. (Variance and) Efficiency

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


LLN: Consistency 𝑌𝑌̄ → 𝜇𝜇𝑌𝑌 ,
𝑝𝑝

• As 𝑛𝑛 → ∞, 𝑌𝑌̄ → 𝜇𝜇𝑌𝑌 ( Pr[| Y − µY | < µ ] → 1 as n → ∞ )


• Intuition: Probability of 𝑌𝑌̄ being far away from 𝜇𝜇𝑌𝑌 goes to 0 as n
gets big
• LLN by simulation
– Draw different sample sizes from an exponential distribution (E 𝑌𝑌𝑖𝑖 = 2)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


� 𝟏𝟏𝟏𝟏
LLN in action: 𝒀𝒀

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


� 𝟑𝟑𝟑𝟑
LLN in action: 𝒀𝒀

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


� 𝟏𝟏𝟏𝟏𝟏𝟏
LLN in action: 𝒀𝒀

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


� 𝟏𝟏𝟏𝟏𝟏𝟏𝟎𝟎
LLN in action : 𝒀𝒀

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


So far…
E (Y ) = µY
σ Y2
var(Y ) =
n

� converges to 𝜇𝜇𝑌𝑌 as n gets large


• We have 𝒀𝒀
• Can we say more? Specifically can we approximate
Pr (𝑎𝑎 < 𝑌𝑌�𝑛𝑛 < 𝑏𝑏)?
• We analyze, again, case of when n large

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


The Central Limit Theorem (CLT) (1 of 3)
If (Y1 , , Yn ) are i.i.d. and 0 < σ Y2 < ∞, then when n is large the
distribution of Y is well approximated by a normal distribution.
• Very powerful: average from a random sample from any population , when
standardized, has standard normal distribution
• Why? Normal distribution is extremely useful/common in statistics

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


The Central Limit Theorem (CLT) (1 of 3)
If (Y1 , , Yn ) are i.i.d. and 0 < σ Y2 < ∞, then when n is large the
distribution of Y is well approximated by a normal distribution.
σ Y2
− Y is approximately distributed N ( µY , ) (“normal distribution with
n
mean µY and variance σ /n”)
2
Y
Y − E (Y ) Y − µY
− That is, “standardized”
= Y = is approximately
var (Y ) σ Y / n
distributed as N (0, 1)

– The larger is n, the better is the approximation.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Standardizing an r.v.
• Common to standardize random variable by subtracting
expectation and dividing by standard deviation:
𝑋𝑋 − 𝐸𝐸 𝑋𝑋
𝑍𝑍 =
𝑉𝑉 𝑋𝑋
• Possible to show that for any X,
– E[Z] = 0
– V[Z] = 1

• Sometimes called a z-score

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


The Central Limit Theorem (CLT) (2 of 3)
Sampling distribution of Ῡ when Y is Bernoulli, p = 0.78:

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


The Central Limit Theorem (CLT) (3 of 3)
Y − E (Y )
Same example: sampling distribution of :
var(Y )

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Summary: The Sampling Distribution of Ῡ
For Y1 , , Yn i.i.d. with 0 < σ Y2 < ∞,
• The exact (finite sample) sampling distribution of Y has mean µY
(“Y is an unbiased estimator of µY ”) and variance σ Y2 /n
• Other than its mean and variance, the exact distribution of Ῡ is
complicated and depends on the distribution of Y (the population
distribution)
• When n is large, the sampling distribution simplifies:
p
− Y → µY (𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂𝐂 𝐢𝐢𝐢𝐢 𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩𝐩 𝐭𝐭𝐭𝐭 𝐭𝐭𝐭𝐭𝐭𝐭𝐭𝐭 𝐦𝐦𝐦𝐦𝐦𝐦𝐦𝐦 (𝐋𝐋𝐋𝐋𝐋𝐋)

Y − E (Y )
− is approximately N (0,1) (Converges in distribution
var(Y ) to normal dist. CLT)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


1. The probability framework for statistical inference
2. Estimation
3. Hypothesis Testing
4. Confidence intervals

Hypothesis Testing
The hypothesis testing problem (for the mean): based on evidence
we have, test whether a null hypothesis is true versus whether an
alternative hypothesis is true. That is, test
– H0: E(Y ) = μY,0 vs. H1: E(Y ) > μY,0 (1-sided, >)
– H0: E(Y ) = μY,0 vs. H1: E(Y ) < μY,0 (1-sided, <)
– H0: E(Y ) = μY,0 vs. H1: E(Y ) ≠ μY,0 (2-sided)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Most common test is:
H0: E(Y) = 0 vs. H1: E(Y) ≠ 0
• Can we reject the null hypothesis H0?
• We determine this by fixing a significance level and computing
the p-value.
– A small p-value (typically ≤ 0.05) indicates strong evidence against the
null, so you reject the null hypothesis.
– A large p-value (> 0.05) indicates weak evidence against the null
hypothesis, so you fail to reject the null hypothesis.

• The significance level is a pre-specified probability of


incorrectly rejecting the null, when the null is true: Usually 10%,
5%, 1%
• I.e. How much we are willing to risk that we incorrectly reject
the null hypothesis (given null is true)
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Some terminology for testing statistical
hypotheses (1 of 2)
p-value = assuming null hypothesis is true, what is the probability
of drawing a statistic (e.g. Ῡ ) at least as adverse to the null as the
value computed from your data?
Calculating the p-value based on Ῡ :

p − value =
Pr[| Y − µY ,0 | > |Y act − µY ,0 |]
Where Ῡ act is the value of Ῡ actually observed (nonrandom)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Some terminology for testing statistical
hypotheses (2 of 2)
• To compute the p-value, you need to know the sampling
distribution of Ῡ (complicated if n small)
• Assuming n large (usually OK), we use the normal
approximation (CLT):
= PrH 0 [| Y − µY ,0 | > |Y act − µY ,0 |],
p -value
Y − µY ,0 Y act − µY ,0
= PrH 0 [| |>| |]
σY / n σY / n
Y − µY ,0 Y act − µY ,0
= PrH 0 [| |>| |]
σY σY
≅ probability under left+right N (0,1) tails
where σ Y std.
= dev. of the distribution of Y σ Y / n .
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Calculating the p-value with σY known:

• For large n, p-value = the probability that a N(0,1) random


variable falls outside |(Ῡ act – μY,0)/σῩ |
• In practice, σῩ is unknown – it must be estimated
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Estimator of the variance of Y:
1 n
= sY2 ∑
n − 1 i =1
(Yi −
= Y ) 2
“sample variance of Y ”

Fact:
If (Y1,…,Yn) are i.i.d. and E(Y4) < ∞, then
p
s →σ Y2
2
Y

Why does the law of large numbers apply?


• Because sY2 is a sample average; see Appendix 3.3
• Technical note: we assume E(Y4) < ∞ because here the average
is not of Yi, but of its square; see App. 3.3
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Computing the p-value with σ Y2 estimated:
p -value= PrH 0 [|Y − µY ,0 | > |Y act − µY ,0 |],
Y − µY ,0 Y act − µY ,0
PrH 0 [| |>| |]
σY / n σY / n
Y − µY ,0 Y act − µY ,0
≅ PrH 0 [| |>| |] (large n)
sY / n sY / n
so
= PrH 0 [|t| > |t act |] (σ Y2 estimated)
p -value
≅ probability under normal tails outside |𝑡𝑡 𝑎𝑎𝑎𝑎𝑎𝑎 |
𝑌𝑌̄ − 𝜇𝜇𝑌𝑌,0
where 𝑡𝑡 = (the usual 𝑡𝑡−statistic)
𝑠𝑠𝑌𝑌 / 𝑛𝑛

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


What is the link between the p-value and the
significance level?
• The significance level is prespecified. For example, if 5%,
– you reject the null hypothesis if |t| ≥ 1.96.
– Equivalently, you reject if p ≤ 0.05.
– Often, it is better to communicate the p-value – the p-value contains more
information than the “yes/no” statement about whether the test rejects.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Roadmap

1.The probability framework for statistical


inference
2. Estimation
3. Testing
4. Confidence Intervals

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Confidence Intervals

• A 100(1-𝛼𝛼)% confidence interval for a population parameter, μY is an


interval that contains the true value of μY, 100(1-𝛼𝛼)% of the time

• We call (1-𝛼𝛼) the confidence level: E.g. 95% confidence interval

• This rule is an estimator just like the sample mean but it will produce
two values instead of one: upper and lower values of the intervals

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Deriving the confidence interval
𝑌𝑌̄ − 𝐸𝐸(𝑌𝑌)
̄
− is approximately 𝑁𝑁(0,1)
̄
var( 𝑌𝑌)

• By CLT, we can derive a confidence interval such that for μY ∶


– Pr( 𝑎𝑎 ≤ μY ≤ 𝑏𝑏 ) = 0.95
̄
𝑌𝑌−𝜇𝜇
• 𝑡𝑡 = 𝑌𝑌
𝑠𝑠𝑌𝑌 / 𝑛𝑛

• Use the following fact, for large n:


̄
𝑌𝑌−𝜇𝜇
Pr (−1.96 ≤ 𝑌𝑌
≤ 1.96) ≈ 0.95
𝑠𝑠𝑌𝑌 / 𝑛𝑛

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Recall: The normal distribution

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Confidence Intervals
A 95% confidence interval is constructed as the set of values
of μY not rejected by a hypothesis test with a 5% significance
level.
𝑌𝑌̄ − 𝜇𝜇𝑌𝑌 𝑌𝑌̄ − 𝜇𝜇𝑌𝑌
{𝜇𝜇𝑌𝑌 : ≤ 1.96} = {𝜇𝜇𝑌𝑌 : − 1.96 ≤ ≤ 1.96}
𝑠𝑠𝑌𝑌 / 𝑛𝑛 𝑠𝑠𝑌𝑌 / 𝑛𝑛
𝑠𝑠𝑌𝑌 𝑠𝑠𝑌𝑌
= {𝜇𝜇𝑌𝑌 : − 1.96 ̄
≤ 𝑌𝑌 − 𝜇𝜇𝑌𝑌 ≤ 1.96 }
𝑛𝑛 𝑛𝑛
𝑠𝑠𝑌𝑌 𝑠𝑠𝑌𝑌
̄
= {𝜇𝜇𝑌𝑌 ∈ (𝑌𝑌 − 1.96 ̄
, 𝑌𝑌 + 1.96 )}
𝑛𝑛 𝑛𝑛

This confidence interval relies on the large-n results that Y is


p
approximately normally distributed and s → σ Y2 .
2
Y

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Interpreting confidence intervals
• An often recited, but incorrect interpretation of a confidence
interval is the following:
– “I calculated a 95% confidence interval of [0.05,0.13], which means that
there is a 95% chance that the true difference in means is in that
interval.”
– WRONG

• The true value of the population difference in means, 𝜇𝜇𝑌𝑌 , is


fixed.
– It is either in interval or is not in interval – NO ROOM FOR PROB.
𝑠𝑠𝑌𝑌
• Randomness is in the interval: 𝑌𝑌̄ ± 1.96
𝑛𝑛

• Correct interpretation: across 95% of random samples, the


constructed confidence interval will contain the true value.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Copyright © 2019 Pearson Education Ltd. All Rights Reserved.
Initial data analysis: Compare districts with “small”
(STR < 20) and “large” (STR ≥ 20) class sizes:

Class Size Average score Standard deviation n


(Ῡ ) (s2)
Small 657.4 19.4 238
Large 650.0 17.9 182

1. Estimation of Δ = difference between group means


2. Test the null hypothesis (𝐻𝐻0 ) that Δ = 0
3. Construct a confidence interval for Δ

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


1. Estimation
nsmall nlarge
1 1
Ysmall − Y
=large
nsmall i 1 =
∑Y − n ∑Yi i
= large i 1

= 657.4 − 650.0
= 7.4

Is this a large difference in a real-world sense?


– Standard deviation across districts = 19.1
– Difference between 60th and 75th percentiles of test score distribution is
667.6 – 659.4 = 8.2
– This is a big enough difference to be important for school reform
discussions, for parents, or for a school committee?

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


2. Hypothesis testing (1 of 2)
Difference−in−means test: compute the 𝑡𝑡−statistic,
𝑌𝑌̄𝑠𝑠 − 𝑌𝑌̄𝑙𝑙 𝑌𝑌̄𝑠𝑠 − 𝑌𝑌̄𝑙𝑙
𝑡𝑡 = =
𝑆𝑆𝑆𝑆( ̄𝑠𝑠 − 𝑌𝑌̄𝑙𝑙 )
𝑌𝑌
2
𝑠𝑠𝑠𝑠2 𝑠𝑠𝑙𝑙
+
𝑛𝑛𝑠𝑠 𝑛𝑛𝑙𝑙

• where SE (Ys − Yl ) is the “standard error” of Ys − Yl , the subscripts s


and/refer to “small” and “large” STR districts, and
1 ns
2
ss ∑
ns − 1 i =1
(Yi − Ys ) 2
(etc.)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


2. Hypothesis testing (2 of 2)
Compute the difference-of-means t-statistic:

Size Ῡ s2 n
small 657.4 19.4 238
large 650.0 17.9 182

Ys − Yl 657.4 − 650.0 7.4


=t = = = 4.05
2
ss
+ sl2
19.4 2
+ 17.9 2
1.83
ns nl 238 182

|t| > 1.96, so reject (at the 5% significance level) the null
hypothesis that the two means are the same.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


3. Confidence interval
A 95% confidence interval for the difference between the means is,

(Ῡs – Ῡl ) ± 1.96 × SE(Ῡs – Ῡl )


= 7.4 ± 1.96 × 1.83 = (3.8, 11.0)
Two equivalent statements:
1. The 95% confidence interval for Δ doesn’t include 0;
2. The hypothesis that Δ = 0 is rejected at the 5% level.

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Let’s go back to the original policy question:
What is the effect on test scores of reducing STR by one
student/class?

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


But is this a causal effect?
• Below are mean characteristics of the two groups, with p-values
from a t-test of the null hypothesis that the two means are equal:
Small Regular Difference p-value
% receiving free lunch (low income) 41.6 48.7 -7.1 0.001
% English as second language 12.5 20.0 -7.5 <0.001
Average income (in 000’s of $) 16.3 14.0 2.3 <0.001

• What do the p-values tell us about these two groups of districts?

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.


Summary:
From the two assumptions of:
1. simple random sampling of a population, that is, {Yi, i = 1,…,n} are i.i.d.
2. 0 < E(Y4) < ∞

We developed, for large n:


– Theory of estimation (sampling distribution of Ῡ )
– Theory of hypothesis testing (large-n distribution of t-statistic and
computing the p-value)
– Theory of confidence intervals (inverting the test statistic)

Copyright © 2019 Pearson Education Ltd. All Rights Reserved.

You might also like