Theory 2. Probability and Statistics (Textbook Chapters 2-3)
Theory 2. Probability and Statistics (Textbook Chapters 2-3)
(SW Chapters 2, 3)
1/2/3-1
The California Test Score Data Set
Variables:
• 5PthP grade test scores (Stanford-9 achievement test,
combined math and reading), district average
• Student-teacher ratio (STR) = no. of students in the
district divided by no. full-time equivalent teachers
1/2/3-2
Initial look at the data:
(You should already know how to interpret this table)
1/2/3-3
Do districts with smaller classes have higher test scores?
Scatterplot of test score v. student-teacher ratio
1/2/3-6
1. Estimation
nlarge
1 nsmall
1
Ysmall − Ylarge =
nsmall
∑Y
i =1
i –
nlarge
∑Y
i =1
i
= 657.4 – 650.0
= 7.4
1/2/3-7
2. Hypothesis testing
Ys − Yl Ys − Yl
t= = (remember this?)
ss2
+ sl2 SE (Ys − Yl )
ns nl
1/2/3-8
Compute the difference-of-means t-statistic:
Size Y sBYB n
small 657.4 19.4 238
large 650.0 17.9 182
|t| > 1.96, so reject (at the 5% significance level) the null
hypothesis that the two means are the same.
1/2/3-9
3. Confidence interval
(Ys – Yl ) ± 1.96×SE(Ys – Yl )
1/2/3-10
Review of Statistical Theory
Population
• The group or collection of all possible entities of interest
(school districts)
• We will think of populations as infinitely large (∞ is an
approximation to “very big”)
Random variable Y
• Numerical summary of a random outcome (district
average test score, district STR)
1/2/3-12
Population distribution of Y
1/2/3-13
(b) Moments of a population distribution: mean, variance,
standard deviation, covariance, correlation
1/2/3-14
Moments, ctd.
E (Y − μY ) ⎤
⎡ 3
skewness = ⎣ ⎦
3
σY
= measure of asymmetry of a distribution
• skewness = 0: distribution is symmetric
• skewness > (<) 0: distribution has long right (left) tail
E (Y μY ) ⎤
⎡ 4
−
kurtosis = ⎣ ⎦
4
σY
= measure of mass in tails
= measure of probability of large values
• kurtosis = 3: normal distribution
• skewness > 3: heavy tails (“leptokurtotic”)
1/2/3-15
1/2/3-16
2 random variables: joint distributions and covariance
1/2/3-17
The covariance between Test Score and STR is negative:
so is the correlation…
1/2/3-18
The correlation coefficient is defined in terms of the
covariance:
cov( X , Z ) σ XZ
corr(X,Z) = = = rBXZB
var( X ) var( Z ) σ X σ Z
• –1 ≤ corr(X,Z) ≤ 1
• corr(X,Z) = 1 mean perfect positive linear association
• corr(X,Z) = –1 means perfect negative linear association
• corr(X,Z) = 0 means no linear association
1/2/3-19
The correlation coefficient measures linear association
1/2/3-20
(c) Conditional distributions and conditional means
Conditional distributions
• The distribution of Y, given value(s) of some other
random variable, X
• Ex: the distribution of test scores, given that STR < 20
Conditional expectations and conditional moments
• conditional mean = mean of conditional distribution
= E(Y|X = x) (important concept and notation)
• conditional variance = variance of conditional distribution
• Example: E(Test scores|STR < 20) = the mean of test
scores among districts with small class sizes
The difference in means is the difference between the means
of two conditional distributions:
1/2/3-21
Conditional mean, ctd.
1/2/3-22
(d) Distribution of a sample of data drawn randomly
from a population: YB1,…,
B YBnB
1/2/3-23
Distribution of YB1,…,
B YBnB under simple random sampling
• Because individuals #1 and #2 are selected at random, the
value of YB1B has no information content for YB2.B Thus:
o Y1B B and Y2B B are independently distributed
o YB1B and YB2B come from the same distribution, that is, YB1,B
YB2B are identically distributed
o That is, under simple random sampling, Y1B B and Y2B B are
independently and identically distributed (i.i.d.).
o More generally, under simple random sampling, {YBi}, B i
= 1,…, n, are i.i.d.
Estimation
Y is the natural estimator of the mean. But:
(a) What are the properties of Y ?
(b) Why should we use Y rather than some other estimator?
• YB1B (the first observation)
• maybe unequal weights – not simple average
• median(YB1,…,
B YBn)B
The starting point is the sampling distribution of Y …
1/2/3-25
(a) The sampling distribution of Y
Y is a random variable, and its properties are determined by
the sampling distribution of Y
• The individuals in the sample are drawn at random.
• Thus the values of (YB1,…,
B YBn)B are random
• Thus functions of (YB1,…,
B YBn),
B such as Y , are random: had
a different sample been drawn, they would have taken on
a different value
• The distribution of Y over different possible samples of
size n is called the sampling distribution of Y .
• The mean and variance of Y are the mean and variance of
its sampling distribution, E(Y ) and var(Y ).
• The concept of the sampling distribution underpins all of
econometrics.
1/2/3-26
The sampling distribution of Y , ctd.
Example: Suppose Y takes on 0 or 1 (a Bernoulli random
variable) with the probability distribution,
Pr[Y = 0] = .22, Pr(Y =1) = .78
Then
E(Y) = p×1 + (1 – p)×0 = p = .78
σ Y2 = E[Y – E(Y)]2 = p(1 – p) [remember this?]
= .78×(1–.78) = 0.1716
The sampling distribution of Y depends on n.
Consider n = 2. The sampling distribution of Y is,
Pr(Y = 0) = .222 = .0484
Pr(Y = ½) = 2×.22×.78 = .3432
Pr(Y = 1) = .782 = .6084
1/2/3-27
The sampling distribution of Y when Y is Bernoulli (p = .78):
1/2/3-28
Things we want to know about the sampling distribution:
1/2/3-29
The mean and variance of the sampling distribution of Y
General case – that is, for Yi i.i.d. from any distribution, not
just Bernoulli:
1 n 1 n 1 n
mean: E(Y ) = E( ∑Yi ) = ∑ E (Yi ) = ∑ μY = μY
n i =1 n i =1 n i =1
i =1
σ Y2
=
n
1/2/3-31
Mean and variance of sampling distribution of Y , ctd.
E(Y ) = μY
σ Y2
var(Y ) =
n
Implications:
1. Y is an unbiased estimator of μY (that is, E(Y ) = μY)
2. var(Y ) is inversely proportional to n
• the spread of the sampling distribution is
proportional to 1/ n
• Thus the sampling uncertainty associated with Y is
proportional to 1/ n (larger samples, less
uncertainty, but square-root law)
1/2/3-32
The sampling distribution of Y when n is large
1/2/3-33
The Law of Large Numbers:
An estimator is consistent if the probability that its falls
within an interval of the true population value tends to one
as the sample size increases.
If (Y1,…,Yn) are i.i.d. and σ Y2 < ∞, then Y is a consistent
estimator of μY, that is,
Pr[|Y – μY| < ε] → 1 as n → ∞
p
which can be written, Y → μY
p
(“Y → μY” means “Y converges in probability to μY”).
σ Y2
(the math: as n → ∞, var(Y ) = → 0, which implies that
n
Pr[|Y – μY| < ε] → 1.)
1/2/3-34
The Central Limit Theorem (CLT):
If (Y1,…,Yn) are i.i.d. and 0 < σ Y2 < ∞, then when n is
large the distribution of Y is well approximated by a
normal distribution.
σ Y2
• Y is approximately distributed N(μY, ) (“normal
n
distribution with mean μY and variance σ Y2 /n”)
• n (Y – μY)/σY is approximately distributed N(0,1)
(standard normal)
Y − E (Y ) Y − μY
• That is, “standardized” Y = = is
var(Y ) σ Y / n
approximately distributed as N(0,1)
• The larger is n, the better is the approximation.
1/2/3-35
Sampling distribution of Y when Y is Bernoulli, p = 0.78:
1/2/3-36
Y − E (Y )
Same example: sampling distribution of :
var(Y )
1/2/3-37
Summary: The Sampling Distribution of Y
For Y1,…,Yn i.i.d. with 0 < σ Y2 < ∞,
• The exact (finite sample) sampling distribution of Y has
mean μY (“Y is an unbiased estimator of μY”) and variance
σ Y2 /n
• Other than its mean and variance, the exact distribution of
Y is complicated and depends on the distribution of Y (the
population distribution)
• When n is large, the sampling distribution simplifies:
p
o Y → μY (Law of large numbers)
Y − E (Y )
o is approximately N(0,1) (CLT)
var(Y )
1/2/3-38
(b) Why Use Y To Estimate μY?
• Y is unbiased: E(Y ) = μY
p
• Y is consistent: Y → μY
• Y is the “least squares” estimator of μY; Y solves,
n
min m ∑ (Yi − m ) 2
i =1
∑
dm i =1
(Yi − m ) 2
= ∑
i =1 dm
(Yi − m ) 2
= 2∑ (Yi − m )
i =1
1/2/3-39
Why Use Y To Estimate μY, ctd.
1/2/3-40
1. The probability framework for statistical inference
2. Estimation
3. Hypothesis Testing
4. Confidence intervals
Hypothesis Testing
The hypothesis testing problem (for the mean): make a
provisional decision, based on the evidence at hand, whether
a null hypothesis is true, or instead that some alternative
hypothesis is true. That is, test
H0: E(Y) = μY,0 vs. H1: E(Y) > μY,0 (1-sided, >)
H0: E(Y) = μY,0 vs. H1: E(Y) < μY,0 (1-sided, <)
H0: E(Y) = μY,0 vs. H1: E(Y) ≠μY,0 (2-sided)
Some terminology for testing statistical hypotheses:
1/2/3-41
p-value = probability of drawing a statistic (e.g. Y ) at least as
adverse to the null as the value actually computed with your
data, assuming that the null hypothesis is true.
1/2/3-42
Calculating the p-value, ctd.
• To compute the p-value, you need the to know the
sampling distribution of Y , which is complicated if n is
small.
• If n is large, you can use the normal approximation (CLT):
1/2/3-44
Estimator of the variance of Y:
1 n
sY2 = ∑
n − 1 i =1
(Yi − Y ) 2
= “sample variance of Y”
Fact:
p
If (Y1,…,Yn) are i.i.d. and E(Y ) < ∞, then s → σ Y2
4 2
Y
1/2/3-45
Computing the p-value with σ Y2 estimated:
1/2/3-46
What is the link between the p-value and the significance
level?
1/2/3-48
Comments on this recipe and the Student t-distribution
1/2/3-51
1/2/3-52
Comments on Student t distribution, ctd.
4. You might not know this. Consider the t-statistic testing
the hypothesis that two means (groups s, l) are equal:
Ys − Yl Ys − Yl
t= 2 2 =
ss
+ sl SE (Ys − Yl )
ns nl
1/2/3-54
1. The probability framework for statistical inference
2. Estimation
3. Testing
4. Confidence intervals
Confidence Intervals
A 95% confidence interval for μY is an interval that contains
the true value of μY in 95% of repeated samples.
Y − μY Y − μY
{μY: ≤ 1.96} = {μY: –1.96 ≤ ≤ 1.96}
sY / n sY / n
sY sY
= {μY: –1.96 ≤ Y – μY ≤ 1.96 }
n n
sY sY
= {μY ∈ (Y – 1.96 , Y + 1.96 )}
n n
This confidence interval relies on the large-n results that Y is
p
approximately normally distributed and s → σ Y2 .
2
Y
1/2/3-56
Summary:
From the two assumptions of:
(1) simple random sampling of a population, that is,
{Yi, i =1,…,n} are i.i.d.
(2) 0 < E(Y4) < ∞
we developed, for large samples (large n):
• Theory of estimation (sampling distribution of Y )
• Theory of hypothesis testing (large-n distribution of t-
statistic and computation of the p-value)
• Theory of confidence intervals (constructed by inverting
test statistic)
Are assumptions (1) & (2) plausible in practice? Yes
1/2/3-57
Let’s go back to the original policy question:
What is the effect on test scores of reducing STR by one
student/class?
Have we answered this question?
1/2/3-58