0% found this document useful (0 votes)
18 views

8822 LectureNotes

This document provides an overview of nonparametric methods for density estimation and kernel regression. It discusses kernel density estimation, properties of kernel density estimators, and extensions beyond the univariate case. It also covers kernel regression estimation, including the Nadaraya-Watson estimator and local OLS estimator, and issues of bias and variance. The document provides information on a variety of nonparametric methods.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

8822 LectureNotes

This document provides an overview of nonparametric methods for density estimation and kernel regression. It discusses kernel density estimation, properties of kernel density estimators, and extensions beyond the univariate case. It also covers kernel regression estimation, including the Nadaraya-Watson estimator and local OLS estimator, and issues of bias and variance. The document provides information on a variety of nonparametric methods.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

ECON8822 - Cross-Section and

Panel Econometrics
Notes from Stefan Hoderlein’s lectures

Paul Anthony Sarkis


Boston College
Contents

1 Nonparametrics 5

1.1 Quick refresh . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Introductory examples . . . . . . . . . . . . . . . . . . . 6

1.2.2 Density estimation . . . . . . . . . . . . . . . . . . . . . . 7

1.2.3 Properties of Kernel Density Estimators . . . . . . . . . 8

1.2.4 Going beyond the univariate, first-order KDE . . . . . . . 11

1.3 Kernel Regression Estimation . . . . . . . . . . . . . . . . . . . 12

1.3.1 Nadaraya-Watson Estimator . . . . . . . . . . . . . . . . 13

1.3.2 Local OLS estimator . . . . . . . . . . . . . . . . . . . . 15

1.3.3 Bias and Variance . . . . . . . . . . . . . . . . . . . . . . 15

1.3.4 General considerations . . . . . . . . . . . . . . . . . . . 16

1.3.5 Series/sieve regression . . . . . . . . . . . . . . . . . . . 18

1.3.6 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1
1.3.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Treatment Effects 20

2.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 Joint full independence . . . . . . . . . . . . . . . . . . . 22

2.2.2 Unconfoundedness . . . . . . . . . . . . . . . . . . . . . 23

2.2.3 Regression Discontinuity Design (RDD) . . . . . . . . . 25

2.2.4 Endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Distributional effects . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3.1 Distributional tests . . . . . . . . . . . . . . . . . . . . . 32

2.3.2 Quantile regression . . . . . . . . . . . . . . . . . . . . . 35

2.3.3 Including Instruments . . . . . . . . . . . . . . . . . . . . 37

2.4 IV models with covariates . . . . . . . . . . . . . . . . . . . . . . 37

2.5 Differences-in-Differences (DiD) . . . . . . . . . . . . . . . . . . . 37

2.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.5.2 Estimation by sample means . . . . . . . . . . . . . . . . 40

2.5.3 Estimation by regression . . . . . . . . . . . . . . . . . . 40

2.5.4 Threats to validity . . . . . . . . . . . . . . . . . . . . . 40

3 Qualitative Dependent Variables 41

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2
3.1.1 Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.1.2 Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.1.3 Nonparametric regression . . . . . . . . . . . . . . . . . 43

3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . 43

3.2.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Ordered Dependent Variable . . . . . . . . . . . . . . . . . . . . 46

3.3.1 Ordered Probit . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.2 Censored Regression (Tobit) . . . . . . . . . . . . . . . . . 47

3.4 Nonparametric Estimation of RC models . . . . . . . . . . . . . . 47

3.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . 48

4 Panel Data 49

4.1 Multivariate Linear Model . . . . . . . . . . . . . . . . . . . . . 49

4.1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1.2 Random Effects approach . . . . . . . . . . . . . . . . . 50

4.1.3 Fixed Effects approach . . . . . . . . . . . . . . . . . . . 53

3
4.1.4 Which approach to choose? . . . . . . . . . . . . . . . . 55

4.2 Nonseparable Model . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.2 Binary Choice model . . . . . . . . . . . . . . . . . . . . 56

4.2.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 Big Data and Machine Learning 57

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.1 Some definitions . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.2 Statistical Decision Theory . . . . . . . . . . . . . . . . . 58

5.1.3 Dimensionality Curse . . . . . . . . . . . . . . . . . . . 59

5.2 Linear Methods for Regression . . . . . . . . . . . . . . . . . . . 59

5.2.1 Least-squares . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.2 Extensions of Linear regression . . . . . . . . . . . . . . 60

5.2.3 Shrinkage methods . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Tree-based methods . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3.2 Formal definition . . . . . . . . . . . . . . . . . . . . . . 69

5.3.3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . 73

5.3.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4
Chapter 1

Nonparametrics

1.1 Quick refresh

An econometric model’s goal is to study the (causal) relationship between a


dependent variable of interest and regressors. Regressors can play two role in
economic analysis as they control the regression or are of direct interest.

Assume a linear model such that:


Y = α + Xβ + U
where Y is the dependent variable

1.2 Kernel Density Estimation

Density estimation might be interesting in its own right, when you need to identify
the particular distribution of a random variable. Nevertheless, it is mostly studied
as a fundamental building block for more complicated semi-/nonparametric mod-
els. Following the example in the previous section, suppose we want to estimate
how Y is related to X where
Y = mY (X) + U

5
Then we recovered that, using the assumption that mY (·) is twice differentiable and
bounded in its second-order derivative, as well as the assumption that E [U|X] = 0,
we have: ∫
E [Y |X = x] = mY (x) = y · fY |X (y, x)dx
χ
Moreover, from probability theory (Bayes’ theorem):
∫ ∫
fY X (y, x)
y · fY |X (y, x)dx = y· dx
χ χ fY (x)
where you have two density functions to estimate.

1.2.1 Introductory examples

Let X be a random variable that can take the value of 1 with true probability p0
or 0 else. Think of how you would estimate the probability p0 .

One answer is to draw the random variable many times and get a series {x1, x2, ...}
then estimating p̂ as the number of times we actually observed 1 divided by the
number of draws. Formally, if we perform n random draws,
I{xi = 1}
Ín
p̂ = i=1
n
where I{·} is a function that takes a value of 1 if the condition inside is true, 0 if
not. For example, if one million draws are made and 333 333 of them have turned
333333
out to be ones, then: p̂ = 1000000 ≈ 1/3.

Now, let’s assume X is actually a continuous variable that can take any real value
on its support. Thinking about the previous example, how would you estimate the
probability that the realization of X falls in a given interval of length h around
a given x, or more formally, falls in [x − h/2, x + h/2]. This value h is called the
bandwidth.

Again, we could use the same strategy and draw the random variable n times,
counting the times xi falls in the ball around x and compare it with the total
number of draws:
Ín n
I{x ∈ Bh/2 (x)} 1 Õ
ˆ X ∈ Bh/2 (x) = i=1 i
Pr = I{x − h/2 ≤ xi ≤ x + h/2}
 
n n i=1

6
Is this type of estimator unbiased? We can check by looking at:

E Prˆ X ∈ Bh/2 (x) = E [I{x − h/2 ≤ X ≤ x + h/2}] = Pr X ∈ Bh/2 (x)


    

which shows that it is indeed an unbiased estimator.

1.2.2 Density estimation

We have just seen how to estimate probabilities without making assumptions on


any structure; in this subsection, we will see how it relates to estimating a density
function.

First, think of what the pdf of X, denoted fX (x), actually is. It is the probability
that X takes the exact value x. In a sense, this is close to what we just did, however,
we’re looking for X to be a point rather than in a set. The probability of being in
a set is given by the cdf FX (x). It turns out that as we reduce the size of the set
more and more, the two concept become closer and closer. Formally, as h tends
to 0, the set around Bh/2 (x) will only contain x. Since fX (x) is the derivative of
FX (x), we can write:

Pr X ∈ Bh/2 (x)
 
FX (x + h/2) − FX (x − h/2)
fX (x) = lim = lim
h→0 h h→0 h
where you should recognize the last term from the previous subsection.

And in fact, you could estimate the pdf by using the estimator for the probability
as seen above:
ˆ X ∈ Bh/2 (x)
Pr n
 
1 Õ
fˆX (x) = = I{x − h/2 ≤ xi ≤ x + h/2}
h nh i=1

for a given h that is relatively small (more about this later). We now have our
first own density estimator, let’s look at it in more detail.

The basic idea behind the estimator is to count how many observations fall in the
neighborhood x, relative to the total number of observations, and the size of the
neighborhood. Here we use “count” since our indicator function is rather naı̈ve
and only does that: setting a weight of one for observations in the neighborhood,

7
and 0 for observations out of the neighborhood. The weight assignment function
is called a kernel (hence the name of kernel density estimator). In particular, the
one used above is called a uniform kernel because it assigns a uniform weight to
all observations within the neighborhood. In practice, this is a very bad kernel and
it should rarely be used. The parameter h that defines the size of the neighborhood
is called the bandwidth.

1.2.3 Properties of Kernel Density Estimators

Definition 1.1 (Standard Kernel). A standard kernel K : R → R+ is a non-negative


function such that:

• K(ψ)dψ = 1: the cdf of the kernel goes to one.


• ψK(ψ)dψ = 0: the kernel is symmetric around 0.


• K 2 (ψ)dψ = κ2 < ∞:

• ψ 2 K(ψ)dψ = µ2 < ∞:

You should view these properties through the lens of what we actually use a
kernel for. Since a kernel is essentially a “weight-assigning” function, it must
make sense that it is symmetric (equally off observations in either direction should
be equally bad), that it is non-negative (although it might be interesting to assign
negative weights to observations we really don’t want) and that it stops assigning
weights after a certain distance.

Using this definition, we can then define a kernel density estimator.

Definition 1.2 (Rosenblatt-Parzen Kernel density estimator). A Kernel density


estimator for a given pdf fX (x) is defined as:
n
1 Õ  xi − x 
fˆX (x) = K
nh i=1 h

where K(·) : R → R+ is a standard kernel.

8
Interesting examples of kernels include:

• Uniform kernel: K(ψ) = I{|ψ| ≤ 1/2}

• Gaussian kernel: K(ψ) = √1 exp{−0.5 · ψ 2 }


2

• Epanechnikov kernel: K(ψ) = I{|ψ| ≤ 1} · (1 − ψ 2 ) · (3/4)

As we did in the parametric econometrics classes, we now look for the kernel
density estimator properties such as bias and variance.

Bias of the KDE

Assume a random sampling over iid data. Then,


" n #
1 1
Õ  xi − x    
h
ˆ
i X−x
E fX (x) = E K = ·n·E K
nh i=1
h nh h

1 ξ−x
∫  
= K · fX (ξ)dξ
h h
Then, we perform a change of variables such that the term inside the kernel is ψ,
meaning ξ = ψh + x and dξ = h · dψ. Replacing it in the bias formula we get:
i 1∫ ∫
ˆ
h
E fX (x) = K (ψ) · fX (ψh + x) · h · dψ = K (ψ) · fX (ψh + x)dψ
h
Further, let’s use a second order mean value expansion to recover fX (x):

(ψh)2 00
fX (ψh + x) = fX (x) + ψh fX0 (x) + f (xr )
2 X
where xr includes a remainder term such that: xr = x + λψh. This yields us:

(ψh)2 00
i ∫  
ˆ
h
E fX (x) = K (ψ) · fX (x) + ψh fX (x) +
0
f (xr ) dψ
2 X

9
(ψh)2 00
∫ ∫ ∫
= K (ψ) · fX (x)dψ + K (ψ) · ψh fX0 (x)dψ
+ K (ψ) · f (xr )dψ
2 X
h2
∫ ∫ ∫
= fX (x) K (ψ) · dψ +h fX (x) K (ψ) · ψdψ +
0
K (ψ) · ψ 2 fX00(xr )dψ
2
| {z } | {z }
=1 =0
The last term is quite problematic since it cannot be simplified out of the integral.
However, we know that fX00(x) could be, so we can naively subtract it and we’ll
see later that the remainder is actually not very relevant.
h2 h2
∫ ∫
2 00
K (ψ) · ψ fX (xr )dψ = K (ψ) · ψ 2 ( fX00(xr ) − fX00(x))dψ
2 2
h2

+ K (ψ) · ψ 2 fX00(x)dψ
2
h2 00
= R+ f (x)µ2
2 X
where R is bounded by o(h2 ). Finally, we can write the expectation of our kernel
density estimator as:
h
ˆ
i h2 00
E fX (x) = fX (x) + f (x)µ2 + o(h2 )
2 X
| {z }
Bias[ fˆX (x)]

and the bias is given by the last two terms. From this equation, you can see
that the bias is increasing with the bandwidth. This is intuitive since a greater
bandwidth also implies more observations that are not related to x (global infor-
mation) relative to the observations actually close to x (local information). Global
information being more likely to introduce bias in the estimator, h is positively
correlated with bias. In the opposite direction, the bias seems to disappear as h
goes to 0. This means that the estimator is more efficient when the bandwidth is
very small, then why not make the bandwidth as small as possible? One could
show by similar equation work that the variance of the estimator is given by:
h
ˆ
i 1
Var fX (x) = fX (x)κ2 + o((nh)−1 )
nh
which this time is actually increasing as h tends to 0. Again, intuitively this makes
sense as reducing the size of the bandwidth will eventually reduce the number
of observations and thus increase the variance. This phenomenon is called the
bias-variance trade-off.

10
Bias-variance trade-off

In order to have a sense of what the bias and variance look like over the whole
distribution, we integrate them with respect to x:
∫  i 2 ∫
ˆ Var fˆX (x) dx = c2 · (nh)−1
h h i
4
Bias fX (x) dx = c1 · h

This allows us to design an optimal measure of the trade-off, analogous to the


mean squared error in the parametric case, defined as the Mean Integrated Squared
Error: MISE(h) ≡ c1 · h4 + c2 · (nh)−1 . Now suppose we want to find the best
bandwidth to minimize MISE:
∂ M ISE
= 0 ⇔ 4 · c1 · h3 − c2 · n−1 h−2 = 0 ⇔ h ∼ n−1/5
∂h
meaning that h must be proportional to n−1/5 . Again, this makes a lot of sense
since it implies that increasing the number of observations allows you to reduce
the size of the bands: the more observations you have, the more likely it is that
they will fall around x, and thus the less need you have to keep wide bands.

Asymptotics

The rate of convergence of the KDE is nh where n is the number of observations
and h the bandwidth. √For an optimal√ bandwidth, we had h = n
−1/5 , yielding

a convergence rate of n · n−1/5 = n4/5 = n2/5 . Therefore, the nonparametric


estimator has a slower rate of convergence than its parametric counterparts of
OLS and ML estimators.

1.2.4 Going beyond the univariate, first-order KDE

As we’ve seen, the KDE method is very interesting in how it gets around the lack
of structure, but it creates a new trade-off between bias and variance. In order to
reduce further the bias, one might be interested in increasing the dimensions of
the kernels, to allow for capturing more data points.

11
Density derivatives

If fX (x) is a differentiable function of x, one could also use the derivative of the
kernel to estimate the object. In practice, to estimate a r-th order derivative, a r-th
order kernel would set all moments up to the r-th one to 0, and the r-th one as a
finite moment µr .√This technique displays the advantage of having convergence
at a rate closer to n. However, you would get potentially negative tails (meaning
it would not be a proper density), and the estimator would be very efficient in
small samples.

Multivariate Density estimation

Another way of achieving bias reduction would be to use multivariate density


estimation, meaning X would now be a random vector in R k space. The kernel
density estimator would then be a product of all univariate kernels. Obviously,
this method adds additional variation in the function, but you would need way
more observations to get an interesting result. To see that, recall the intuition
behind the kernel function: it assigns weights to observations based on the mean
distance of these observations to a point of interest, given a bandwidth. It turns
out that increasing the dimension of the neighborhood (from a line, to a square,
to hypercubes) will also increase the volume of the object, thus reducing the
probability of observations being near the point of interest, and increasing the
need for observations. This problem is called the curse of dimensionality.

1.3 Kernel Regression Estimation

In the previous section, we were interested in estimating the distribution of one


variable X. However, in most economics applications, a more interesting element
to estimate is the distribution of a variable Y , conditional on X. This is the mean
regression model that we are going to study here.

12
Recall our definition of a kernel density estimator for a true distribution fX (x):
n
1 Õ  xi − x 
fˆX (x) = K
nh i=1 h

where K is a standard kernel (refer to …). Also recall the mean regression model
of
fXY (x, y)

E [Y |X = x] = mY (x) = y· dy
fX (x)
Our goal is to use kernel density estimators for both the distribution of X and the
joint distribution of X and Y . Formally, we look for:

fˆXY (x, y)

m̂Y (x) = y· dy
fˆX (x)

1.3.1 Nadaraya-Watson Estimator

Intuitively, we turn directly to our kernel density estimators. We already wrote


the definition of our estimator in the univariate case of estimating fˆX (x), but what
is the KDE for the joint distribution of X and Y ? For that we use a multivariate
KDE including a product kernel. In particular, we get:
n
ˆ 1 Õ   xi − x   y − y 
i
fXY (x, y) = 2 K ·K
nh i=1 h h

The two KDE can be used in the mean regression estimator to get:

fˆXY (x, y) (nh2 )−1 i=1 K xih−x · K yih−y


∫ ∫ Ín  
m̂Y (x) = y· dy = y· dy
fˆX (x) K xih−x
Ín 
(nh)−1 i=1
xi −x 
yK yih−y dy
Ín ∫ 
1 i=1 K h ·
= · Ín xi −x 
h i=1 K h

The only term that is not obvious here is the last term of the numerator. Let’s
look at it in detail.

Apply a change of variable so that ψ is the term inside the kernel. We get
y = ψh + yi (recall that since the kernel is symmetric K(yi − y) = K(y − yi )). We

13
also have dy = hdψ. Then we can write:
∫ y − y ∫
i
yK dy = (ψh + yi )K (ψ) hdψ
h
and separating we have:
∫ ∫
2
h ψK (ψ) dψ + hyi K (ψ) dψ = h · yi

from the properties of the kernel. Finally, plugging this expression back into the
mean regression estimator we get:
xi −x 
· yi
Ín
i=1 K
m̂Y (x) = Ín h
xi −x 
i=1 K h

Definition 1.3 (Nadaraya-Watson estimator). For a given model of two variables


Y and X such that Y = mY (X) + U, the Nadaraya-Watson estimator for the function
mY (x) is defined as:
xi −x 
· yi
Ín
i=1 K
m̂Y (x) = Ín h
xi −x 
i=1 K h
where K(·) : R → R is a standard kernel.

Note that a kernel regression estimator is only a valid estimator for m(·) in a local
neighborhood of size h.

Nadaraya-Watson and OLS

Consider a model of Y and X such that Y = α + U or in matrix form:


 Y1  1  U1 
 ..   ..   .. 
     
 .  = . · α +  . 
YN  1
     
UN 
     
By OLS, the estimation of α is now straightforward:
Í
Yi
α̂ = (ι ι) ι Y = i = Ȳ
0 −1 0
n

14
where ι is a n-dimensional vector of 1. This means that the OLS estimation is
equivalent to fitting a constant (the average of Y ) globally on the model. Now, if
you consider the NW estimator, you should see a type of relation between both.
In fact, around the neighborhood of x, the two estimators are exactly the same.
Hence, intuitively, you could see the NW estimator as fitting a constant locally for
  1/2
all x. To see that, reweight the the data by K hx−Xi
so that observations in the
neighborhood are 1, while others are 0 (in the case of the uniform kernel), the NW
estimator is in fact the average of Y for values of Y inside of the neighborhood.

1.3.2 Local OLS estimator

We have seen that the intuition behind the NW estimator was about fitting a
constant locally on the model. Naturally, one could think about extending this
line of reasoning and fit more complex models inside the kernel. In particular, a
well-studied extension is to fit a line, as a local linear model. This type of models
is usually called local OLS models. They are represented by the following model:
Xi − x
Y = m(x)ι̃ + h · m0(x) + U = X̃ β(x) + U
h
Note that adding dimensions to the polynomial used to fit the model locally
does not change the value of the m(x) function, but rather adds information
about higher-order derivatives of the m(x) function at the point x. For example,
estimating a simple line locally gives the value of the point at x as well as the
slope of m at x.
Definition 1.4 (Local OLS estimator). For a given model of two variables Y and X
such that Y = m(x)ι̃ + h · m0(x) Xih−x + U = X̃ β(x) + U, the local OLS estimator for
the function mY (x) is defined as:
β̂(x) = ( X̃ 0 X̃)−1 X̃ 0Ỹ

1.3.3 Bias and Variance

The bias of the kernel regression estimation is given by:


h2
E β̂0 |X − β0 = E [m̂(x)|X] − m(x) = · µ2 mY00(x) + op (h2 )
 
2

15
which is very similar to the formula derived for the bias of the kernel density
estimator. If we also look at the variance, we get:

1 σ2
Var β̂0 |X = Var [m̂(x)|X] = · κ 2 U + op ((nh)−1 )
 
nh fX (x)
which is slightly different than the kernel density counterpart. In fact, in the
regression case, fX (x) enters in the denominator, meaning that increasing f (x)
(the probability of finding observations at the point x) will decrease the variance
while in the density estimation, increasing the number of observations increased
the variance.

1.3.4 General considerations

Asymptotic normality

Under similar regularity conditions as the density estimator, the regression estima-
tion will tend in distribution to a standard normal
√ as the number of observations
increase to ∞. The rate of convergence is also nh in this setting.

Curse of dimensionality

Again, in a similar way as the KDE, the kernel regression estimator faces the
curse of dimensionality
√ as the number of regressors k increases. The rate of
convergence is then nh k .

Higher-order bias reduction

Same as KDE.

16
Order of local polynomial

As it is discussed in the section, instead of fitting a constant or a line, one could


go further and fit higher order polynomials in the data in order to get more
information on the shape of the fitted line. It turns out that asymptotically, there
is no cost to move to the next-odd order when estimating a given object of interest.
For example, say you want to estimate m(·) up to its p-th derivative, then you
could use a p + 1 or p + 3 order local polynomial estimation. You should remember
that when it comes to local OLS, it is an odd world.

Moreover, estimating polynomials will actually achieve bias reduction in the same
way that high order kernel do. This bias reduction comes without the cost of
putting negative weight on some observations like the higher order kernels do.
This is why it is generally thought that high order polynomial is more interesting
than high order kernels.

Selection of bandwidth

There are two schools of thoughts when it comes to bandwidth selection in the
case of kernel regression.

We have seen in the kernel density estimation section that we chose the bandwidth
in order to minimize the mean integrated squared error. In the context of kernel
regression, the MISE does not have an analytic expression, and we thus have to
approximate it using the following expression:

h2 1 σ2

AM ISE = · µ2 mY00(x) + · κ 2 U dx
2 nh fX (x)
and then minimize over h.

The second way to select the adequate bandwidth is to use cross-validations.


Define the leave-one-out estimator m̂− j as the the standard local OLS estimator,
without the j-th observation. Now let the average prediction error (APE) be:

APE = (Yj − m̂− j (X j ))2
n j

17
It turns out that choosing h to minimize the APE is equivalent to minimizing the
MISE.

Choice of kernel

Use Epanechnikov.

1.3.5 Series/sieve regression

1.3.6 Testing

In nonparametrics, hypothesis testing is separated from the regression estima-


tion. It is also difficult to perform as generally, hypothesis testing is meant to
look at interesting features on the whole dataset (the whole function), while
nonparametrics focuses on local features of the data.

Omission of variables test

1.3.7 Applications

Binary Dependent Variable

Consider the case where Y , the dependent variable, only takes values of 1 or 0,
such that: (
1 if X β + U > 0
Y=
0 else.
where U is assumed to be independent of X, U ⊥ X. As outlined in the beginning
of this section, we look for an estimator for the expectation of Y given X = x.

18
Since Y is now a discrete random variable, we can write:

E [Y |X = x] = 1 · Pr [X β + U > 0|X = x] + 0 · Pr [X β + U ≤ 0|X = x]


= Pr [X β + U > 0|X = x] = Pr [U > −x β]
= 1 − G(−x β)

This setting is problematic since it does not allow for “point-identification” of β.


To see that, note that we have two objects to estimate here: θ = {G(−X β), β}.
However, we could also define β̃ = β/c and G̃(z) = G(cz), which would yield that
G̃(X β̃) = G(X β). This means that observing the same data, one could estimate
G̃ or G, leaving β unidentified, or set identified (all vectors β̃). We say that β is
identified up to scale c.

In order to solve this issue, we can impose a restriction on the size of β such
that we can single out a parameter from all proportional parameters. We call this
restriction a normalization.

This normalization turns out not to affect the economic meaningfulness of the
model. In fact, we have just seen that G(·) is perfectly identified, but identification
of β, although an advantage, is not necessary. To see that, consider another object
of interest in this model:

O x E [Y |X = x] = β · g(−x β)

where g denotes the pdf of the distribution of U. Then, define the set of parameters
we want to estimate as θ = {G(X β), βg(X β)}. Now let β̃ = β/c and G̃(x) = G(cx).
From this we get that G(−X β) = G̃(−X β̃) and βg(−X β) = β̃ g̃(−X β̃). Therefore,
we can write that θ̃ = θ, meaning that whatever the value of c is, our set of objects
of interest will not change.

19
Chapter 2

Treatment Effects

2.1 Intuition

Suppose the data follows the model:

Y = φ(D, A)

where φ(·) is a very general function of the data, such that it is not differentiable;
D is the discrete (binary) variable indicating if yes (1) or no (0) the treatment was
administered; and A is a potentially infinite dimensional error.

We denote Y1 and Y0 as respectively the values of the outcome for each different
treatment:
Y1 = φ(1, A); Y0 = φ(0, A)
Note that both variables are not (never) directly observed. The observations in the
data are realized outcomes depending on the realization of the random variable
A. Thus, the function φ that transforms the data can never be observed.

Ideally, we want to be able to recover the effect of the treatment, or how the
outcome changes when D = 0 increases to D = 1, for a given A = a (an individual).
We call this the individual treatment effect:

Y1 − Y0 = φ(1, A) − φ(0, A)

20
which varies for any A, across the population. However, knowing the effect for
any individual might not be that useful in practical terms. In fact, when designing
a policy or evaluating programs, you might be interested only in a subgroup of
people, or the population as a whole, but rarely about each individual. This is why
we might be more interested in the Average Treatment Effect (or ATE), defined
as:
AT E ≡ E [Y1 − Y0 ] = E [φ(1, A) − φ(0, A)]
the average of the individual treatment effect across all individuals. One could
also be directly looking at the subgroup of interest, say the average treatment
effect on the treated (ATT), i.e.

ATT ≡ E [Y1 − Y0 |D = 1] = E [φ(1, A) − φ(0, A)|D = 1]

Going further, one might be interested in identifying a subgroup on other charac-


teristics X, using the average treatment effect conditional on X (or CATE):

C AT E ≡ E [Y1 − Y0 |X = x] = E [φ(1, A) − φ(0, A)|X = x]

Finally, in the same line of reasoning, one could separate subpopulations in terms
of endogeneity of their response to the treatment, using estimators we’ll study
later like LATE, MTE, etc. All these estimators take the form:

E [Y1 − Y0 |Subpop.] = E [φ(1, A) − φ(0, A)|Subpop.]

To go back to our first object of interest, an alternative interpretation of the


average treatment effect can be found by rewriting the equation in terms of the
binary variable:

Y = φ(D, A) = Y0 + (Y1 − Y0 ) · D = α(A) + β(A) · D

where Y0 becomes a random intercept and Y1 − Y0 a random slope. Then, the ATE
is the average random slope of the model: AT E = E [β(A)].

2.2 Identification

If Y1 and Y0 were known for the whole population under study, there would not
be a whole field dedicated to compute the ATE. In fact, averaging over a simple

21
substraction would be quite easy. However, for any individual i, only one of the
outcomes can be observed at the same time. In fact, either an individual received
the treatment (Y1i is observed) or he did not (Y0i is observed). Because of that fact,
we will have to make some assumptions on the unobservables to make progress.

2.2.1 Joint full independence

In particular, the first assumption we ought to make is the so-called “joint full inde-
pendence” of outcomes with respect to treatment. Formally, we write: (Y1,Y0 ) ⊥ D,
meaning that jointly, Y1 and Y0 are fully independent from D. This also implies
that A ⊥ D.

Intuitively, this assumption (denoted A2) means two things. First, that everything
not observed by the econometrician (A) is independent of the treatment D, i.e.
receiving the treatment or not does not change the unobserved variables that
might affect the outcome of the treatment. Second, the unobserved variable A
has no effect on the treatment being delivered or not, i.e. the treatment is purely
random, even on unobserved characteristics.

This assumption has an interesting implication on the regression of Y on D.


Consider the regression separated for each group of treatment. For the treated:
E [Y |D = 1] = E [Y0 + (Y1 − Y0 ) · D|D = 1]
= E [Y0 + (Y1 − Y0 )|D = 1]
= E [Y1 |D = 1]
= E [Y1 ] by assumption of joint full independence.
Similarly, for the non-treated, you get: E [Y |D = 0] = E [Y0 ]. And thus,
AT E = E [Y1 − Y0 ] = E [Y1 ] − E [Y0 ] = E [Y |D = 1] − E [Y |D = 0]
In words, this assumption allows the econometrician to compute the ATE as the
difference between the average effect across the treatment group (E [Y |D = 1])
and the average effect across the control group (E [Y |D = 0]). Remember that this
can only be true if unobservables across the whole population are independent of
the treatment.

Under the same assumption, we also get that E [Y |D = 0] = E [Y0 ] = E [Y0 |D = 1]


and thus AT E = ATT.

22
2.2.2 Unconfoundedness

Although the previous assumption allows for some very interesting results, it
requires a lot of effort to ensure. In fact, the assumption requires perfect random-
ization of the treatment assignment. This setting is called a perfect experiment, but
it is not so common in research, as it is hard to randomize, and/or make sure that
everyone follows the instructions. Nevertheless, we could study a more realistic
setting where, conditional on some observables X, we would have independence.
Assuming the following model:

Y = φ(D, X, A) = α(A, X) + β(A, X) · D

the unconfoundedness assumption (A20) requires that (Y1,Y0 ) ⊥ D|X, implying


that A ⊥ D|X, instead of the previous A ⊥ D. A weaker assumption (A200) would
be that only the expectations of Y would be
 the same
 regardless
 of the treatment
once conditioned on X (more formally, E Yj |D, X = E Yj |X for both j = 0, 1).


Using this assumption and following the same reasoning as with joint full in-
dependence, we can come up with the Conditional Average Treatment Effect
(CATE):

C AT E(x) = E [Y1 − Y0 |X = x] = E [Y1 |X = x] − E [Y0 |X = x]


= E [Y |D = 1, X = x] − E [Y |D = 0, X = x]

and thus, the average treatment effect as:


∫ ∫
AT E = C AT E(x)dF(x) = (E [Y |D = 1, X = x] − E [Y |D = 0, X = x]) dF(x)

which in words is the expectation of the CATE over X.

Estimation

This equation for the ATE should really ring a bell if you have followed the
last chapter. In fact, both elements within the integral can be estimated with
nonparametric (kernel) regression. However, this type of regression applied
directly to the problem will throw you directly under the curse of dimensionality

23
(having expectations conditional on both D and X, the latter potentially being
multidimensional as well).

A solution to this problem could be to implement a variable linking the treatment


D to the observables X. Define a propensity score p(x) ≡ Pr [D = 1|X = x].
Along uncertainty in X, you get uncertainty in p, which can be summarized as a
random variable P. Then, using the previous assumptions A20, we get:
C AT E(p) = E [Y |D = 1, P = p] − E [Y |D = 0, P = p]
where P only has a single dimension.

In practice, this propensity score p could be estimated nonparametrically or


not. The issue with nonparametric estimation is that you are just displacing the
dimensionality curse situation. Using a parametric structure such as the probit,
logit or the kind would help in reducing the dimensionality issue.

Now, using the definition of the ATE, we have:


AT E = E [E [Y |D = 1, P = p] − E [Y |D = 0, P = p]]

AT E =
š m̂1 (Pi ) − m̂0 (Pi )
n i

Practical issues

Using last chapter, we know how to estimate m(·). Nevertheless, the setting
derived just above is slightly different than before in the sense that the object
of interest, š
AT E, is now an average over kernel regression estimators. Among
other things, this changes how we interpret the optimal bandwidth. In fact, since
we are now averaging, we could deal with smaller bandwidths without being
too scared of the effect it would have on variance (averages reduce variances).
Because a smaller h is not that costly anymore, the cross-validation approach does
not deliver the best bandwidth anymore, so we’ll have to use different approaches.
In particular, the field has come up with two interesting approaches: (1) the
propensity score matching and (2) direct averages.

The propensity score matching is very intuitive and maps to a sort of nearest
neighbor estimator. The idea is that for any individual i in the control group

24
(with propensity score pi ), you find the individual i0 in the treatment group such
that i0 ∈ arg min j∈I1 |pi − p j | where I1 is the set of individual who received the
treatment. In words, you ”match” every individual in the control group with at
least one individual in the treatment, based on the proximity of their proximity
score. Then, for each pair you compute the difference between their CATE, and
finally average over all pairs to get the ATE. The advantage in that estimator is
that as n increase, you will find more and more individuals in the matching pairs.
However, one disadvantage is that even with an infinite number of individuals,
the bias of this estimator will not vanish.

The second approach of direct averages uses a clever rewriting of the problem
such that the ATE is defined as:
 
(D − p(X)) · Y
AT E ≡ E
p(X) · (1 − p(X)
which suggests the simple following sample counterpart:
1 Õ (Di − p̂(Xi )) · Yi
AT E ≡
š
n i p̂(Xi ) · (1 − p̂(Xi )

where p̂(·) can be any first-stage estimator of the propensity score (non-parametric,
probit, etc.).

2.2.3 Regression Discontinuity Design (RDD)

The RDD is another setup used to analyze treatment effects conditional on co-
variates. The idea is quite simple and intuitive since it relies on an existing
discontinuity in the treatment selection (who gets it and who do not) to study
the effect of the treatment. In simpler words, if along a dataset the only dis-
continuity is whether a treatment was received or not (all other variables are
continuous), then by studying the response for people around the discontinuity,
you can identify the effect of the treatment.

For example, consider a situation in which students in a high-school are selected


to go in an “honors” class based on their grade in some exam. The threshold is set
at 800 points, such that everyone (this is important) above the threshold goes to
the “honors” program, and everyone below does not. Then, assuming that people

25
close to the threshold (in both directions) are similar in ability, we could study
the effect of the “honors” program by looking at the average difference in effect
between people on both sides of the threshold.

Model

Consider the following structural model:

Y = Y0 (X, A) + [Y1 (X, A) − Y0 (X, A)] · D where D = I{X ≥ c}

or in words, the total outcome Y is equal to the control outcome (Y0 ) plus the
difference between the treatment and control outcomes (Y1 − Y0 ), in case the
treatment was administered, which is the case if and only if X ≥ c. Then, we get:

• On the right side of the discontinuity:

lim E [Y |D = 1, X = x] = lim+ E [Y1 (x, A)|D = 1, X = x]


x→c+ x→c
= lim+ E [Y1 (x, A)|X = x] (by A2’)
x→c

• On the left side of the discontinuity:

lim E [Y |D = 0, X = x] = lim− E [Y0 (x, A)|D = 0, X = x]


x→c− x→c
= lim− E [Y0 (x, A)|X = x] (by A2’)
x→c

Assume that the distribution of the unobservables A, conditional on observables


X is exactly the same within the infinitesimal neighborhood of the threshold c.
Formally, assume:

lim f A|X (a; x) = lim− f A|X (a; x) = f A|X (a; c)


x→c+ x→c

Moreover, assume that the outcome Yj , conditional on both X and A, is the same
within the infinitesimal neighborhood of the threshold. Formally,

lim Yj (x, a) = lim− Yj (x, a) = Yj (c, a)


x→c+ x→c

for both Y0 and Y1 .

26
Then, we have that:

lim E [Y |D = 1, X = x] − lim− E [Y |D = 0, X = x]
x→c+ x→c
= E [Y1 (c, A)|X = c] − E [Y0 (c, A)|X = c]
= E [Y1 (c, A) − Y0 (c, A)|X = c] = C AT E(c)

This technique gives us the conditional average treatment effect based on being
around the threshold. For that reason, it cannot be used to recover the global aver-
age treatment effect (the AT E), even using the techniques developed above. One
should always keep in mind that the RDD model only applies for the neighborhood
of the discontinuity.

2.2.4 Endogeneity

All three of the previous methods to compute the average treatment effect or the
conditional average treatment effect rely on some version of assumption 2 (A2)
which is correct only in the case of conditional or unconditional exogeneity of
the treatment. However, in most applications, while selection of the treatment
could be perfectly random, individual compliance with the selected treatment
is not guaranteed. In fact, if you consider the effect of a training program for
unemployed individuals, some people could be randomly selected to participate
in a program, but decide not to do it. In order to control for that, we need a model
that allows for endogenous selection.

Model

This model relies on two stages:

(2nd stage): Y = Y0 + ∆ · D
(1st stage): D = I{ψ(Z,V) > 0}

where ∆ ≡ Y1 − Y0 as in previous models, Z are instruments, V are first-stage


unobservables and ψ(·) is a function that maps the space defined by (Z,V) to the
decision space (where a positive number means the treatment is accepted, and a
negative that is refused).

27
The first-stage equation describes the choice of participation in the treatment:
given some exogenous stimulus Z and unobservables V (that the individual ob-
serves, but not the econometrician), if ψ(Z,V) > 0, then the individual participates
in the program, else, he does not.

In the second stage, as we did in the previous sections, an outcome is realized


based on the individual’s decision D. If D = 1, Y = Y1 , else, Y = Y0 . Recall that
Yi are also functions of observables X and unobservables A, as in the previous
sections. Moreover, both unobservables V and A might be correlated in some way
if for example individuals have private information (inside V) about the potential
success of the program (within A).

The instrument Z can have one or more dimensions, but a major question in this
literature is whether Z should include at least one discrete variable or at least one
continuous. In Angrist and Imbens point of view, the most convincing instrument
is a single binary IV. In Heckman’s point of view, a continuous IV does the job
well enough.

Binary IV

The application of binary IVs come with four definitions that should be understood
perfectly before going on. The graph below as well as the definitions should
include enough information to understand.
Definition 2.1 (Classification of individuals). There are four classes of individuals
in a given program evaluation framework. This classification relies on the individuals’
participation behavior (D), based on the binary instrument (Z).

• If an individual will participate in the program regardless of Z, then he is an


“always-taker”.
• If an individual will not participate in the program regardless of Z, then he is
a “never-taker”.
• If an individual will participate in the program if he does not get Z, but he
refuses to participate if he gets Z, then he is a “defier”.
• If an individual will not participate in the program if he does not get Z, but
he accepts to participate if he gets Z, then he is a “complier”.

28
Now, define D0 = I{ψ(0,V) > 0} and D1 = I{ψ(1,V) > 0}. We can write the
first-stage equation as:
D = (1 − Z)D0 + Z D1 = D0 + (D1 − D0 ) · Z
and thus the second-stage equation as:
Y = Y0 + D0 · ∆ + (D1 − D0 ) · ∆ · Z

Assuming the binary instrumental variable Z is jointly independent of participa-


tion and outcome (A2000), or formally, Z ⊥ (Y1,Y0, D1, D0 ). Then,
E [Y |Z = 1] = E [Y0 + D0 · ∆ + (D1 − D0 ) · ∆ · Z |Z = 1]
= E [Y0 + D1 · ∆|Z = 1]
= E [Y0 + D1 · ∆] (by A2000)
= E [Y0 ] + E [D1 · ∆]
and also,
E [Y |Z = 0] = E [Y0 + D0 · ∆ + (D1 − D0 ) · ∆ · Z |Z = 0]
= E [Y0 + D0 · ∆|Z = 0]
= E [Y0 + D0 · ∆] (by A2000)
= E [Y0 ] + E [D0 · ∆]

29
which implies that:

E [Y |Z = 1] − E [Y |Z = 0] = E [(D1 − D0 ) · ∆]

This last term can be simplified with assumptions about the presence of some
types of individuals in the sample. In fact, consider dividing the last term in the
groups defined above:

E [(D1 − D0 ) · ∆] = 1 · E [∆|(D1 − D0 ) = 1] · Pr [(D1 − D0 ) = 1] (compliers)


− 1 · E [∆|(D1 − D0 ) = −1] · Pr [(D1 − D0 ) = −1] (defiers)
+ 0 · E [∆|(D1 − D0 ) = 0] · Pr [(D1 − D0 ) = 0] (others)

and assume that there are no defiers in the sample, formally, that Pr [D1 − D0 = −1]
is equal to 0. Then, we get:

E [(D1 − D0 ) · ∆] = E [∆|(D1 − D0 ) = 1] · Pr [(D1 − D0 ) = 1]


E [(D1 − D0 ) · ∆]
⇔ = E [∆|(D1 − D0 ) = 1]
Pr [(D1 − D0 ) = 1]
where the last term is the average treatment effect conditional on being a complier.
In order to compute it, we need to know the probability of being a complier. Using
the A2000 assumption, one can show that:

Pr [(D1 − D0 ) = 1] = E [D|Z = 1] − E [D|Z = 0]

Finally, using the implication above, we have:


E [Y |Z = 1] − E [Y |Z = 0]
L AT E ≡ E [∆|(D1 − D0 ) = 1] =
E [D|Z = 1] − E [D|Z = 0]
which is called the Local Average Treatment Effect but actually only means the
ATE for compliers.

This estimator has been heavily criticized due to the fact that it depends on the
instruments chosen. In fact, the subpopulation of interest (compliers) can change
if Z is different. For example, consider the unemployment training program,
where the instrument would be a coupon of 500$ to selected individuals. Then,
for a higher coupon value, say of 1000$, the number of compliers would change
for sure, making the estimator very different.

30
Continuous IV

Now, assume that the instrument is continuous. We have the following two-stage
model:
(2nd stage): Y = Y0 + ∆ · D
(1st stage): D = I{p(Z) > V }
where p(·) is the propensity score as used in the previous sections. First, note
that in this context, the no defiers condition in the binary IV case is equivalent to
the threshold structure in the first-stage of this model. Second, one could assume
wlog that V ∼ U[0, 1].

As in the previous subsection, start by looking at:


E [Y |Z = z] = E [Y0 + ∆ · D|Z = z]
= E [Y0 ] + E [∆ · D|Z = z]
= E [Y0 ] + E [E [∆|Z,V] · D|Z = z]
= E [Y0 ] + E [E [∆|V] · I{p(z) > V }] (by A2000)
∫ p(z)
= E [Y0 ] + E [∆|V = v] dv
0

Then, by Leibniz’ rule:


∂z E [Y |Z = z] = E [∆|V = p(z)] · ∂z p(z)
∂z E [Y |Z = z]
⇔ = E [∆|V = p(z)]
∂z p(z)
which is the analog result to the LATE estimator in Angrist and Imbens’ work. In
words, the right-hand side term is the marginal treatment effect for the population
that is indifferent between participating in the program or not for a given z. The
left-hand side term is the instrumental variable at the point z. Using p in place of
p(z) we get:
E [∆|V = p] = ∂z E [Y |P = p]
which we can use to get the global average treatment effect, as the integral over
the marginal treatment effect for individuals that are indifferent at each level of p:
∫ 1 ∫ 1
AT E = E [∆|V = p] dp = ∂z E [Y |P = p] dp = E [Y |P = 1] − E [Y |P = 0]
0 0

31
This strategy has also been heavily criticized, this time based on the fact that it
should be impossible to observe propensity score of exactly 1 and 0. In fact, if one
uses a parametric model to estimate p(·), then identification would only come for
Z = ±∞. We call this issue identification at infinity.

2.3 Distributional effects

As we have seen, the main underlying point of interest in the previous sections
was the estimation of an ”average” treatment effect. If this is important, one
could also be interested in studying the impact of the treatment across the whole
distribution of outcomes, not only the mean. For example, answering questions
such as whether a treatment increases or decreases inequality can only be done
by looking at the entire distribution. There are multiple ways to study these
questions, three of them will be studied here: experiments settings (everyone has
to participate), quantile regressions and IV quantile regressions.

2.3.1 Distributional tests

Consider an experiment with perfect compliance, and assuming that (Y0,Y1 ) ⊥


D (assumption A2); we can evaluate the distributional effects by comparing
the distribution of outcomes between both groups (treatment and control). In
particular, we have:

FY1 (y) = Pr [Y1 ≤ y] = Pr [Y1 ≤ y|D = 1] = Pr [Y ≤ y|D = 1] = FY (y|D = 1)

which is the cdf of outcomes considering only the treated individuals. In a similar
way, we can get that FY0 (y) = FY (y|D = 0), the cdf of outcomes in the control
group. These two elements can be estimated using their frequentist analogs:
n1 n0
1 Õ 1 Õ
F̂Y1 (y) = I{Yi ≤ y} and F̂Y0 (y) = I{Yi ≤ y}
n1 i=1 n0 i=1

But how can we compare these two functions?

32
Comparing distributions

In order to compare both estimated distributions, we can look at two main ways:
first, we could look at the social welfare derived from each, second, we could look
at their stochastic order.

The social welfare analysis relies on computing (estimating) the value of social
utility, based on the utilitarian social welfare function defined as:

W(u, F) = u(y) f (y)dy

where u(·) is the utility derived from the outcome y. Once this object is estimated,
we want to check if W(u, FY1 ) ≥ W(u, FY0 ). However, the utility function is not
an object that is known to the researcher, let alone individuals. We typically
assume that u0 ≥ 0 (non-decreasing utility) and that u00 ≤ 0 (concave utility), but
fundamentally, this problem cannot be solved.

An alternative way would be to look for mathematical properties of distribution


rankings. In particular, we are interested in identifying the relation between both
distribution among three types of rankings: equality (EQ), first-order stochastic
dominance (FOSD) and second-order stochastic dominance (SOSD).

Definition 2.2 (Distribution rankings). Let FY1 and FY0 be the cdfs of the treatment
and the control groups respectively.

• (EQ): The distributions are equal if and only if FY1 (y) = FY0 (y) for all y.
⇒ This property implies that for any utility function u(·), W(u, FY1 ) =
W(u, FY0 )

• (FOSD): A distribution (FY1 ) first-order stochastically dominates another (FY0 )


if and only if FY1 (y) ≤ FY0 (y) for all y.
⇒ This property implies that for any non-decreasing utility function u(·),
W(u, FY1 ) ≥ W(u, FY0 )

• (SOSD): A distribution
∫ y (FY1 ) second-order
∫y stochastically dominates another
(FY0 ) if and only if −∞ FY1 (x)dx ≤ −∞ FY0 (x)dx for all y.
⇒ This property implies that for any non-decreasing and concave utility
function u(·), W(u, FY1 ) ≥ W(u, FY0 )

33
Kolmogorov-Smirnov Test

We’ve seen in the subsection just above the many ways to compare two distri-
butions of outcome theoretically. However, because one never actually observes
these distributions, we need to find a way to use the data to say something about
he rankings between these. On way to empirically test whether distributions are
EQ, FOSD or SOSD is using the so-called Kolmogorov-Smirnov test. This test is
divided in three, each one testing for a particular relation.
Definition 2.3 (Kolmogorov-Smirnov Test). Let Y1 and Y0 be the outcomes of a
randomized experiment (i = 1 defines the treatment group while i = 0 is the control
group).

If the null hypothesis is H0 : FY1 = FY0 (EQ hypothesis), then:


  1/2
N1 · N0
Teq = · sup | F̂Y1 (y) − F̂Y0 (y)|

N y

If the null hypothesis is H0 : FY1 FOSD FY0 (FOSD hypothesis), then:


  1/2
N1 · N0
T f osd = · sup F̂Y1 (y) − F̂Y0 (y)

N y

If the null hypothesis is H0 : FY1 SOSD FY0 (SOSD hypothesis), then:


  1/2 ∫ y 
N1 · N0
Tsosd = · sup (F̂Y1 (x) − F̂Y0 (x))dx
N y −∞

34
In words, this test looks at the value of a function of y for which the difference
(in the measure of interest, depending of the null) in the two distributions is the
greatest. If the variable Y is continuous, then the distribution of Th (where h is
the null hypothesis) is known. However, if Y is not a continuous variables (there
is positive probability at Y = 0), then one should use a bootstrap technique to test
the hypothesis.

The bootstrap method is quite simple to grasp: by computing the test statistic on
a resampled set of observations a high enough number of times, one can estimate
the actual distribution of the test-statistic, and determine if the entire-sample test
statistic is statistically different. The method follows four steps:

1. Compute Th in the original sample.

2. Assuming the null hypothesis, resample your data and compute the new
statistic Thb . For example, if the null is the EQ hypothesis, assume both
groups are the same.

3. Repeat step 2 for B number of times.

4. Compute the p-value of the test as:


B

p= I{Thb > Th }
B b=1

2.3.2 Quantile regression

Recall our main objective in this section is to depart from looking only at the
average treatment effect to study the effect of a program. In the previous subsec-
tion, we looked at how one could study the entire distribution of outcomes. In
this subsection and the following we will be interested in studying the effect on
a given quantile of the distribution. This analysis can be useful if a program is
targeted towards a particular subpopulation defined as a quantile, say the 25% of
poorest unemployed individuals, etc.

In order to do that, we need to assume that for any quantile θ, the θ-quantile of
the distribution in outcomes Y is linear given D (participation in the treatment)

35
and X (observable covariates) such that:

Q θ (Y |D, X) = α0 D + X β0

Moreover, we will assume some version of our beloved assumption A2, that
is, either the treatment is perfectly randomized, or it is randomized based on
observables X. In that case, we have that (α0, β0 ) satisfy:

(α0, β0 ) = arg min E [ρθ (Y − αD − X β)]


α,β

where the function ρθ (x) ≡ x · (θ − I{x < 0}) is a function that works as a
weigthing function. To see that, consider the quantile θ = 0.25, then

ρ0.25 (Y − αD − X β) = (Y − αD − X β) · (0.25 − I{(Y − αD − X β) < 0})


(
−0.75 · (Y − αD − X β) if Y − αD − X β < 0
=
0.25 · (Y − αD − X β) if Y − αD − X β ≥ 0

meaning that all observations in the lower part of the distribution (where Y −
αD − X β < 0) will decrease the value of the function to minimize over (which is
a good thing) while the top part of the distribution will be more costly. In that
way, (α̂0, β̂0 ) will be chosen to favor the lower part of the distribution, and will
yield the θ = 0.25 quantile parameters.

The following graph shows how the weighting function works for different quan-
tiles:

36
Now, in order to recover the estimated parameters (α̂0, β̂0 ), we minimize over the
sample analog of the expectation:
N
1 Õ
(α̂0, β̂0 ) = arg min ρθ (Yi − αDi − Xi β)
α,β N
i=1

This estimator is called the Koenker-Bassett quantile regression estimator.

2.3.3 Including Instruments

Distributional tests

Quantile regression

2.4 IV models with covariates

2.5 Differences-in-Differences (DiD)

Often, there are reasons to believe that the treatment and control groups differ
in characteristics that are unobserved, and that could be correlated with the
outcomes, even after controlling for observed characteristics. In this particular
case, a direct comparison between groups is not possible.

For example, consider the following question: how do inflows of immigrants affect
the wages and the employment level of natives in local labor markets? To study
this question Card (1990) uses the Mariel Boatlift of 1980 (massive immigration of
Cuban population to the Mariel harbor) as a natural experiment to measure the
effect on uemployment of a sudden influx of immigrants. In order to measure this
effect, he looked at data on unemployment for Miami and four other cities (Atl,
LA, Hou and Tampa). As stressed above, those cities clearly have differences in
characteristics that cannot be observed fully, and these differences will determine
at least partially the dynamics of the labor force.

37
2.5.1 Setup

As in the previous sections, we divide individuals in two groups: the treatment


group (D = 1) and the control group (D = 0). In addition, we differentiate between
the two periods pre- and post-treatment. The pre-treatment period is indexed by
T = 0, while the post-treatment is indexed by T = 1. Thus, our outcome variable
Y is now a function of the time period: Ydi (t) is the outcome of individual i at time
t after receiving d (treatment or control).

Again, following the same intuition as in the previous sections, we are interested
in measuring the effect of the treatment. In particular, consider the average
treatment effect on the treated, that is the difference between the treatment and
control outcomes at time 1 for a treated individual i: E [Y1i (1) − Y0t (1)|D = 1].
However, as we know, only one outcome can be observed since one individual
cannot be subject to both the treatment and the control at the same time. In fact,
we can only observe the following variables:

Post-treatment (T = 1) Pre-treatment (T = 0)
Treatment (D = 1) Y1i (1) for all i : Di = 1 Y0i (0) for all i : Di = 1
Control (D = 0) Y0i (1) for all i : Di = 0 Y0i (0) for all i : Di = 0

This means that we are missing the potential outcome Y0i (1) for the treated, i.e.
what would have been the outcome if they did not get treated. From there, we
can use three strategies.

Before = after identification

First, we could assume that in expectations, for the treated (D = 1), the outcome
of not being treated at time 1 is the same as the outcome of not being treated at
time 0, that is, we are assuming that without treatment, the average outcome does
not change with time. More formally, this assumption states: E [Y0 (1)|D = 1] =
E [Y0 (0)|D = 1]. Using that fact, we can recover the ATT as defined above:
E [Y1i (1) − Y0t (1)|D = 1] = E [Y1i (1)|D = 1] − E [Y0t (1)|D = 1]
= E [Y1i (1)|D = 1] − E [Y0t (0)|D = 1] by assumption.
Obviously, one should first consider if this assumption would make sense in the
setting studied.

38
Treated = control identification

Second, we could assume that in expectations, for the treated, the outcome of not
having been treated at time 1 is the same as the outcome of not having been treated
at time 1 for the control. That is, we are assuming that without the treatment
being administered at time 1, both groups would be the same. Or more formally,
E [Y0 (1)|D = 1] = E [Y0 (1)|D = 0]. Again, using that fact, we can recover the ATT
as defined above:

E [Y1i (1) − Y0t (1)|D = 1] = E [Y1i (1)|D = 1] − E [Y0t (1)|D = 1]


= E [Y1i (1)|D = 1] − E [Y0t (1)|D = 0] by assumption.

Obviously, one should first consider if this assumption would make sense in the
setting studied.

DiD identification

Finally, one could assume that in expectations, the effect of not receiving the
treatment at time 1, is the same for the treated and control groups. That is, if
the treated did not get the treatment, they would have seen the same evolution
in their outcome. Formally, we are assuming that: E [Y0 (1) − Y0 (0)|D = 1] =
E [Y0 (1) − Y0 (0)|D = 0]. And using this assumption, we can recover the ATT as
defined above:

E [Y1i (1) − Y0t (1)|D = 1]


= E [Y1i (1)|D = 1] − E [Y0t (1)|D = 1]
= E [Y1i (1)|D = 1] − E [Y0i (0)|D = 1] + E [Y0i (0)|D = 1] − E [Y0t (1)|D = 1]
= E [Y1i (1)|D = 1] − E [Y0i (0)|D = 1] − E [Y0t (1) − Y0i (0)|D = 1]
= E [Y1i (1)|D = 1] − E [Y0i (0)|D = 1] − E [Y0t (1) − Y0i (0)|D = 0] by assumption.
= {E [Y1i (1)|D = 1] − E [Y0i (0)|D = 1]} − {E [Y0t (1)|D = 0] − E [Y0i (0)|D = 0]}
| {z } | {z }
Difference in treated Difference in control
| {z }
Difference-in-differences

39
2.5.2 Estimation by sample means

Panel data

Repeated cross-sections

2.5.3 Estimation by regression

Panel data

Repeated cross-sections

2.5.4 Threats to validity

Compositional differences

In repeated cross-sections, we could be worried about the composition of the


population under study changing across cross-sections. In order to test for that,
we could look at the distribution of (D, X) in all samples and we would look for
them to be similar.

Non-parallel dynamics

For any type of data, we could also be worried about how our main assumption
for the DiD estimator could fail. In particular, if the dynamics of the outcome
depends on unobservables, the assumption would fail. In order to test for that,
we could run a falsification test by applying a DiD analysis on the period before
the first period (period −1 to 0). If our assumption holds, it should be that this
DiD estimator is not statistically different from 0.

40
Chapter 3

Qualitative Dependent Variables

3.1 Motivation

In some applications, the policy-relevant question relies on the analysis of a


binomial or multinomial variable. In that case, a linear regression can mix up
the results by not accounting for the given values of the dependent variable. For
example, if Y can only take values of 1 and 0, a linear regression estimate of 0.5
would make no sense.

As a more general example, consider the question of female participation in the


labor force. If a woman participates, then the variable of interest Y = 1, if she
does not, then Y = 0. Assume that we believe that this participation depends
on other variables such as the number of kids below the age of 6, the age, the
level of education and the non-wife income of the household. Then, we want to
estimate E [Y |X], that is the expected value of Y given X where X contains all the
variables discussed above. Since the variable Y is discrete, we can decompose this
expectation into probabilities:

E [Y |X] = 1 · Pr [Y = 1|X] + 0 · Pr [Y = 0|X]

⇔ E [Y |X] = Pr [Y = 1|X]
which is a probability regression on X.

41
The linear probability model (using OLS) would be to estimate the probability as
a linear function of variables:

Pr [Y = 1|X] = X β

However, this model could give results not consistent with the qualitative in-
terpretation of Y . In fact, this probability could be negative or above one. One
solution to that problem would be to assume the specification of Y depends on
inputs in the following:
Y = I{X β + U > 0}
Therefore we have that:

E [Y |X] = Pr [Y = 1|X] = Pr [X β + U > 0]


= Pr [U > X β] = 1 − Pr [U < X β]

Then estimate this probability using the properties of the distribution of U. Given
that we are now estimating a probability using probabilities, we escape the issues
of finding results lower than 0 or greater than 1. In the rest of the section, we
explore three possible ways to assume or estimate the distribution of U.

3.1.1 Probit

The probit model is the binary choice estimator assuming that U is distributed as
a normal with mean zero and variance v. Define Φ(z) as the cdf of the normal
distribution N(0, v), such that:
1
∫ z
Φ(z) = √ · exp(−v 2 /2)dv
2π −∞
Therefore the probability of Y = 1 becomes Pr [Y = 1|X] = 1 − Φ(X β) and
intuitively, the probability that Y = 0 is Φ(X β).

3.1.2 Logit

In the same way, instead of using a normal distribution, one could use the logistic
distribution. Both distributions are quite similar: the logistic distribution has

42
slightly fatter tails, but it also has a closed-from solution. Denote Λ(z) as the
standard logistic distribution cdf. We have:

exp(z)
Λ(z) =
1 + exp(z)

and thus the probability model yields Pr [Y = 1|X] = 1−Λ(X β) and Pr [Y = 0|X] =
Λ(X β).

3.1.3 Nonparametric regression

Finally, instead of using assumptions on the distribution of U, one could use


nonparametric regression which applies for binary dependent variables as well.
This approach might be interesting in some settings where heterogeneity or
endogeneity could play a key role in the underlying model. In fact, by assuming
a parametric form on the error term U, one rules out the interpretation that the
model is only a reduced-form version of a heterogeneous population; or that one
or more variables are endogenous with U. This chapter will not focus on this
method.

3.2 Estimation

3.2.1 Maximum Likelihood

Suppose that (Z1, ..., Zn ) is a random sample drawn iid from a distribution with
density f (Z, β), where β is not observed. Since the data is iid, we can compute
the joint density of the sample as a product of individual densities. Formally,
n
Ö
f (Z1, ..., Zn ; β) = f (Zi, β)
i=1

This object is referred to as the likelihood of the sample, implying that it is the
ex-ante probability of observing this exact draw of the joint distribution.

43
However, as stated above, since β is unknown, the likelihood of the sample cannot
be computed. For a given β = b, we could still compute a value of the likelihood.
Therefore, we could find the value of b that maximizes the likelihood of the sample,
or in other words, the value of b that would make this exact sample draw the
most likely ex-ante. This is the maximum likelihood estimator, β̂M L , defined as:
n
Ö
β̂M L = arg max L(b) ≡ f (Zi, b)
b
i=1

Since the likelihood cannot be negative, this problem is equivalent to maximizing


over the log of the likelihood, conveniently called log-likelihood L(b) = ln(L(b)).
Therefore, we have:
n
Õ
β̂M L = arg max L(b) ≡ ln ( f (Zi, b))
b
i=1


It can be then proven that the maximum likelihood estimator
√ is n-CAN, meaning
it converges in distribution to a normal distribution at a n rate.

3.2.2 Interpretation

Consider a simple probit model such that P ≡ Pr [Y = 1|X] = Φ(β0 + β1 X).

If X is a continuous variable, then we have that:


∂P
= β1 · φ(β0 + β1 X)
∂X
where φ is the pdf of Φ, the normal distribution cdf. Since a density is always
positive, the derivative will always have the same sign as β1 , however, they do
not have the same value. As a generalization, we have that the coefficient is equal
to φ(x 0 β)β j .

If X is a discrete variable, the situation is slightly different as we cannot take


the derivative of the probability with respect to X. Nonetheless, we are also not
interested in infinitesimal changes in X, but rather discrete jumps. Therefore, we
want to know:
∆P(X) ≡ Φ(β0 + β1 ) − Φ(β0 )

44
for a change of X from 0 to 1. Again, the sign will be the same as β since a cdf is
monotone and increasing, but the value will be different. As a generalization, we
have that the coefficient is equal to Φ(x 0 β + β j ) − Φ(x 0 β).

The interpretation of both formulas is the same as in the models we know, however
in these cases, the value of these coefficients depends on the value of non-varying
inputs (x 0 in the general formulas above). In practice, these are estimated given
the average value of all other inputs.

3.2.3 Testing

Individual parameter testing

The z-statistic and its associated p-value displayed in any program like Stata have
the exact same interpretations as their analogs in the OLS case. A coefficient is
significant if it p-value is lower than 0.05, and its z-stat is higher than 1.96 in
absolute value.

Multiple parameter testing

In order to test for the significance of multiple parameters, we use the likelihood
ratio test. The null hypothesis H0 is that r coefficients are equal to 0. We want
to test it against H1 , that at least one is non-zero. In order to test this, we define
a “restricted” model, where the r coefficients are set to 0 (hence the restriction),
while the unrestricted model includes these parameters freely. We denote the
two sets of parameters as β̂R and β̂U respectively. Finally, we denote as L(·) the
likelihood function, yielding the value of the likelihood of the model, for a given
set of parameters.

Then, the likelihood ratio test is based on comparing both likelihoods using the
2
L( β̂U )

statistic R = L( β̂ ) , or equivalently, its log:
R

 
LR ≡ 2 · ln L( β̂U ) − ln L( β̂R ) ∼ χr2

45
We reject the null if LR > q.95 ( χr2 ), implying that the likelihood of the unrestricted
model is so high compared to the other that it must be different.

3.3 Ordered Dependent Variable

3.3.1 Ordered Probit

Sometimes, the dependent variable is not only categorical but also ordered, mean-
ing that it follows a progression in intensity. Examples include the number of
cars in a household, the highest educational degree attained, votes, indices of
democracy, etc.

In that case, we will need to partition the estimated function such that each region
corresponds to particular value of Y . For example, consider the number of cars in
a household could be 0, 1 or 2. We need to estimate both a function and cutoffs so
that:


 0 if Y ∗ ≤ c1
Y = 1 if c1 < Y ∗ ≤ c2

 2 if c2 < Y ∗



where Y is the observed decision of the household.

Suppose that Y ∗ = X β + ε, where ε ∼ N(0, 1). Then, we could compute the


following probabilities:

• Pr [Y = 0|X] = Pr [Y ∗ ≤ c1 |X] = Pr [X β + ε ≤ c1 |X] = Pr [ε ≤ c1 − X β|X]

= Φ(c1 − X β)

• Pr [Y = 1|X] = Pr [c1 < Y ∗ ≤ c2 |X] = Pr [c1 − X β < ε ≤ c2 − X β|X]

= Φ(c2 − X β) − Φ(c1 − X β)

• Pr [Y = 2|X] = Pr [c2 < Y ∗ |X] = Pr [c2 < X β + ε|X] = Pr [ε > c2 − X β|X]

= 1 − Φ(c2 − X β)

46
3.3.2 Censored Regression (Tobit)

3.4 Nonparametric Estimation of RC models

3.4.1 Motivation

Consider the following random coefficients model where X is a scalar:

Y = A + XB

such that (A, B) ⊥ X. In this model, recall that A and B are random variables (not
fixed parameters as we usually see in econometrics). In that sense, their value
will be different for any individual observation. That is why we are not interested
in point estimates of their value, but rather in their joint distribution f AB (·).

3.4.2 Identification

The main question of this subsection is therefore, how can we identify the joint
density of A and B, given the known (observed) conditional density of Y given X?

Characteristic functions

Before going further, let’s remind ourselves the definition and properties of charac-
teristic functions. The characteristic function is an alternative way (from pdfs and
cdfs) to describe a random variable X. In the same way that FX (·) = E [I{X ≤ x}]
completely determines the behavior and properties of X, the characteristic func-
tion, denoted as φ X (t), defined as:

φ X (t) ≡ E [exp(it X)]



where i = −1 is also fully informative of the behavior of X. In fact, the two
approaches are equivalent in the sense that the characteristic function is a Fourier
transform of the pdf of X, and vice-versa (Fourier transforms are bijective trans-
formations).

47
We can also define the characteristic function of joint densities. Consider the ran-
dom vector X = [X1 X2 ]0 with joint density fX (·). Then the joint characteristic
function φ X (t), where t is a vector of the same dimension as X, say t = [t1 t2 ]0,
is defined as:
φ X (t) = E [exp(i · t 0 X)] = E [exp(i · (t1 X1 + t2 X2 )] = F ( fX (·; x)) (t)

Now, consider the characteristic function of the conditional distribution of Y given


X, denoted as usual fY |X (·; x), also called the conditional characteristic function
(ccf). As explained above, we define the ccf as:
φY |X (t; x) = E [exp(itY )|X = x] = F fY |X (·; x) (t)


Identifying f AB (·)

Knowing the distribution of Y given X, we want to recover the joint distribution


of (A, B). Starting from the ccf of Y |X, one can show that f AB (·) is identified:
φY |X (t; x) = E [exp(it(A + X B))|X = x] = E [exp(it(A + xB))|X = x]
= E [exp(it(A + xB))]
= E [exp(it A + it xB)]
= E [exp(it A + isB)] setting s = t x
= E [exp(i(t A + sB)] = φ AB (t, s)
which is the exact definition of the characteristic function of the random vector
[A B].

This means that we are one inverse Fourier transform away from the joint density
of A and B. More formally:
f AB (a, b) = F −1 F ( fY |X (·; X)(·) (a, b)


⇔ f AB (a, b) = T F ( fY |X (·; X)(·) (a, b)




3.4.3 Estimation

3.4.4 Application

48
Chapter 4

Panel Data

4.1 Multivariate Linear Model

4.1.1 Setup

Recall a simple multivariate linear model (the one we use all the time) such
that: Y = α + X β + U where U is an unobservable term. Under the Gauss-
Markov assumptions, we have seen that the OLS estimator is BLUE, and the
model is easily identified. In that model, we assumed that each observation in
the data corresponded to a different “individual”. Now suppose that we observe a
population repeatedly over time and that the same linear model applies but to
each period, separately. We end up with a system of equations, the panel, indexed
by t, such that:

 Y1 = α + X1 β + U1
..



 .
Y = α + X β + U

 T T T

where we observe the joint distribution of (X1, ..., XT ,Y1, ...,YT ), for each individual.
Note that this system describes the general population model; in reality, there are
N such systems of equations for N individuals, yielding N × T single equations.
Usually, T is way smaller than N, so we can keep using asymptotics of N → ∞,
however, if T is comparable to N, we could choose either way of performing

49
asymptotics. Note also that the usual unobservable U is now indexed by t, meaning
it is a time-process, or innovation term that follows a stochastic process.

In this new setup, there are two ways to look at the parameters of interest (α, β):

• Time variation: if a parameter is the same in all periods for a given individ-
ual, we say that it is time-invariant.

• Individual variation: if a parameter changes across individuals, we say that


it is individual-specific.

In the following sections, we will focus on models where there exists one param-
eter that is both time-invariant and individual-specific. To set this up, assume a
panel model where the intercept is the parameter in question:

Yt = Ã + Xt β + Ut , for all t = 1, ...,T

Now, we separate this individual-specific term into


 a population average and an
individual deviation. To do that, define α = E Ã as the average individual effect
across the population; and Ai = Ãi − α as the individual specific deviation from
the mean. We can rewrite the general panel model as:

Yt = α + A + Xt β + Ut , for all t = 1, ...,T

We now have both time-and-population fixed parameters in (α, β) and a time-


invariant, individual specific parameter in A. Recall that both α and A are un-
known to the econometrician. If one is interested in estimating those parameters,
assumptions about their interpretation will be necessary, which the two following
sections will present.

4.1.2 Random Effects approach

The Random Effects (RE) approach is the traditional way of dealing with the
parameter A in panel data. It relies on the interpretation that A, as the individual
deviation from the mean intercept α, is purely random, in the sense that it is
not correlated with observable features of the individuals contained in X. That’s
where the “random effects” name comes from: conditional on X, the term A
is a random effect on Y . Following this intuition, one could set both random

50
unobservables A and U into one “total error” denoted V. Then, if you recall
the first-year econometrics class, we could apply the Feasible GLS method of
estimation.

Recap of GLS

FGLS in the RE model

The RE model works in the same way, provided we can specify the variance matrix
of the total unobservable term (Vt ≡ A + Ut ). For that, we need a few assumptions.
Let the unindexed variables denote the vector of t-indexed variables and ιT be
a T-dimensional vector of ones. The system of equations for one individual can
then be rewritten as:
Y = X β + V where
where V = AιT + U and Xt is a matrix of width k + 1 (meaning there are k
regressors and a constant α). We make a few further assumptions:

• Strict exogeneity of Ut , or formally E [Ut |X, A] = 0. This means that not only
Ut is independent of Xt and A, but also of any past (or future) realizations
of X. This assumption is analog to U being purely random in the simple
OLS case.

• Strict exogeneity of A, also written as E [A|X] = 0. This assumption is the


one leading us to the RE model in the first place.
⇒ These assumptions yield that Var [V |X] = Var [V] and Cov (A, Ut ) = 0
for all t = 1, ...,T.

• “Well-behaved” variance of Ut , meaning it is not autocorrelated (Cov (Ut , Us ) =


0 for any t , s) and is homoskedastic (Var [Ut ] = σU2 for all t).

These assumptions are crucial to identify the sample counterpart to the Ω matrix
in the GLS estimation. In fact, we can write Var [V] as Var [AιT + U|X] and denote
Var [A] = σA2 , we get:

51

 Var [A + U1 ] Cov (A + U1, A + U2 ) . . . Cov (A + U1, A + UT )
 Cov (A + U1, A + U2 ) Var [A + U2 ] . . . Cov (A + U2, A + UT )
. .. ... ..


 .
. . .


Cov (A + U1, A + UT ) Cov (A + U2, A + UT ) ... Var [A + UT ]
 

 
σ 2 + σ 2 σA2 ... σA2 
 A U
 σ2 2 2
σA + σU . . . σA2 
A
= .. .. ... ..

. . .

 
 σ2 σA2 ... 2
σA + σU 2
 
 A

As stipulated in the review of the GLS estimation, we now need a way to recover
an estimator for this variance matrix, which we could then plug in the FGLS
estimation procedure.

To do that, perform a simple OLS on the model, to recover V̂, the estimated
residuals. In the equation for the variance matrix, we have seen that all off-
diagonal terms were equal to σA2 . Therefore, we could use that fact to recover
a sample average of those terms, which will be our estimator σ̂A2 . Formally, we
have:
n
2 1 ÕÕ 1 Õ
σ̂A = V̂it V̂is
T(T − 1)/2 s>t t n i=1
which is the average over all off-diagonal terms (there are T(T − 1)/2) of the
average value of these terms across the n individuals. Now that we have σ̂A2 , we
can use it to recover σ̂U2 , using the fact that the diagonal of Ω̂ is composed only
of σ̂A2 + σ̂U2 . Then,
n
2 1 Õ1Õ 2 2

σ̂U = V̂ − σ̂A
T t n i=1 it
which is the average over all T diagonal terms of the average value of these terms
across n individuals.

Finally, we can compute Ω̂ as σ̂A2 ιT ιT0 + σ̂U2 IT , and plug it in the feasible GLS
method. Then, the RE estimator will be:
" n # −1 " n #
Õ Õ
β̂RE = Xi0Ω̂−1 Xi Xi0Ω̂−1Yi
i=1 i=1

52
In the end, the RE model will give you, in exchange for strong assumptions, a
way to estimate α, a time-invariant effect on Y . This can be very interesting when
studying settings such as the effect of gender or race on outcomes. As we will see
in the next period, the fixed effects estimator will not allow for estimating such
effects.

4.1.3 Fixed Effects approach

The precedent approach relies heavily on the assumption that the individual effect
A is perfectly random (not correlated with X), however, this view of the constant
individual effect is less popular now. Another way to estimate the model would
be to use the time-invariance property of the unobservables to cancel them out
in the model. For that, we can use two methods: first-differencing and the fixed
effects transformation. Note that in these models, the effects are not less random
than before, we are still talking about the exact same effects, it is just that the
naming convention is bad in that sense, so remember, fixed effects ARE random
variables.

First-differences

Recall that if we have only two time periods, t = 1, 2, the model is:
(
Y1 = α + X1 β + A + U1
Y2 = α + X2 β + A + U2

and using the first difference (in t) of the equations, we get:


Y2 − Y1 = (X2 − X1 )β + U2 − U1
⇔ ∆Y = ∆X β + ∆U
which has eliminated completely the time-invariant effects α and A. Moreover,
this new form is similar to an OLS model, and if the Gauss-Markov assumptions
are satisfied by ∆Y and ∆X, the estimator β̂ is identified and consistent. This
first-differencing strategy is useful in any context of correlation between X and A.
Nevertheless, as we can see with α, if any regressor included is also time-invariant,
this strategy will equally cancel it out.

53
Fixed-effects transformation

Another way of removing the individual-specific and time-invariant effect A is


to consider only deviations from the average. In fact, define the average of a
variable over time as: Ȳ = T1 Tt=1 Yt . Then, Ā = A ⇔ A − Ā = 0. By applying this
Í
transformation in our complete model we also remove α and A completely:
(
Y1 − Ȳ = α − ᾱ + (X1 − X̄)β + A − Ā + U1 − Ū
Y2 − Ȳ = α − ᾱ + (X2 − X̄)β + A − Ā + U2 − Ū

To simplify notation, we denote the variables transformed by this fixed effects


methods as X.
Ü In vector notation, the model can then be written down as:

YÜ = XÜ β + UÜ

If this model satisfies the classic Gauss-Markov assumptions, estimators will be


identified and consistent. However, the convergent distribution is not exactly the
same as in OLS. In fact, we have that Σ = σU2 ( XÜ 0 X)
Ü −1 which is greater than the
OLS equivalent σU2 (X 0 X)−1 .

As in the RE model, we are interested in estimating the residual variance. Using


the same reasoning, we average over the now T − 1 observations and n individuals
to get:
1
σ̂U2 = UÜ̂ is2
ÕÕ
n(T − 1) s i

1D vs. FE

As we’ve seen, both first-differencing and fixed-effects transformation deal per-


fectly with the presence of time-invariant individual-specific unobservables.
Nonetheless, it is true that in practice, the fixed-effects approach is used more
than first-differencing. This preference is due to the fact that when variables
might include measurement errors, the fixed-effects transformation will reduce
the bias due to that issue. Thus, whenever variables are known to potentially
include errors-in-measurements, the fixed-effects transformation will yield better
estimators.

54
4.1.4 Which approach to choose?

The two models designed around the existence of an invariant effect turned out to
be similar to simple methods like OLS or GLS. Nevertheless, whether we choose
one or the other relied on a single assumption: are individual unobservable,
time-invariant deviations A correlated with observables X? To sum up, it was
established that if no correlation is assumed, the RE estimator will be a consistent
and efficient estimator; and if a correlation was present, the FE estimator was a
consistent and efficient estimator. Moreover, we can clearly see that if we were
to use FE in place of RE, we would still get consistency, without efficiency. On
the other hand, doing the opposite and using RE instead of FE would not yield a
consistent estimator.

Using this intuition, we can derive a sort of Hausman test to elicit which model
can be used. For that, define the Hausman test statistic as:
d
    
c β̂FE − β̂RE −1 β̂FE − β̂RE
Γ̂ ≡ β̂FE − β̂RE Var → χk2


4.2 Nonseparable Model

In the case of nonseparable models such as binary choice models (probit, etc.),
estimation of panel data using fixed effects is not going to work out. Indeed, the
presence of the time-invariant effect A within a function g(·) implies that it is
impossible to remove it by subtracting out its average across individuals. However,
in
√ the logit case, and only in the logit case, Chamberlain’s approach could yield
n-CAN estimators, by considering only people that switch choices between
periods (i.e. have different Y1 and YT ). Although interesting as an approach, the
fact that it would work only within the logit setting is not satisfactory. This
section will look at how nonparametric estimation could help in generalizing the
fixed effects model.

55
4.2.1 Setup

Consider the following model:


(
Y1 = φ(X1, Z1, A, U1 )
Y2 = φ(X2, Z2, A, U2 )

where, as in the fixed effects approach, A is correlated with either Xt or Zt and Ut


is i.i.d. In the same way as we did in the FE model, we try to cancel out the A. Since
we are now within a more general setting, we call this generalized differencing.

Assumption 1

There exists ε > 0 such that:

Ut ⊥ (I{k∆X k < ε} · ∆X, X1 )| A; I{k∆Z k = 0} · ∆Z; Z1

meaning that conditional on A and Zt , the unobservable term Ut is independent


of past X and small increments of X (in short, processes of X).

Assuming F(A|X) is time-invariant, we can write:

∂ξ |ξ=0 E [∆Y |∆X = ξ, X1 = x] = E [∂x φ(X1, U1, A)|∆X = 0, X1 = x]

4.2.2 Binary Choice model

4.2.3 Application

56
Chapter 5

Big Data and Machine Learning

5.1 Introduction

5.1.1 Some definitions

The statistics of big data are somewhat different from what we have seen in the
previous chapters of this class. For that reason, we will need to define (even
redefine) some concepts. First of all, what is big data? Consider a simple cross-
section dataset. Define n as the number of observations and p as the number of
variables observed within each record. Most often, we have worked with what we
call “tall” data, meaning we had a lot of observations n and not so much variables
p. In the same way, “wide” data is when p is bigger than n. Both types of dataset
are computationally demanding as they grow. Now, “big” data is a combination of
both types, with big n and big p. As you can imagine, it is very computationally
demanding, and uses different techniques than what we are used to.

In the definitions of ML, we will not use the usual regression terms. Instead, we
call X the input variables (or predictors) while Y will be the output variable (or
response). Nevertheless, we still differentiate in the same way variables that are
quantitative (continuous), qualitative (categorical or discrete) and ordered; usually,
discrete binary variables will be set to 0/1 or -1/1 (using dummies). When Y is
quantitative, the ML naming convention will define prediction of Y as a regression.

57
If Y is instead qualitative or ordered, prediction will be named classification. Note
that this taxonomy of methods pertains to the different types in Y only; whether
some or all X are of one type of the other will not affect this taxonomy, even
though it can call for different methods. Finally, it should be known that the
usage of Y in the prediction process is not mandatory. It is true that it is closer
in interpretation to what we have seen earlier in econometrics, but a fringe
literature exists in ML where the task is instead to describe how data is organized
or clustered. This is called unsupervised learning, whereas using Y (as we’ll mostly
do) is called supervised learning.

As was just described, one goal of statistical learning (whether machine learning
or simple techniques like OLS) is to predict the outcome Y , given an input X. For
that, we have to make the very general assumption that Y is defined as a function
f (·) of input variables and potentially a random error ε. These two objects are
unknown to the econometrician. Prediction is thus all about finding a “good
enough” function fˆ such that the predicted outcome Ŷ = fˆ(X) is the “closest” to
the actual outcomes Y . Another goal of statistical learning could be to understand
the relationship between Y and X, the form of f or related questions. These issues
fall under the definition of inference.

5.1.2 Statistical Decision Theory

Quantitative output

Let X ∈ R p denote a real valued random input vector and Y ∈ R a real valued
random output variable. Both are linked by the joint distribution Pr [X,Y ]. As
described earlier, prediction is about finding a function f (X) for predicting Y ,
given values of X. For that, we need to define a loss function, denoted L(·), that
will indicate how “close” our function is to the reality, by penalizing errors in
prediction.

One potential loss function can be the squared error function, defined as L(Y, f (X)) =
[Y − f (X)]2 . The criterion associated with that function is to choose f that mini-
mizes the expectation of the loss function. In detail, we have:

EPE( f ) = E (Y − f (X))2 = E E (Y − f (X))2 |X


    

58
From that last equation, we can see that minimizing the EPE is equivalent to
minimizing the conditional expectation of the loss function, given X. And in fact,
we choose:
f (x) = arg max E (Y − c)2 |X = x
 
c
which has the solution f (x) = E [Y |X]. Since expectations are not available in the
data, we need to estimate it using different methods (OLS, nonparametrics, etc.).

Categorical output

5.1.3 Dimensionality Curse

We have seen how the linear estimation model (OLS) has low variance but a
potentially high bias, while the nearest-neighbor was at the opposite of the
spectrum, having high variance but low bias. We have also seen that this issue
vanishes with bigger datasets (as n → ∞), the intuition being that as the number of
observations increase, a given neighborhood around a point will contain more and
more points, reducing variance, allowing a lower neigborhood, thus reducing bias.
It turns out that this particular intuition falls completely flat when we consider
higher dimensional input variables. This is called the curse of dimensionality. It
can be understood and visualized in different ways.

One intuitive way is using hypercubes. Consider a p-dimensional hypercube


starting at a target point x. In one dimension, capturing 80% of the data requires
a bandwidth of 0.8, in two dimensions, the same bandwidth of 0.8 will capture
only 64% of the data, in three dimensions, only 51%, etc. In ten dimensions, the
volume covered by a 0.8 bandwidth in each dimension amounts to merely 11% of
the data.

5.2 Linear Methods for Regression

The linear regression model assumes that the relation between output Y and
inputs X is either linear or approximately linear. Formally, it is equivalent to

59
writing f (X) as:
p
Õ
f (X) = β0 + Xj βj
j=1
where β is the vector of unknown parameters (which yields the unknownness to
f ) and X j could be anything from:

• observed quantitative inputs


• transformations of quantitative inputs
• basis expansions, yielding a polynomial representation
• coding of categorical variables
• interactions between inputs

5.2.1 Least-squares

Recall that the least-squares method picks the vector β = (β0, ..., βp ) such that it
minimizes the residual sum of squares (criterion) given by RSS(β) = (Y − X β)0(Y −
X β). The first-order conditions yields:
∂ RSS
= 0 ⇔ −2X 0(Y − X β) = 0
∂β
and the second-order yields:
∂ 2 RSS
= 2X 0 X
∂ β∂ β 0

Meaning that if and only if X is full-rank and X 0 X is positive definite, we obtain


a unique solution:
β̂ = (X 0 X)−1 X 0Y

5.2.2 Extensions of Linear regression

The model we have just seen relies on a linear model. However, it can be that
the relationship between Y and X is actually not linear, so that f (X) is not a

60
linear function. In this subsection, we look at ways to move our model beyond
linearity by augmenting and/or transforming the inputs X. Formally, we look at
representations of f (X) such that:
M
Õ
f (X) = βm hm (X)
m=1

In this equation hm (X) are called basis functions, they are transformations of the
inputs X. The global model is called a linear basis expansion, as it expands X
using hm but it is still linearly entering f (X).

Polynomial regression

Consider basis functions hm (X) of the form X j2 or X j Xk for j, k = 1, ..., p. We call


this class of models polynomial regressions. The simplest model of this class,
assuming a single input X, is a p-degree polynomial such that:
p
Õ
f (X) = β0 + βj X j + ε
j=1

It is still estimated by OLS and is quite flexible, however, higher order polynomials
(p ≥ 5) can lead to strange shapes due to boundary issues (much like in the
nonparametric case).

Step functions

Consider now basis functionshm (X) which would set X to 0 under some condition,
and 1 otherwise, much like a threshold-based indicator function. This class of
models is called step functions. A simple example would be the following:
h1 = I{X < ξ1 }; h2 = I{ξ1 ≤ X < ξ2 }; h3 = I{ξ2 ≤ X }
Then, the global model would fit a constant in each of the set of X values. It is
obvious in this model that the cutpoints or thresholds ξ are set by choice (vs.
data-driven) which leads to the question of how to choose them.

Note that since hm = 1, we cannot include β0 in the global model.


Í

61
Regression Splines

One could also use both previous methods to first divide the domain of X into
contiguous, mutually exclusive intervals (step basis functions) and second, use
linear functions or even low-degree polynomials within each interval. The result
would not be a step function anymore but rather a piecewise linear or polynomial
function.

In the piecewise linear case, a basis expansion would look like:

h1 = I{X < ξ1 }; h2 = I{ξ1 ≤ X < ξ2 }; h3 = I{ξ2 ≤ X }

and then, for each region m: hm+3 = X · hm (X). This model yields 2 linear
parameters × 3 regions = 6 total parameters. Note that the three lines that will
result from this model do not say anything about continuity. Since it can be a
desirable property, we would need to add constraints so that the value of f (X) is
the same at ξm+ and ξm− . For that, we need two constraints (one for each knot),
such that:
β1 + β4 ξ1 = β2 + β5 ξ1
and β2 + β5 ξ2 = β3 + β6 ξ2
which will fix two parameters and leave 4 free.

In general, piecewise polynomial functions that are constrained to be continuous,


but also constrained to have continuous derivatives are called order-M splines
with knots ξ j , j = 1, ..., K. In this case, one needs to select the order of the spline,
the number of the knots as well as their location to identify the model.

Natural Splines

Although regression splines allowed for flexible and interesting modeling over the
data, it still has the drawbacks of polynomial regression. In particular, recall that
around the boundaries, polynomial regression can yield weird erratic functions
that are not desired. In order to control for that issue, a natural spline will add
constraints beyond boundary knots. The intuition is that we reduce the order of
the polynomials beyond some boundaries (both left and right) to eliminate the
extravagant variation at the extremes of the data. By doing that, we’re estimating

62
less parameters, and that leaves us to add more knots in the center of the data such
that the increased bias in the boundaries (by having lower order polynomials) is
balanced out by less bias within boundaries (by having more knots).

Smoothing Splines

Another way to control for too much variation in the regression is to punish it
within the objective function. Recall that the purpose of these linear methods
was to minimize the residual sum of squares. In addition to that, we could add
a term in the RSS that would penalize too much variation. This term will be
composed of a smoothing parameter λ, which is really a “punishment parameter”
that multiplies the integral of the second derivative (squared) of the function.
This intuition works because the second-order derivative of f (X) represents its
curvature. This process is called regularization. The new objective function is
defined as:
n ∫
(yi − f (xi )) + λ ( f 00(x))2 dx
2
Õ
RSS( f , λ) =
i=1
In the extreme cases, consider λ = 0, then f can be any function that interpolates
the data (goes exactly through all the points), regardless of its shape. If λ = ∞,
no second-order derivative will be tolerated, thus leaving only linear functions in
the choice set: we are back to the OLS case. It turns out that the solution to this
problem is the natural cubic spline, with knots at each value of xi . Intuitively, one
would worry about overparameterization in this context (since we have n knots),
however, the penalization of excessive variation in f keeps this issue in check.
Moreover, in a cubic spline, the order reduction that happens at the boundaries
forces f to be linear.

In vector form, we have:


RSS( f , λ) = (Y − Bβ)0(Y − Bβ) + λβ0Ωn β
where B is a matrix (n × k) of all basis functions, meaning Bi j = b j (xi ) and Ωn
is
∫ the value of the integral of second derivative products, such that (Ωn ) j k =
b00j (t)b00k (s)dt. We can recover the optimal β easily here:

β̂ = (B0 B + λΩn )−1 B0Y


which is called the generalized Ridge regression estimator.

63
5.2.3 Shrinkage methods

Hoderlein’s presentation of the Ridge regression

Recall that the solution to the simple OLS model given by Y = X β + ε has the
solution β̂ = (X 0 X)−1 X 0Y , under some regularity conditions. One these conditions
is that the matrix X 0 X be of full rank, or in other words, there is no collinearity
in the matrix. As we have seen in Lewbel’s class, issues may arise also when the
X 0 X matrix is close to being collinear, this is called the near multicollinearity
issue. One solution to solve this issue is to add a constant λ to the diagonal of
X 0 X. By doing this, we shift the eigenvalues of the matrix away from 0, by λ,
giving them a lower bound. With this, X 0 X becomes positive definite, regardless
of the relative size of p and n. This estimator is called the Ridge estimator:

β̂R = (X 0 X + λIp )−1 X 0Y

From this formula, you can observe two things. First, since λ enters in the
“denominator”, the ridge estimator will be shrunk compared to the OLS analog for
any positive value of λ. As λ → 0, we get the actual OLS estimator. Second, this
reduction in the estimator is not perfect, in the sense that it creates a bias (recall
that OLS is BLUE). This bias is important because it comes with the advantage of
reducing variance. In fact, it can be shown that the variance of the ridge estimator
is always lower than the OLS estimator. Using the ridge estimator will therefore
imply a different bias-variance tradeoff than OLS, which is better then?

The Theobald theorem stipulates that there always exists a λ > 0 such that
MSE( β̂R ) < MSE( β̂OLS ). In words, you could always find a ridge estimator differ-
ent than OLS such that it is better than the OLS in the MSE metric.

To sum up, the intuition of this estimator is that, by allowing some bias, we can
reduce variance and get an estimator with a lower MSE.

Ridge regression as a shrinkage method

An alternative view to the ridge regression is in terms of “shrinkage” since, as


we’ve seen, the ridge estimator is a shrunk version of the OLS estimator. The
philosophy behind this alternative view is that when the number of regressors

64
(p) is high, we would like to select only a few of them to ideally improve the
prediction error, but also give a more palatable interpretation of the model. In
this sense, the ridge regression shrinks the coefficients by imposing a penalty on
their size. In fact, we can write the ridge estimator as the estimate that solves:
p
X βk22 β2j
Õ
β̂R = arg min kY − +λ
β
j=1

where the first norm corresponds to the classic definition of the RSS, while the
second term penalizes the values taken by each parameter in the model. Thus, we
have an explicit regularization of the parameters within the function to optimize.

Ridge regression optimization problem

As stressed above, the ridge regression revolves around solving:

β̂R = arg min kY − X βk22 + λk βk22


β

This problem can also be seen as a constrained optimization problem of minimizing


the RSS, subject to β being in a p-dimensional sphere of some radius c (with c
having a one-to-one relation to λ).

In particular, we write:

β̂R = arg min kY − X βk22


k βk22 ≤c

The graphical interpretation of this problem is intuitive and interesting:

65
In this graph, we can see in red the level curve representation of the objective
function (the RSS), while the blue circle (2D representation of the sphere) is the
constraint on parameters. Then, instead of finding the point that minimizes the
objective function (i.e. the eye of the objective), we look for the tangent point
between the objective and the constraint sphere, which is different from β̂, the
OLS estimator.

Lasso regression

The lasso regression is another shrinkage method like ridge, with a defining dif-
ference in the norm used to constrain β. in the ridge setting, we have constrained
β to a sphere by setting its norm 2 to be lower than a given scalar c, formally:
k βk22 ≤ c. The lasso, on the other hand, uses the norm 1 to do the exact same
thing. Norm 1, denoted | β|1 , is the sum of absolute values of the parameters.

66
Therefore, its formal definition is:
p
X βk22
Õ
β̂L = arg min kY − +λ | βj |
β
j=1

or in its constrained optimization form:

β̂L = arg min kY − X βk22


| β|1 ≤t

In order to understand more intuitively the difference between the two estimators,
we go back to its graphical representation.

As you can see, the constraint area is now a diamond (whose 2D representation
is a tilted square). The interpretation of the difference between this solution and
the OLS has not changed. However, as is clearly shown in the graph, the lasso
estimator will have a tendency to shrink some parameters all the way to 0 (β1

67
is this case). In particular, if the data is organized such that p > n (OLS solution
is not unique and overfitting), the lasso regression will make so that only k < n
regressors are left, creating a sparsity in the parameter vector.

5.3 Tree-based methods

5.3.1 Introduction

Tree-based methods are a completely different approach to estimation of random


variables. Its goal is to partition the feature space (input space) into a set of
rectangles and fit a simple model into each of the rectangles. For example, consider
a regression problem with a continuous response Y and inputs X1 and X2 , each
taking values within [0, 1]. The following graph shows how a tree could be used
to estimate Y using the inputs. On the left panel, you can see the process of
partitioning: first, we divide the data in values of Y for which X1 ≤ t1 or not; then,
on each side of the division, we partition again, fit a constant, etc. until our tree
meets a criterion. The result of this method is shown in the right-panel.

68
5.3.2 Formal definition

Let the dataset contain N observations (xi, yi ) for i = 1, ..., N, and xi a p-dimensional
vector (xi 1, ..., xi p). Given a partition of the R p space into M regions denoted
R1, ..., RM , fitting a regression tree is defined as finding the function:
M
Õ
f (x) = cm · I{x ∈ Rm }
m=1

such that it minimizes the RSS given by i=1 (yi − f (xi ))2 . This is equivalent to
ÍN
minimizing the RSS in each region, yielding as a result that
1 Õ
ĉm = yi
Nm i∈I
m

or in words, the average value of yi in the region Rm . Note that this method is
useful only when the partition is given! Finding the exact partition that would
yield the lowest possible RSS is computationally infeasible. Instead, one should
use a so-called “greedy algorithm”.

Greedy algorithm

The greedy algorithm is a heuristic method of finding an optimal solution, it is


called greedy because it is short-sighted and will go the nearest best solution,
without realizing that there could be a better global solution by making a less
optimal choice right now. In the setting of designing trees, the algorithm solves
two optimization problems, one nested within the other. These are called inner
and outer minimizations.

First, the algorithm starts with all the data and considers splitting a variable
j at a cutpoint s. This would create two regions R1 ( j, s) = {X |X j ≤ s} and
R2 ( j, s) = {X |X j > s}. Given these regions, the algorithm solves the tree as
was described in the previous section, meaning that it finds (c1, c2 ) such that it
minimizes the RSS. This process is the inner minimization. Then, the algorithm
chooses which pair ( j, s) would yield the lowest RSS, among all feasible pairs.
This is the outer minimization problem. All in all, this process can be formalized

69
as finding:
 
( j , s ) = arg min min RSS(yi, c1 ; R1 ( j, s)) + min RSS(yi, c2 ; R2 ( j, s))
∗ ∗
( j,s) c1 c2

The data is now partitioned in two regions, and we repeat this process in each of
them, increasing the size of the tree until a particular criterion has been met.

Growing the tree

Which criterion to choose? From what we know about statistics, growing the
tree indefinitely will cause overfitting, while not allowing it to grow enough
might lead to miss important features. In fact, the size of a tree will govern the
complexity of the model, hence one should try to let the data guide the size of
the tree.

One approach could be to evaluate the variation in the total RSS of the model after
each split, and stop before a split when the RSS decrease is too low. However, this
raises two issues: first, the choice of the threshold is not guided by the data, and
it might not be optimal; second, this strategy is also short-sighted in the sense
that a very small decrease in RSS at a given step might lead to bigger decreases in
the following steps. Recall that the greedy algorithm is already short-sighted in
this way, thus we might try to choose another, more robust method.

Another strategy, usually preferred, is to first grow the tree until it is large enough
(say 5 nodes). Then, we would use a “pruning” method called cost-complexity
pruning. Denote the big tree as T0 and define T ⊂ T0 as a sub-tree which could
be obtained by “pruning” the tree T0 , meaning collapsing an internal node of
the tree. Use m = 1, ..., M as an index for terminal nodes, such that the input
space is partitioned in M regions. Finally, let |T | be the number of terminal nodes
(regions) of T. Now define,

• Nm = |{xi ∈ Rm }|, the number of observations within a region m.

• ĉm = N1m xi ∈Rm yi , the constant fit in region m.


Í

1
• Q m (T) = − ĉm )2 , the average RSS of the fit within region m.
Í
Nm xi ∈Rm (yi

70
The cost-complexity criterion of a tree T, denoted as Cα (T) is defined as:
|T |
Õ
Cα (T) = Nm Q m (T) + α · |T |
m=1
|{z}
| {z } cost of the number of regions
RSS of the tree T

As you can see, this criterion contains a trade-off between the total RSS of the
model (its fit) and the number of regions it creates, for a given parameter α. A
higher α would penalize big trees, while a lower α would allow for bigger trees.
Therefore, α is key in governing this trade-off can be considered as a tuning
parameter of the model. The idea behind this criterion is that for a given α, the
optimal tree is the subtree Tα ⊆ T0 that minimizes Cα (T). One can show that this
tree is unique for any value of α.

5.3.3 Bagging

The concept of bagging, or bootstrap aggregating, is linked to the concept of


bootstrapping. Recall that the main idea of bootstrapping is to assess the accuracy
of parameter estimates using multiple samples from the data. The method of
bagging is essentially an average over a collection of bootstrap samples.

Formally, consider a set Z containing “training” data {(x1, y1 ), ..., (xN , yN )} where
we fit a model yielding our main estimate fˆ(x). In order to perform bagging, we
need to first draw B samples from Z, which we denote Z b for b = 1, ..., B. In each
sample, we fit the same model as before, this time yielding a different estimate
fˆb (x). Then, the bagging estimate is the average of all bootstrap fits:
B
1 Õ ˆb
fˆbag (x) = f (x)
B b=1

In the tree regression setting, this estimator is interesting as each fˆb (x) will be a
different tree, displaying different features such as different partitions, but also a
different size! An example on the following page:

71
72
5.3.4 Random Forests

The random forests method is a substantial modification of bagging. While


bagging was averaging over many trees of the same type (correlated), random
forests are based on averages of completely random trees so that they end up
uncorrelated. To go further, consider the fact that any tree in the bagging process
will have the same expected bias than any other. Thus, getting a better model
from bagging means going through variance reduction, which is achieved by
averaging over many trees. In that sense, random forests are not very different
than bagging but allowing for uncorrelated trees will yield even lower variance
than through averaging only.

Starting with a data sample of size N, the procedure is defined as follows: for
b = 1, ..., B:

(a). Draw a sample Z b from the data.

(b). Grow a tree Tb using Z b by repeating the following steps, until a minimum
number of nodes nmin is attained:
i. Select m variables at random for the p variables.
ii. Pick the best variable/split pair among the m variables
iii. Split the node into daughter nodes and perform step 1.b.i again.

(c). Save output tree as fˆb (x).

Finally, we have:
B
1 Õ ˆb
fˆr f (x) = f (x)
B b=1

5.3.5 Boosting

73

You might also like