8822 LectureNotes
8822 LectureNotes
Panel Econometrics
Notes from Stefan Hoderlein’s lectures
1 Nonparametrics 5
1.3.6 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1
1.3.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . 18
2 Treatment Effects 20
2.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Unconfoundedness . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 Endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2
3.1.1 Probit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.2 Logit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . 48
4 Panel Data 49
4.1.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3
4.1.4 Which approach to choose? . . . . . . . . . . . . . . . . 55
4.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.3 Application . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.1 Least-squares . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.3 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4
Chapter 1
Nonparametrics
Density estimation might be interesting in its own right, when you need to identify
the particular distribution of a random variable. Nevertheless, it is mostly studied
as a fundamental building block for more complicated semi-/nonparametric mod-
els. Following the example in the previous section, suppose we want to estimate
how Y is related to X where
Y = mY (X) + U
5
Then we recovered that, using the assumption that mY (·) is twice differentiable and
bounded in its second-order derivative, as well as the assumption that E [U|X] = 0,
we have: ∫
E [Y |X = x] = mY (x) = y · fY |X (y, x)dx
χ
Moreover, from probability theory (Bayes’ theorem):
∫ ∫
fY X (y, x)
y · fY |X (y, x)dx = y· dx
χ χ fY (x)
where you have two density functions to estimate.
Let X be a random variable that can take the value of 1 with true probability p0
or 0 else. Think of how you would estimate the probability p0 .
One answer is to draw the random variable many times and get a series {x1, x2, ...}
then estimating p̂ as the number of times we actually observed 1 divided by the
number of draws. Formally, if we perform n random draws,
I{xi = 1}
Ín
p̂ = i=1
n
where I{·} is a function that takes a value of 1 if the condition inside is true, 0 if
not. For example, if one million draws are made and 333 333 of them have turned
333333
out to be ones, then: p̂ = 1000000 ≈ 1/3.
Now, let’s assume X is actually a continuous variable that can take any real value
on its support. Thinking about the previous example, how would you estimate the
probability that the realization of X falls in a given interval of length h around
a given x, or more formally, falls in [x − h/2, x + h/2]. This value h is called the
bandwidth.
Again, we could use the same strategy and draw the random variable n times,
counting the times xi falls in the ball around x and compare it with the total
number of draws:
Ín n
I{x ∈ Bh/2 (x)} 1 Õ
ˆ X ∈ Bh/2 (x) = i=1 i
Pr = I{x − h/2 ≤ xi ≤ x + h/2}
n n i=1
6
Is this type of estimator unbiased? We can check by looking at:
First, think of what the pdf of X, denoted fX (x), actually is. It is the probability
that X takes the exact value x. In a sense, this is close to what we just did, however,
we’re looking for X to be a point rather than in a set. The probability of being in
a set is given by the cdf FX (x). It turns out that as we reduce the size of the set
more and more, the two concept become closer and closer. Formally, as h tends
to 0, the set around Bh/2 (x) will only contain x. Since fX (x) is the derivative of
FX (x), we can write:
Pr X ∈ Bh/2 (x)
FX (x + h/2) − FX (x − h/2)
fX (x) = lim = lim
h→0 h h→0 h
where you should recognize the last term from the previous subsection.
And in fact, you could estimate the pdf by using the estimator for the probability
as seen above:
ˆ X ∈ Bh/2 (x)
Pr n
1 Õ
fˆX (x) = = I{x − h/2 ≤ xi ≤ x + h/2}
h nh i=1
for a given h that is relatively small (more about this later). We now have our
first own density estimator, let’s look at it in more detail.
The basic idea behind the estimator is to count how many observations fall in the
neighborhood x, relative to the total number of observations, and the size of the
neighborhood. Here we use “count” since our indicator function is rather naı̈ve
and only does that: setting a weight of one for observations in the neighborhood,
7
and 0 for observations out of the neighborhood. The weight assignment function
is called a kernel (hence the name of kernel density estimator). In particular, the
one used above is called a uniform kernel because it assigns a uniform weight to
all observations within the neighborhood. In practice, this is a very bad kernel and
it should rarely be used. The parameter h that defines the size of the neighborhood
is called the bandwidth.
• K 2 (ψ)dψ = κ2 < ∞:
∫
• ψ 2 K(ψ)dψ = µ2 < ∞:
∫
You should view these properties through the lens of what we actually use a
kernel for. Since a kernel is essentially a “weight-assigning” function, it must
make sense that it is symmetric (equally off observations in either direction should
be equally bad), that it is non-negative (although it might be interesting to assign
negative weights to observations we really don’t want) and that it stops assigning
weights after a certain distance.
8
Interesting examples of kernels include:
As we did in the parametric econometrics classes, we now look for the kernel
density estimator properties such as bias and variance.
1 ξ−x
∫
= K · fX (ξ)dξ
h h
Then, we perform a change of variables such that the term inside the kernel is ψ,
meaning ξ = ψh + x and dξ = h · dψ. Replacing it in the bias formula we get:
i 1∫ ∫
ˆ
h
E fX (x) = K (ψ) · fX (ψh + x) · h · dψ = K (ψ) · fX (ψh + x)dψ
h
Further, let’s use a second order mean value expansion to recover fX (x):
(ψh)2 00
fX (ψh + x) = fX (x) + ψh fX0 (x) + f (xr )
2 X
where xr includes a remainder term such that: xr = x + λψh. This yields us:
(ψh)2 00
i ∫
ˆ
h
E fX (x) = K (ψ) · fX (x) + ψh fX (x) +
0
f (xr ) dψ
2 X
9
(ψh)2 00
∫ ∫ ∫
= K (ψ) · fX (x)dψ + K (ψ) · ψh fX0 (x)dψ
+ K (ψ) · f (xr )dψ
2 X
h2
∫ ∫ ∫
= fX (x) K (ψ) · dψ +h fX (x) K (ψ) · ψdψ +
0
K (ψ) · ψ 2 fX00(xr )dψ
2
| {z } | {z }
=1 =0
The last term is quite problematic since it cannot be simplified out of the integral.
However, we know that fX00(x) could be, so we can naively subtract it and we’ll
see later that the remainder is actually not very relevant.
h2 h2
∫ ∫
2 00
K (ψ) · ψ fX (xr )dψ = K (ψ) · ψ 2 ( fX00(xr ) − fX00(x))dψ
2 2
h2
∫
+ K (ψ) · ψ 2 fX00(x)dψ
2
h2 00
= R+ f (x)µ2
2 X
where R is bounded by o(h2 ). Finally, we can write the expectation of our kernel
density estimator as:
h
ˆ
i h2 00
E fX (x) = fX (x) + f (x)µ2 + o(h2 )
2 X
| {z }
Bias[ fˆX (x)]
and the bias is given by the last two terms. From this equation, you can see
that the bias is increasing with the bandwidth. This is intuitive since a greater
bandwidth also implies more observations that are not related to x (global infor-
mation) relative to the observations actually close to x (local information). Global
information being more likely to introduce bias in the estimator, h is positively
correlated with bias. In the opposite direction, the bias seems to disappear as h
goes to 0. This means that the estimator is more efficient when the bandwidth is
very small, then why not make the bandwidth as small as possible? One could
show by similar equation work that the variance of the estimator is given by:
h
ˆ
i 1
Var fX (x) = fX (x)κ2 + o((nh)−1 )
nh
which this time is actually increasing as h tends to 0. Again, intuitively this makes
sense as reducing the size of the bandwidth will eventually reduce the number
of observations and thus increase the variance. This phenomenon is called the
bias-variance trade-off.
10
Bias-variance trade-off
In order to have a sense of what the bias and variance look like over the whole
distribution, we integrate them with respect to x:
∫ i 2 ∫
ˆ Var fˆX (x) dx = c2 · (nh)−1
h h i
4
Bias fX (x) dx = c1 · h
Asymptotics
√
The rate of convergence of the KDE is nh where n is the number of observations
and h the bandwidth. √For an optimal√ bandwidth, we had h = n
−1/5 , yielding
As we’ve seen, the KDE method is very interesting in how it gets around the lack
of structure, but it creates a new trade-off between bias and variance. In order to
reduce further the bias, one might be interested in increasing the dimensions of
the kernels, to allow for capturing more data points.
11
Density derivatives
If fX (x) is a differentiable function of x, one could also use the derivative of the
kernel to estimate the object. In practice, to estimate a r-th order derivative, a r-th
order kernel would set all moments up to the r-th one to 0, and the r-th one as a
finite moment µr .√This technique displays the advantage of having convergence
at a rate closer to n. However, you would get potentially negative tails (meaning
it would not be a proper density), and the estimator would be very efficient in
small samples.
12
Recall our definition of a kernel density estimator for a true distribution fX (x):
n
1 Õ xi − x
fˆX (x) = K
nh i=1 h
where K is a standard kernel (refer to …). Also recall the mean regression model
of
fXY (x, y)
∫
E [Y |X = x] = mY (x) = y· dy
fX (x)
Our goal is to use kernel density estimators for both the distribution of X and the
joint distribution of X and Y . Formally, we look for:
fˆXY (x, y)
∫
m̂Y (x) = y· dy
fˆX (x)
The two KDE can be used in the mean regression estimator to get:
The only term that is not obvious here is the last term of the numerator. Let’s
look at it in detail.
Apply a change of variable so that ψ is the term inside the kernel. We get
y = ψh + yi (recall that since the kernel is symmetric K(yi − y) = K(y − yi )). We
13
also have dy = hdψ. Then we can write:
∫ y − y ∫
i
yK dy = (ψh + yi )K (ψ) hdψ
h
and separating we have:
∫ ∫
2
h ψK (ψ) dψ + hyi K (ψ) dψ = h · yi
from the properties of the kernel. Finally, plugging this expression back into the
mean regression estimator we get:
xi −x
· yi
Ín
i=1 K
m̂Y (x) = Ín h
xi −x
i=1 K h
Note that a kernel regression estimator is only a valid estimator for m(·) in a local
neighborhood of size h.
14
where ι is a n-dimensional vector of 1. This means that the OLS estimation is
equivalent to fitting a constant (the average of Y ) globally on the model. Now, if
you consider the NW estimator, you should see a type of relation between both.
In fact, around the neighborhood of x, the two estimators are exactly the same.
Hence, intuitively, you could see the NW estimator as fitting a constant locally for
1/2
all x. To see that, reweight the the data by K hx−Xi
so that observations in the
neighborhood are 1, while others are 0 (in the case of the uniform kernel), the NW
estimator is in fact the average of Y for values of Y inside of the neighborhood.
We have seen that the intuition behind the NW estimator was about fitting a
constant locally on the model. Naturally, one could think about extending this
line of reasoning and fit more complex models inside the kernel. In particular, a
well-studied extension is to fit a line, as a local linear model. This type of models
is usually called local OLS models. They are represented by the following model:
Xi − x
Y = m(x)ι̃ + h · m0(x) + U = X̃ β(x) + U
h
Note that adding dimensions to the polynomial used to fit the model locally
does not change the value of the m(x) function, but rather adds information
about higher-order derivatives of the m(x) function at the point x. For example,
estimating a simple line locally gives the value of the point at x as well as the
slope of m at x.
Definition 1.4 (Local OLS estimator). For a given model of two variables Y and X
such that Y = m(x)ι̃ + h · m0(x) Xih−x + U = X̃ β(x) + U, the local OLS estimator for
the function mY (x) is defined as:
β̂(x) = ( X̃ 0 X̃)−1 X̃ 0Ỹ
15
which is very similar to the formula derived for the bias of the kernel density
estimator. If we also look at the variance, we get:
1 σ2
Var β̂0 |X = Var [m̂(x)|X] = · κ 2 U + op ((nh)−1 )
nh fX (x)
which is slightly different than the kernel density counterpart. In fact, in the
regression case, fX (x) enters in the denominator, meaning that increasing f (x)
(the probability of finding observations at the point x) will decrease the variance
while in the density estimation, increasing the number of observations increased
the variance.
Asymptotic normality
Under similar regularity conditions as the density estimator, the regression estima-
tion will tend in distribution to a standard normal
√ as the number of observations
increase to ∞. The rate of convergence is also nh in this setting.
Curse of dimensionality
Again, in a similar way as the KDE, the kernel regression estimator faces the
curse of dimensionality
√ as the number of regressors k increases. The rate of
convergence is then nh k .
Same as KDE.
16
Order of local polynomial
Moreover, estimating polynomials will actually achieve bias reduction in the same
way that high order kernel do. This bias reduction comes without the cost of
putting negative weight on some observations like the higher order kernels do.
This is why it is generally thought that high order polynomial is more interesting
than high order kernels.
Selection of bandwidth
There are two schools of thoughts when it comes to bandwidth selection in the
case of kernel regression.
We have seen in the kernel density estimation section that we chose the bandwidth
in order to minimize the mean integrated squared error. In the context of kernel
regression, the MISE does not have an analytic expression, and we thus have to
approximate it using the following expression:
h2 1 σ2
∫
AM ISE = · µ2 mY00(x) + · κ 2 U dx
2 nh fX (x)
and then minimize over h.
17
It turns out that choosing h to minimize the APE is equivalent to minimizing the
MISE.
Choice of kernel
Use Epanechnikov.
1.3.6 Testing
1.3.7 Applications
Consider the case where Y , the dependent variable, only takes values of 1 or 0,
such that: (
1 if X β + U > 0
Y=
0 else.
where U is assumed to be independent of X, U ⊥ X. As outlined in the beginning
of this section, we look for an estimator for the expectation of Y given X = x.
18
Since Y is now a discrete random variable, we can write:
In order to solve this issue, we can impose a restriction on the size of β such
that we can single out a parameter from all proportional parameters. We call this
restriction a normalization.
This normalization turns out not to affect the economic meaningfulness of the
model. In fact, we have just seen that G(·) is perfectly identified, but identification
of β, although an advantage, is not necessary. To see that, consider another object
of interest in this model:
O x E [Y |X = x] = β · g(−x β)
where g denotes the pdf of the distribution of U. Then, define the set of parameters
we want to estimate as θ = {G(X β), βg(X β)}. Now let β̃ = β/c and G̃(x) = G(cx).
From this we get that G(−X β) = G̃(−X β̃) and βg(−X β) = β̃ g̃(−X β̃). Therefore,
we can write that θ̃ = θ, meaning that whatever the value of c is, our set of objects
of interest will not change.
19
Chapter 2
Treatment Effects
2.1 Intuition
Y = φ(D, A)
where φ(·) is a very general function of the data, such that it is not differentiable;
D is the discrete (binary) variable indicating if yes (1) or no (0) the treatment was
administered; and A is a potentially infinite dimensional error.
We denote Y1 and Y0 as respectively the values of the outcome for each different
treatment:
Y1 = φ(1, A); Y0 = φ(0, A)
Note that both variables are not (never) directly observed. The observations in the
data are realized outcomes depending on the realization of the random variable
A. Thus, the function φ that transforms the data can never be observed.
Ideally, we want to be able to recover the effect of the treatment, or how the
outcome changes when D = 0 increases to D = 1, for a given A = a (an individual).
We call this the individual treatment effect:
Y1 − Y0 = φ(1, A) − φ(0, A)
20
which varies for any A, across the population. However, knowing the effect for
any individual might not be that useful in practical terms. In fact, when designing
a policy or evaluating programs, you might be interested only in a subgroup of
people, or the population as a whole, but rarely about each individual. This is why
we might be more interested in the Average Treatment Effect (or ATE), defined
as:
AT E ≡ E [Y1 − Y0 ] = E [φ(1, A) − φ(0, A)]
the average of the individual treatment effect across all individuals. One could
also be directly looking at the subgroup of interest, say the average treatment
effect on the treated (ATT), i.e.
Finally, in the same line of reasoning, one could separate subpopulations in terms
of endogeneity of their response to the treatment, using estimators we’ll study
later like LATE, MTE, etc. All these estimators take the form:
where Y0 becomes a random intercept and Y1 − Y0 a random slope. Then, the ATE
is the average random slope of the model: AT E = E [β(A)].
2.2 Identification
If Y1 and Y0 were known for the whole population under study, there would not
be a whole field dedicated to compute the ATE. In fact, averaging over a simple
21
substraction would be quite easy. However, for any individual i, only one of the
outcomes can be observed at the same time. In fact, either an individual received
the treatment (Y1i is observed) or he did not (Y0i is observed). Because of that fact,
we will have to make some assumptions on the unobservables to make progress.
In particular, the first assumption we ought to make is the so-called “joint full inde-
pendence” of outcomes with respect to treatment. Formally, we write: (Y1,Y0 ) ⊥ D,
meaning that jointly, Y1 and Y0 are fully independent from D. This also implies
that A ⊥ D.
Intuitively, this assumption (denoted A2) means two things. First, that everything
not observed by the econometrician (A) is independent of the treatment D, i.e.
receiving the treatment or not does not change the unobserved variables that
might affect the outcome of the treatment. Second, the unobserved variable A
has no effect on the treatment being delivered or not, i.e. the treatment is purely
random, even on unobserved characteristics.
22
2.2.2 Unconfoundedness
Although the previous assumption allows for some very interesting results, it
requires a lot of effort to ensure. In fact, the assumption requires perfect random-
ization of the treatment assignment. This setting is called a perfect experiment, but
it is not so common in research, as it is hard to randomize, and/or make sure that
everyone follows the instructions. Nevertheless, we could study a more realistic
setting where, conditional on some observables X, we would have independence.
Assuming the following model:
Using this assumption and following the same reasoning as with joint full in-
dependence, we can come up with the Conditional Average Treatment Effect
(CATE):
Estimation
This equation for the ATE should really ring a bell if you have followed the
last chapter. In fact, both elements within the integral can be estimated with
nonparametric (kernel) regression. However, this type of regression applied
directly to the problem will throw you directly under the curse of dimensionality
23
(having expectations conditional on both D and X, the latter potentially being
multidimensional as well).
Practical issues
Using last chapter, we know how to estimate m(·). Nevertheless, the setting
derived just above is slightly different than before in the sense that the object
of interest,
AT E, is now an average over kernel regression estimators. Among
other things, this changes how we interpret the optimal bandwidth. In fact, since
we are now averaging, we could deal with smaller bandwidths without being
too scared of the effect it would have on variance (averages reduce variances).
Because a smaller h is not that costly anymore, the cross-validation approach does
not deliver the best bandwidth anymore, so we’ll have to use different approaches.
In particular, the field has come up with two interesting approaches: (1) the
propensity score matching and (2) direct averages.
The propensity score matching is very intuitive and maps to a sort of nearest
neighbor estimator. The idea is that for any individual i in the control group
24
(with propensity score pi ), you find the individual i0 in the treatment group such
that i0 ∈ arg min j∈I1 |pi − p j | where I1 is the set of individual who received the
treatment. In words, you ”match” every individual in the control group with at
least one individual in the treatment, based on the proximity of their proximity
score. Then, for each pair you compute the difference between their CATE, and
finally average over all pairs to get the ATE. The advantage in that estimator is
that as n increase, you will find more and more individuals in the matching pairs.
However, one disadvantage is that even with an infinite number of individuals,
the bias of this estimator will not vanish.
The second approach of direct averages uses a clever rewriting of the problem
such that the ATE is defined as:
(D − p(X)) · Y
AT E ≡ E
p(X) · (1 − p(X)
which suggests the simple following sample counterpart:
1 Õ (Di − p̂(Xi )) · Yi
AT E ≡
n i p̂(Xi ) · (1 − p̂(Xi )
where p̂(·) can be any first-stage estimator of the propensity score (non-parametric,
probit, etc.).
The RDD is another setup used to analyze treatment effects conditional on co-
variates. The idea is quite simple and intuitive since it relies on an existing
discontinuity in the treatment selection (who gets it and who do not) to study
the effect of the treatment. In simpler words, if along a dataset the only dis-
continuity is whether a treatment was received or not (all other variables are
continuous), then by studying the response for people around the discontinuity,
you can identify the effect of the treatment.
25
close to the threshold (in both directions) are similar in ability, we could study
the effect of the “honors” program by looking at the average difference in effect
between people on both sides of the threshold.
Model
or in words, the total outcome Y is equal to the control outcome (Y0 ) plus the
difference between the treatment and control outcomes (Y1 − Y0 ), in case the
treatment was administered, which is the case if and only if X ≥ c. Then, we get:
Moreover, assume that the outcome Yj , conditional on both X and A, is the same
within the infinitesimal neighborhood of the threshold. Formally,
26
Then, we have that:
lim E [Y |D = 1, X = x] − lim− E [Y |D = 0, X = x]
x→c+ x→c
= E [Y1 (c, A)|X = c] − E [Y0 (c, A)|X = c]
= E [Y1 (c, A) − Y0 (c, A)|X = c] = C AT E(c)
This technique gives us the conditional average treatment effect based on being
around the threshold. For that reason, it cannot be used to recover the global aver-
age treatment effect (the AT E), even using the techniques developed above. One
should always keep in mind that the RDD model only applies for the neighborhood
of the discontinuity.
2.2.4 Endogeneity
All three of the previous methods to compute the average treatment effect or the
conditional average treatment effect rely on some version of assumption 2 (A2)
which is correct only in the case of conditional or unconditional exogeneity of
the treatment. However, in most applications, while selection of the treatment
could be perfectly random, individual compliance with the selected treatment
is not guaranteed. In fact, if you consider the effect of a training program for
unemployed individuals, some people could be randomly selected to participate
in a program, but decide not to do it. In order to control for that, we need a model
that allows for endogenous selection.
Model
(2nd stage): Y = Y0 + ∆ · D
(1st stage): D = I{ψ(Z,V) > 0}
27
The first-stage equation describes the choice of participation in the treatment:
given some exogenous stimulus Z and unobservables V (that the individual ob-
serves, but not the econometrician), if ψ(Z,V) > 0, then the individual participates
in the program, else, he does not.
The instrument Z can have one or more dimensions, but a major question in this
literature is whether Z should include at least one discrete variable or at least one
continuous. In Angrist and Imbens point of view, the most convincing instrument
is a single binary IV. In Heckman’s point of view, a continuous IV does the job
well enough.
Binary IV
The application of binary IVs come with four definitions that should be understood
perfectly before going on. The graph below as well as the definitions should
include enough information to understand.
Definition 2.1 (Classification of individuals). There are four classes of individuals
in a given program evaluation framework. This classification relies on the individuals’
participation behavior (D), based on the binary instrument (Z).
28
Now, define D0 = I{ψ(0,V) > 0} and D1 = I{ψ(1,V) > 0}. We can write the
first-stage equation as:
D = (1 − Z)D0 + Z D1 = D0 + (D1 − D0 ) · Z
and thus the second-stage equation as:
Y = Y0 + D0 · ∆ + (D1 − D0 ) · ∆ · Z
29
which implies that:
E [Y |Z = 1] − E [Y |Z = 0] = E [(D1 − D0 ) · ∆]
This last term can be simplified with assumptions about the presence of some
types of individuals in the sample. In fact, consider dividing the last term in the
groups defined above:
and assume that there are no defiers in the sample, formally, that Pr [D1 − D0 = −1]
is equal to 0. Then, we get:
This estimator has been heavily criticized due to the fact that it depends on the
instruments chosen. In fact, the subpopulation of interest (compliers) can change
if Z is different. For example, consider the unemployment training program,
where the instrument would be a coupon of 500$ to selected individuals. Then,
for a higher coupon value, say of 1000$, the number of compliers would change
for sure, making the estimator very different.
30
Continuous IV
Now, assume that the instrument is continuous. We have the following two-stage
model:
(2nd stage): Y = Y0 + ∆ · D
(1st stage): D = I{p(Z) > V }
where p(·) is the propensity score as used in the previous sections. First, note
that in this context, the no defiers condition in the binary IV case is equivalent to
the threshold structure in the first-stage of this model. Second, one could assume
wlog that V ∼ U[0, 1].
31
This strategy has also been heavily criticized, this time based on the fact that it
should be impossible to observe propensity score of exactly 1 and 0. In fact, if one
uses a parametric model to estimate p(·), then identification would only come for
Z = ±∞. We call this issue identification at infinity.
As we have seen, the main underlying point of interest in the previous sections
was the estimation of an ”average” treatment effect. If this is important, one
could also be interested in studying the impact of the treatment across the whole
distribution of outcomes, not only the mean. For example, answering questions
such as whether a treatment increases or decreases inequality can only be done
by looking at the entire distribution. There are multiple ways to study these
questions, three of them will be studied here: experiments settings (everyone has
to participate), quantile regressions and IV quantile regressions.
which is the cdf of outcomes considering only the treated individuals. In a similar
way, we can get that FY0 (y) = FY (y|D = 0), the cdf of outcomes in the control
group. These two elements can be estimated using their frequentist analogs:
n1 n0
1 Õ 1 Õ
F̂Y1 (y) = I{Yi ≤ y} and F̂Y0 (y) = I{Yi ≤ y}
n1 i=1 n0 i=1
32
Comparing distributions
In order to compare both estimated distributions, we can look at two main ways:
first, we could look at the social welfare derived from each, second, we could look
at their stochastic order.
The social welfare analysis relies on computing (estimating) the value of social
utility, based on the utilitarian social welfare function defined as:
∫
W(u, F) = u(y) f (y)dy
where u(·) is the utility derived from the outcome y. Once this object is estimated,
we want to check if W(u, FY1 ) ≥ W(u, FY0 ). However, the utility function is not
an object that is known to the researcher, let alone individuals. We typically
assume that u0 ≥ 0 (non-decreasing utility) and that u00 ≤ 0 (concave utility), but
fundamentally, this problem cannot be solved.
Definition 2.2 (Distribution rankings). Let FY1 and FY0 be the cdfs of the treatment
and the control groups respectively.
• (EQ): The distributions are equal if and only if FY1 (y) = FY0 (y) for all y.
⇒ This property implies that for any utility function u(·), W(u, FY1 ) =
W(u, FY0 )
• (SOSD): A distribution
∫ y (FY1 ) second-order
∫y stochastically dominates another
(FY0 ) if and only if −∞ FY1 (x)dx ≤ −∞ FY0 (x)dx for all y.
⇒ This property implies that for any non-decreasing and concave utility
function u(·), W(u, FY1 ) ≥ W(u, FY0 )
33
Kolmogorov-Smirnov Test
We’ve seen in the subsection just above the many ways to compare two distri-
butions of outcome theoretically. However, because one never actually observes
these distributions, we need to find a way to use the data to say something about
he rankings between these. On way to empirically test whether distributions are
EQ, FOSD or SOSD is using the so-called Kolmogorov-Smirnov test. This test is
divided in three, each one testing for a particular relation.
Definition 2.3 (Kolmogorov-Smirnov Test). Let Y1 and Y0 be the outcomes of a
randomized experiment (i = 1 defines the treatment group while i = 0 is the control
group).
34
In words, this test looks at the value of a function of y for which the difference
(in the measure of interest, depending of the null) in the two distributions is the
greatest. If the variable Y is continuous, then the distribution of Th (where h is
the null hypothesis) is known. However, if Y is not a continuous variables (there
is positive probability at Y = 0), then one should use a bootstrap technique to test
the hypothesis.
The bootstrap method is quite simple to grasp: by computing the test statistic on
a resampled set of observations a high enough number of times, one can estimate
the actual distribution of the test-statistic, and determine if the entire-sample test
statistic is statistically different. The method follows four steps:
2. Assuming the null hypothesis, resample your data and compute the new
statistic Thb . For example, if the null is the EQ hypothesis, assume both
groups are the same.
Recall our main objective in this section is to depart from looking only at the
average treatment effect to study the effect of a program. In the previous subsec-
tion, we looked at how one could study the entire distribution of outcomes. In
this subsection and the following we will be interested in studying the effect on
a given quantile of the distribution. This analysis can be useful if a program is
targeted towards a particular subpopulation defined as a quantile, say the 25% of
poorest unemployed individuals, etc.
In order to do that, we need to assume that for any quantile θ, the θ-quantile of
the distribution in outcomes Y is linear given D (participation in the treatment)
35
and X (observable covariates) such that:
Q θ (Y |D, X) = α0 D + X β0
Moreover, we will assume some version of our beloved assumption A2, that
is, either the treatment is perfectly randomized, or it is randomized based on
observables X. In that case, we have that (α0, β0 ) satisfy:
where the function ρθ (x) ≡ x · (θ − I{x < 0}) is a function that works as a
weigthing function. To see that, consider the quantile θ = 0.25, then
meaning that all observations in the lower part of the distribution (where Y −
αD − X β < 0) will decrease the value of the function to minimize over (which is
a good thing) while the top part of the distribution will be more costly. In that
way, (α̂0, β̂0 ) will be chosen to favor the lower part of the distribution, and will
yield the θ = 0.25 quantile parameters.
The following graph shows how the weighting function works for different quan-
tiles:
36
Now, in order to recover the estimated parameters (α̂0, β̂0 ), we minimize over the
sample analog of the expectation:
N
1 Õ
(α̂0, β̂0 ) = arg min ρθ (Yi − αDi − Xi β)
α,β N
i=1
Distributional tests
Quantile regression
Often, there are reasons to believe that the treatment and control groups differ
in characteristics that are unobserved, and that could be correlated with the
outcomes, even after controlling for observed characteristics. In this particular
case, a direct comparison between groups is not possible.
For example, consider the following question: how do inflows of immigrants affect
the wages and the employment level of natives in local labor markets? To study
this question Card (1990) uses the Mariel Boatlift of 1980 (massive immigration of
Cuban population to the Mariel harbor) as a natural experiment to measure the
effect on uemployment of a sudden influx of immigrants. In order to measure this
effect, he looked at data on unemployment for Miami and four other cities (Atl,
LA, Hou and Tampa). As stressed above, those cities clearly have differences in
characteristics that cannot be observed fully, and these differences will determine
at least partially the dynamics of the labor force.
37
2.5.1 Setup
Again, following the same intuition as in the previous sections, we are interested
in measuring the effect of the treatment. In particular, consider the average
treatment effect on the treated, that is the difference between the treatment and
control outcomes at time 1 for a treated individual i: E [Y1i (1) − Y0t (1)|D = 1].
However, as we know, only one outcome can be observed since one individual
cannot be subject to both the treatment and the control at the same time. In fact,
we can only observe the following variables:
Post-treatment (T = 1) Pre-treatment (T = 0)
Treatment (D = 1) Y1i (1) for all i : Di = 1 Y0i (0) for all i : Di = 1
Control (D = 0) Y0i (1) for all i : Di = 0 Y0i (0) for all i : Di = 0
This means that we are missing the potential outcome Y0i (1) for the treated, i.e.
what would have been the outcome if they did not get treated. From there, we
can use three strategies.
First, we could assume that in expectations, for the treated (D = 1), the outcome
of not being treated at time 1 is the same as the outcome of not being treated at
time 0, that is, we are assuming that without treatment, the average outcome does
not change with time. More formally, this assumption states: E [Y0 (1)|D = 1] =
E [Y0 (0)|D = 1]. Using that fact, we can recover the ATT as defined above:
E [Y1i (1) − Y0t (1)|D = 1] = E [Y1i (1)|D = 1] − E [Y0t (1)|D = 1]
= E [Y1i (1)|D = 1] − E [Y0t (0)|D = 1] by assumption.
Obviously, one should first consider if this assumption would make sense in the
setting studied.
38
Treated = control identification
Second, we could assume that in expectations, for the treated, the outcome of not
having been treated at time 1 is the same as the outcome of not having been treated
at time 1 for the control. That is, we are assuming that without the treatment
being administered at time 1, both groups would be the same. Or more formally,
E [Y0 (1)|D = 1] = E [Y0 (1)|D = 0]. Again, using that fact, we can recover the ATT
as defined above:
Obviously, one should first consider if this assumption would make sense in the
setting studied.
DiD identification
Finally, one could assume that in expectations, the effect of not receiving the
treatment at time 1, is the same for the treated and control groups. That is, if
the treated did not get the treatment, they would have seen the same evolution
in their outcome. Formally, we are assuming that: E [Y0 (1) − Y0 (0)|D = 1] =
E [Y0 (1) − Y0 (0)|D = 0]. And using this assumption, we can recover the ATT as
defined above:
39
2.5.2 Estimation by sample means
Panel data
Repeated cross-sections
Panel data
Repeated cross-sections
Compositional differences
Non-parallel dynamics
For any type of data, we could also be worried about how our main assumption
for the DiD estimator could fail. In particular, if the dynamics of the outcome
depends on unobservables, the assumption would fail. In order to test for that,
we could run a falsification test by applying a DiD analysis on the period before
the first period (period −1 to 0). If our assumption holds, it should be that this
DiD estimator is not statistically different from 0.
40
Chapter 3
3.1 Motivation
⇔ E [Y |X] = Pr [Y = 1|X]
which is a probability regression on X.
41
The linear probability model (using OLS) would be to estimate the probability as
a linear function of variables:
Pr [Y = 1|X] = X β
However, this model could give results not consistent with the qualitative in-
terpretation of Y . In fact, this probability could be negative or above one. One
solution to that problem would be to assume the specification of Y depends on
inputs in the following:
Y = I{X β + U > 0}
Therefore we have that:
Then estimate this probability using the properties of the distribution of U. Given
that we are now estimating a probability using probabilities, we escape the issues
of finding results lower than 0 or greater than 1. In the rest of the section, we
explore three possible ways to assume or estimate the distribution of U.
3.1.1 Probit
The probit model is the binary choice estimator assuming that U is distributed as
a normal with mean zero and variance v. Define Φ(z) as the cdf of the normal
distribution N(0, v), such that:
1
∫ z
Φ(z) = √ · exp(−v 2 /2)dv
2π −∞
Therefore the probability of Y = 1 becomes Pr [Y = 1|X] = 1 − Φ(X β) and
intuitively, the probability that Y = 0 is Φ(X β).
3.1.2 Logit
In the same way, instead of using a normal distribution, one could use the logistic
distribution. Both distributions are quite similar: the logistic distribution has
42
slightly fatter tails, but it also has a closed-from solution. Denote Λ(z) as the
standard logistic distribution cdf. We have:
exp(z)
Λ(z) =
1 + exp(z)
and thus the probability model yields Pr [Y = 1|X] = 1−Λ(X β) and Pr [Y = 0|X] =
Λ(X β).
3.2 Estimation
Suppose that (Z1, ..., Zn ) is a random sample drawn iid from a distribution with
density f (Z, β), where β is not observed. Since the data is iid, we can compute
the joint density of the sample as a product of individual densities. Formally,
n
Ö
f (Z1, ..., Zn ; β) = f (Zi, β)
i=1
This object is referred to as the likelihood of the sample, implying that it is the
ex-ante probability of observing this exact draw of the joint distribution.
43
However, as stated above, since β is unknown, the likelihood of the sample cannot
be computed. For a given β = b, we could still compute a value of the likelihood.
Therefore, we could find the value of b that maximizes the likelihood of the sample,
or in other words, the value of b that would make this exact sample draw the
most likely ex-ante. This is the maximum likelihood estimator, β̂M L , defined as:
n
Ö
β̂M L = arg max L(b) ≡ f (Zi, b)
b
i=1
√
It can be then proven that the maximum likelihood estimator
√ is n-CAN, meaning
it converges in distribution to a normal distribution at a n rate.
3.2.2 Interpretation
44
for a change of X from 0 to 1. Again, the sign will be the same as β since a cdf is
monotone and increasing, but the value will be different. As a generalization, we
have that the coefficient is equal to Φ(x 0 β + β j ) − Φ(x 0 β).
The interpretation of both formulas is the same as in the models we know, however
in these cases, the value of these coefficients depends on the value of non-varying
inputs (x 0 in the general formulas above). In practice, these are estimated given
the average value of all other inputs.
3.2.3 Testing
The z-statistic and its associated p-value displayed in any program like Stata have
the exact same interpretations as their analogs in the OLS case. A coefficient is
significant if it p-value is lower than 0.05, and its z-stat is higher than 1.96 in
absolute value.
In order to test for the significance of multiple parameters, we use the likelihood
ratio test. The null hypothesis H0 is that r coefficients are equal to 0. We want
to test it against H1 , that at least one is non-zero. In order to test this, we define
a “restricted” model, where the r coefficients are set to 0 (hence the restriction),
while the unrestricted model includes these parameters freely. We denote the
two sets of parameters as β̂R and β̂U respectively. Finally, we denote as L(·) the
likelihood function, yielding the value of the likelihood of the model, for a given
set of parameters.
Then, the likelihood ratio test is based on comparing both likelihoods using the
2
L( β̂U )
statistic R = L( β̂ ) , or equivalently, its log:
R
LR ≡ 2 · ln L( β̂U ) − ln L( β̂R ) ∼ χr2
45
We reject the null if LR > q.95 ( χr2 ), implying that the likelihood of the unrestricted
model is so high compared to the other that it must be different.
Sometimes, the dependent variable is not only categorical but also ordered, mean-
ing that it follows a progression in intensity. Examples include the number of
cars in a household, the highest educational degree attained, votes, indices of
democracy, etc.
In that case, we will need to partition the estimated function such that each region
corresponds to particular value of Y . For example, consider the number of cars in
a household could be 0, 1 or 2. We need to estimate both a function and cutoffs so
that:
0 if Y ∗ ≤ c1
Y = 1 if c1 < Y ∗ ≤ c2
2 if c2 < Y ∗
where Y is the observed decision of the household.
= Φ(c1 − X β)
= Φ(c2 − X β) − Φ(c1 − X β)
= 1 − Φ(c2 − X β)
46
3.3.2 Censored Regression (Tobit)
3.4.1 Motivation
Y = A + XB
such that (A, B) ⊥ X. In this model, recall that A and B are random variables (not
fixed parameters as we usually see in econometrics). In that sense, their value
will be different for any individual observation. That is why we are not interested
in point estimates of their value, but rather in their joint distribution f AB (·).
3.4.2 Identification
The main question of this subsection is therefore, how can we identify the joint
density of A and B, given the known (observed) conditional density of Y given X?
Characteristic functions
Before going further, let’s remind ourselves the definition and properties of charac-
teristic functions. The characteristic function is an alternative way (from pdfs and
cdfs) to describe a random variable X. In the same way that FX (·) = E [I{X ≤ x}]
completely determines the behavior and properties of X, the characteristic func-
tion, denoted as φ X (t), defined as:
47
We can also define the characteristic function of joint densities. Consider the ran-
dom vector X = [X1 X2 ]0 with joint density fX (·). Then the joint characteristic
function φ X (t), where t is a vector of the same dimension as X, say t = [t1 t2 ]0,
is defined as:
φ X (t) = E [exp(i · t 0 X)] = E [exp(i · (t1 X1 + t2 X2 )] = F ( fX (·; x)) (t)
Identifying f AB (·)
This means that we are one inverse Fourier transform away from the joint density
of A and B. More formally:
f AB (a, b) = F −1 F ( fY |X (·; X)(·) (a, b)
3.4.3 Estimation
3.4.4 Application
48
Chapter 4
Panel Data
4.1.1 Setup
Recall a simple multivariate linear model (the one we use all the time) such
that: Y = α + X β + U where U is an unobservable term. Under the Gauss-
Markov assumptions, we have seen that the OLS estimator is BLUE, and the
model is easily identified. In that model, we assumed that each observation in
the data corresponded to a different “individual”. Now suppose that we observe a
population repeatedly over time and that the same linear model applies but to
each period, separately. We end up with a system of equations, the panel, indexed
by t, such that:
Y1 = α + X1 β + U1
..
.
Y = α + X β + U
T T T
where we observe the joint distribution of (X1, ..., XT ,Y1, ...,YT ), for each individual.
Note that this system describes the general population model; in reality, there are
N such systems of equations for N individuals, yielding N × T single equations.
Usually, T is way smaller than N, so we can keep using asymptotics of N → ∞,
however, if T is comparable to N, we could choose either way of performing
49
asymptotics. Note also that the usual unobservable U is now indexed by t, meaning
it is a time-process, or innovation term that follows a stochastic process.
In this new setup, there are two ways to look at the parameters of interest (α, β):
• Time variation: if a parameter is the same in all periods for a given individ-
ual, we say that it is time-invariant.
In the following sections, we will focus on models where there exists one param-
eter that is both time-invariant and individual-specific. To set this up, assume a
panel model where the intercept is the parameter in question:
The Random Effects (RE) approach is the traditional way of dealing with the
parameter A in panel data. It relies on the interpretation that A, as the individual
deviation from the mean intercept α, is purely random, in the sense that it is
not correlated with observable features of the individuals contained in X. That’s
where the “random effects” name comes from: conditional on X, the term A
is a random effect on Y . Following this intuition, one could set both random
50
unobservables A and U into one “total error” denoted V. Then, if you recall
the first-year econometrics class, we could apply the Feasible GLS method of
estimation.
Recap of GLS
The RE model works in the same way, provided we can specify the variance matrix
of the total unobservable term (Vt ≡ A + Ut ). For that, we need a few assumptions.
Let the unindexed variables denote the vector of t-indexed variables and ιT be
a T-dimensional vector of ones. The system of equations for one individual can
then be rewritten as:
Y = X β + V where
where V = AιT + U and Xt is a matrix of width k + 1 (meaning there are k
regressors and a constant α). We make a few further assumptions:
• Strict exogeneity of Ut , or formally E [Ut |X, A] = 0. This means that not only
Ut is independent of Xt and A, but also of any past (or future) realizations
of X. This assumption is analog to U being purely random in the simple
OLS case.
These assumptions are crucial to identify the sample counterpart to the Ω matrix
in the GLS estimation. In fact, we can write Var [V] as Var [AιT + U|X] and denote
Var [A] = σA2 , we get:
51
Var [A + U1 ] Cov (A + U1, A + U2 ) . . . Cov (A + U1, A + UT )
Cov (A + U1, A + U2 ) Var [A + U2 ] . . . Cov (A + U2, A + UT )
. .. ... ..
.
. . .
Cov (A + U1, A + UT ) Cov (A + U2, A + UT ) ... Var [A + UT ]
σ 2 + σ 2 σA2 ... σA2
A U
σ2 2 2
σA + σU . . . σA2
A
= .. .. ... ..
. . .
σ2 σA2 ... 2
σA + σU 2
A
As stipulated in the review of the GLS estimation, we now need a way to recover
an estimator for this variance matrix, which we could then plug in the FGLS
estimation procedure.
To do that, perform a simple OLS on the model, to recover V̂, the estimated
residuals. In the equation for the variance matrix, we have seen that all off-
diagonal terms were equal to σA2 . Therefore, we could use that fact to recover
a sample average of those terms, which will be our estimator σ̂A2 . Formally, we
have:
n
2 1 ÕÕ 1 Õ
σ̂A = V̂it V̂is
T(T − 1)/2 s>t t n i=1
which is the average over all off-diagonal terms (there are T(T − 1)/2) of the
average value of these terms across the n individuals. Now that we have σ̂A2 , we
can use it to recover σ̂U2 , using the fact that the diagonal of Ω̂ is composed only
of σ̂A2 + σ̂U2 . Then,
n
2 1 Õ1Õ 2 2
σ̂U = V̂ − σ̂A
T t n i=1 it
which is the average over all T diagonal terms of the average value of these terms
across n individuals.
Finally, we can compute Ω̂ as σ̂A2 ιT ιT0 + σ̂U2 IT , and plug it in the feasible GLS
method. Then, the RE estimator will be:
" n # −1 " n #
Õ Õ
β̂RE = Xi0Ω̂−1 Xi Xi0Ω̂−1Yi
i=1 i=1
52
In the end, the RE model will give you, in exchange for strong assumptions, a
way to estimate α, a time-invariant effect on Y . This can be very interesting when
studying settings such as the effect of gender or race on outcomes. As we will see
in the next period, the fixed effects estimator will not allow for estimating such
effects.
The precedent approach relies heavily on the assumption that the individual effect
A is perfectly random (not correlated with X), however, this view of the constant
individual effect is less popular now. Another way to estimate the model would
be to use the time-invariance property of the unobservables to cancel them out
in the model. For that, we can use two methods: first-differencing and the fixed
effects transformation. Note that in these models, the effects are not less random
than before, we are still talking about the exact same effects, it is just that the
naming convention is bad in that sense, so remember, fixed effects ARE random
variables.
First-differences
Recall that if we have only two time periods, t = 1, 2, the model is:
(
Y1 = α + X1 β + A + U1
Y2 = α + X2 β + A + U2
53
Fixed-effects transformation
YÜ = XÜ β + UÜ
1D vs. FE
54
4.1.4 Which approach to choose?
The two models designed around the existence of an invariant effect turned out to
be similar to simple methods like OLS or GLS. Nevertheless, whether we choose
one or the other relied on a single assumption: are individual unobservable,
time-invariant deviations A correlated with observables X? To sum up, it was
established that if no correlation is assumed, the RE estimator will be a consistent
and efficient estimator; and if a correlation was present, the FE estimator was a
consistent and efficient estimator. Moreover, we can clearly see that if we were
to use FE in place of RE, we would still get consistency, without efficiency. On
the other hand, doing the opposite and using RE instead of FE would not yield a
consistent estimator.
Using this intuition, we can derive a sort of Hausman test to elicit which model
can be used. For that, define the Hausman test statistic as:
d
c β̂FE − β̂RE −1 β̂FE − β̂RE
Γ̂ ≡ β̂FE − β̂RE Var → χk2
In the case of nonseparable models such as binary choice models (probit, etc.),
estimation of panel data using fixed effects is not going to work out. Indeed, the
presence of the time-invariant effect A within a function g(·) implies that it is
impossible to remove it by subtracting out its average across individuals. However,
in
√ the logit case, and only in the logit case, Chamberlain’s approach could yield
n-CAN estimators, by considering only people that switch choices between
periods (i.e. have different Y1 and YT ). Although interesting as an approach, the
fact that it would work only within the logit setting is not satisfactory. This
section will look at how nonparametric estimation could help in generalizing the
fixed effects model.
55
4.2.1 Setup
Assumption 1
4.2.3 Application
56
Chapter 5
5.1 Introduction
The statistics of big data are somewhat different from what we have seen in the
previous chapters of this class. For that reason, we will need to define (even
redefine) some concepts. First of all, what is big data? Consider a simple cross-
section dataset. Define n as the number of observations and p as the number of
variables observed within each record. Most often, we have worked with what we
call “tall” data, meaning we had a lot of observations n and not so much variables
p. In the same way, “wide” data is when p is bigger than n. Both types of dataset
are computationally demanding as they grow. Now, “big” data is a combination of
both types, with big n and big p. As you can imagine, it is very computationally
demanding, and uses different techniques than what we are used to.
In the definitions of ML, we will not use the usual regression terms. Instead, we
call X the input variables (or predictors) while Y will be the output variable (or
response). Nevertheless, we still differentiate in the same way variables that are
quantitative (continuous), qualitative (categorical or discrete) and ordered; usually,
discrete binary variables will be set to 0/1 or -1/1 (using dummies). When Y is
quantitative, the ML naming convention will define prediction of Y as a regression.
57
If Y is instead qualitative or ordered, prediction will be named classification. Note
that this taxonomy of methods pertains to the different types in Y only; whether
some or all X are of one type of the other will not affect this taxonomy, even
though it can call for different methods. Finally, it should be known that the
usage of Y in the prediction process is not mandatory. It is true that it is closer
in interpretation to what we have seen earlier in econometrics, but a fringe
literature exists in ML where the task is instead to describe how data is organized
or clustered. This is called unsupervised learning, whereas using Y (as we’ll mostly
do) is called supervised learning.
As was just described, one goal of statistical learning (whether machine learning
or simple techniques like OLS) is to predict the outcome Y , given an input X. For
that, we have to make the very general assumption that Y is defined as a function
f (·) of input variables and potentially a random error ε. These two objects are
unknown to the econometrician. Prediction is thus all about finding a “good
enough” function fˆ such that the predicted outcome Ŷ = fˆ(X) is the “closest” to
the actual outcomes Y . Another goal of statistical learning could be to understand
the relationship between Y and X, the form of f or related questions. These issues
fall under the definition of inference.
Quantitative output
Let X ∈ R p denote a real valued random input vector and Y ∈ R a real valued
random output variable. Both are linked by the joint distribution Pr [X,Y ]. As
described earlier, prediction is about finding a function f (X) for predicting Y ,
given values of X. For that, we need to define a loss function, denoted L(·), that
will indicate how “close” our function is to the reality, by penalizing errors in
prediction.
One potential loss function can be the squared error function, defined as L(Y, f (X)) =
[Y − f (X)]2 . The criterion associated with that function is to choose f that mini-
mizes the expectation of the loss function. In detail, we have:
58
From that last equation, we can see that minimizing the EPE is equivalent to
minimizing the conditional expectation of the loss function, given X. And in fact,
we choose:
f (x) = arg max E (Y − c)2 |X = x
c
which has the solution f (x) = E [Y |X]. Since expectations are not available in the
data, we need to estimate it using different methods (OLS, nonparametrics, etc.).
Categorical output
We have seen how the linear estimation model (OLS) has low variance but a
potentially high bias, while the nearest-neighbor was at the opposite of the
spectrum, having high variance but low bias. We have also seen that this issue
vanishes with bigger datasets (as n → ∞), the intuition being that as the number of
observations increase, a given neighborhood around a point will contain more and
more points, reducing variance, allowing a lower neigborhood, thus reducing bias.
It turns out that this particular intuition falls completely flat when we consider
higher dimensional input variables. This is called the curse of dimensionality. It
can be understood and visualized in different ways.
The linear regression model assumes that the relation between output Y and
inputs X is either linear or approximately linear. Formally, it is equivalent to
59
writing f (X) as:
p
Õ
f (X) = β0 + Xj βj
j=1
where β is the vector of unknown parameters (which yields the unknownness to
f ) and X j could be anything from:
5.2.1 Least-squares
Recall that the least-squares method picks the vector β = (β0, ..., βp ) such that it
minimizes the residual sum of squares (criterion) given by RSS(β) = (Y − X β)0(Y −
X β). The first-order conditions yields:
∂ RSS
= 0 ⇔ −2X 0(Y − X β) = 0
∂β
and the second-order yields:
∂ 2 RSS
= 2X 0 X
∂ β∂ β 0
The model we have just seen relies on a linear model. However, it can be that
the relationship between Y and X is actually not linear, so that f (X) is not a
60
linear function. In this subsection, we look at ways to move our model beyond
linearity by augmenting and/or transforming the inputs X. Formally, we look at
representations of f (X) such that:
M
Õ
f (X) = βm hm (X)
m=1
In this equation hm (X) are called basis functions, they are transformations of the
inputs X. The global model is called a linear basis expansion, as it expands X
using hm but it is still linearly entering f (X).
Polynomial regression
It is still estimated by OLS and is quite flexible, however, higher order polynomials
(p ≥ 5) can lead to strange shapes due to boundary issues (much like in the
nonparametric case).
Step functions
Consider now basis functionshm (X) which would set X to 0 under some condition,
and 1 otherwise, much like a threshold-based indicator function. This class of
models is called step functions. A simple example would be the following:
h1 = I{X < ξ1 }; h2 = I{ξ1 ≤ X < ξ2 }; h3 = I{ξ2 ≤ X }
Then, the global model would fit a constant in each of the set of X values. It is
obvious in this model that the cutpoints or thresholds ξ are set by choice (vs.
data-driven) which leads to the question of how to choose them.
61
Regression Splines
One could also use both previous methods to first divide the domain of X into
contiguous, mutually exclusive intervals (step basis functions) and second, use
linear functions or even low-degree polynomials within each interval. The result
would not be a step function anymore but rather a piecewise linear or polynomial
function.
and then, for each region m: hm+3 = X · hm (X). This model yields 2 linear
parameters × 3 regions = 6 total parameters. Note that the three lines that will
result from this model do not say anything about continuity. Since it can be a
desirable property, we would need to add constraints so that the value of f (X) is
the same at ξm+ and ξm− . For that, we need two constraints (one for each knot),
such that:
β1 + β4 ξ1 = β2 + β5 ξ1
and β2 + β5 ξ2 = β3 + β6 ξ2
which will fix two parameters and leave 4 free.
Natural Splines
Although regression splines allowed for flexible and interesting modeling over the
data, it still has the drawbacks of polynomial regression. In particular, recall that
around the boundaries, polynomial regression can yield weird erratic functions
that are not desired. In order to control for that issue, a natural spline will add
constraints beyond boundary knots. The intuition is that we reduce the order of
the polynomials beyond some boundaries (both left and right) to eliminate the
extravagant variation at the extremes of the data. By doing that, we’re estimating
62
less parameters, and that leaves us to add more knots in the center of the data such
that the increased bias in the boundaries (by having lower order polynomials) is
balanced out by less bias within boundaries (by having more knots).
Smoothing Splines
Another way to control for too much variation in the regression is to punish it
within the objective function. Recall that the purpose of these linear methods
was to minimize the residual sum of squares. In addition to that, we could add
a term in the RSS that would penalize too much variation. This term will be
composed of a smoothing parameter λ, which is really a “punishment parameter”
that multiplies the integral of the second derivative (squared) of the function.
This intuition works because the second-order derivative of f (X) represents its
curvature. This process is called regularization. The new objective function is
defined as:
n ∫
(yi − f (xi )) + λ ( f 00(x))2 dx
2
Õ
RSS( f , λ) =
i=1
In the extreme cases, consider λ = 0, then f can be any function that interpolates
the data (goes exactly through all the points), regardless of its shape. If λ = ∞,
no second-order derivative will be tolerated, thus leaving only linear functions in
the choice set: we are back to the OLS case. It turns out that the solution to this
problem is the natural cubic spline, with knots at each value of xi . Intuitively, one
would worry about overparameterization in this context (since we have n knots),
however, the penalization of excessive variation in f keeps this issue in check.
Moreover, in a cubic spline, the order reduction that happens at the boundaries
forces f to be linear.
63
5.2.3 Shrinkage methods
Recall that the solution to the simple OLS model given by Y = X β + ε has the
solution β̂ = (X 0 X)−1 X 0Y , under some regularity conditions. One these conditions
is that the matrix X 0 X be of full rank, or in other words, there is no collinearity
in the matrix. As we have seen in Lewbel’s class, issues may arise also when the
X 0 X matrix is close to being collinear, this is called the near multicollinearity
issue. One solution to solve this issue is to add a constant λ to the diagonal of
X 0 X. By doing this, we shift the eigenvalues of the matrix away from 0, by λ,
giving them a lower bound. With this, X 0 X becomes positive definite, regardless
of the relative size of p and n. This estimator is called the Ridge estimator:
From this formula, you can observe two things. First, since λ enters in the
“denominator”, the ridge estimator will be shrunk compared to the OLS analog for
any positive value of λ. As λ → 0, we get the actual OLS estimator. Second, this
reduction in the estimator is not perfect, in the sense that it creates a bias (recall
that OLS is BLUE). This bias is important because it comes with the advantage of
reducing variance. In fact, it can be shown that the variance of the ridge estimator
is always lower than the OLS estimator. Using the ridge estimator will therefore
imply a different bias-variance tradeoff than OLS, which is better then?
The Theobald theorem stipulates that there always exists a λ > 0 such that
MSE( β̂R ) < MSE( β̂OLS ). In words, you could always find a ridge estimator differ-
ent than OLS such that it is better than the OLS in the MSE metric.
To sum up, the intuition of this estimator is that, by allowing some bias, we can
reduce variance and get an estimator with a lower MSE.
64
(p) is high, we would like to select only a few of them to ideally improve the
prediction error, but also give a more palatable interpretation of the model. In
this sense, the ridge regression shrinks the coefficients by imposing a penalty on
their size. In fact, we can write the ridge estimator as the estimate that solves:
p
X βk22 β2j
Õ
β̂R = arg min kY − +λ
β
j=1
where the first norm corresponds to the classic definition of the RSS, while the
second term penalizes the values taken by each parameter in the model. Thus, we
have an explicit regularization of the parameters within the function to optimize.
In particular, we write:
65
In this graph, we can see in red the level curve representation of the objective
function (the RSS), while the blue circle (2D representation of the sphere) is the
constraint on parameters. Then, instead of finding the point that minimizes the
objective function (i.e. the eye of the objective), we look for the tangent point
between the objective and the constraint sphere, which is different from β̂, the
OLS estimator.
Lasso regression
The lasso regression is another shrinkage method like ridge, with a defining dif-
ference in the norm used to constrain β. in the ridge setting, we have constrained
β to a sphere by setting its norm 2 to be lower than a given scalar c, formally:
k βk22 ≤ c. The lasso, on the other hand, uses the norm 1 to do the exact same
thing. Norm 1, denoted | β|1 , is the sum of absolute values of the parameters.
66
Therefore, its formal definition is:
p
X βk22
Õ
β̂L = arg min kY − +λ | βj |
β
j=1
In order to understand more intuitively the difference between the two estimators,
we go back to its graphical representation.
As you can see, the constraint area is now a diamond (whose 2D representation
is a tilted square). The interpretation of the difference between this solution and
the OLS has not changed. However, as is clearly shown in the graph, the lasso
estimator will have a tendency to shrink some parameters all the way to 0 (β1
67
is this case). In particular, if the data is organized such that p > n (OLS solution
is not unique and overfitting), the lasso regression will make so that only k < n
regressors are left, creating a sparsity in the parameter vector.
5.3.1 Introduction
68
5.3.2 Formal definition
Let the dataset contain N observations (xi, yi ) for i = 1, ..., N, and xi a p-dimensional
vector (xi 1, ..., xi p). Given a partition of the R p space into M regions denoted
R1, ..., RM , fitting a regression tree is defined as finding the function:
M
Õ
f (x) = cm · I{x ∈ Rm }
m=1
such that it minimizes the RSS given by i=1 (yi − f (xi ))2 . This is equivalent to
ÍN
minimizing the RSS in each region, yielding as a result that
1 Õ
ĉm = yi
Nm i∈I
m
or in words, the average value of yi in the region Rm . Note that this method is
useful only when the partition is given! Finding the exact partition that would
yield the lowest possible RSS is computationally infeasible. Instead, one should
use a so-called “greedy algorithm”.
Greedy algorithm
First, the algorithm starts with all the data and considers splitting a variable
j at a cutpoint s. This would create two regions R1 ( j, s) = {X |X j ≤ s} and
R2 ( j, s) = {X |X j > s}. Given these regions, the algorithm solves the tree as
was described in the previous section, meaning that it finds (c1, c2 ) such that it
minimizes the RSS. This process is the inner minimization. Then, the algorithm
chooses which pair ( j, s) would yield the lowest RSS, among all feasible pairs.
This is the outer minimization problem. All in all, this process can be formalized
69
as finding:
( j , s ) = arg min min RSS(yi, c1 ; R1 ( j, s)) + min RSS(yi, c2 ; R2 ( j, s))
∗ ∗
( j,s) c1 c2
The data is now partitioned in two regions, and we repeat this process in each of
them, increasing the size of the tree until a particular criterion has been met.
Which criterion to choose? From what we know about statistics, growing the
tree indefinitely will cause overfitting, while not allowing it to grow enough
might lead to miss important features. In fact, the size of a tree will govern the
complexity of the model, hence one should try to let the data guide the size of
the tree.
One approach could be to evaluate the variation in the total RSS of the model after
each split, and stop before a split when the RSS decrease is too low. However, this
raises two issues: first, the choice of the threshold is not guided by the data, and
it might not be optimal; second, this strategy is also short-sighted in the sense
that a very small decrease in RSS at a given step might lead to bigger decreases in
the following steps. Recall that the greedy algorithm is already short-sighted in
this way, thus we might try to choose another, more robust method.
Another strategy, usually preferred, is to first grow the tree until it is large enough
(say 5 nodes). Then, we would use a “pruning” method called cost-complexity
pruning. Denote the big tree as T0 and define T ⊂ T0 as a sub-tree which could
be obtained by “pruning” the tree T0 , meaning collapsing an internal node of
the tree. Use m = 1, ..., M as an index for terminal nodes, such that the input
space is partitioned in M regions. Finally, let |T | be the number of terminal nodes
(regions) of T. Now define,
1
• Q m (T) = − ĉm )2 , the average RSS of the fit within region m.
Í
Nm xi ∈Rm (yi
70
The cost-complexity criterion of a tree T, denoted as Cα (T) is defined as:
|T |
Õ
Cα (T) = Nm Q m (T) + α · |T |
m=1
|{z}
| {z } cost of the number of regions
RSS of the tree T
As you can see, this criterion contains a trade-off between the total RSS of the
model (its fit) and the number of regions it creates, for a given parameter α. A
higher α would penalize big trees, while a lower α would allow for bigger trees.
Therefore, α is key in governing this trade-off can be considered as a tuning
parameter of the model. The idea behind this criterion is that for a given α, the
optimal tree is the subtree Tα ⊆ T0 that minimizes Cα (T). One can show that this
tree is unique for any value of α.
5.3.3 Bagging
Formally, consider a set Z containing “training” data {(x1, y1 ), ..., (xN , yN )} where
we fit a model yielding our main estimate fˆ(x). In order to perform bagging, we
need to first draw B samples from Z, which we denote Z b for b = 1, ..., B. In each
sample, we fit the same model as before, this time yielding a different estimate
fˆb (x). Then, the bagging estimate is the average of all bootstrap fits:
B
1 Õ ˆb
fˆbag (x) = f (x)
B b=1
In the tree regression setting, this estimator is interesting as each fˆb (x) will be a
different tree, displaying different features such as different partitions, but also a
different size! An example on the following page:
71
72
5.3.4 Random Forests
Starting with a data sample of size N, the procedure is defined as follows: for
b = 1, ..., B:
(b). Grow a tree Tb using Z b by repeating the following steps, until a minimum
number of nodes nmin is attained:
i. Select m variables at random for the p variables.
ii. Pick the best variable/split pair among the m variables
iii. Split the node into daughter nodes and perform step 1.b.i again.
Finally, we have:
B
1 Õ ˆb
fˆr f (x) = f (x)
B b=1
5.3.5 Boosting
73