Maximum Likelihood Notes1
Maximum Likelihood Notes1
(First Part)
Introduction to Econometrics
Christopher Flinn
Fall 2004
Most maximum likelihood estimation begins with the specification of an entire prob-
ability distribution for the data (i.e., the dependent variables of the analysis). We will
concentrate on the case of one dependent variable, and begin with the no exogenous vari-
ables just for simplicity. When the distribution of the dependent variable is specified to
depend on a finite-dimensional parameter vector, we speak of parametric maximum likeli-
hood estimation. When not, such as the case of trying to estimate the entire distribution
of the data making only some assumptions regarding the nature of its cumulative distrib-
ution function, we speak of nonparametric m.l.e. In this case the objective is to estimate
an entire function, F, rather than a few parameters that characterize the distribution of
the data under some specific functional form assumption regarding F.
Event Probability
1 π .
0 1−π
1
We flip the coin N times, and the data consist of a string of N 1’s and 0’s. Let the outcome
of flip i be denoted by di . Then the joint probability of the sample is given by
N
Y
L({d}N
i=1 , π) = π di (1 − π)(1−di )
i=1
S S
di (1−di )
= π (1 − π)
= π N1 (1 − π)N0 ,
where N1 is the number of times ‘1’ appeared and N0 is the number of times ‘0’ appeared,
with N1 + N0 = N.
The function L({d}N N
i=1 , π) is a joint probability function of the sample {d}i=1 given
the parameter value π. In this case, we don’t know the value of π, and will use the one
piece of information, the sample {d}N i=1 to attempt to infer its value. Now we treat L as
a function of the unknown parameter π, the sample itself being implicitly an argument of
the function, and choose that value of π that maximizes L as our estimator. Formally,
The value of π that maximizes L will also maximize any increasing monotonic function of
L, such as the natural logarithm. Then define
L(π) ≡ ln L(π)
= N1 ln π + N0 ln(1 − π).
Then the maximum likelihood estimator of π is the solution to the first order condition
dL(π̂ ML )
= 0
dπ
N1 N0
⇒ − = 0
π̂ ML 1 − π̂ ML
⇒ N1 (1 − π̂ ML ) − N0 π̂ ML = 0
N1 N1
⇒ π̂ ML = = .
N1 + N0 N
We know that for this particular example the maximum likelihood estimator has good
2
“small sample” properties. For example, it is unbiased since
N1
E π̂ ML = E
N
1
= EN1
N
1 P N
= E di
N i=1
1 PN
= Edi
N i=1
1 PN
= π
N i=1
Nπ
= = π.
N
We even have the small sample variance of the estimator, which is simply
π(1 − π)
V AR(π̂ ML ) = ,
N
an estimate of which is given by
π̂ ML (1 − π̂ ML )
Vd
AR(π̂ ML ) = .
N
One of the reasons for the good small sample properties of the ML estimator in this case
is that it coincides with the sample mean - which is just the proportion of successes out of
the N draws. In a random sample of N draws, the sample mean is an unbiased estimator
of the population mean, remember, with the variance of the estimate of the sample mean
equal to the variance of the random variable divided by the sample size - exactly what
we have above. Proving consistency of the maximum likelihood estimator in this case is
straightforward, since the estimator is unbiased and the limiting value of the variance is 0.
3
Say that we wanted to estimate the parameter α, which given our functional form assump-
tion, would serve to completely characterize the distribution of the random variable T.
Let’s think of T as the duration of time an individual spend unemployed while looking for
a job.
We have a random sample of N individuals, and we consider each individual’s unem-
ployment duration as an independent draw from the same distribution F. In this case, the
joint p.d.f. of all of the sample draws, the likelihood function, is given by
N
Y
L(t1 , ..., tN , α) = α exp(−αti )
i=1
P
N
= αN exp(−α ti ).
i=1
P
N
L(α) = N ln α − α ti .
i=1
We want to maximize this function with respect to α. The solution, if there is exactly one,
is the maximum likelihood estimator for the problem. Does this function have a unique
solution? The first derivative is
dL(α) N P
N
= − ti ,
dα α i=1
dL(α̂ML )
= 0
dα
N P
N
⇒ − ti = 0
α̂ML i=1
N 1
⇒ α̂ML = = ,
P
N tN
ti
i=1
4
3 Example 2 continued with Observable Heterogeneity
Often times we have access to a information on observable characteristics of individuals in
addition to the value of the dependent variable for each one. For example, in the unem-
ployment case discussed above, we may know each individuals gender, years of schooling
completed, region of the country in which they live, etc. Let the row vector of these char-
acteristics for individual i be given by Xi .We assume that the are K columns in Xi , one
of which may be a ‘1’ for each individual (corresponding to a sort of intercept term).
Say we continue to assume that the distribution of times in unemployment is negative
exponential, and the one individual’s unemployment duration is unrelated (i.e., indepen-
dent) of another’s. We just want to relax the “identical” part of the i.i.d. assumption. The
easiest way to accomplish this, but by no means the only way, is to write
αi = exp(Xi β),
The first partial derivative of this function with respect to the parameter vector β is
N
∂L(β) X 0
= {Xi − exp(Xi β)Xi0 ti },
∂β
i=1
and the second partial of the log likelihood function with respect to β is
N
∂ 2 L(β) X
= {− exp(Xi β)Xi0 Xi ti } < 0,
∂β∂β 0 i=1
so that there is a unique maximum likelihood estimator of β. The estimator solves the first
order conditions, or
N
∂L(β̂ ML ) X 0
= {Xi − exp(Xi β̂ ML )Xi0 ti } = 0.
∂β
i=1
There is no “closed form” solution to the first order conditions, and the solution must be
obtained iteratively using numerical methods. We will say a few words on this topic below.
5
4 Example 3 The Latent Variable Discrete Choice Frame-
work
We saw that there were some interpretive problems with the linear probability model -
namely, the predicted probability of a “success” was often outside of the interval (0,1) -
which is not a good thing for a probability. Instead we can set the problem up in a latent
variable framework. We illustrate the technique in the binary choice case.
Think of a case in which choice is to buy one unit of some good or not, for example, a
car. There is a utility gain from buying the car, but of course the cost of the car reduces
resources available to spend on other goods. Let U ∗ (X) be the net utility gain from buying
the car, which depends on characteristics X, such as the price of the car, the individual’s
residence and workplace (as indicators of how much she will use it), characteristics of the
car such as color, type, etc., and the individual’s total income. We will assume a functional
form for this indirect utility function, typically
Ui∗ = Xi β + εi ,
where Xi are the characteristics of the individual i and the good in question and εi is
a random “preference shock.” We will make a parametric assumption regarding εi , and
assume that is independently and identically distributed across consumers, with c.d.f. F (ε).
We will assume that E(ε) = 0.
The individual purchases the good if and only if the net utility from doing so is positive.
If di = 1 when individual i purchases the good, then
½
1 iff Ui∗ > 0
di = .
0 iff Ui∗ ≤ 0
Then the probability that the individual purchases the good is given by
p(di = 1|Xi ) = p(Xi β + ε > 0)
= p(ε > −Xi β)
= 1 − F (−Xi β),
and of course
p(di = 0|Xi ) = F (−Xi β).
If we have a random sample of observations on {di , Xi }N
i=1 , we define the likelihood
function as
N
Y
L(d1 , ..., dN , β|X1 , ..., XN ) = p(di = 1|Xi )di p(di = 0|Xi )1−di
i=1
YN
= (1 − F (−Xi β))di F (−Xi β)1−di ,
i=1
6
with the log likelihood function given by
N
X
L(β) = {di ln(1 − F (−Xi β)) + (1 − di ) ln F (−Xi β)}.
i=1
Now let’s look at the first partials of the log likelihood function, which are
N
∂L(β) X 1 1
= {di − (1 − di ) }f (−Xi β)Xi0 .
∂β (1 − F (−Xi β)) F (−Xi β)
i=1
Under many distributional assumptions regarding F it is possible to prove that the matrix
of second partial derivatives is negative definite, and so the solutions to the first order
conditions give the unique maximum likelihood estimates of the parameters β, if these
parameters are “identified.” We will see what this means in the context of the probit
model, where we assume that F is a normal distribution.
Specifically, we will begin by assuming that ε follows the N (0, σ 2ε ) distribution. Then
we have that the probability of d = 1 is given by
1 − Φ(−z) = Φ(z),
We note something immediately from the log likelihood function concerning its de-
pendence on the unknown parameters of the model. By assuming that ε was normally
distributed with a variance equal to σ 2ε , we added an additional parameter to estimate -
σ 2ε . However, we can see from inspection of L(β, σ 2ε ) that β and σ 2ε always appear in the
same combination β/σ ε everywhere in L. This means that these parameters cannot be sep-
arately identified. Whatever values are taken by β and σ ε , if we multiply each parameter
by some scalar k we will end up with the same ratio - β/σ ε . Thus only the ratio of β to
7
σ ε can be estimated, and not each term individually. Sometimes we simply “normalize”
the standard deviation of ε to be equal to 1, and in that way say that we have estimated
β uniquely. But the important thing to remember is that the estimate is not really unique
unless we have fixed the value of σ ε by assumption.
In light of this discussion, let’s simply assume that σ ε = 1 so that we only have the K
parameters in β to estimate. The first order conditions in this case are given by
∂L(β̂ ML )
= 0
∂β
N
X 1 1
= {di − (1 − di ) }φ(Xi β̂ ML )Xi0 ,
i=1 Φ(Xi β̂ ML ) (1 − Φ(Xi β̂ ML ))
where φ denotes the standard normal probability density function. This can be rewritten
as
XN
di − Φ(Xi β̂ ML )
0= { }φ(Xi β̂ ML )Xi0 .
i=1 Φ(Xi β̂ ML )(1 − Φ(Xi β̂ ML ))
Note that the maximum likelihood estimator is chosen to set a certain average of the
residuals equal to 0 in the sample, where the residual is
di − Φ(Xi β̂ ML ),
∂L(θ̂ML )
= 0.
∂θ
Assuming that L is continuously differentiable, let us expand this function around some
8
initial guess of the parameter value, which we will denote by θ0 . Then write
∂L(θ̂ML )
0 =
∂θ
∂L(θ0 ) ³ ´ ∂ 2 L(θ )
0
' + θ̂ML − θ0 .
∂θ ∂θ2
We can rearrange this last expression and write
∙ ¸−1
∂ 2 L(θ0 ) ∂L(θ0 )
θ̂ML = θ0 − . (1)
∂θ2 ∂θ
Now because this first-order Taylor series does not hold exactly, unless L happens to be
quadratic in θ, the value on the left hand side of [1] does not actually satisfy the first order
conditions. Instead, we would take the value value labelled θ̂ML and call it θ1 . If the log
likelihood function is well-behaved, i.e., globally concave in θ, then θ1 should be closer to
the true solution to the likelihood equation than θ0 . We would proceed with our iteration
of [1] until the values θ0 , θ1 , ..., θ m , converged according to some criterion we would have
to define. This criterion usually involves the first derivative ∂L(θ ∂θ
m)
being sufficiently close
to 0, small changes in the value of the log likelihood between successive iterations, or small
relative changes in the parameter guesses θm and θm+1 between iterations.
α2m 100
αm+1 = αm − ( )( − 148).
100 α
9
For illustration we chose an initial guess of α0 far away from the true value, which we know
is 100/148. We set α0 = 5, and found that:
α1 .630000
α2 .672588
α3 .675662 .
α4 .675676
α5 .675676
Thus convergence was very rapid using this algorithm for this simple model. In practice, to
solve the likelihood equations in more complicated cases, certain modifications are necessary
to obtain good convergence results.
10