0% found this document useful (0 votes)
8 views

Lecture-6

Uploaded by

armaz.shotadze
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Lecture-6

Uploaded by

armaz.shotadze
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Asymptotic Analysis

Paolo Zacchia

Probability and Statistics

Lecture 6
Why asymptotics?
• This lecture characterizes probabilistic results for samples
in asymptotic settings: when the sample size N is large.

• The focus is on convergence results for selected statistics:


their value and distribution for large N .

• These results greatly facilitate estimation & inference when


exact results on sampling distributions are hard to obtain.

• The main objective is to characterize the behavior of known


estimators (MM and MLE) in asymptotic environments.

• This is achieved via the analysis of two fundamental results


about the sample mean x̄: the law of large numbers and
the central limit theorem.
Random sequences
To characterize asymptotic results it is necessary to adopt a notation
that helps express the dependence of these results on the sample size.

Definition 1
Random sequence. Any random vector expressed as an N -indexed
T
sequence, write it as xN = (X1N , . . . , XKN ) , is a random sequence.
In the univariate context (K = 1), one can write it simply as XN .

The definition can also apply to sequences of random matrices having


dimension J × K, which combine J vectorial sequences xjN of length
K for j = 1, . . . , J. One such matrix is denoted for example as follows.
 T
XN = x1N x2N . . . xJN

Example: both the sample mean and the sample variance-covariance:


N N
1 X 1 X T
x̄N = xi and SN = (xi − x̄N ) (xi − x̄N )
N i=1 N − 1 i=1
are two random sequences, as they are statistics that depend upon the
2
sample size N . Their univariate versions are written as X̄N and SN .
Boundedness in probability

The first asymptotic concept related to the idea of “convergence,” as


applied to random sequences, is defined next.

Definition 2
Boundedness in Probability. A sequence xN of random vectors is
bounded in probability if and only if, for any ε > 0, there exists some
number δε < ∞ and an integer Nε such that

P (kxN k ≥ δε ) < ε ∀N ≥ Nε

which is also written as xN = Op (1) and read as “xN is big p-oh one.”

This is a desirable properties of random sequences, however it is still


not fully satisfactory, as it allows the distribution of random vectors to
remain “dense” around a certain interval.
Convergence in probability
The following “convergence” concept is stronger than the previous.

Definition 3
Convergence in Probability. A sequence xN of random vectors con-
verges in probability to a constant vector c if

lim P (kxN − ck > δ) = 0


N →∞

for any positive real number δ > 0.

This definition formalizes the idea that as the sample size N grows in-
creasingly larger, the probability distribution of xN concentrates within
an increasingly smaller neighborhood of c.
Convergence in probability is usually denoted in one of two ways.
p
xN → c
plim xN = c
p
Here the first type of notation (using →) is preferred.
Convergence implies boundedness in probability

Theorem 1
Convergent Random Sequences are also Bounded. If some se-
quence xN of random vectors converges in probability to a constant c,
p
that is xN → c, then it is also bounded: xN = Op (1).
Proof.
By the definition of convergence in probability, for any ε > 0 there is
always an integer Nε such that

P (kxN − ck > δ) < ε ∀N ≥ Nε

thus by setting δε = δ + kxNε k − kxNε − ck one gets xN = Op (1).

Hence, while “boundedness” is valid for a specific constant δ so long


as N large enough, “convergence” must be true for any δ.
Convergence of random to real sequances
If convergence in probability holds for c = 0, one can also write:

xN = op (1)

which is read as “xN is little p-oh one.” This notation helps develop
the following concept.

Definition 4
Convergence of Random Sequences to Real Sequences. Con-
sider a random sequence xN , and a non-random sequence aN of the
same dimension K as xN . Moreover, define the random sequence
T
zN = (Z1N , . . . , ZKN ) where Zkn = Xkn /akn for k = 1, . . . , K and
for n = 1, 2, . . . to infinity.
1. If zN = Op (1), then xN is said to be bounded in probability by
aN , which one can write as xN = Op (aN ).
2. If zN = op (1), then xN is said to converge in probability to aN ,
which one can write as xN = op (aN ).
Convergence in r-th Mean

The following asymptotic concept is even stronger than convergence in


probability.

Definition 5
Convergence in r-th Mean. A random sequence xN is said to con-
verge in r-th mean to a constant vector c if the following holds.
r
lim E [kxN − ck ] = 0
N →∞

In the special case with r = 2, this concept is known as Convergence


in Quadratic Mean and is also expressed as follows.
qm
xN → c

This particular kind of convergence is not as general as convergence in


probability, but it may be more convenient to work with, given that it
is based upon moments.
Convergence in lower means
Intuitively, if a random sequence converges in the r-th mean it shall
also converge to means of order lowern than r.

Theorem 2
Convergence in Lower Means. A random sequence xN that con-
verges in r-th mean to some constant vector c also converges in s-th
mean to c for s < r.
Proof.
The proof is based on Jensen’s Inequality:
r s
h i
s
lim E [kxN − ck ] = lim E (kxN − ck ) r
N →∞ N →∞
s
r
≤ lim {E [kxN − ck ]} r
N →∞
=0
r
since limN →∞ E [kxN − ck ] = 0.
Convergence in quadratic mean (1/2)
Theorem 3
Convergence in Quadratic Mean and Probability. If a random
sequence xN converges in r-th mean to a constant vector c for r ≥ 2
qm
(that is, at least xN → c), then it also converges in probability to c.
Proof.
Define the (one-dimensional) nonnegative random sequence QN as:
q
T
QN = kxN − ck = (xN − c) (xN − c) ∈ R+

and notice that by Theorem 2 it must converge in first mean.

lim E [QN ] = lim E [kxN − ck] = 0


N →∞ N →∞

In addition, quadratic mean convergence implies the following.


h i
2
lim Var [QN ] = lim E Q2N = lim E kxN − ck = 0
 
N →∞ N →∞ N →∞

(Continues. . . )
Convergence in quadratic mean (2/2)
Theorem 3
Proof.
(Continued.) At the same time, by Čebyšëv’s Inequality:

Var [QN ]
P (|QN − E [QN ]| > δ) ≤
δ2
therefore, taking limits on both sides gives:

lim P (kxN − ck > δ) = lim P (|QN − E [QN ]| > δ)


N →∞ N →∞
Var [QN ]
≤ lim
N →∞ δ2
=0
p
implying convergence in probability: xN → c.

This result is useful to verify that in random samples drawn from a


random vector x with finite variance Var [x] < ∞, the sample mean
xN converges in probability to the mean of the population, E [x].
Convergence in probability of the sample mean
In a random sample drawn from some random variable X:
h i N
lim E X̄N = lim E [X] = E [X]
N →∞ N →∞ N

and in addition, if Var [X] < ∞:


Var [X]
 h i2  h i
lim E X̄N − E X̄N = lim Var X̄N = lim =0
N →∞ N →∞ N →∞ N
qm p
and therefore, X̄N → E [X] which also implies X̄N → E [X].
This generalizes to a multivariate environment: given a random
sample drawn from a random vector x with Var [x] < ∞, it is:
N
1 X qm
x̄N = xi → E [x]
N i=1
p
which also implies convergence in probability, x̄N → E [x]
Almost Sure Convergence

There is yet another, stronger notion of convergence.

Definition 6
Almost Sure Convergence. A sequence xN of random vectors con-
verges almost surely, or with probability one to a constant vector c if it
holds that:  
P lim xN = c = 1
N →∞
a.s.
where limN →∞ xN is a random vector. This is also writes as xN → c.

One can prove that “almost sure” convergence implies convergence in


probability, but occasionally the converse is not true.
Convergence of random matrix sequences
• All concepts and results discussed until here also apply to
random sequences that are matrix-valued.

• A matrix-valued random sequence XN is said to converge in


probability to some matrix C if:

lim P (kXN − Ck > δ) = 0


N →∞

• (. . . where, for any matrix B, it is as follows).


q
kBk = tr (BT B)

• Convergence in probability of random matrix sequences can


be written as follows.
p
XN → C
Continuous mapping theorem (1/2)
What follows is an extremely useful result.

Theorem 4
Continuous Mapping Theorem. Consider a vector-valued random
sequence xN ∈ X, a vector c ∈ X with the same length as xN , as well
as a vector-valued continuous function g (·) with a set of discontinuity
points Dg such that:
P (x ∈ Dg ) = 0
(the probability mass at the discontinuities is zero). It follows that:
p p
xN → c ⇒ g (xN ) → g (c)
a.s. a.s.
xN → c ⇒ g (xN ) → g (c)

thus, convergence in probability and almost sure convergence are pre-


served when functions are applied to random sequences.
Proof.
(Sketched.) Only the case about convergence in probability is proved
here, for the sake of illustration. (Continues. . . )
Continuous mapping theorem (2/2)
Theorem 4
Proof.
(Continued.) For a given positive number δ > 0, define the set:

Gδ = { x ∈ X| x ∈
/ Dg : ∃y ∈ X : kx − yk < δ, kg (x) − g (y)k > ε}

that is, the set of points in X where g (·) “amplifies” the distance with
some other point y beyond a small neighborhood of ε. In light of this
definition:

P (kg (xN ) − g (c)k > ε) ≤ P (kxN − ck ≥ δ) + P (c ∈ Gδ ) + P (c ∈ Dg )

and note that upon taking the limit of the right-hand side as N → ∞,
the second term vanishes by definition of a continuous function, while
the third term is zero by hypothesis. Therefore:

lim P (kg (xN ) − g (c)k > ε) ≤ lim P (kxN − ck ≥ δ)


N →∞ N →∞

which proves the result on convergence in probability.


Uses of the continuous mapping theorem (1/3)
• This result also applies to scalar-valued and matrix-valued
sequences.

• The theorem already showcases the convenience of working


in an asymptotic environment.

• Note that in general, one cannot derive the expected value


of some function g (µ b N ) of some given unbiased estimator
µ
b N such that E [µ
b N ] = µ0 for some µ0 .

• (The best one can do is to derive approximations based on


Jensen’s Inequality.)

• If µ
b N converges in probability to µ0 though, the continuous
mapping theorem ensures that in large samples g (µ b N ) also
converges in probability to g (µ0 ).
Uses of the continuous mapping theorem (2/3)
A list of ‘properties’ of random sequences, which can be derived
from the continuous mapping theorem, follows suit.
p
1. Scalars. Given two scalar random sequences XN → x and
p
YN → y, the following holds.
p
(XN + YN ) → x + y
p
XN YN → xy
p
XN /YN → x/y if y 6= 0

p
2. Vectors. Given two vector random sequences xN → x and
p
yN → y of equal length, the following holds.
p
xT T
N yN → x y
T p
xN yN → xyT
Uses of the continuous mapping theorem (3/3)
p
3. Matrices. Given two matrix random sequences XN → X
p
and YN → Y of appropriate dimension it holds that:
p
XN YN → XY

while for sequences of random square matrices of full rank


p
ZN → Z, it is as follows.
−1 p
ZN → Z−1

4. Combinations of the above. Consider the three random


sequences XN , xN and XN as above, and suppose that the
column dimension of XN corresponds to the row dimension
of xN . Then, the following holds.
p
XN XN xN → xXx
Laws of Large Numbers
• Endowed with these convergence concepts, it is possible to
state and prove the Laws of Large Numbers.

• These fundamental results in asymptotic analysis show how


sample means converge to population means in ways
that depend on the assumptions that one makes.

• There are two distinct kinds of Law: weak (for convergence


in probability) and strong (for almost sure convergence).

• While both are stated next, only the weak law is proved. A
full-fledged proof would use characteristic functions; for the
sake of simplicity, a slightly less general proof that is based
on moment-generating functions is given.

• The weak law resembles the result following from quadratic


mean convergence, but it does not impose finite variances.
Weak Law of Large Numbers (1/3)
The simplest version of the Law is introduced next.

Theorem 5
Weak Law of Large Numbers (Khinčin’s). The sample mean as-
sociated with a random (i.i.d.) sample drawn from the distribution of
a random vector x with finite mean E [x] < ∞ converges in probability
to such population mean of x.
N
1 X p
x̄N = xi → E [x]
N i=1

Proof.
(Sketched.) The proof is restricted to random vectors x for which the
m.g.f. Mx (t) is actually defined. A more general analysis, which also
allows for random vectors that lack an m.g.f., would use characteristic
functions φx (t) instead. (Continues. . . )
Weak Law of Large Numbers (2/3)
Theorem 5
Proof.
(Continued.) The m.g.f. of the sample mean x̄N is, for a given N :

Mx̄N (t) = E exp tT x̄N


 
" N
!#
1 X T
= E exp t xi
N i=1
N   
Y 1 T
= E exp t xi
i=1
N
  N
1
= Mx t
N

where the third line follows from independence between observations,


and the fourth line relies on observations being identically distributed
(so that they have the same m.g.f.); basically, this is an application of
the theorem on the m.g.f. of linear combinations. (Continues. . . )
Weak Law of Large Numbers (3/3)

Theorem 5
Proof.
(Continued.) From a Taylor expansion around t0 = 0:
 T N
tT E [x]

t ι
Mx̄N (t) = 1 + +o
N N

hence, taking the limit gives the following result.

lim Mx̄N (t) = exp tT E [x]



N →∞

This is a trivial m.g.f.: the one of a degenerate discrete random vector


where the entire probability mass is concentrated in E [x]! Therefore,
exploiting the fact that m.g.f. uniquely characterize distributions, one
can conclude that the sample mean indeed converges in probability to
its mean as N grows larger.
Strong Laws of Large Numbers (1/2)
Under appropriate restrictions about the variance of the population,
one can also establish almost sure convergence.

Theorem 6
Strong Law of Large Numbers (Kolmogorov’s). If in a random
(i.i.d.) sample drawn from the distribution of some random vector x it
simultaneously holds that:
i. E [x] < ∞,
ii. Var [x] < ∞,
P∞ −2
iii. n=1 n Var [xn ] < ∞,
1
PN
then the sample mean x̄N = N i=1 xi converges almost surely to its
population mean.
N
1 X a.s.
x̄N = xi → E [x]
N i=1
Strong Laws of Large Numbers (2/2)
The following result applies to non-identically distributed observations.

Theorem 7
Strong Law of Large Numbers (Markov’s). Consider a sample
with independent, non identically distributed observations (i.n.i.d.) so
that the random vectors xi that generate it have possibly heterogeneous
moments E [xi ] and Var [xi ]. If for some δ > 0 it holds that:

X 1 h
1+δ
i
lim E |x i − E [xi ]| <∞
N →∞
i=1
i1+δ

then the following almost sure convergence result holds.


N
1 X a.s.
(xi − E [xi ]) → 0
N i=1

More complex results that apply to non-independent observations also


exist. These are widely applied in econometrics.
LLN: Illustrative simulations (1/5)
• To illustrate the functioning of the Theorem in practice, a
simulation is presented next.

• The simulation is based off random draws from the Poisson


distribution with parameter λ = 4.

• Four histograms are presented. All of them bin the values


of 800 sample means from 800 simulated samples.

• Across histograms the size of the sample N varies. Thus, if


N = 1, 800 observations are grouped across 800 samples; if
N = 10, 8,000 observations are grouped across 800 samples,
and so on.

• This helps showing how the empirical sampling distribution


of the 800 sample means becomes increasingly concentrated
around 4 as N becomes larger.
LLN: Illustrative simulations (2/5)

0.5

0
0 4 8 12 16
N =1
LLN: Illustrative simulations (3/5)

0.6

0.4

0.2

0
0 4 8 12 16
N = 10
LLN: Illustrative simulations (4/5)

1.5

0.5

0
0 4 8 12 16
N = 100
LLN: Illustrative simulations (5/5)

0
0 4 8 12 16
N = 1000
Consistent estimators
The Laws of Large Numbers can be exploited to show that selected
estimators converge in probability to the parameters of interest for
estimation.

Definition 7
Consistent Estimators. An estimator θ bN is called consistent if it
converges in probability to the true population parameters θ0 which it
is meant to estimate.
bN → p
θ θ0

Note: from now on, the subscript 0 – as in θ0 – is used to denote the


“true” value of the parameters of interest, the ones that characterize
the distribution which generates the data.

Before proving consistency of both MM and MLE estimators (under


some loose conditions), it is useful to provide an example about the
bivariate linear regression model, showing that the estimator(s) that
are associated with it are consistent.
Bivariate regression and consistency (1/2)
Consider the bivariate linear regression model from Lecture 3.
The MM estimator of the true slope parameter β1 is given by
(Lecture 4) as:
PN   
i=1 Xi − X̄ Yi − Ȳ
β
b 1,M M =
PN  2
i=1 Xi − X̄

where X̄ = N −1 N −1 N
P P
i=1 Xi and Ȳ = N i=1 Yi . This estimator
can also be obtained via MLE under certain assumptions.

Observe that after dividing both sides of the above ratio by N ,


the numerator and the denominator are, respectively:
• the sample covariance between Xi and Yi , and
• the sample variance of Xi ,
(both multiplied by the (N − 1) /N factor).
Bivariate regression and consistency (2/2)
Since the two sides of the ratio are (generalized) sample means,
by the Weak Law of Large Numbers:
N 
1 X   p
Xi − X̄ Yi − Ȳ → Cov [Xi , Yi ]
N i=1
N 
1 X 2 p
Xi − X̄ → Var [Xi ]
N i=1
and thanks to the Continuous Mapping Theorem:
p
b 1,M M → β1
β
this estimator of the slope parameter β1 is consistent!

An extended analysis shows that the MM estimator of β0 :


b 0,M M = Ȳ − β
β b 1,M M · X̄
p
b 0,M M → β0 , again thanks to the Continuous
is also consistent: β
Mapping Theorem.
Consistency of the Method of Moments
Theorem 8
Consistency of the Method of Moments. An estimator defined as
the solution θ
bM M of a set of sample moments
N
1 X  
m xi ; θ
bM M = 0
N i=1
is consistent for the set of parameters θ0 that solves the corresponding
population moments E [m (xi ; θ)] = 0, if such a solution exists (i.e. if
the estimation problem is well defined).
Proof.
(Heuristic.) By some applicable Law of Large Numbers:
N
1 X   p h 
bM M →
i
m xi ; θ E m xi ; θ bM M = 0
N i=1
where the equality to 0 follows by definition of MM estimator, which
is maintained throughout the sequence as N → ∞. As by hypothesis
the zero moment conditions have only one admissible solution, at the
probability limit it is plim θ
b M M = θ0 .
Consistency of Maximum Likelihood (1/3)
Theorem 9
Consistency of Maximum Likelihood Estimators. In a random
sample, an estimator θ bM LE which is defined as the maximizer of a
log-likelihood function as per
N
X
θ
bM LE = arg max log fxi ( θ| xi )
θ∈Θ i=1

converges in probability to that parameter set θ0 that maximizes the


corresponding population moment function.

θ0 = arg max E [log fx (x; θ)]


θ∈Θ

If such a maximum exists, by the likelihood principle it corresponds to


the true parameter of the distribution under analysis.
Proof.
(Continues. . . )
Consistency of Maximum Likelihood (2/3)
Theorem 9
Proof.
(Continued.) (Heuristic.) By the Weak Law of Large Numbers, for
any θ ∈ Θ including θ
bM LE and θ0 :
N
1 X p
log fx (xi ; θ) → E [log fx (x; θ)]
N i=1

while by the definition of MLE the following holds for all N ∈ N.


N N
1 X 
bM LE ≥ 1
 X p
log fx xi ; θ log fx (xi ; θ0 ) → E [log fx (x; θ0 )]
N i=1 N i=1

Moreover, since θ0 maximizes the expected logarithmic p.d.f. or p.m.f.


in the population, the following holds too.
 h  i
lim P E [log fx (x; θ0 )] ≥ E log fx x; θ
bM LE =1
N →∞

(Continues. . . )
Consistency of Maximum Likelihood (3/3)

Theorem 9
Proof.
(Continued.) All these facts can be simultaneously reconciled only if,
at the limit:
N
1 X   p h
bM LE →
 i
log fx xi ; θ E log fx x; θ
bM LE
N i=1

while, at the same time:


h  i
E log fx x; θ
bM LE = E [log fx (x; θ0 )]

hence, at the limit it follows that plim θ


bM LE = θ0 by the Continuous
Mapping Theorem.
Convergence to random vectors?
• All convergence concepts that were discuss so far concern
convergence to a point or interval in RK .

• What if interest falls on convergence to a random vector?

• Consider the expression:


p
xN → x

which can be read as: “the random sequence xN converges


in probability to the random vector x” as follows.

lim P (kxN − xk > δ) = 0


N →∞

• Is that enough so to guarantee that at the probability limit,


the distribution (and moments) of xN and x coincide?
Convergence in distribution
The answer to the question is “no:” such a result is only obtained when
the following stronger concept can be applied.

Definition 8
Convergence in Distribution. Consider:
• a sequence of random vectors xN , whose each element has a cu-
mulative distribution function FxN (xN ),
• and a random vector x with cumulative distribution function
Fx (x).
The random sequence xN is said to converge in distribution to x if:

lim |FxN (xN ) − Fx (x)| = 0


N →∞

at all continuity points x ∈ X belonging to the support of x. This is


usually expressed with the following formalism.
d
xN → x
Limiting distribution, and discussion

The distribution Fx (x) from the definition takes the following name.

Definition 9
d
Limiting Distribution. If xN → x, that is some random sequence
xN converges in distribution to a random vector x, Fx (x) is said to be
the limiting distribution of xN .

Observe the following.


• The definition of convergence in distribution indicates that the
probabilistic behavior of xN and x becomes increasingly closer as
N grows, eventually it shall coincide;
• This is a stronger concept than is convergence in probability to a
random vector, which implies that xN and x are going to deliver
increasingly close realizations as N grows, even if not necessarily
with the same probabilities on the support.
Student’s t convergence to the normal (1/2)
Observation 1
Asymptotics of Student’s t-distribution. Consider a random vari-
able that follows the Student’s t-distribution with parameter ν, that is
X ∼ T (ν). As ν → ∞, the probability distribution of X tends to that
of the standard normal distribution, i.e. limν→∞ X = Z ∼ N (0, 1).
Proof.
Taking the limit of the probability density function of the Student’s
t-distribution as ν → ∞:
− ν+1
x2
  2
1 1 2
1 x
lim √ 1+ = √ exp −
ν→∞ B 1 , ν ν ν 2π 2
2 2
√ 1 ν
 √
as limν→∞ νB 2, 2 = 2π by the properties of the Beta function;
while:
− (ν+1)
x2
  2
2
x
lim 1+ = exp −
ν→∞ ν 2
by more standard arguments.
Student’s t convergence to the normal (2/2)

This result was already anticipated in Lecture 2. It is worthwhile to


replicate the graphical intuition in a different graphical form.

FX (x) ν=1
1 ν=3
ν→∞

0.5

x
−5 −3 −1 1 3 5
Convergence of t-statistics
• Consider a random sample which is drawn from a normally
distributed random variable X ∼ N µ, σ2 .


• As it has been analyzed in Lecture 4, a t-statistic follows a


t-distribution with N − 1 degrees of freedom.
√ X̄N − µ
tN = N ∼ T(N −1)
SN

• If seen as a random sequence, the t-statistic thus converges


in distribution to the standard normal.
d
tN → N (0, 1)

• Hence, with large N one can very reliably perform inference


using t-statistics evaluated against the standard normal.
Snedecor’s F convergence to the chi-squared
Observation 2
Asymptotics of Snedecor’s F-distribution. Consider a random
variable that follows Snedecor’s F-distribution having parameters ν1
and ν2 , X ∼ F (ν1 , ν2 ). As ν2 → ∞, the probability distribution of
W = ν1 X tends to that of a chi-squared distribution with parameter
ν1 , i.e. limν2 →∞ ν1 X = W ∼ χ2 (ν1 ).
Proof.
After deriving the p.d.f. of W = ν1 X, take its limit as ν2 → ∞:
  ν21  − ν1 +ν 2
1 1 ν1
−1 w 2
lim fW (w) = lim w 2 1+
ν2 →∞ B ν1 , ν2

ν2 →∞
2 2
ν2 ν2
ν1 − ν22
Γ ν1 +ν

−1
2

− 21 w 2 w
ν
2 
= lim (ν2 + w)  1+
ν2 →∞ Γ ν2 Γ ν21 ν2
2
1 ν1
 w
=  ν1 w 2 −1 exp −
Γ ν1 · 2 2
2
2
ν ν1
− 21
where limν2 →∞ Γ ν1 +ν ν2
= 2− 2 derives from
  
2
2
/Γ 2 (ν 2 + w)
the properties of the Gamma function.
Convergence of Hotelling’s t-squared statistics
• Recall Hotelling’s rescaled t-squared statistic.

N −K 2
t =
K (N − 1) N
(N − K) N
= (x̄ − µ)T S −1 (x̄ − µ) ∼ FK,N −K
K (N − 1)
For a given N , this statistic follows the F-distribution with
paired degrees of freedom K and N − K.
• Per Observation 2, Hotelling’s t-squared statistic converges
in distribution to a chi-squared distribution with K degrees
of freedom:
d
t2N → χ2K
note that the term (N − K) / (N − 1) vanishes as N → ∞.
• Like in the univariate case, this result facilitates statistical
inference in multivariate settings.
Gamma convergence to the normal
Observation 3
Asymptotics of the Gamma distribution. Consider a random vari-
able that follows the Gamma distribution with parameters α and β,
X ∼ Γ (α, β). Let µ = α/β as well as σ2 = α/β2 . As α → ∞, the
probability distribution of X tends to that of a normal
 distribution
with parameters µ and σ2 , i.e. limα→∞ X ∼ N µ, σ2 .
Proof. √  √
Define the random variable Z = (X − µ) /σ = β/ α X − α; by
the properties of m.g.f.s it is:
−α
√  √ 
  
β t
MZ (t) = exp − αt · MX √ t = exp − αt 1 − √
α α

and after some manipulation, the limit as α → ∞ gives:


−α
√ 
  2
t t
lim MZ (t) = lim exp − αt 1 − √ = exp
α→∞ α→∞ α 2

showing that at the limit, Z ∼ N (0, 1) and thus X ∼ N µ, σ2 .
Mean convergence in exponential samples
• Recall that in a random sample drawn from X ∼ Exp (λ)
the sample mean is Gamma-distributed, X̄ ∼ Γ (N, N/λ).

• Thus, by Observation 3, the following holds.


√  
d
 
N X̄N − λ → N 0, λ2

• This statement is interpreted in the sense that for a fixed


value of N : !
A λ2
X̄N ∼ N λ,
N
where A there stands for “asymptotic” (observe that by the
definition of convergence in distribution, N cannot show up
in the expression of a limiting distribution).

• This is a particular case of the Central Limit Theorem, one


that can help inference about the exponential distribution.
Continuous Mapping Theorem, continued
The Continuous Mapping Theorem also applies to the concept of con-
vergence in distribution.

Theorem 10
Continuous Mapping Theorem (convergence in distribution).
Under the hypotheses of Theorem 4:
d d
xN → x ⇒ g (xN ) → g (x)

that is, a random sequence which is obtained from the application of a


transformation g (·) to some original random sequence xN , converges
in distribution to the distribution resulting from applying the transfor-
mation g (·) to the random vector x associated with the limiting distri-
bution of xN .

The proof of this statement is omitted: it involves advanced measure


theory. This version of the continuous mapping theorem is important,
as it allows to prove some properties of random sequences – which are
exploited in statistics and econometrics – that are presented next.
Slutskij’s Theorem (1/2)
Theorem 11
Slutskij’s Theorem. Consider any two (scalar) random sequences
XN and YN such that:
d
XN → X
p
YN → c

that is, XN converges in distribution to that of the random variable X,


while YN converges in probability to a constant c. Then, the following
holds.
d
(XN + YN ) → X + c
d
XN YN → cX
d
XN /YN → X/c if c 6= 0

Proof.
(Continues. . . )
Slutskij’s Theorem (2/2)

Theorem 11
Proof. p
(Continued.) Recognize that, as YN → c, YN has a degenerate limi-
ting distribution, and the (vector-valued) random sequence (XN , YN )
converges in distribution to that of the random vector (X, c). All the
results above follow, therefore, from applying the Continuous Mapping
Theorem to three given continuous functions of XN and YN .

Corollary: Cramér-Wold Device. Given a random sequence xN


and a constant vector a of the same dimension:
d d
xN → x ⇒ a T xN → a T x

that is, if a vectorial random sequence has a limiting distribution, any


linear combination of its elements will converge in distribution to the
distribution of the corresponding “limiting” linear combination.
The Extreme Value Theorem (1/4)
It is worth to briefly sketch here the central result of extreme value
theory: that is, the asymptotic theory of order statistics.

Theorem 12
Extreme Value Theorem (Fisher-Tippett-Gnedenko). Given a
random (i.i.d.) sample (X1 , . . . , XN ), if a convergence in distribution
result of the kind
X(N ) − bN d
→W
aN
can be established – where X(N ) is the maximum order statistic while
aN > 0 and bN are sequences of real constants – then:

W ∼ GEV (0, 1, ξ)

for some real ξ. That is, the limiting distribution of the “normalized”
maximum is some standardized type of the Generalized Extreme Value
distribution.
Proof.
(Outline.) The extended proof is quite elaborate. (Continues. . . )
The Extreme Value Theorem (2/4)
Theorem 12
Proof.
(Continued.) The objective is to show that, given a random variable
X from which the random sample is drawn, for all the points x ∈ X in
its support where the distribution FX (x) is continuous:
 
N −1
lim [FX (aN x − bN )] = exp − (1 + ξx) ξ
N →∞

where the left-hand side is the limit of the cumulative distribution of


the standardized maximum, and the right-hand side is the expression
of the cumulative standardized GEV distribution.
By taking the logarithm of this expression, the above is:
1
lim N log FX (aN x − bN ) = − (1 + ξx) ξ
N →∞

showing that FX (aN x − bN ) → 1 as N → ∞. (Continues. . . )


The Extreme Value Theorem (3/4)
Theorem 12
Proof.
(Continued.) Since − log (x) ≈ 1 − x for any given x is close to 1, the
above expression approximates the following.
1 1
lim = 1
N →∞ N [1 − FX (aN x − bN )] (1 + ξx) ξ

The rest of the proof is mathematically involved, and it proceeds to:

i. show that the right-hand side of the above expression on is the


only admissible limit; and

ii. establish conditions under which ξ = 0 (Type I GEV, Gumbel),


ξ > 0 (Type II GEV, Fréchet) and ξ < 0 (Type III GEV, reverse
Weibull).
In this context, ξ = 0 is interpreted as a limit case (see Lecture 2).
The Extreme Value Theorem (4/4)
The Extreme Value Theorem has the following implications.

1. A standardized maximum does not necessarily always con-


verge to a GEV distribution; the Theorem states that if it
converges, the limiting distribution is GEV.

2. By defining Y = −X, for every N it clearly is:

Y(1) = −X(N )

which helps identify the distribution of the minimum if the


maximum’s is known (e.g. reverse vs. traditional Weibull).

3. The technical conditions in the proof that help identify the


GEV Type are often useful. For example, one can establish
that in sampling from the normal distribution, maxima are
Gumbel-distributed.
Central Limit Theorems
• Convergence in distribution is a useful concept, but it is of
limited practical use in inference if the limiting distribution
of a statistic cannot be derived.

• In this regard, Central Limit Theorems are paramount:


they prove that some specific functions of sample means
converge in distribution to the (multivariate) normal.

• This is even more important as the result does not depend


upon the underlying distribution that generates the sample.

• This results helps conduct inference in a variety of settings,


including – as discussed later – estimation results from MM
and MLE frameworks alike.

• Once again (as in the Law of Large Numbers case), various


versions of the result exist, for different sets of assumptions.
Classic Central Limit Theorem (1/5)

Theorem 13
Central Limit Theorem (Lindeberg and Lévy’s). The sample
mean x̄N associated with a random (i.i.d.) sample drawn from the di-
stribution of a random vector x with mean and variance that are both
finite: E [x] < ∞ and Var [x] < ∞, is such that √ the random sequence
defined as a centered sample mean multiplied by N converges in di-
stribution to a multivariate normal distribution.
N
!
√ 1 X d
N xi − E [x] → N (0, Var [x])
N i=1

Proof.
(Sketched.) Like in earlier proof for the Weak Law of Large Numbers
(Theorem 5) this one will make use of moment-generating functions,
but in order to be general enough, characteristic functions should be
used instead. (Continues. . . )
Classic Central Limit Theorem (2/5)

Theorem 13
Proof.
(Continued.) Consider the standardized random vector
− 12
z = [Var [x]] (x − E [x])
− 12 1
where the matrix [Var [x]] and its inverse [Var [x]] 2 satisfies:
− 12 − 12
[Var [x]] Var [x] [Var [x]] =I

as well as the following.


1 1
[Var [x]] 2 [Var [x]] 2 = Var [x]

Such a matrix can always be constructed because variance-covariance


matrices are positive semi-definite. (Continues. . . )
Classic Central Limit Theorem (3/5)
Theorem 13
Proof.
(Continued.) The objective of the proof is to show that:
N
¯N ≡ √1 d
X
z̄ zi → N (0, I)
N i=1

that is, the random sequence z̄ ¯N defined above converges in distribu-


tion to a standard multivariate normal distribution. If this holds, the
main result would also follow thanks to the (linear) properties of the
multivariate normal distribution, per the following relationship.
N
!
√ √ 1 X
N (x̄N − E [x]) = N xi − E [x]
N i=1
N
!
1 1 X
= [Var [x]] 2 √ zi
N i=1

(Continues. . . )
Classic Central Limit Theorem (4/5)

Theorem 13
Proof.
¯N , for fixed N , as:
(Continued.) To show this, express the m.g.f. of z̄
¯N
Mz̄¯N (t) = E exp tT z̄
 
" N
!#
1 X T
= E exp √ t zi
N i=1
N   
Y 1 T
= E exp √ t z
i=1
N
  N
1
= Mz √ t
N
by a derivation analogous to the one in the proof of the Weak Law of
Large Numbers. (Continues. . . )
Classic Central Limit Theorem (5/5)
Theorem 13
Proof.
(Continued.) Just like in that proof, apply here a Taylor expansion
of the above expression around t0 = 0, but now of second degree:
"    T # N
tT E [z] tT E zz T t t t
Mz̄¯N (t) = 1 + √ + +o
N 2N 2N
N
tT t tT t
  
= 1+ +o
2N 2N
 
where the second line exploits the fact that E [z] = 0 and E zz T = I
by construction of z. Taking the limit for N → ∞ now gives:
 T 
t t
lim Mz̄¯N (t) = exp
N →∞ 2

which is precisely the m.g.f. of the standard multivariate normal, as it


was postulated.
Use of the Central Limit Theorem
• How to “use” the Central Limit Theorem? Note that for a
given N , the result can be restated as follows.
N
1 X 1
 
A
x̄N = xi ∼ N E [x] , Var [x]
N i=1 N

• The sample mean is “approximately” normally distributed


with a variance-covariance decreasing in the sample size.

A
• The notation ∼ indicates here that the normal distribution
in question, which is called the asymptotic distribution,
is approximate and is valid for a fixed N , instead of being a
“limiting” distribution.

• Recall that limiting distributions cannot be functions of N .


CLT: Illustrative simulations (1/5)
• Once again, simulations reveal themselves useful.

• These are based on exactly the same random draws from the
Poisson distribution with parameter λ = 4.

• The four histograms now bin 800 values calculated as:


√ x̄N − 4
z̄¯N = N ·
2
where x̄N is a realized mean from the previous simulation.
Across histograms the size of the sample N varies as before.

• Note the standardization in the construction of z̄¯N : in this


Poisson distribution, both the mean and variance equal 4.

• An overlayed standard normal p.d.f. helps showing how the


sampling distribution of this statistic resembles the normal
increasingly better as N increases.
CLT: Illustrative simulations (2/5)

0.8

0.6

0.4

0.2

0
−2 −1 0 1 2 3 4
N =1
CLT: Illustrative simulations (3/5)

0.4

0.2

0
−3 −2 −1 0 1 2 3
N = 10
CLT: Illustrative simulations (4/5)

0.4

0.2

0
−3 −2 −1 0 1 2 3
N = 100
CLT: Illustrative simulations (5/5)

0.4

0.2

0
−3 −2 −1 0 1 2 3
N = 1000
More general Central Limit Theorems
• As with the Laws of Large Numbers, more general Central
Limit Theorems with less restrictive assumptions exist.

• Two famous versions, which both allow for i.n.i.d. data, are
presented next without proof.

• Of these two, the one that is named after A. Ljapunov is of


particular interest, as it is based on a condition which often
shows up in some technical econometric papers.

• The so-called Ljapunov condition requires that in a sample,


at least some cross-osservation moments of order “slightly”
higher than two (as detailed later) are finite.

• Even in this case, some more general versions that allow for
weakly dependent observations also exist.
Lindeberg-Feller Central Limit Theorem
Theorem 14
Central Limit Theorem (Lindeberg and Feller’s). Consider a
non-random (i.n.i.d.) sample where the random vectors xi that gene-
rate it have possibly heterogeneous finite means E [xi ] < ∞, variances
Var [xi ] < ∞, and all mixed third moments are finite too. If:
N
!−1
X
lim Var [xi ] Var [xi ] = 0
N →∞
i=1

then it holds that:


N
1 X d
√ (xi − E [xi ]) → N (0, Var [x])
N i=1

where: N
1 X p
Var [xi ] → Var [x]
N i=1
that is, the positive semi-definite matrix Var [x] is the probability limit
of the observations’ variances.
Ljapunov’s Central Limit Theorem
Theorem 15
Central Limit Theorem (Ljapunov’s). Consider a non-random
(i.n.i.d.) sample where the random vectors xi that generate it have
possibly heterogeneous finite moments E [xi ] < ∞, Var [xi ] < ∞. If:
N
!−(1+ δ2 ) N
X X h 2+δ
i
lim Var [xi ] E |xi − E [xi ]| =0
N →∞
i=1 i=1

for some δ > 0, then:


N
1 X d
√ (xi − E [xi ]) → N (0, Var [x])
N i=1
where Var [x] is the variances’ probability limit as in Theorem 14.

Note: in econometric applications with E [xi ] = 0 for i = 1, . . . , N , the


“Ljapunov condition” specializes, for some δ > 0, to:
h i
1+δ
E |Xik Xi` | <∞
for any two elements k, ` = 1, . . . , K of x and for all observations i.
Asymptotic normality & linear regression (1/5)
To show how the Central Limit can help statistical inference in
practice, consider again the estimator of the slope parameter in
the bivariate regression model. Rework it as follows.
PN  
i=1 Xi − X̄ Yi
β
b 1,M M =
PN  2
i=1 Xi − X̄
PN   PN  
i=1 Xi − X̄ Xi i=1 Xi − X̄ εi
= β1 P  2 + P  2
N N
i=1 Xi − X̄ i=1 Xi − X̄
 
1 N
Xi − X̄ εi
P
N i=1
= β1 + PN  2
1
N i=1 Xi − X̄
where
εi ≡ Yi − β0 − β1 Xi
is the error term of the model: the deviation between Yi and
the linear CEF, E [ Yi | Xi ] = β0 + β1 Xi . Note that E [εi ] = 0.
Asymptotic normality & linear regression (2/5)
Recall that in the bivariate linear regression model, the Law of
Iterated Expectations gives E [Xi εi ] = 0. This provides another
avenue to demonstrate consistency of the MM estimator for β1 .
In fact, by the Continuous Mapping Theorem:
N 
1 X  p
Xi − X̄ εi → E [Xi εi ] − E [Xi ] E [εi ] = 0
N i=1 | {z } | {z }
=0 =0
p
b 1,M M → β1 .
implying β
As the expression on the left-hand side above is a sample mean,
under adequate assumptions about the sample some applicable
Central Limit Theorem would imply the following.
N 
1 X 
d
 h i
√ Xi − X̄ εi → N 0, E ε2i (Xi − E [Xi ])2
N i=1
p
Here the limiting variance takes this form because X̄ → E [Xi ]
at the probability limit.
Asymptotic normality & linear regression (3/5)
The limiting variance obtains as:
N N
" #
1 X 1 X
Var √ (Xi − E [Xi ]) εi = Var [(Xi − E [Xi ]) εi ]
N i=1 N i=1
h i
= E ε2i (Xi − E [Xi ])2

while in the more specialized case where the squared deviations


of Xi and εi from their means are mutually independent, it is:
h i h i h i
E ε2i (Xi − E [Xi ])2 = E ε2i E (Xi − E [Xi ])2 = σ2ε · Var [Xi ]

where σ2ε ≡ E ε2i .


 

This latter case is that where the conditional variance function


of εi given Xi is actually a constant – a scenario usually called
homoscedasticity (as opposed to heteroscedasticity, the general
case). This is typical terminology in regression parlance.
Asymptotic normality & linear regression (4/5)
By the Cramér-Wold device and the following implication of the
Continuous Mapping Theorem:
" N 
#−1
1 X 2 p
Xi − X̄ → [Var [Xi ]]−1
N i=1

these results allow, together, to obtain the limiting distribution


of the MM estimator as:
h i
E ε2i (Xi − E [Xi ])2

√ 
d

b 1,M M − β1 →
N β N 0, 
(Var [Xi ])2

and for some given N , its asymptotic distribution as follows.


h i
2

2
A 1 E εi (Xi − E [Xi ]) 
β
b 1,M M ∼ N β1 ,

N (Var [Xi ])2
Asymptotic normality & linear regression (5/5)
In the specialized homoscedastic case, the limiting distribution
of the estimator is:
√ 
!

b 1,M M − β1 →d σ2ε
N β N 0,
Var [Xi ]

and for some given N , its asymptotic distribution as follows.


!
A
b 1,M M ∼ N 1 σ2ε
β β1 ,
N Var [Xi ]

For them to be used in statistical inference, the results for both


the heteroscedastic and homoscedastic cases require knowledge
of the various components of the limiting variances. In general,
these are unknown by researchers and must be estimated.

This is best discussed later after reviewing the application of the


Central Limit Theorem to general MM and MLE estimators.
The Delta Method (1/2)
Theorem 16
Delta Method. Suppose that some random sequence of dimension K
– call it xN – is asymptotically normal:
√ d
N (xN − c) → N (0, Υ)

for some K × 1 vector c and for some K × K matrix Υ. In addition,


consider some vector-valued function d (x) : RK → RJ . If the latter is
continuously differentiable at c and the J × K Jacobian matrix

∆≡ d (c)
∂xT
has full row rank J, the limiting distribution of d (xN ) is as follows.
√ d
N (d (xN ) − d (c)) → N 0, ∆Υ∆T


Proof.
(Continues. . . )
The Delta Method (2/2)
Theorem 16
Proof.
(Continued.) From the mean value theorem it is:


d (xN ) = d (c) + xN ) (xN − c)
d (e
∂xT
p
eN is a convex combination of xN and c. However, as xN → c:
where x
∂ p ∂
T
xN ) →
d (e d (c) = ∆
∂x ∂xT
hence, at the probability limit:
√ p √
N (d (xN ) − d (c)) → ∆ · N (xN − c)

which, by the given hypotheses, implies the result.

This result is extremely useful to derive the asymptotic distribution of


estimators that relate with sample means, but are not sample means.
Method of moments asymptotic normality (1/4)
Theorem 17
Asymptotically, MM estimators are normally distributed. An
estimator θ
bM M defined as the solution of a set of sample moments
N
1 X  
m xi ; θ
bM M = 0
N i=1

is asymptotically normal. If the sample is random and the moment


conditions are differentiable the limiting distribution is:
√  
d
N 0, M0 Υ0 MT

N θ bM M − θ0 →
0

so long as the following matrices exist, are finite and nonsingular.


  −1

Υ0 = Var [m (xi ; θ0 )] M0 ≡ E m (xi ; θ0 )
∂θT

Proof.
(Continues. . . )
Method of moments asymptotic normality (2/4)
Theorem 17
Proof.
(Continued.) The proof applies the same logic as the Delta Method.
By the mean value theorem, the sample moment conditions become:
N
1 X  
0= m xi ; θbM M =
N i=1
N
" N
#
1 X 1 X ∂   
bM M − θ0

= m (xi ; θ0 ) + m x i ; θ
e N θ
N i=1 N i=1 ∂θT

where the first expression in the first line equals to zero by construc-

tion of all MM estimators. After multiplying both sides by N and
some manipulation the above expression is rendered as follows.

 −1 1 X
" N
# N
√   1 X ∂ 
N θ bM M − θ0 = − m x i ; θ
e N √ m (xi ; θ0 )
N i=1 ∂θT N i=1

(Continues. . . )
Method of moments asymptotic normality (3/4)
Theorem 17
Proof.
(Continued.) Since this is a random sample:
1. by a suitable Central Limit Theorem:
N
1 X d
−√ m (xi ; θ0 ) → N (0, Var [m (xi ; θ0 )])
N i=1

given that E [m (xi ; θ0 )] = 0 by hypothesis;


2. while by the Weak Law of Large Numbers:
N  p  ∂ 
1 X ∂ 
m xi ; θN → E
e m (xi ; θ0 )
N i=1 ∂θT ∂θT
p
eN →
since θ θ0 by consistency of the estimator (at the limit, θ
eN ,
θM M and θ0 all coincide).
b

(Continues. . . )
Method of moments asymptotic normality (4/4)

Theorem 17
Proof.
(Continued.) These intermediate results are together combined via
the Continuous Mapping Theorem, Slutskij’s Theorem as well as the
Cramér-Wold device so to imply the statement. Therefore, for a fixed
N the asymptotic distribution is:
 
A 1
bM M ∼
θ N θ0 , M0 Υ0 MT 0
N

which concludes the proof.

This expression of the asymptotic variance-covariance is typically un-


known and must be thus estimated. The general approach to address
this issue is shown later alongside the MLE case.
Maximum likelihood asymptotic normality (1/5)
Theorem 18
Asymptotically, ML estimators are normally distributed and
they attain the Cramér-Rao bound. An estimator θ bM LE defined
as the maximizer of a log-likelihood function as per
N
X
θ
bM LE = arg max log fxi ( θ| xi )
θ∈Θ i=1

is asymptotically normal. Define the following ‘regularity conditions:’


i. the problem is well defined, i.e. θ0 is the maximizer of the popu-
lation expression E [log fx (xi ; θ)] – where fx (xi ; θ) is the proba-
bility mass or density function that generates the data;
ii. fx (xi ; θ) is three times continuously differentiable and its deri-
vatives are bounded in absolute value;
iii. the support of xi does not depend on θ, so that derivatives for θ
can pass at least twice through an integral defined over fx (xi ; θ).
(Continues. . . )
Maximum likelihood asymptotic normality (2/5)
Theorem 18
(Continued.) If the sample is random and the regularity conditions
hold, then the limiting distribution is expressible as:
√  
d

−1

N θ bM LE − θ0 → N 0, [I (θ0 )]

where I (θ0 ) – written without the N subscript – is the expression for


the following “single-observation” information matrix evaluated at θ0 .
"  T #
∂ ∂
I (θ0 ) ≡ E log fx (xi ; θ0 ) log fx (xi ; θ0 )
∂θ ∂θ
∂2
 
= −E log f x (x i ; θ 0 )
∂θ∂θT

Consequently, θ
bM LE asymptotically attains the Cramér-Rao bound.

Proof.
(Continues. . . )
Maximum likelihood asymptotic normality (3/5)
Theorem 18
Proof.
(Continued.) The proof proceeds similarly to the MM case. By the
mean value theorem, the MLE First Order Conditions can write as:
N N
1 X ∂ 
bM LE = 1
 X ∂
0= log fx xi ; θ log fx (xi ; θ0 ) +
N i=1 ∂θ N i=1 ∂θ
" N
#
1 X ∂2   
bM LE − θ0

+ log f x xi ; θ
e N θ
N i=1 ∂θ∂θT

where the entire expression is zero by definition of MLE. Once again:


√  
N θ bM LE − θ0 =

 −1 1 X
" N
# N
1 X ∂2  ∂
=− T
log fx x i ; θ
e N √ log fx (xi ; θ0 )
N i=1 ∂θ∂θ N i=1 ∂θ

but here additional simplifications are possible. (Continues. . . )


Maximum likelihood asymptotic normality (4/5)
Theorem 18
Proof.
(Continued.) Thanks to the Information Matrix Equality, under the
regularity conditions the following holds.
1. A suitable Central Limit Theorem implies that:
N
1 X ∂  
d
−√ bM LE →
log fx xi ; θ N (0, I (θ0 ))
N i=1 ∂θ

 
as θ0 maximizes E [log fx (xi ; θ0 )], so: E ∂θ log fx (xi ; θ0 ) = 0;
2. while by the Weak Law of Large Numbers:
N
1 X ∂2   p
eN →
log fx x i ; θ −I (θ0 )
N i=1 ∂θ∂θT
p
eN →
again since θ θ0 by consistency of MLE as per Theorem 9.
(Continues. . . )
Maximum likelihood asymptotic normality (5/5)
Theorem 18
Proof.
(Continued.) Here, the application of the Delta Method results in a
simplified expression of the limiting variance – given in the statement
of the Theorem. Collecting terms, for a fixed N the asymptotic distri-
bution is:  
bM LE ∼A −1
θ N θ0 , [IN (θ0 )]

where IN (θ0 ) is the grand (sample) information matrix for a fixed N .


Since the MLE is asymptotically consistent, at the probability limit its
bias is zero, hence the estimator attains the Cramér-Rao bound.

Some comments are in order here.


• Asymptotic attainment of the Cramér-Rao bound is a desirable
property of MLE (alongside invariance – see Lecture 5).
• Yet it hinges on correctly assuming the underlying distribution. If
this is incorrect, the MLE can fail utterly (be inconsistent).
• By contrast, the MM is more robust: there is a trade-off here!
Estimating asymptotic variance-covariances
• The above results develop expressions for both limiting and
asymptotic variance-covariances of MM and ML estimators.

• However, the elements inside such expressions, like M0 , Υ0


and I (θ0 ), are generally unknown ex-ante.

• To use these results in statistical inference it is necessary to


estimate these quantities.

• By the analogy principle, one could use sample analogues


as consistent estimators of population variance-covariances.
√  
d N −1 2 p
• Example: if N X̄ − µ → N 0, σ2 , then σ2 .

N S →

• A more elaborate example on the bivariate linear regression


model is developed next.

• General estimators for the MM and MLE cases then follow.


Asymptotic inference in linear regression (1/2)
Suppose one wants to perform a two-sided test of hypothesis on
the bivariate linear regression slope parameter β1 .

H0 : β 1 = C H1 : β1 6= C

If C = 0, this is a so-called significance test of the regression: a


test whether the explanatory variable Xi affects the mean of Yi
in a conditional (CEF) sense.

In small samples this test may be problematic and require extra


assumptions. In an asymptotic environment, the earlier analysis
of the model allows to establish the following property.
√ b 1,M M − C d
β
tN = N → N (0, 1)
Sβ1
Here tN is a t-statistic and Sβ1 is a suitable consistent estimator
√  
of the limiting standard deviation of N β1,M M − β1 .
b
Asymptotic inference in linear regression (2/2)
The expression of Sβ1 differs across assumptions. In the general
heteroscedastic case, its squared version is:
PN  2  2
i=1
b 0,M M − β
Yi − β b 1,M M Xi Xi − X̄
Sβ2 1 = N  2 2
PN 
i=1 Xi − X̄

while in the more restricted homoscedastic case Sβ2 1 is as follows.


PN  2
i=1 Yi − β
b 0,M M − β
b 1,M M Xi
Sβ2 1 = PN  2
i=1 Xi − X̄

The quantity Sβ1 / N is called the standard error of β b 1,M M .
A proper confidence interval for β
b 1,M M would be as follows.

Sβ1 b Sβ
 
b 1,M M − zα∗ √
β1 ∈ β , β1,M M + zα∗/2 √ 1
/2
N N
Estimating MM asymptotic variance-covariances
In the MM case, McN Υ c T /N is a consistent estimator of the
b NM
N
asymptotic variance-covariance in random samples, where:
" N
#−1
1 X ∂   p
M
cN ≡ m xi ; θ
b
MM → M0
N i=1 ∂θT

is a consistent estimator of M0 (by some Law of Large Numbers


and the Continuous Mapping Theorem), while
N h
1 X  i h  iT p
ΥN ≡
b m xi ; θ
b
MM m xi ; θ
b
MM → Υ0
N i=1

is also a consistent estimator of the variance of the zero moment


conditions by some applicable Law of Large Numbers, since in a
random sample the following holds.
h i
Υ0 = Var [m (xi ; θ0 )] = E (m (xi ; θ0 )) (m (xi ; θ0 ))T

These estimators also work under general i.n.i.d. assumptions.


Estimating ML asymptotic variance-covariances
In MLE, there are two ways to estimate the information matrix,
corresponding to both sides of the Information Matrix Equality.
The first option is based on the Hessian of the p.m.f. or p.d.f.:
N
bN ≡ − 1 X ∂2   p
H log fx xi ; θ
b
M LE → I (θ0 )
N i=1 ∂θ∂θT

while the second option exploits the “squared” score:


N   T
bN ≡ 1 X ∂    ∂ 
J log fx xi ; θ
b
M LE log fx xi ; θ
b
M LE
N i=1 ∂θ ∂θ

with
p
b N → I (θ0 ) .
J
The choice between Hb N and J
b N is based on convenience and is
context-dependent. Observe how all these estimators (both MM
and MLE) are evaluated at the consistent parameter estimates.

You might also like