0% found this document useful (0 votes)

8 views

Lecture-6

Uploaded by

armaz.shotadze

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Lecture-6

Uploaded by

armaz.shotadze

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 90

Asymptotic Analysis

Paolo Zacchia

Probability and Statistics

Lecture 6
Why asymptotics?
• This lecture characterizes probabilistic results for samples
in asymptotic settings: when the sample size N is large.

• The focus is on convergence results for selected statistics:

their value and distribution for large N .

• These results greatly facilitate estimation & inference when

exact results on sampling distributions are hard to obtain.

• The main objective is to characterize the behavior of known

estimators (MM and MLE) in asymptotic environments.

• This is achieved via the analysis of two fundamental results

about the sample mean x̄: the law of large numbers and
the central limit theorem.
Random sequences
To characterize asymptotic results it is necessary to adopt a notation
that helps express the dependence of these results on the sample size.

Definition 1
Random sequence. Any random vector expressed as an N -indexed
T
sequence, write it as xN = (X1N , . . . , XKN ) , is a random sequence.
In the univariate context (K = 1), one can write it simply as XN .

The definition can also apply to sequences of random matrices having

dimension J × K, which combine J vectorial sequences xjN of length
K for j = 1, . . . , J. One such matrix is denoted for example as follows.
T
XN = x1N x2N . . . xJN

Example: both the sample mean and the sample variance-covariance:

N N
1 X 1 X T
x̄N = xi and SN = (xi − x̄N ) (xi − x̄N )
N i=1 N − 1 i=1
are two random sequences, as they are statistics that depend upon the
2
sample size N . Their univariate versions are written as X̄N and SN .
Boundedness in probability

The first asymptotic concept related to the idea of “convergence,” as

applied to random sequences, is defined next.

Definition 2
Boundedness in Probability. A sequence xN of random vectors is
bounded in probability if and only if, for any ε > 0, there exists some
number δε < ∞ and an integer Nε such that

P (kxN k ≥ δε ) < ε ∀N ≥ Nε

which is also written as xN = Op (1) and read as “xN is big p-oh one.”

This is a desirable properties of random sequences, however it is still

not fully satisfactory, as it allows the distribution of random vectors to
remain “dense” around a certain interval.
Convergence in probability
The following “convergence” concept is stronger than the previous.

Definition 3
Convergence in Probability. A sequence xN of random vectors con-
verges in probability to a constant vector c if

lim P (kxN − ck > δ) = 0

N →∞

for any positive real number δ > 0.

This definition formalizes the idea that as the sample size N grows in-
creasingly larger, the probability distribution of xN concentrates within
an increasingly smaller neighborhood of c.
Convergence in probability is usually denoted in one of two ways.
p
xN → c
plim xN = c
p
Here the first type of notation (using →) is preferred.
Convergence implies boundedness in probability

Theorem 1
Convergent Random Sequences are also Bounded. If some se-
quence xN of random vectors converges in probability to a constant c,
p
that is xN → c, then it is also bounded: xN = Op (1).
Proof.
By the definition of convergence in probability, for any ε > 0 there is
always an integer Nε such that

P (kxN − ck > δ) < ε ∀N ≥ Nε

thus by setting δε = δ + kxNε k − kxNε − ck one gets xN = Op (1).

Hence, while “boundedness” is valid for a specific constant δ so long

as N large enough, “convergence” must be true for any δ.
Convergence of random to real sequances
If convergence in probability holds for c = 0, one can also write:

xN = op (1)

which is read as “xN is little p-oh one.” This notation helps develop
the following concept.

Definition 4
Convergence of Random Sequences to Real Sequences. Con-
sider a random sequence xN , and a non-random sequence aN of the
same dimension K as xN . Moreover, define the random sequence
T
zN = (Z1N , . . . , ZKN ) where Zkn = Xkn /akn for k = 1, . . . , K and
for n = 1, 2, . . . to infinity.
1. If zN = Op (1), then xN is said to be bounded in probability by
aN , which one can write as xN = Op (aN ).
2. If zN = op (1), then xN is said to converge in probability to aN ,
which one can write as xN = op (aN ).
Convergence in r-th Mean

The following asymptotic concept is even stronger than convergence in

probability.

Definition 5
Convergence in r-th Mean. A random sequence xN is said to con-
verge in r-th mean to a constant vector c if the following holds.
r
lim E [kxN − ck ] = 0
N →∞

In the special case with r = 2, this concept is known as Convergence

in Quadratic Mean and is also expressed as follows.
qm
xN → c

This particular kind of convergence is not as general as convergence in

probability, but it may be more convenient to work with, given that it
is based upon moments.
Convergence in lower means
Intuitively, if a random sequence converges in the r-th mean it shall
also converge to means of order lowern than r.

Theorem 2
Convergence in Lower Means. A random sequence xN that con-
verges in r-th mean to some constant vector c also converges in s-th
mean to c for s < r.
Proof.
The proof is based on Jensen’s Inequality:
r s
h i
s
lim E [kxN − ck ] = lim E (kxN − ck ) r
N →∞ N →∞
s
r
≤ lim {E [kxN − ck ]} r
N →∞
=0
r
since limN →∞ E [kxN − ck ] = 0.
Convergence in quadratic mean (1/2)
Theorem 3
Convergence in Quadratic Mean and Probability. If a random
sequence xN converges in r-th mean to a constant vector c for r ≥ 2
qm
(that is, at least xN → c), then it also converges in probability to c.
Proof.
Define the (one-dimensional) nonnegative random sequence QN as:
q
T
QN = kxN − ck = (xN − c) (xN − c) ∈ R+

and notice that by Theorem 2 it must converge in first mean.

lim E [QN ] = lim E [kxN − ck] = 0

N →∞ N →∞

In addition, quadratic mean convergence implies the following.

h i
2
lim Var [QN ] = lim E Q2N = lim E kxN − ck = 0

N →∞ N →∞ N →∞

(Continues. . . )
Convergence in quadratic mean (2/2)
Theorem 3
Proof.
(Continued.) At the same time, by Čebyšëv’s Inequality:

Var [QN ]
P (|QN − E [QN ]| > δ) ≤
δ2
therefore, taking limits on both sides gives:

lim P (kxN − ck > δ) = lim P (|QN − E [QN ]| > δ)

N →∞ N →∞
Var [QN ]
≤ lim
N →∞ δ2
=0
p
implying convergence in probability: xN → c.

This result is useful to verify that in random samples drawn from a

random vector x with finite variance Var [x] < ∞, the sample mean
xN converges in probability to the mean of the population, E [x].
Convergence in probability of the sample mean
In a random sample drawn from some random variable X:
h i N
lim E X̄N = lim E [X] = E [X]
N →∞ N →∞ N

and in addition, if Var [X] < ∞:

Var [X]
h i2 h i
lim E X̄N − E X̄N = lim Var X̄N = lim =0
N →∞ N →∞ N →∞ N
qm p
and therefore, X̄N → E [X] which also implies X̄N → E [X].
This generalizes to a multivariate environment: given a random
sample drawn from a random vector x with Var [x] < ∞, it is:
N
1 X qm
x̄N = xi → E [x]
N i=1
p
which also implies convergence in probability, x̄N → E [x]
Almost Sure Convergence

There is yet another, stronger notion of convergence.

Definition 6
Almost Sure Convergence. A sequence xN of random vectors con-
verges almost surely, or with probability one to a constant vector c if it
holds that:
P lim xN = c = 1
N →∞
a.s.
where limN →∞ xN is a random vector. This is also writes as xN → c.

One can prove that “almost sure” convergence implies convergence in

probability, but occasionally the converse is not true.
Convergence of random matrix sequences
• All concepts and results discussed until here also apply to
random sequences that are matrix-valued.

• A matrix-valued random sequence XN is said to converge in

probability to some matrix C if:

lim P (kXN − Ck > δ) = 0

N →∞

• (. . . where, for any matrix B, it is as follows).

q
kBk = tr (BT B)

• Convergence in probability of random matrix sequences can

be written as follows.
p
XN → C
Continuous mapping theorem (1/2)
What follows is an extremely useful result.

Theorem 4
Continuous Mapping Theorem. Consider a vector-valued random
sequence xN ∈ X, a vector c ∈ X with the same length as xN , as well
as a vector-valued continuous function g (·) with a set of discontinuity
points Dg such that:
P (x ∈ Dg ) = 0
(the probability mass at the discontinuities is zero). It follows that:
p p
xN → c ⇒ g (xN ) → g (c)
a.s. a.s.
xN → c ⇒ g (xN ) → g (c)

thus, convergence in probability and almost sure convergence are pre-

served when functions are applied to random sequences.
Proof.
(Sketched.) Only the case about convergence in probability is proved
here, for the sake of illustration. (Continues. . . )
Continuous mapping theorem (2/2)
Theorem 4
Proof.
(Continued.) For a given positive number δ > 0, define the set:

Gδ = { x ∈ X| x ∈
/ Dg : ∃y ∈ X : kx − yk < δ, kg (x) − g (y)k > ε}

that is, the set of points in X where g (·) “amplifies” the distance with
some other point y beyond a small neighborhood of ε. In light of this
definition:

P (kg (xN ) − g (c)k > ε) ≤ P (kxN − ck ≥ δ) + P (c ∈ Gδ ) + P (c ∈ Dg )

and note that upon taking the limit of the right-hand side as N → ∞,
the second term vanishes by definition of a continuous function, while
the third term is zero by hypothesis. Therefore:

lim P (kg (xN ) − g (c)k > ε) ≤ lim P (kxN − ck ≥ δ)

N →∞ N →∞

which proves the result on convergence in probability.

Uses of the continuous mapping theorem (1/3)
• This result also applies to scalar-valued and matrix-valued
sequences.

• The theorem already showcases the convenience of working

in an asymptotic environment.

• Note that in general, one cannot derive the expected value

of some function g (µ b N ) of some given unbiased estimator
µ
b N such that E [µ
b N ] = µ0 for some µ0 .

• (The best one can do is to derive approximations based on

Jensen’s Inequality.)

• If µ
b N converges in probability to µ0 though, the continuous
mapping theorem ensures that in large samples g (µ b N ) also
converges in probability to g (µ0 ).
Uses of the continuous mapping theorem (2/3)
A list of ‘properties’ of random sequences, which can be derived
from the continuous mapping theorem, follows suit.
p
1. Scalars. Given two scalar random sequences XN → x and
p
YN → y, the following holds.
p
(XN + YN ) → x + y
p
XN YN → xy
p
XN /YN → x/y if y 6= 0

p
2. Vectors. Given two vector random sequences xN → x and
p
yN → y of equal length, the following holds.
p
xT T
N yN → x y
T p
xN yN → xyT
Uses of the continuous mapping theorem (3/3)
p
3. Matrices. Given two matrix random sequences XN → X
p
and YN → Y of appropriate dimension it holds that:
p
XN YN → XY

while for sequences of random square matrices of full rank

p
ZN → Z, it is as follows.
−1 p
ZN → Z−1

4. Combinations of the above. Consider the three random

sequences XN , xN and XN as above, and suppose that the
column dimension of XN corresponds to the row dimension
of xN . Then, the following holds.
p
XN XN xN → xXx
Laws of Large Numbers
• Endowed with these convergence concepts, it is possible to
state and prove the Laws of Large Numbers.

• These fundamental results in asymptotic analysis show how

sample means converge to population means in ways
that depend on the assumptions that one makes.

• There are two distinct kinds of Law: weak (for convergence

in probability) and strong (for almost sure convergence).

• While both are stated next, only the weak law is proved. A
full-fledged proof would use characteristic functions; for the
sake of simplicity, a slightly less general proof that is based
on moment-generating functions is given.

• The weak law resembles the result following from quadratic

mean convergence, but it does not impose finite variances.
Weak Law of Large Numbers (1/3)
The simplest version of the Law is introduced next.

Theorem 5
Weak Law of Large Numbers (Khinčin’s). The sample mean as-
sociated with a random (i.i.d.) sample drawn from the distribution of
a random vector x with finite mean E [x] < ∞ converges in probability
to such population mean of x.
N
1 X p
x̄N = xi → E [x]
N i=1

Proof.
(Sketched.) The proof is restricted to random vectors x for which the
m.g.f. Mx (t) is actually defined. A more general analysis, which also
allows for random vectors that lack an m.g.f., would use characteristic
functions φx (t) instead. (Continues. . . )
Weak Law of Large Numbers (2/3)
Theorem 5
Proof.
(Continued.) The m.g.f. of the sample mean x̄N is, for a given N :

Mx̄N (t) = E exp tT x̄N

" N
!#
1 X T
= E exp t xi
N i=1
N
Y 1 T
= E exp t xi
i=1
N
N
1
= Mx t
N

where the third line follows from independence between observations,

and the fourth line relies on observations being identically distributed
(so that they have the same m.g.f.); basically, this is an application of
the theorem on the m.g.f. of linear combinations. (Continues. . . )
Weak Law of Large Numbers (3/3)

Theorem 5
Proof.
(Continued.) From a Taylor expansion around t0 = 0:
T N
tT E [x]

t ι
Mx̄N (t) = 1 + +o
N N

hence, taking the limit gives the following result.

lim Mx̄N (t) = exp tT E [x]

N →∞

This is a trivial m.g.f.: the one of a degenerate discrete random vector

where the entire probability mass is concentrated in E [x]! Therefore,
exploiting the fact that m.g.f. uniquely characterize distributions, one
can conclude that the sample mean indeed converges in probability to
its mean as N grows larger.
Strong Laws of Large Numbers (1/2)
Under appropriate restrictions about the variance of the population,
one can also establish almost sure convergence.

Theorem 6
Strong Law of Large Numbers (Kolmogorov’s). If in a random
(i.i.d.) sample drawn from the distribution of some random vector x it
simultaneously holds that:
i. E [x] < ∞,
ii. Var [x] < ∞,
P∞ −2
iii. n=1 n Var [xn ] < ∞,
1
PN
then the sample mean x̄N = N i=1 xi converges almost surely to its
population mean.
N
1 X a.s.
x̄N = xi → E [x]
N i=1
Strong Laws of Large Numbers (2/2)
The following result applies to non-identically distributed observations.

Theorem 7
Strong Law of Large Numbers (Markov’s). Consider a sample
with independent, non identically distributed observations (i.n.i.d.) so
that the random vectors xi that generate it have possibly heterogeneous
moments E [xi ] and Var [xi ]. If for some δ > 0 it holds that:
∞
X 1 h
1+δ
i
lim E |x i − E [xi ]| <∞
N →∞
i=1
i1+δ

then the following almost sure convergence result holds.

N
1 X a.s.
(xi − E [xi ]) → 0
N i=1

More complex results that apply to non-independent observations also

exist. These are widely applied in econometrics.
LLN: Illustrative simulations (1/5)
• To illustrate the functioning of the Theorem in practice, a
simulation is presented next.

• The simulation is based off random draws from the Poisson

distribution with parameter λ = 4.

• Four histograms are presented. All of them bin the values

of 800 sample means from 800 simulated samples.

• Across histograms the size of the sample N varies. Thus, if

N = 1, 800 observations are grouped across 800 samples; if
N = 10, 8,000 observations are grouped across 800 samples,
and so on.

• This helps showing how the empirical sampling distribution

of the 800 sample means becomes increasingly concentrated
around 4 as N becomes larger.
LLN: Illustrative simulations (2/5)

0.5

0
0 4 8 12 16
N =1
LLN: Illustrative simulations (3/5)

0.6

0.4

0.2

0
0 4 8 12 16
N = 10
LLN: Illustrative simulations (4/5)

1.5

0.5

0
0 4 8 12 16
N = 100
LLN: Illustrative simulations (5/5)

0
0 4 8 12 16
N = 1000
Consistent estimators
The Laws of Large Numbers can be exploited to show that selected
estimators converge in probability to the parameters of interest for
estimation.

Definition 7
Consistent Estimators. An estimator θ bN is called consistent if it
converges in probability to the true population parameters θ0 which it
is meant to estimate.
bN → p
θ θ0

Note: from now on, the subscript 0 – as in θ0 – is used to denote the

“true” value of the parameters of interest, the ones that characterize
the distribution which generates the data.

Before proving consistency of both MM and MLE estimators (under

some loose conditions), it is useful to provide an example about the
bivariate linear regression model, showing that the estimator(s) that
are associated with it are consistent.
Bivariate regression and consistency (1/2)
Consider the bivariate linear regression model from Lecture 3.
The MM estimator of the true slope parameter β1 is given by
(Lecture 4) as:
PN
i=1 Xi − X̄ Yi − Ȳ
β
b 1,M M =
PN 2
i=1 Xi − X̄

where X̄ = N −1 N −1 N
P P
i=1 Xi and Ȳ = N i=1 Yi . This estimator
can also be obtained via MLE under certain assumptions.

Observe that after dividing both sides of the above ratio by N ,

the numerator and the denominator are, respectively:
• the sample covariance between Xi and Yi , and
• the sample variance of Xi ,
(both multiplied by the (N − 1) /N factor).
Bivariate regression and consistency (2/2)
Since the two sides of the ratio are (generalized) sample means,
by the Weak Law of Large Numbers:
N
1 X p
Xi − X̄ Yi − Ȳ → Cov [Xi , Yi ]
N i=1
N
1 X 2 p
Xi − X̄ → Var [Xi ]
N i=1
and thanks to the Continuous Mapping Theorem:
p
b 1,M M → β1
β
this estimator of the slope parameter β1 is consistent!

An extended analysis shows that the MM estimator of β0 :

b 0,M M = Ȳ − β
β b 1,M M · X̄
p
b 0,M M → β0 , again thanks to the Continuous
is also consistent: β
Mapping Theorem.
Consistency of the Method of Moments
Theorem 8
Consistency of the Method of Moments. An estimator defined as
the solution θ
bM M of a set of sample moments
N
1 X
m xi ; θ
bM M = 0
N i=1
is consistent for the set of parameters θ0 that solves the corresponding
population moments E [m (xi ; θ)] = 0, if such a solution exists (i.e. if
the estimation problem is well defined).
Proof.
(Heuristic.) By some applicable Law of Large Numbers:
N
1 X p h
bM M →
i
m xi ; θ E m xi ; θ bM M = 0
N i=1
where the equality to 0 follows by definition of MM estimator, which
is maintained throughout the sequence as N → ∞. As by hypothesis
the zero moment conditions have only one admissible solution, at the
probability limit it is plim θ
b M M = θ0 .
Consistency of Maximum Likelihood (1/3)
Theorem 9
Consistency of Maximum Likelihood Estimators. In a random
sample, an estimator θ bM LE which is defined as the maximizer of a
log-likelihood function as per
N
X
θ
bM LE = arg max log fxi ( θ| xi )
θ∈Θ i=1

converges in probability to that parameter set θ0 that maximizes the

corresponding population moment function.

θ0 = arg max E [log fx (x; θ)]

θ∈Θ

If such a maximum exists, by the likelihood principle it corresponds to

the true parameter of the distribution under analysis.
Proof.
(Continues. . . )
Consistency of Maximum Likelihood (2/3)
Theorem 9
Proof.
(Continued.) (Heuristic.) By the Weak Law of Large Numbers, for
any θ ∈ Θ including θ
bM LE and θ0 :
N
1 X p
log fx (xi ; θ) → E [log fx (x; θ)]
N i=1

while by the definition of MLE the following holds for all N ∈ N.

N N
1 X
bM LE ≥ 1
X p
log fx xi ; θ log fx (xi ; θ0 ) → E [log fx (x; θ0 )]
N i=1 N i=1

Moreover, since θ0 maximizes the expected logarithmic p.d.f. or p.m.f.

in the population, the following holds too.
h i
lim P E [log fx (x; θ0 )] ≥ E log fx x; θ
bM LE =1
N →∞

(Continues. . . )
Consistency of Maximum Likelihood (3/3)

Theorem 9
Proof.
(Continued.) All these facts can be simultaneously reconciled only if,
at the limit:
N
1 X p h
bM LE →
i
log fx xi ; θ E log fx x; θ
bM LE
N i=1

while, at the same time:

h i
E log fx x; θ
bM LE = E [log fx (x; θ0 )]

hence, at the limit it follows that plim θ

bM LE = θ0 by the Continuous
Mapping Theorem.
Convergence to random vectors?
• All convergence concepts that were discuss so far concern
convergence to a point or interval in RK .

• What if interest falls on convergence to a random vector?

• Consider the expression:

p
xN → x

which can be read as: “the random sequence xN converges

in probability to the random vector x” as follows.

lim P (kxN − xk > δ) = 0

N →∞

• Is that enough so to guarantee that at the probability limit,

the distribution (and moments) of xN and x coincide?
Convergence in distribution
The answer to the question is “no:” such a result is only obtained when
the following stronger concept can be applied.

Definition 8
Convergence in Distribution. Consider:
• a sequence of random vectors xN , whose each element has a cu-
mulative distribution function FxN (xN ),
• and a random vector x with cumulative distribution function
Fx (x).
The random sequence xN is said to converge in distribution to x if:

lim |FxN (xN ) − Fx (x)| = 0

N →∞

at all continuity points x ∈ X belonging to the support of x. This is

usually expressed with the following formalism.
d
xN → x
Limiting distribution, and discussion

The distribution Fx (x) from the definition takes the following name.

Definition 9
d
Limiting Distribution. If xN → x, that is some random sequence
xN converges in distribution to a random vector x, Fx (x) is said to be
the limiting distribution of xN .

Observe the following.

• The definition of convergence in distribution indicates that the
probabilistic behavior of xN and x becomes increasingly closer as
N grows, eventually it shall coincide;
• This is a stronger concept than is convergence in probability to a
random vector, which implies that xN and x are going to deliver
increasingly close realizations as N grows, even if not necessarily
with the same probabilities on the support.
Student’s t convergence to the normal (1/2)
Observation 1
Asymptotics of Student’s t-distribution. Consider a random vari-
able that follows the Student’s t-distribution with parameter ν, that is
X ∼ T (ν). As ν → ∞, the probability distribution of X tends to that
of the standard normal distribution, i.e. limν→∞ X = Z ∼ N (0, 1).
Proof.
Taking the limit of the probability density function of the Student’s
t-distribution as ν → ∞:
− ν+1
x2
2
1 1 2
1 x
lim √ 1+ = √ exp −
ν→∞ B 1 , ν ν ν 2π 2
2 2
√ 1 ν
√
as limν→∞ νB 2, 2 = 2π by the properties of the Beta function;
while:
− (ν+1)
x2
2
2
x
lim 1+ = exp −
ν→∞ ν 2
by more standard arguments.
Student’s t convergence to the normal (2/2)

This result was already anticipated in Lecture 2. It is worthwhile to

replicate the graphical intuition in a different graphical form.

FX (x) ν=1
1 ν=3
ν→∞

0.5

x
−5 −3 −1 1 3 5
Convergence of t-statistics
• Consider a random sample which is drawn from a normally
distributed random variable X ∼ N µ, σ2 .

• As it has been analyzed in Lecture 4, a t-statistic follows a

t-distribution with N − 1 degrees of freedom.
√ X̄N − µ
tN = N ∼ T(N −1)
SN

• If seen as a random sequence, the t-statistic thus converges

in distribution to the standard normal.
d
tN → N (0, 1)

• Hence, with large N one can very reliably perform inference

using t-statistics evaluated against the standard normal.
Snedecor’s F convergence to the chi-squared
Observation 2
Asymptotics of Snedecor’s F-distribution. Consider a random
variable that follows Snedecor’s F-distribution having parameters ν1
and ν2 , X ∼ F (ν1 , ν2 ). As ν2 → ∞, the probability distribution of
W = ν1 X tends to that of a chi-squared distribution with parameter
ν1 , i.e. limν2 →∞ ν1 X = W ∼ χ2 (ν1 ).
Proof.
After deriving the p.d.f. of W = ν1 X, take its limit as ν2 → ∞:
ν21 − ν1 +ν 2
1 1 ν1
−1 w 2
lim fW (w) = lim w 2 1+
ν2 →∞ B ν1 , ν2

ν2 →∞
2 2
ν2 ν2
ν1 − ν22
Γ ν1 +ν

−1
2

− 21 w 2 w
ν
2
= lim (ν2 + w) 1+
ν2 →∞ Γ ν2 Γ ν21 ν2
2
1 ν1
w
= ν1 w 2 −1 exp −
Γ ν1 · 2 2
2
2
ν ν1
− 21
where limν2 →∞ Γ ν1 +ν ν2
= 2− 2 derives from

2
2
/Γ 2 (ν 2 + w)
the properties of the Gamma function.
Convergence of Hotelling’s t-squared statistics
• Recall Hotelling’s rescaled t-squared statistic.

N −K 2
t =
K (N − 1) N
(N − K) N
= (x̄ − µ)T S −1 (x̄ − µ) ∼ FK,N −K
K (N − 1)
For a given N , this statistic follows the F-distribution with
paired degrees of freedom K and N − K.
• Per Observation 2, Hotelling’s t-squared statistic converges
in distribution to a chi-squared distribution with K degrees
of freedom:
d
t2N → χ2K
note that the term (N − K) / (N − 1) vanishes as N → ∞.
• Like in the univariate case, this result facilitates statistical
inference in multivariate settings.
Gamma convergence to the normal
Observation 3
Asymptotics of the Gamma distribution. Consider a random vari-
able that follows the Gamma distribution with parameters α and β,
X ∼ Γ (α, β). Let µ = α/β as well as σ2 = α/β2 . As α → ∞, the
probability distribution of X tends to that of a normal
distribution
with parameters µ and σ2 , i.e. limα→∞ X ∼ N µ, σ2 .
Proof. √ √
Define the random variable Z = (X − µ) /σ = β/ α X − α; by
the properties of m.g.f.s it is:
−α
√ √

β t
MZ (t) = exp − αt · MX √ t = exp − αt 1 − √
α α

and after some manipulation, the limit as α → ∞ gives:

−α
√
2
t t
lim MZ (t) = lim exp − αt 1 − √ = exp
α→∞ α→∞ α 2

showing that at the limit, Z ∼ N (0, 1) and thus X ∼ N µ, σ2 .
Mean convergence in exponential samples
• Recall that in a random sample drawn from X ∼ Exp (λ)
the sample mean is Gamma-distributed, X̄ ∼ Γ (N, N/λ).

• Thus, by Observation 3, the following holds.

√
d

N X̄N − λ → N 0, λ2

• This statement is interpreted in the sense that for a fixed

value of N : !
A λ2
X̄N ∼ N λ,
N
where A there stands for “asymptotic” (observe that by the
definition of convergence in distribution, N cannot show up
in the expression of a limiting distribution).

• This is a particular case of the Central Limit Theorem, one

that can help inference about the exponential distribution.
Continuous Mapping Theorem, continued
The Continuous Mapping Theorem also applies to the concept of con-
vergence in distribution.

Theorem 10
Continuous Mapping Theorem (convergence in distribution).
Under the hypotheses of Theorem 4:
d d
xN → x ⇒ g (xN ) → g (x)

that is, a random sequence which is obtained from the application of a

transformation g (·) to some original random sequence xN , converges
in distribution to the distribution resulting from applying the transfor-
mation g (·) to the random vector x associated with the limiting distri-
bution of xN .

The proof of this statement is omitted: it involves advanced measure

theory. This version of the continuous mapping theorem is important,
as it allows to prove some properties of random sequences – which are
exploited in statistics and econometrics – that are presented next.
Slutskij’s Theorem (1/2)
Theorem 11
Slutskij’s Theorem. Consider any two (scalar) random sequences
XN and YN such that:
d
XN → X
p
YN → c

that is, XN converges in distribution to that of the random variable X,

while YN converges in probability to a constant c. Then, the following
holds.
d
(XN + YN ) → X + c
d
XN YN → cX
d
XN /YN → X/c if c 6= 0

Proof.
(Continues. . . )
Slutskij’s Theorem (2/2)

Theorem 11
Proof. p
(Continued.) Recognize that, as YN → c, YN has a degenerate limi-
ting distribution, and the (vector-valued) random sequence (XN , YN )
converges in distribution to that of the random vector (X, c). All the
results above follow, therefore, from applying the Continuous Mapping
Theorem to three given continuous functions of XN and YN .

Corollary: Cramér-Wold Device. Given a random sequence xN

and a constant vector a of the same dimension:
d d
xN → x ⇒ a T xN → a T x

that is, if a vectorial random sequence has a limiting distribution, any

linear combination of its elements will converge in distribution to the
distribution of the corresponding “limiting” linear combination.
The Extreme Value Theorem (1/4)
It is worth to briefly sketch here the central result of extreme value
theory: that is, the asymptotic theory of order statistics.

Theorem 12
Extreme Value Theorem (Fisher-Tippett-Gnedenko). Given a
random (i.i.d.) sample (X1 , . . . , XN ), if a convergence in distribution
result of the kind
X(N ) − bN d
→W
aN
can be established – where X(N ) is the maximum order statistic while
aN > 0 and bN are sequences of real constants – then:

W ∼ GEV (0, 1, ξ)

for some real ξ. That is, the limiting distribution of the “normalized”
maximum is some standardized type of the Generalized Extreme Value
distribution.
Proof.
(Outline.) The extended proof is quite elaborate. (Continues. . . )
The Extreme Value Theorem (2/4)
Theorem 12
Proof.
(Continued.) The objective is to show that, given a random variable
X from which the random sample is drawn, for all the points x ∈ X in
its support where the distribution FX (x) is continuous:

N −1
lim [FX (aN x − bN )] = exp − (1 + ξx) ξ
N →∞

where the left-hand side is the limit of the cumulative distribution of

the standardized maximum, and the right-hand side is the expression
of the cumulative standardized GEV distribution.
By taking the logarithm of this expression, the above is:
1
lim N log FX (aN x − bN ) = − (1 + ξx) ξ
N →∞

showing that FX (aN x − bN ) → 1 as N → ∞. (Continues. . . )

The Extreme Value Theorem (3/4)
Theorem 12
Proof.
(Continued.) Since − log (x) ≈ 1 − x for any given x is close to 1, the
above expression approximates the following.
1 1
lim = 1
N →∞ N [1 − FX (aN x − bN )] (1 + ξx) ξ

The rest of the proof is mathematically involved, and it proceeds to:

i. show that the right-hand side of the above expression on is the

only admissible limit; and

ii. establish conditions under which ξ = 0 (Type I GEV, Gumbel),

ξ > 0 (Type II GEV, Fréchet) and ξ < 0 (Type III GEV, reverse
Weibull).
In this context, ξ = 0 is interpreted as a limit case (see Lecture 2).
The Extreme Value Theorem (4/4)
The Extreme Value Theorem has the following implications.

1. A standardized maximum does not necessarily always con-

verge to a GEV distribution; the Theorem states that if it
converges, the limiting distribution is GEV.

2. By defining Y = −X, for every N it clearly is:

Y(1) = −X(N )

which helps identify the distribution of the minimum if the

maximum’s is known (e.g. reverse vs. traditional Weibull).

3. The technical conditions in the proof that help identify the

GEV Type are often useful. For example, one can establish
that in sampling from the normal distribution, maxima are
Gumbel-distributed.
Central Limit Theorems
• Convergence in distribution is a useful concept, but it is of
limited practical use in inference if the limiting distribution
of a statistic cannot be derived.

• In this regard, Central Limit Theorems are paramount:

they prove that some specific functions of sample means
converge in distribution to the (multivariate) normal.

• This is even more important as the result does not depend

upon the underlying distribution that generates the sample.

• This results helps conduct inference in a variety of settings,

including – as discussed later – estimation results from MM
and MLE frameworks alike.

• Once again (as in the Law of Large Numbers case), various

versions of the result exist, for different sets of assumptions.
Classic Central Limit Theorem (1/5)

Theorem 13
Central Limit Theorem (Lindeberg and Lévy’s). The sample
mean x̄N associated with a random (i.i.d.) sample drawn from the di-
stribution of a random vector x with mean and variance that are both
finite: E [x] < ∞ and Var [x] < ∞, is such that √ the random sequence
defined as a centered sample mean multiplied by N converges in di-
stribution to a multivariate normal distribution.
N
!
√ 1 X d
N xi − E [x] → N (0, Var [x])
N i=1

Proof.
(Sketched.) Like in earlier proof for the Weak Law of Large Numbers
(Theorem 5) this one will make use of moment-generating functions,
but in order to be general enough, characteristic functions should be
used instead. (Continues. . . )
Classic Central Limit Theorem (2/5)

Theorem 13
Proof.
(Continued.) Consider the standardized random vector
− 12
z = [Var [x]] (x − E [x])
− 12 1
where the matrix [Var [x]] and its inverse [Var [x]] 2 satisfies:
− 12 − 12
[Var [x]] Var [x] [Var [x]] =I

as well as the following.

1 1
[Var [x]] 2 [Var [x]] 2 = Var [x]

Such a matrix can always be constructed because variance-covariance

matrices are positive semi-definite. (Continues. . . )
Classic Central Limit Theorem (3/5)
Theorem 13
Proof.
(Continued.) The objective of the proof is to show that:
N
¯N ≡ √1 d
X
z̄ zi → N (0, I)
N i=1

that is, the random sequence z̄ ¯N defined above converges in distribu-

tion to a standard multivariate normal distribution. If this holds, the
main result would also follow thanks to the (linear) properties of the
multivariate normal distribution, per the following relationship.
N
!
√ √ 1 X
N (x̄N − E [x]) = N xi − E [x]
N i=1
N
!
1 1 X
= [Var [x]] 2 √ zi
N i=1

(Continues. . . )
Classic Central Limit Theorem (4/5)

Theorem 13
Proof.
¯N , for fixed N , as:
(Continued.) To show this, express the m.g.f. of z̄
¯N
Mz̄¯N (t) = E exp tT z̄

" N
!#
1 X T
= E exp √ t zi
N i=1
N
Y 1 T
= E exp √ t z
i=1
N
N
1
= Mz √ t
N
by a derivation analogous to the one in the proof of the Weak Law of
Large Numbers. (Continues. . . )
Classic Central Limit Theorem (5/5)
Theorem 13
Proof.
(Continued.) Just like in that proof, apply here a Taylor expansion
of the above expression around t0 = 0, but now of second degree:
" T # N
tT E [z] tT E zz T t t t
Mz̄¯N (t) = 1 + √ + +o
N 2N 2N
N
tT t tT t

= 1+ +o
2N 2N

where the second line exploits the fact that E [z] = 0 and E zz T = I
by construction of z. Taking the limit for N → ∞ now gives:
T
t t
lim Mz̄¯N (t) = exp
N →∞ 2

which is precisely the m.g.f. of the standard multivariate normal, as it

was postulated.
Use of the Central Limit Theorem
• How to “use” the Central Limit Theorem? Note that for a
given N , the result can be restated as follows.
N
1 X 1

A
x̄N = xi ∼ N E [x] , Var [x]
N i=1 N

• The sample mean is “approximately” normally distributed

with a variance-covariance decreasing in the sample size.

A
• The notation ∼ indicates here that the normal distribution
in question, which is called the asymptotic distribution,
is approximate and is valid for a fixed N , instead of being a
“limiting” distribution.

• Recall that limiting distributions cannot be functions of N .

CLT: Illustrative simulations (1/5)
• Once again, simulations reveal themselves useful.

• These are based on exactly the same random draws from the
Poisson distribution with parameter λ = 4.

• The four histograms now bin 800 values calculated as:

√ x̄N − 4
z̄¯N = N ·
2
where x̄N is a realized mean from the previous simulation.
Across histograms the size of the sample N varies as before.

• Note the standardization in the construction of z̄¯N : in this

Poisson distribution, both the mean and variance equal 4.

• An overlayed standard normal p.d.f. helps showing how the

sampling distribution of this statistic resembles the normal
increasingly better as N increases.
CLT: Illustrative simulations (2/5)

0.8

0.6

0.4

0.2

0
−2 −1 0 1 2 3 4
N =1
CLT: Illustrative simulations (3/5)

0.4

0.2

0
−3 −2 −1 0 1 2 3
N = 10
CLT: Illustrative simulations (4/5)

0.4

0.2

0
−3 −2 −1 0 1 2 3
N = 100
CLT: Illustrative simulations (5/5)

0.4

0.2

0
−3 −2 −1 0 1 2 3
N = 1000
More general Central Limit Theorems
• As with the Laws of Large Numbers, more general Central
Limit Theorems with less restrictive assumptions exist.

• Two famous versions, which both allow for i.n.i.d. data, are
presented next without proof.

• Of these two, the one that is named after A. Ljapunov is of

particular interest, as it is based on a condition which often
shows up in some technical econometric papers.

• The so-called Ljapunov condition requires that in a sample,

at least some cross-osservation moments of order “slightly”
higher than two (as detailed later) are finite.

• Even in this case, some more general versions that allow for
weakly dependent observations also exist.
Lindeberg-Feller Central Limit Theorem
Theorem 14
Central Limit Theorem (Lindeberg and Feller’s). Consider a
non-random (i.n.i.d.) sample where the random vectors xi that gene-
rate it have possibly heterogeneous finite means E [xi ] < ∞, variances
Var [xi ] < ∞, and all mixed third moments are finite too. If:
N
!−1
X
lim Var [xi ] Var [xi ] = 0
N →∞
i=1

then it holds that:

N
1 X d
√ (xi − E [xi ]) → N (0, Var [x])
N i=1

where: N
1 X p
Var [xi ] → Var [x]
N i=1
that is, the positive semi-definite matrix Var [x] is the probability limit
of the observations’ variances.
Ljapunov’s Central Limit Theorem
Theorem 15
Central Limit Theorem (Ljapunov’s). Consider a non-random
(i.n.i.d.) sample where the random vectors xi that generate it have
possibly heterogeneous finite moments E [xi ] < ∞, Var [xi ] < ∞. If:
N
!−(1+ δ2 ) N
X X h 2+δ
i
lim Var [xi ] E |xi − E [xi ]| =0
N →∞
i=1 i=1

for some δ > 0, then:

N
1 X d
√ (xi − E [xi ]) → N (0, Var [x])
N i=1
where Var [x] is the variances’ probability limit as in Theorem 14.

Note: in econometric applications with E [xi ] = 0 for i = 1, . . . , N , the

“Ljapunov condition” specializes, for some δ > 0, to:
h i
1+δ
E |Xik Xi` | <∞
for any two elements k, ` = 1, . . . , K of x and for all observations i.
Asymptotic normality & linear regression (1/5)
To show how the Central Limit can help statistical inference in
practice, consider again the estimator of the slope parameter in
the bivariate regression model. Rework it as follows.
PN
i=1 Xi − X̄ Yi
β
b 1,M M =
PN 2
i=1 Xi − X̄
PN PN
i=1 Xi − X̄ Xi i=1 Xi − X̄ εi
= β1 P 2 + P 2
N N
i=1 Xi − X̄ i=1 Xi − X̄

1 N
Xi − X̄ εi
P
N i=1
= β1 + PN 2
1
N i=1 Xi − X̄
where
εi ≡ Yi − β0 − β1 Xi
is the error term of the model: the deviation between Yi and
the linear CEF, E [ Yi | Xi ] = β0 + β1 Xi . Note that E [εi ] = 0.
Asymptotic normality & linear regression (2/5)
Recall that in the bivariate linear regression model, the Law of
Iterated Expectations gives E [Xi εi ] = 0. This provides another
avenue to demonstrate consistency of the MM estimator for β1 .
In fact, by the Continuous Mapping Theorem:
N
1 X p
Xi − X̄ εi → E [Xi εi ] − E [Xi ] E [εi ] = 0
N i=1 | {z } | {z }
=0 =0
p
b 1,M M → β1 .
implying β
As the expression on the left-hand side above is a sample mean,
under adequate assumptions about the sample some applicable
Central Limit Theorem would imply the following.
N
1 X
d
h i
√ Xi − X̄ εi → N 0, E ε2i (Xi − E [Xi ])2
N i=1
p
Here the limiting variance takes this form because X̄ → E [Xi ]
at the probability limit.
Asymptotic normality & linear regression (3/5)
The limiting variance obtains as:
N N
" #
1 X 1 X
Var √ (Xi − E [Xi ]) εi = Var [(Xi − E [Xi ]) εi ]
N i=1 N i=1
h i
= E ε2i (Xi − E [Xi ])2

while in the more specialized case where the squared deviations

of Xi and εi from their means are mutually independent, it is:
h i h i h i
E ε2i (Xi − E [Xi ])2 = E ε2i E (Xi − E [Xi ])2 = σ2ε · Var [Xi ]

where σ2ε ≡ E ε2i .

This latter case is that where the conditional variance function

of εi given Xi is actually a constant – a scenario usually called
homoscedasticity (as opposed to heteroscedasticity, the general
case). This is typical terminology in regression parlance.
Asymptotic normality & linear regression (4/5)
By the Cramér-Wold device and the following implication of the
Continuous Mapping Theorem:
" N
#−1
1 X 2 p
Xi − X̄ → [Var [Xi ]]−1
N i=1

these results allow, together, to obtain the limiting distribution

of the MM estimator as:
h i
E ε2i (Xi − E [Xi ])2

√
d

b 1,M M − β1 →
N β N 0, 
(Var [Xi ])2

and for some given N , its asymptotic distribution as follows.

h i
2

2
A 1 E εi (Xi − E [Xi ]) 
β
b 1,M M ∼ N β1 ,

N (Var [Xi ])2
Asymptotic normality & linear regression (5/5)
In the specialized homoscedastic case, the limiting distribution
of the estimator is:
√
!

b 1,M M − β1 →d σ2ε
N β N 0,
Var [Xi ]

and for some given N , its asymptotic distribution as follows.

!
A
b 1,M M ∼ N 1 σ2ε
β β1 ,
N Var [Xi ]

For them to be used in statistical inference, the results for both

the heteroscedastic and homoscedastic cases require knowledge
of the various components of the limiting variances. In general,
these are unknown by researchers and must be estimated.

This is best discussed later after reviewing the application of the

Central Limit Theorem to general MM and MLE estimators.
The Delta Method (1/2)
Theorem 16
Delta Method. Suppose that some random sequence of dimension K
– call it xN – is asymptotically normal:
√ d
N (xN − c) → N (0, Υ)

for some K × 1 vector c and for some K × K matrix Υ. In addition,

consider some vector-valued function d (x) : RK → RJ . If the latter is
continuously differentiable at c and the J × K Jacobian matrix
∂
∆≡ d (c)
∂xT
has full row rank J, the limiting distribution of d (xN ) is as follows.
√ d
N (d (xN ) − d (c)) → N 0, ∆Υ∆T

Proof.
(Continues. . . )
The Delta Method (2/2)
Theorem 16
Proof.
(Continued.) From the mean value theorem it is:

∂
d (xN ) = d (c) + xN ) (xN − c)
d (e
∂xT
p
eN is a convex combination of xN and c. However, as xN → c:
where x
∂ p ∂
T
xN ) →
d (e d (c) = ∆
∂x ∂xT
hence, at the probability limit:
√ p √
N (d (xN ) − d (c)) → ∆ · N (xN − c)

which, by the given hypotheses, implies the result.

This result is extremely useful to derive the asymptotic distribution of

estimators that relate with sample means, but are not sample means.
Method of moments asymptotic normality (1/4)
Theorem 17
Asymptotically, MM estimators are normally distributed. An
estimator θ
bM M defined as the solution of a set of sample moments
N
1 X
m xi ; θ
bM M = 0
N i=1

is asymptotically normal. If the sample is random and the moment

conditions are differentiable the limiting distribution is:
√
d
N 0, M0 Υ0 MT

N θ bM M − θ0 →
0

so long as the following matrices exist, are finite and nonsingular.

−1
∂
Υ0 = Var [m (xi ; θ0 )] M0 ≡ E m (xi ; θ0 )
∂θT

Proof.
(Continues. . . )
Method of moments asymptotic normality (2/4)
Theorem 17
Proof.
(Continued.) The proof applies the same logic as the Delta Method.
By the mean value theorem, the sample moment conditions become:
N
1 X
0= m xi ; θbM M =
N i=1
N
" N
#
1 X 1 X ∂
bM M − θ0

= m (xi ; θ0 ) + m x i ; θ
e N θ
N i=1 N i=1 ∂θT

where the first expression in the first line equals to zero by construc-
√
tion of all MM estimators. After multiplying both sides by N and
some manipulation the above expression is rendered as follows.

−1 1 X
" N
# N
√ 1 X ∂
N θ bM M − θ0 = − m x i ; θ
e N √ m (xi ; θ0 )
N i=1 ∂θT N i=1

(Continues. . . )
Method of moments asymptotic normality (3/4)
Theorem 17
Proof.
(Continued.) Since this is a random sample:
1. by a suitable Central Limit Theorem:
N
1 X d
−√ m (xi ; θ0 ) → N (0, Var [m (xi ; θ0 )])
N i=1

given that E [m (xi ; θ0 )] = 0 by hypothesis;

2. while by the Weak Law of Large Numbers:
N p ∂
1 X ∂
m xi ; θN → E
e m (xi ; θ0 )
N i=1 ∂θT ∂θT
p
eN →
since θ θ0 by consistency of the estimator (at the limit, θ
eN ,
θM M and θ0 all coincide).
b

(Continues. . . )
Method of moments asymptotic normality (4/4)

Theorem 17
Proof.
(Continued.) These intermediate results are together combined via
the Continuous Mapping Theorem, Slutskij’s Theorem as well as the
Cramér-Wold device so to imply the statement. Therefore, for a fixed
N the asymptotic distribution is:

A 1
bM M ∼
θ N θ0 , M0 Υ0 MT 0
N

which concludes the proof.

This expression of the asymptotic variance-covariance is typically un-

known and must be thus estimated. The general approach to address
this issue is shown later alongside the MLE case.
Maximum likelihood asymptotic normality (1/5)
Theorem 18
Asymptotically, ML estimators are normally distributed and
they attain the Cramér-Rao bound. An estimator θ bM LE defined
as the maximizer of a log-likelihood function as per
N
X
θ
bM LE = arg max log fxi ( θ| xi )
θ∈Θ i=1

is asymptotically normal. Define the following ‘regularity conditions:’

i. the problem is well defined, i.e. θ0 is the maximizer of the popu-
lation expression E [log fx (xi ; θ)] – where fx (xi ; θ) is the proba-
bility mass or density function that generates the data;
ii. fx (xi ; θ) is three times continuously differentiable and its deri-
vatives are bounded in absolute value;
iii. the support of xi does not depend on θ, so that derivatives for θ
can pass at least twice through an integral defined over fx (xi ; θ).
(Continues. . . )
Maximum likelihood asymptotic normality (2/5)
Theorem 18
(Continued.) If the sample is random and the regularity conditions
hold, then the limiting distribution is expressible as:
√
d

−1

N θ bM LE − θ0 → N 0, [I (θ0 )]

where I (θ0 ) – written without the N subscript – is the expression for

the following “single-observation” information matrix evaluated at θ0 .
" T #
∂ ∂
I (θ0 ) ≡ E log fx (xi ; θ0 ) log fx (xi ; θ0 )
∂θ ∂θ
∂2

= −E log f x (x i ; θ 0 )
∂θ∂θT

Consequently, θ
bM LE asymptotically attains the Cramér-Rao bound.

Proof.
(Continues. . . )
Maximum likelihood asymptotic normality (3/5)
Theorem 18
Proof.
(Continued.) The proof proceeds similarly to the MM case. By the
mean value theorem, the MLE First Order Conditions can write as:
N N
1 X ∂
bM LE = 1
X ∂
0= log fx xi ; θ log fx (xi ; θ0 ) +
N i=1 ∂θ N i=1 ∂θ
" N
#
1 X ∂2
bM LE − θ0

+ log f x xi ; θ
e N θ
N i=1 ∂θ∂θT

where the entire expression is zero by definition of MLE. Once again:

√
N θ bM LE − θ0 =

−1 1 X
" N
# N
1 X ∂2 ∂
=− T
log fx x i ; θ
e N √ log fx (xi ; θ0 )
N i=1 ∂θ∂θ N i=1 ∂θ

but here additional simplifications are possible. (Continues. . . )

Maximum likelihood asymptotic normality (4/5)
Theorem 18
Proof.
(Continued.) Thanks to the Information Matrix Equality, under the
regularity conditions the following holds.
1. A suitable Central Limit Theorem implies that:
N
1 X ∂
d
−√ bM LE →
log fx xi ; θ N (0, I (θ0 ))
N i=1 ∂θ
∂

as θ0 maximizes E [log fx (xi ; θ0 )], so: E ∂θ log fx (xi ; θ0 ) = 0;
2. while by the Weak Law of Large Numbers:
N
1 X ∂2 p
eN →
log fx x i ; θ −I (θ0 )
N i=1 ∂θ∂θT
p
eN →
again since θ θ0 by consistency of MLE as per Theorem 9.
(Continues. . . )
Maximum likelihood asymptotic normality (5/5)
Theorem 18
Proof.
(Continued.) Here, the application of the Delta Method results in a
simplified expression of the limiting variance – given in the statement
of the Theorem. Collecting terms, for a fixed N the asymptotic distri-
bution is:
bM LE ∼A −1
θ N θ0 , [IN (θ0 )]

where IN (θ0 ) is the grand (sample) information matrix for a fixed N .

Since the MLE is asymptotically consistent, at the probability limit its
bias is zero, hence the estimator attains the Cramér-Rao bound.

Some comments are in order here.

• Asymptotic attainment of the Cramér-Rao bound is a desirable
property of MLE (alongside invariance – see Lecture 5).
• Yet it hinges on correctly assuming the underlying distribution. If
this is incorrect, the MLE can fail utterly (be inconsistent).
• By contrast, the MM is more robust: there is a trade-off here!
Estimating asymptotic variance-covariances
• The above results develop expressions for both limiting and
asymptotic variance-covariances of MM and ML estimators.

• However, the elements inside such expressions, like M0 , Υ0

and I (θ0 ), are generally unknown ex-ante.

• To use these results in statistical inference it is necessary to

estimate these quantities.

• By the analogy principle, one could use sample analogues

as consistent estimators of population variance-covariances.
√
d N −1 2 p
• Example: if N X̄ − µ → N 0, σ2 , then σ2 .

N S →

• A more elaborate example on the bivariate linear regression

model is developed next.

• General estimators for the MM and MLE cases then follow.

Asymptotic inference in linear regression (1/2)
Suppose one wants to perform a two-sided test of hypothesis on
the bivariate linear regression slope parameter β1 .

H0 : β 1 = C H1 : β1 6= C

If C = 0, this is a so-called significance test of the regression: a

test whether the explanatory variable Xi affects the mean of Yi
in a conditional (CEF) sense.

In small samples this test may be problematic and require extra

assumptions. In an asymptotic environment, the earlier analysis
of the model allows to establish the following property.
√ b 1,M M − C d
β
tN = N → N (0, 1)
Sβ1
Here tN is a t-statistic and Sβ1 is a suitable consistent estimator
√
of the limiting standard deviation of N β1,M M − β1 .
b
Asymptotic inference in linear regression (2/2)
The expression of Sβ1 differs across assumptions. In the general
heteroscedastic case, its squared version is:
PN 2 2
i=1
b 0,M M − β
Yi − β b 1,M M Xi Xi − X̄
Sβ2 1 = N 2 2
PN
i=1 Xi − X̄

while in the more restricted homoscedastic case Sβ2 1 is as follows.

PN 2
i=1 Yi − β
b 0,M M − β
b 1,M M Xi
Sβ2 1 = PN 2
i=1 Xi − X̄
√
The quantity Sβ1 / N is called the standard error of β b 1,M M .
A proper confidence interval for β
b 1,M M would be as follows.

Sβ1 b Sβ

b 1,M M − zα∗ √
β1 ∈ β , β1,M M + zα∗/2 √ 1
/2
N N
Estimating MM asymptotic variance-covariances
In the MM case, McN Υ c T /N is a consistent estimator of the
b NM
N
asymptotic variance-covariance in random samples, where:
" N
#−1
1 X ∂ p
M
cN ≡ m xi ; θ
b
MM → M0
N i=1 ∂θT

is a consistent estimator of M0 (by some Law of Large Numbers

and the Continuous Mapping Theorem), while
N h
1 X i h iT p
ΥN ≡
b m xi ; θ
b
MM m xi ; θ
b
MM → Υ0
N i=1

is also a consistent estimator of the variance of the zero moment

conditions by some applicable Law of Large Numbers, since in a
random sample the following holds.
h i
Υ0 = Var [m (xi ; θ0 )] = E (m (xi ; θ0 )) (m (xi ; θ0 ))T

These estimators also work under general i.n.i.d. assumptions.

Estimating ML asymptotic variance-covariances
In MLE, there are two ways to estimate the information matrix,
corresponding to both sides of the Information Matrix Equality.
The first option is based on the Hessian of the p.m.f. or p.d.f.:
N
bN ≡ − 1 X ∂2 p
H log fx xi ; θ
b
M LE → I (θ0 )
N i=1 ∂θ∂θT

while the second option exploits the “squared” score:

N T
bN ≡ 1 X ∂ ∂
J log fx xi ; θ
b
M LE log fx xi ; θ
b
M LE
N i=1 ∂θ ∂θ

with
p
b N → I (θ0 ) .
J
The choice between Hb N and J
b N is based on convenience and is
context-dependent. Observe how all these estimators (both MM
and MLE) are evaluated at the consistent parameter estimates.

Recitation_1
No ratings yet
Recitation_1
10 pages
Lec 6
No ratings yet
Lec 6
7 pages
lecture-1
No ratings yet
lecture-1
10 pages
Math556 11 ModesOfConvergence
No ratings yet
Math556 11 ModesOfConvergence
9 pages
Lecture 12
No ratings yet
Lecture 12
7 pages
1 Upper Bounds On The Tail Probability
No ratings yet
1 Upper Bounds On The Tail Probability
7 pages
SST 304 Lesson 6 - 240912 - 000756
No ratings yet
SST 304 Lesson 6 - 240912 - 000756
6 pages
Consistency of Estimators: Guy Lebanon May 1, 2006
No ratings yet
Consistency of Estimators: Guy Lebanon May 1, 2006
2 pages
Probability II Upload Week 9
No ratings yet
Probability II Upload Week 9
3 pages
Probability and Statistics: I. Learning Questions
No ratings yet
Probability and Statistics: I. Learning Questions
7 pages
notes-B-2025-02-10
No ratings yet
notes-B-2025-02-10
58 pages
Rakhlin Mathstat sp22
No ratings yet
Rakhlin Mathstat sp22
108 pages
notes-B-2025-04-07
No ratings yet
notes-B-2025-04-07
115 pages
Convexity Sec Proj
No ratings yet
Convexity Sec Proj
13 pages
Gambling, Random Walks and The Central Limit Theorem: 3.1 Random Variables and Laws of Large Num-Bers
No ratings yet
Gambling, Random Walks and The Central Limit Theorem: 3.1 Random Variables and Laws of Large Num-Bers
59 pages
15-359: Probability and Computing Inequalities: N J N J
No ratings yet
15-359: Probability and Computing Inequalities: N J N J
11 pages
Math2058 2023 Tuto5
No ratings yet
Math2058 2023 Tuto5
2 pages
4 Convergence and Simulation
No ratings yet
4 Convergence and Simulation
55 pages
Convergence of Random Variables
No ratings yet
Convergence of Random Variables
11 pages
FA TA Week 3
No ratings yet
FA TA Week 3
5 pages
HW 5
No ratings yet
HW 5
2 pages
Probd
No ratings yet
Probd
49 pages
Lec8 Series
No ratings yet
Lec8 Series
21 pages
2 PDF
No ratings yet
2 PDF
27 pages
The Infinite Square Well: PHY3011 Wells and Barriers Page 1 of 17
No ratings yet
The Infinite Square Well: PHY3011 Wells and Barriers Page 1 of 17
17 pages
Arithmetic Density and Related Concepts: 1 Counting Arithmetic Progressions (Aps)
No ratings yet
Arithmetic Density and Related Concepts: 1 Counting Arithmetic Progressions (Aps)
5 pages
Strong Law
No ratings yet
Strong Law
9 pages
Lecture_3
No ratings yet
Lecture_3
14 pages
More Discrete R.V
No ratings yet
More Discrete R.V
40 pages
Discrete Random Variable
No ratings yet
Discrete Random Variable
41 pages
Random Processes: Version 2, ECE IIT, Kharagpur
No ratings yet
Random Processes: Version 2, ECE IIT, Kharagpur
8 pages
Lesson4 MAT284 PDF
100% (1)
Lesson4 MAT284 PDF
36 pages
6617notes24jan14
No ratings yet
6617notes24jan14
19 pages
Random Variables: COS 341 Fall 2002, Lecture 21
No ratings yet
Random Variables: COS 341 Fall 2002, Lecture 21
6 pages
Chebyshev
No ratings yet
Chebyshev
3 pages
Chapter3 Asymtotic Stats
No ratings yet
Chapter3 Asymtotic Stats
114 pages
Lec 2
No ratings yet
Lec 2
7 pages
Large Deviations: S. R. S. Varadhan
No ratings yet
Large Deviations: S. R. S. Varadhan
12 pages
Philosophy of Analyzing Randomized Algorithms
No ratings yet
Philosophy of Analyzing Randomized Algorithms
2 pages
Limsup Liminf
No ratings yet
Limsup Liminf
3 pages
Defining Exponential Functions Via Limits
No ratings yet
Defining Exponential Functions Via Limits
7 pages
I1-Limit of Infinite Series
No ratings yet
I1-Limit of Infinite Series
14 pages
Lecture 12 Without Duplicate
No ratings yet
Lecture 12 Without Duplicate
18 pages
Integral Calculus
No ratings yet
Integral Calculus
6 pages
QE-(Analysis)-2024-Fall-Key
No ratings yet
QE-(Analysis)-2024-Fall-Key
5 pages
lec06topology
No ratings yet
lec06topology
13 pages
Probability Tutorial
No ratings yet
Probability Tutorial
8 pages
Lecture 24
No ratings yet
Lecture 24
9 pages
HW 2 Solution
No ratings yet
HW 2 Solution
3 pages
Lecture 3: Cauchy Criterion, Bolzano-Weierstrass Theorem: N N N M N M N M
No ratings yet
Lecture 3: Cauchy Criterion, Bolzano-Weierstrass Theorem: N N N M N M N M
2 pages
MIT6 436JF18 Lec23
No ratings yet
MIT6 436JF18 Lec23
9 pages
MATH2352 Differential Equations and Applications Tutorial Notes 4
No ratings yet
MATH2352 Differential Equations and Applications Tutorial Notes 4
5 pages
Summable PDF
No ratings yet
Summable PDF
16 pages
X n θ by less than any arbitrary constant c > 0. Also using Chebyshev's theorem, we see > 0
No ratings yet
X n θ by less than any arbitrary constant c > 0. Also using Chebyshev's theorem, we see > 0
2 pages
(15) Equi-statistical Σ-convergence of Positive Linear Operators
No ratings yet
(15) Equi-statistical Σ-convergence of Positive Linear Operators
7 pages
Lim PR Ob - 0: Convergence in Probability
No ratings yet
Lim PR Ob - 0: Convergence in Probability
4 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Mathematical Foundations of Information Theory
From Everand
Mathematical Foundations of Information Theory
A. Ya. Khinchin
3.5/5 (9)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Lecture-3
No ratings yet
Lecture-3
109 pages
Lecture-2
No ratings yet
Lecture-2
90 pages
Lecture-1
No ratings yet
Lecture-1
87 pages
Lecture-4
No ratings yet
Lecture-4
66 pages
English With Fluency Academy Plans
No ratings yet
English With Fluency Academy Plans
10 pages
BioEdit Version 7.0.0 PDF
No ratings yet
BioEdit Version 7.0.0 PDF
192 pages
Module 5 - Data Science Methodologies
No ratings yet
Module 5 - Data Science Methodologies
9 pages
Vocabulary & Grammar Test Unit 6 Test A
No ratings yet
Vocabulary & Grammar Test Unit 6 Test A
4 pages
Kinetics
No ratings yet
Kinetics
7 pages
Voice of The Rain
No ratings yet
Voice of The Rain
4 pages
ASSAY or CALAMINE LOTION(____________ CALAMINE LOTION)
No ratings yet
ASSAY or CALAMINE LOTION(____________ CALAMINE LOTION)
7 pages
Resume Mugunthan HVAC
No ratings yet
Resume Mugunthan HVAC
1 page
Indirect Questions Activity-2-3
No ratings yet
Indirect Questions Activity-2-3
2 pages
Design Process
No ratings yet
Design Process
35 pages
Hydraulic - Chapter 3 PDF
No ratings yet
Hydraulic - Chapter 3 PDF
113 pages
PRACTICE TEST 1 For The Gifted
No ratings yet
PRACTICE TEST 1 For The Gifted
11 pages
Coordinate Measuring Machines and Systems Second Edition Robert J. Hocken pdf download
No ratings yet
Coordinate Measuring Machines and Systems Second Edition Robert J. Hocken pdf download
61 pages
The Hologram As A Teaching Medium For The Acquisition of STEM Contents
No ratings yet
The Hologram As A Teaching Medium For The Acquisition of STEM Contents
15 pages
DLL - Mathematics 6 - Q3 - W4
No ratings yet
DLL - Mathematics 6 - Q3 - W4
7 pages
GizmoRayTracingLenses PART A
No ratings yet
GizmoRayTracingLenses PART A
3 pages
RGSTC Sanctioned Projects List 2023
No ratings yet
RGSTC Sanctioned Projects List 2023
19 pages
Single Variable Notes
No ratings yet
Single Variable Notes
134 pages
CV Posao Eng
No ratings yet
CV Posao Eng
2 pages
Bem - 2011 - Feeling The Future Experimental Evidence For Anom
No ratings yet
Bem - 2011 - Feeling The Future Experimental Evidence For Anom
19 pages
EXCLUSIVE: Anthropology Prof Teaches 'Culture Makes Us Women, Men, or Other Genders'
No ratings yet
EXCLUSIVE: Anthropology Prof Teaches 'Culture Makes Us Women, Men, or Other Genders'
25 pages
NYU Graduate Admissions Communications, Journalism, and Media Studies
No ratings yet
NYU Graduate Admissions Communications, Journalism, and Media Studies
2 pages
Marketing Strategy Jindal Saw Ltd. 1
No ratings yet
Marketing Strategy Jindal Saw Ltd. 1
56 pages
Examination Timetable For Semester One 2024 2025 Academic Year
No ratings yet
Examination Timetable For Semester One 2024 2025 Academic Year
9 pages
8854Download ebooks file Lumped Elements for RF and Microwave Circuits 2nd Edition Inder J. Bahl all chapters
100% (15)
8854Download ebooks file Lumped Elements for RF and Microwave Circuits 2nd Edition Inder J. Bahl all chapters
60 pages
Quarter 1 Week 5 - Organization and Management
No ratings yet
Quarter 1 Week 5 - Organization and Management
12 pages
Heat Exchanger LAB
No ratings yet
Heat Exchanger LAB
10 pages
V208-e-Correlation of Misting During Printing Withextensional Rheological Investigations HAAKE CaBER 1
No ratings yet
V208-e-Correlation of Misting During Printing Withextensional Rheological Investigations HAAKE CaBER 1
4 pages
Experiment 4-Shoe Impression and Moulage Casting
No ratings yet
Experiment 4-Shoe Impression and Moulage Casting
3 pages
Pe Chemical Exam Specs
0% (1)
Pe Chemical Exam Specs
4 pages

Lecture-6

Uploaded by

Lecture-6

Uploaded by

Asymptotic Analysis

Probability and Statistics

• The focus is on convergence results for selected statistics:

• These results greatly facilitate estimation & inference when

• The main objective is to characterize the behavior of known

• This is achieved via the analysis of two fundamental results

The definition can also apply to sequences of random matrices having

Example: both the sample mean and the sample variance-covariance:

The first asymptotic concept related to the idea of “convergence,” as

This is a desirable properties of random sequences, however it is still

lim P (kxN − ck > δ) = 0

for any positive real number δ > 0.

P (kxN − ck > δ) < ε ∀N ≥ Nε

thus by setting δε = δ + kxNε k − kxNε − ck one gets xN = Op (1).

Hence, while “boundedness” is valid for a specific constant δ so long

The following asymptotic concept is even stronger than convergence in

In the special case with r = 2, this concept is known as Convergence

This particular kind of convergence is not as general as convergence in

and notice that by Theorem 2 it must converge in first mean.

lim E [QN ] = lim E [kxN − ck] = 0

In addition, quadratic mean convergence implies the following.

lim P (kxN − ck > δ) = lim P (|QN − E [QN ]| > δ)

This result is useful to verify that in random samples drawn from a

and in addition, if Var [X] < ∞:

There is yet another, stronger notion of convergence.

One can prove that “almost sure” convergence implies convergence in

• A matrix-valued random sequence XN is said to converge in

lim P (kXN − Ck > δ) = 0

• (. . . where, for any matrix B, it is as follows).

• Convergence in probability of random matrix sequences can

thus, convergence in probability and almost sure convergence are pre-

P (kg (xN ) − g (c)k > ε) ≤ P (kxN − ck ≥ δ) + P (c ∈ Gδ ) + P (c ∈ Dg )

lim P (kg (xN ) − g (c)k > ε) ≤ lim P (kxN − ck ≥ δ)

which proves the result on convergence in probability.

• The theorem already showcases the convenience of working

• Note that in general, one cannot derive the expected value

• (The best one can do is to derive approximations based on

while for sequences of random square matrices of full rank

4. Combinations of the above. Consider the three random

• These fundamental results in asymptotic analysis show how

• There are two distinct kinds of Law: weak (for convergence

• The weak law resembles the result following from quadratic

Mx̄N (t) = E exp tT x̄N

where the third line follows from independence between observations,

hence, taking the limit gives the following result.

lim Mx̄N (t) = exp tT E [x]

This is a trivial m.g.f.: the one of a degenerate discrete random vector

then the following almost sure convergence result holds.

More complex results that apply to non-independent observations also

• The simulation is based off random draws from the Poisson

• Four histograms are presented. All of them bin the values

• Across histograms the size of the sample N varies. Thus, if

• This helps showing how the empirical sampling distribution

Note: from now on, the subscript 0 – as in θ0 – is used to denote the

Before proving consistency of both MM and MLE estimators (under

Observe that after dividing both sides of the above ratio by N ,

An extended analysis shows that the MM estimator of β0 :

converges in probability to that parameter set θ0 that maximizes the

θ0 = arg max E [log fx (x; θ)]

If such a maximum exists, by the likelihood principle it corresponds to

while by the definition of MLE the following holds for all N ∈ N.

Moreover, since θ0 maximizes the expected logarithmic p.d.f. or p.m.f.

while, at the same time:

hence, at the limit it follows that plim θ

• What if interest falls on convergence to a random vector?

• Consider the expression:

which can be read as: “the random sequence xN converges

lim P (kxN − xk > δ) = 0

• Is that enough so to guarantee that at the probability limit,

lim |FxN (xN ) − Fx (x)| = 0

at all continuity points x ∈ X belonging to the support of x. This is

Observe the following.

This result was already anticipated in Lecture 2. It is worthwhile to

• As it has been analyzed in Lecture 4, a t-statistic follows a

• If seen as a random sequence, the t-statistic thus converges

• Hence, with large N one can very reliably perform inference

Hence, while “boundedness” is valid for a specific constant δ so long