Lecture-6
Lecture-6
Paolo Zacchia
Lecture 6
Why asymptotics?
• This lecture characterizes probabilistic results for samples
in asymptotic settings: when the sample size N is large.
Definition 1
Random sequence. Any random vector expressed as an N -indexed
T
sequence, write it as xN = (X1N , . . . , XKN ) , is a random sequence.
In the univariate context (K = 1), one can write it simply as XN .
Definition 2
Boundedness in Probability. A sequence xN of random vectors is
bounded in probability if and only if, for any ε > 0, there exists some
number δε < ∞ and an integer Nε such that
P (kxN k ≥ δε ) < ε ∀N ≥ Nε
which is also written as xN = Op (1) and read as “xN is big p-oh one.”
Definition 3
Convergence in Probability. A sequence xN of random vectors con-
verges in probability to a constant vector c if
This definition formalizes the idea that as the sample size N grows in-
creasingly larger, the probability distribution of xN concentrates within
an increasingly smaller neighborhood of c.
Convergence in probability is usually denoted in one of two ways.
p
xN → c
plim xN = c
p
Here the first type of notation (using →) is preferred.
Convergence implies boundedness in probability
Theorem 1
Convergent Random Sequences are also Bounded. If some se-
quence xN of random vectors converges in probability to a constant c,
p
that is xN → c, then it is also bounded: xN = Op (1).
Proof.
By the definition of convergence in probability, for any ε > 0 there is
always an integer Nε such that
xN = op (1)
which is read as “xN is little p-oh one.” This notation helps develop
the following concept.
Definition 4
Convergence of Random Sequences to Real Sequences. Con-
sider a random sequence xN , and a non-random sequence aN of the
same dimension K as xN . Moreover, define the random sequence
T
zN = (Z1N , . . . , ZKN ) where Zkn = Xkn /akn for k = 1, . . . , K and
for n = 1, 2, . . . to infinity.
1. If zN = Op (1), then xN is said to be bounded in probability by
aN , which one can write as xN = Op (aN ).
2. If zN = op (1), then xN is said to converge in probability to aN ,
which one can write as xN = op (aN ).
Convergence in r-th Mean
Definition 5
Convergence in r-th Mean. A random sequence xN is said to con-
verge in r-th mean to a constant vector c if the following holds.
r
lim E [kxN − ck ] = 0
N →∞
Theorem 2
Convergence in Lower Means. A random sequence xN that con-
verges in r-th mean to some constant vector c also converges in s-th
mean to c for s < r.
Proof.
The proof is based on Jensen’s Inequality:
r s
h i
s
lim E [kxN − ck ] = lim E (kxN − ck ) r
N →∞ N →∞
s
r
≤ lim {E [kxN − ck ]} r
N →∞
=0
r
since limN →∞ E [kxN − ck ] = 0.
Convergence in quadratic mean (1/2)
Theorem 3
Convergence in Quadratic Mean and Probability. If a random
sequence xN converges in r-th mean to a constant vector c for r ≥ 2
qm
(that is, at least xN → c), then it also converges in probability to c.
Proof.
Define the (one-dimensional) nonnegative random sequence QN as:
q
T
QN = kxN − ck = (xN − c) (xN − c) ∈ R+
(Continues. . . )
Convergence in quadratic mean (2/2)
Theorem 3
Proof.
(Continued.) At the same time, by Čebyšëv’s Inequality:
Var [QN ]
P (|QN − E [QN ]| > δ) ≤
δ2
therefore, taking limits on both sides gives:
Definition 6
Almost Sure Convergence. A sequence xN of random vectors con-
verges almost surely, or with probability one to a constant vector c if it
holds that:
P lim xN = c = 1
N →∞
a.s.
where limN →∞ xN is a random vector. This is also writes as xN → c.
Theorem 4
Continuous Mapping Theorem. Consider a vector-valued random
sequence xN ∈ X, a vector c ∈ X with the same length as xN , as well
as a vector-valued continuous function g (·) with a set of discontinuity
points Dg such that:
P (x ∈ Dg ) = 0
(the probability mass at the discontinuities is zero). It follows that:
p p
xN → c ⇒ g (xN ) → g (c)
a.s. a.s.
xN → c ⇒ g (xN ) → g (c)
Gδ = { x ∈ X| x ∈
/ Dg : ∃y ∈ X : kx − yk < δ, kg (x) − g (y)k > ε}
that is, the set of points in X where g (·) “amplifies” the distance with
some other point y beyond a small neighborhood of ε. In light of this
definition:
and note that upon taking the limit of the right-hand side as N → ∞,
the second term vanishes by definition of a continuous function, while
the third term is zero by hypothesis. Therefore:
• If µ
b N converges in probability to µ0 though, the continuous
mapping theorem ensures that in large samples g (µ b N ) also
converges in probability to g (µ0 ).
Uses of the continuous mapping theorem (2/3)
A list of ‘properties’ of random sequences, which can be derived
from the continuous mapping theorem, follows suit.
p
1. Scalars. Given two scalar random sequences XN → x and
p
YN → y, the following holds.
p
(XN + YN ) → x + y
p
XN YN → xy
p
XN /YN → x/y if y 6= 0
p
2. Vectors. Given two vector random sequences xN → x and
p
yN → y of equal length, the following holds.
p
xT T
N yN → x y
T p
xN yN → xyT
Uses of the continuous mapping theorem (3/3)
p
3. Matrices. Given two matrix random sequences XN → X
p
and YN → Y of appropriate dimension it holds that:
p
XN YN → XY
• While both are stated next, only the weak law is proved. A
full-fledged proof would use characteristic functions; for the
sake of simplicity, a slightly less general proof that is based
on moment-generating functions is given.
Theorem 5
Weak Law of Large Numbers (Khinčin’s). The sample mean as-
sociated with a random (i.i.d.) sample drawn from the distribution of
a random vector x with finite mean E [x] < ∞ converges in probability
to such population mean of x.
N
1 X p
x̄N = xi → E [x]
N i=1
Proof.
(Sketched.) The proof is restricted to random vectors x for which the
m.g.f. Mx (t) is actually defined. A more general analysis, which also
allows for random vectors that lack an m.g.f., would use characteristic
functions φx (t) instead. (Continues. . . )
Weak Law of Large Numbers (2/3)
Theorem 5
Proof.
(Continued.) The m.g.f. of the sample mean x̄N is, for a given N :
Theorem 5
Proof.
(Continued.) From a Taylor expansion around t0 = 0:
T N
tT E [x]
t ι
Mx̄N (t) = 1 + +o
N N
Theorem 6
Strong Law of Large Numbers (Kolmogorov’s). If in a random
(i.i.d.) sample drawn from the distribution of some random vector x it
simultaneously holds that:
i. E [x] < ∞,
ii. Var [x] < ∞,
P∞ −2
iii. n=1 n Var [xn ] < ∞,
1
PN
then the sample mean x̄N = N i=1 xi converges almost surely to its
population mean.
N
1 X a.s.
x̄N = xi → E [x]
N i=1
Strong Laws of Large Numbers (2/2)
The following result applies to non-identically distributed observations.
Theorem 7
Strong Law of Large Numbers (Markov’s). Consider a sample
with independent, non identically distributed observations (i.n.i.d.) so
that the random vectors xi that generate it have possibly heterogeneous
moments E [xi ] and Var [xi ]. If for some δ > 0 it holds that:
∞
X 1 h
1+δ
i
lim E |x i − E [xi ]| <∞
N →∞
i=1
i1+δ
0.5
0
0 4 8 12 16
N =1
LLN: Illustrative simulations (3/5)
0.6
0.4
0.2
0
0 4 8 12 16
N = 10
LLN: Illustrative simulations (4/5)
1.5
0.5
0
0 4 8 12 16
N = 100
LLN: Illustrative simulations (5/5)
0
0 4 8 12 16
N = 1000
Consistent estimators
The Laws of Large Numbers can be exploited to show that selected
estimators converge in probability to the parameters of interest for
estimation.
Definition 7
Consistent Estimators. An estimator θ bN is called consistent if it
converges in probability to the true population parameters θ0 which it
is meant to estimate.
bN → p
θ θ0
where X̄ = N −1 N −1 N
P P
i=1 Xi and Ȳ = N i=1 Yi . This estimator
can also be obtained via MLE under certain assumptions.
(Continues. . . )
Consistency of Maximum Likelihood (3/3)
Theorem 9
Proof.
(Continued.) All these facts can be simultaneously reconciled only if,
at the limit:
N
1 X p h
bM LE →
i
log fx xi ; θ E log fx x; θ
bM LE
N i=1
Definition 8
Convergence in Distribution. Consider:
• a sequence of random vectors xN , whose each element has a cu-
mulative distribution function FxN (xN ),
• and a random vector x with cumulative distribution function
Fx (x).
The random sequence xN is said to converge in distribution to x if:
The distribution Fx (x) from the definition takes the following name.
Definition 9
d
Limiting Distribution. If xN → x, that is some random sequence
xN converges in distribution to a random vector x, Fx (x) is said to be
the limiting distribution of xN .
FX (x) ν=1
1 ν=3
ν→∞
0.5
x
−5 −3 −1 1 3 5
Convergence of t-statistics
• Consider a random sample which is drawn from a normally
distributed random variable X ∼ N µ, σ2 .
N −K 2
t =
K (N − 1) N
(N − K) N
= (x̄ − µ)T S −1 (x̄ − µ) ∼ FK,N −K
K (N − 1)
For a given N , this statistic follows the F-distribution with
paired degrees of freedom K and N − K.
• Per Observation 2, Hotelling’s t-squared statistic converges
in distribution to a chi-squared distribution with K degrees
of freedom:
d
t2N → χ2K
note that the term (N − K) / (N − 1) vanishes as N → ∞.
• Like in the univariate case, this result facilitates statistical
inference in multivariate settings.
Gamma convergence to the normal
Observation 3
Asymptotics of the Gamma distribution. Consider a random vari-
able that follows the Gamma distribution with parameters α and β,
X ∼ Γ (α, β). Let µ = α/β as well as σ2 = α/β2 . As α → ∞, the
probability distribution of X tends to that of a normal
distribution
with parameters µ and σ2 , i.e. limα→∞ X ∼ N µ, σ2 .
Proof. √ √
Define the random variable Z = (X − µ) /σ = β/ α X − α; by
the properties of m.g.f.s it is:
−α
√ √
β t
MZ (t) = exp − αt · MX √ t = exp − αt 1 − √
α α
Theorem 10
Continuous Mapping Theorem (convergence in distribution).
Under the hypotheses of Theorem 4:
d d
xN → x ⇒ g (xN ) → g (x)
Proof.
(Continues. . . )
Slutskij’s Theorem (2/2)
Theorem 11
Proof. p
(Continued.) Recognize that, as YN → c, YN has a degenerate limi-
ting distribution, and the (vector-valued) random sequence (XN , YN )
converges in distribution to that of the random vector (X, c). All the
results above follow, therefore, from applying the Continuous Mapping
Theorem to three given continuous functions of XN and YN .
Theorem 12
Extreme Value Theorem (Fisher-Tippett-Gnedenko). Given a
random (i.i.d.) sample (X1 , . . . , XN ), if a convergence in distribution
result of the kind
X(N ) − bN d
→W
aN
can be established – where X(N ) is the maximum order statistic while
aN > 0 and bN are sequences of real constants – then:
W ∼ GEV (0, 1, ξ)
for some real ξ. That is, the limiting distribution of the “normalized”
maximum is some standardized type of the Generalized Extreme Value
distribution.
Proof.
(Outline.) The extended proof is quite elaborate. (Continues. . . )
The Extreme Value Theorem (2/4)
Theorem 12
Proof.
(Continued.) The objective is to show that, given a random variable
X from which the random sample is drawn, for all the points x ∈ X in
its support where the distribution FX (x) is continuous:
N −1
lim [FX (aN x − bN )] = exp − (1 + ξx) ξ
N →∞
Y(1) = −X(N )
Theorem 13
Central Limit Theorem (Lindeberg and Lévy’s). The sample
mean x̄N associated with a random (i.i.d.) sample drawn from the di-
stribution of a random vector x with mean and variance that are both
finite: E [x] < ∞ and Var [x] < ∞, is such that √ the random sequence
defined as a centered sample mean multiplied by N converges in di-
stribution to a multivariate normal distribution.
N
!
√ 1 X d
N xi − E [x] → N (0, Var [x])
N i=1
Proof.
(Sketched.) Like in earlier proof for the Weak Law of Large Numbers
(Theorem 5) this one will make use of moment-generating functions,
but in order to be general enough, characteristic functions should be
used instead. (Continues. . . )
Classic Central Limit Theorem (2/5)
Theorem 13
Proof.
(Continued.) Consider the standardized random vector
− 12
z = [Var [x]] (x − E [x])
− 12 1
where the matrix [Var [x]] and its inverse [Var [x]] 2 satisfies:
− 12 − 12
[Var [x]] Var [x] [Var [x]] =I
(Continues. . . )
Classic Central Limit Theorem (4/5)
Theorem 13
Proof.
¯N , for fixed N , as:
(Continued.) To show this, express the m.g.f. of z̄
¯N
Mz̄¯N (t) = E exp tT z̄
" N
!#
1 X T
= E exp √ t zi
N i=1
N
Y 1 T
= E exp √ t z
i=1
N
N
1
= Mz √ t
N
by a derivation analogous to the one in the proof of the Weak Law of
Large Numbers. (Continues. . . )
Classic Central Limit Theorem (5/5)
Theorem 13
Proof.
(Continued.) Just like in that proof, apply here a Taylor expansion
of the above expression around t0 = 0, but now of second degree:
" T # N
tT E [z] tT E zz T t t t
Mz̄¯N (t) = 1 + √ + +o
N 2N 2N
N
tT t tT t
= 1+ +o
2N 2N
where the second line exploits the fact that E [z] = 0 and E zz T = I
by construction of z. Taking the limit for N → ∞ now gives:
T
t t
lim Mz̄¯N (t) = exp
N →∞ 2
A
• The notation ∼ indicates here that the normal distribution
in question, which is called the asymptotic distribution,
is approximate and is valid for a fixed N , instead of being a
“limiting” distribution.
• These are based on exactly the same random draws from the
Poisson distribution with parameter λ = 4.
0.8
0.6
0.4
0.2
0
−2 −1 0 1 2 3 4
N =1
CLT: Illustrative simulations (3/5)
0.4
0.2
0
−3 −2 −1 0 1 2 3
N = 10
CLT: Illustrative simulations (4/5)
0.4
0.2
0
−3 −2 −1 0 1 2 3
N = 100
CLT: Illustrative simulations (5/5)
0.4
0.2
0
−3 −2 −1 0 1 2 3
N = 1000
More general Central Limit Theorems
• As with the Laws of Large Numbers, more general Central
Limit Theorems with less restrictive assumptions exist.
• Two famous versions, which both allow for i.n.i.d. data, are
presented next without proof.
• Even in this case, some more general versions that allow for
weakly dependent observations also exist.
Lindeberg-Feller Central Limit Theorem
Theorem 14
Central Limit Theorem (Lindeberg and Feller’s). Consider a
non-random (i.n.i.d.) sample where the random vectors xi that gene-
rate it have possibly heterogeneous finite means E [xi ] < ∞, variances
Var [xi ] < ∞, and all mixed third moments are finite too. If:
N
!−1
X
lim Var [xi ] Var [xi ] = 0
N →∞
i=1
where: N
1 X p
Var [xi ] → Var [x]
N i=1
that is, the positive semi-definite matrix Var [x] is the probability limit
of the observations’ variances.
Ljapunov’s Central Limit Theorem
Theorem 15
Central Limit Theorem (Ljapunov’s). Consider a non-random
(i.n.i.d.) sample where the random vectors xi that generate it have
possibly heterogeneous finite moments E [xi ] < ∞, Var [xi ] < ∞. If:
N
!−(1+ δ2 ) N
X X h 2+δ
i
lim Var [xi ] E |xi − E [xi ]| =0
N →∞
i=1 i=1
Proof.
(Continues. . . )
The Delta Method (2/2)
Theorem 16
Proof.
(Continued.) From the mean value theorem it is:
∂
d (xN ) = d (c) + xN ) (xN − c)
d (e
∂xT
p
eN is a convex combination of xN and c. However, as xN → c:
where x
∂ p ∂
T
xN ) →
d (e d (c) = ∆
∂x ∂xT
hence, at the probability limit:
√ p √
N (d (xN ) − d (c)) → ∆ · N (xN − c)
Proof.
(Continues. . . )
Method of moments asymptotic normality (2/4)
Theorem 17
Proof.
(Continued.) The proof applies the same logic as the Delta Method.
By the mean value theorem, the sample moment conditions become:
N
1 X
0= m xi ; θbM M =
N i=1
N
" N
#
1 X 1 X ∂
bM M − θ0
= m (xi ; θ0 ) + m x i ; θ
e N θ
N i=1 N i=1 ∂θT
where the first expression in the first line equals to zero by construc-
√
tion of all MM estimators. After multiplying both sides by N and
some manipulation the above expression is rendered as follows.
−1 1 X
" N
# N
√ 1 X ∂
N θ bM M − θ0 = − m x i ; θ
e N √ m (xi ; θ0 )
N i=1 ∂θT N i=1
(Continues. . . )
Method of moments asymptotic normality (3/4)
Theorem 17
Proof.
(Continued.) Since this is a random sample:
1. by a suitable Central Limit Theorem:
N
1 X d
−√ m (xi ; θ0 ) → N (0, Var [m (xi ; θ0 )])
N i=1
(Continues. . . )
Method of moments asymptotic normality (4/4)
Theorem 17
Proof.
(Continued.) These intermediate results are together combined via
the Continuous Mapping Theorem, Slutskij’s Theorem as well as the
Cramér-Wold device so to imply the statement. Therefore, for a fixed
N the asymptotic distribution is:
A 1
bM M ∼
θ N θ0 , M0 Υ0 MT 0
N
Consequently, θ
bM LE asymptotically attains the Cramér-Rao bound.
Proof.
(Continues. . . )
Maximum likelihood asymptotic normality (3/5)
Theorem 18
Proof.
(Continued.) The proof proceeds similarly to the MM case. By the
mean value theorem, the MLE First Order Conditions can write as:
N N
1 X ∂
bM LE = 1
X ∂
0= log fx xi ; θ log fx (xi ; θ0 ) +
N i=1 ∂θ N i=1 ∂θ
" N
#
1 X ∂2
bM LE − θ0
+ log f x xi ; θ
e N θ
N i=1 ∂θ∂θT
−1 1 X
" N
# N
1 X ∂2 ∂
=− T
log fx x i ; θ
e N √ log fx (xi ; θ0 )
N i=1 ∂θ∂θ N i=1 ∂θ
H0 : β 1 = C H1 : β1 6= C
Sβ1 b Sβ
b 1,M M − zα∗ √
β1 ∈ β , β1,M M + zα∗/2 √ 1
/2
N N
Estimating MM asymptotic variance-covariances
In the MM case, McN Υ c T /N is a consistent estimator of the
b NM
N
asymptotic variance-covariance in random samples, where:
" N
#−1
1 X ∂ p
M
cN ≡ m xi ; θ
b
MM → M0
N i=1 ∂θT
with
p
b N → I (θ0 ) .
J
The choice between Hb N and J
b N is based on convenience and is
context-dependent. Observe how all these estimators (both MM
and MLE) are evaluated at the consistent parameter estimates.