0% found this document useful (0 votes)
44 views

Lecture Notes 4 Convergence (Chapter 5) 1 Random Samples: 1 N N 1 N N I

1. The document discusses different types of convergence for random variables and sequences of random variables: almost sure convergence, convergence in probability, convergence in quadratic mean, and convergence in distribution. 2. It provides definitions and examples for each type of convergence. Key results shown are that convergence in probability does not imply almost sure convergence, and the relationships between the different types of convergence. 3. The relationships are: convergence in quadratic mean implies convergence in probability; convergence in probability implies convergence in distribution; if convergence in distribution holds and the limit is a constant, then convergence in probability holds; and almost sure convergence implies convergence in probability.

Uploaded by

Vishal Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Lecture Notes 4 Convergence (Chapter 5) 1 Random Samples: 1 N N 1 N N I

1. The document discusses different types of convergence for random variables and sequences of random variables: almost sure convergence, convergence in probability, convergence in quadratic mean, and convergence in distribution. 2. It provides definitions and examples for each type of convergence. Key results shown are that convergence in probability does not imply almost sure convergence, and the relationships between the different types of convergence. 3. The relationships are: convergence in quadratic mean implies convergence in probability; convergence in probability implies convergence in distribution; if convergence in distribution holds and the limit is a constant, then convergence in probability holds; and almost sure convergence implies convergence in probability.

Uploaded by

Vishal Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lecture Notes 4

Convergence (Chapter 5)

1 Random Samples
Let X1 , . . . , Xn ∼ F . A statistic is any function Tn = g(X1 , . . . , Xn ). Recall that the sample
mean is n
1X
Xn = Xi
n i=1
and sample variance is
n
1 X
Sn2 = (Xi − X n )2 .
n − 1 i=1
Let µ = E(Xi ) and σ 2 = Var(Xi ). Recall that

σ2
E(X n ) = µ, Var(X n ) = , E(Sn2 ) = σ 2 .
n

Theorem 1 If X1 , . . . , Xn ∼ N (µ, σ 2 ) then X n ∼ N (µ, σ 2 /n).


2 s2 /2
Proof. We know that MXi (s) = eµs+σ . So,
t Pn
MX n (t) = E(etX n ) = E(e n i=1 Xi
)
( )
σ2 2
 2 t2 /(2n2 )
n t
= (EetXi /n )n = (MXi (t/n))n = e(µt/n)+σ = exp µt + n
2

which is the mgf of a N (µ, σ 2 /n). 

Example 2 Let X(1) , . . . , X(n) denoted the ordered values:

X(1) ≤ X(2) ≤ · · · ≤ X(n) .

Then X(1) , . . . , X(n) are called the order statistics. And Tn = (X(1) , . . . , X(n) ) is a statistic.

1
2 Convergence
Let X1 , X2 , . . . be a sequence of random variables and let X be another random variable.
Let Fn denote the cdf of Xn and let F denote the cdf of X. We are going to study different
types of convergence.

Example: A good example to keep in mind is the folllowing. Let Y1 , Y2 , . . . , be a


sequence of iid random variables. Let
n
1X
Xn = Yi
n i=1
be the average of the first n of the Yi ’s. This defines a new sequence X1 , X2 , . . . ,. In other
words, the sequence of interest X1 , X2 , . . . , might be a sequence of statistics based on some
other sequence of iid random variables. Note that the original sequence Y1 , Y2 , . . . , is
iid but the sequence X1 , X2 , . . . , is not iid.

a.s.
1. Xn converges almost surely to X, written Xn → X, if, for every  > 0,
 
P lim |Xn − X| <  = 1. (1)
n→∞
a.s.
Xn converges almost surely to a constant c, written Xn → c, if, for every  > 0,
 
P lim Xn = c = 1. (2)
n→∞

P
2. Xn converges to X in probability, written Xn → X, if, for every  > 0,
P(|Xn − X| > ) → 0 (3)
as n → ∞. In other words, Xn − X = oP (1).
P
Xn converges to c in probability, written Xn → c, if, for every  > 0,
P(|Xn − c| > ) → 0 (4)
as n → ∞. In other words, Xn − c = oP (1).
3. Xn converges to X in quadratic mean (also called convergence in L2 ), written
qm
Xn → X, if
E(Xn − X)2 → 0 (5)
as n → ∞.
qm
Xn converges to c in quadratic mean, written Xn → c, if
E(Xn − c)2 → 0 (6)
as n → ∞.

2
4. Xn converges to X in distribution, written Xn X, if

lim Fn (t) = F (t) (7)


n→∞

at all t for which F is continuous.


Xn converges to c in distribution, written Xn c, if

lim Fn (t) = δc (t) (8)


n→∞

at all t 6= c where δc (t) = 0 if t < c and δc (t) = 1 if t ≥ c.

Theorem 3 Convergence in probability does not imply almost sure convergence.

Proof. Let Ω = [0, 1]. Let P be uniform on [0, 1]. We draw S ∼ P . Let X(s) = s and
let

X1 = s + I[0,1] (s), X2 = s + I[0,1/2] (s), X3 = s + I[1/2,1] (s)


X4 = s + I[0,1/3] (s), X5 = s + I[1/3,2/3] (s), X6 = s + I[2/3,1] (s)

P
etc. Then Xn → X. But, for each s, Xn (s) does not converge to X(s). Hence, Xn does not
converge almost surely to X. In fact, P ({s ∈ Ω : limn Xn (s) = X(s)}) = 0. 

Example 4 Let Xn ∼ N (0, 1/n). Intuitively, Xn is concentrating at 0 so we would like to


say that Xn converges to 0. Let’s
√ see if this is true. Let F be the distribution function for
a point mass at 0. Note that nXn ∼ N (0, 1). Let Z denote a standard normal random
variable. For t < 0,
√ √ √
Fn (t) = P(Xn < t) = P( nXn < nt) = P(Z < nt) → 0

since nt → −∞. For t > 0,
√ √ √
Fn (t) = P(Xn < t) = P( nXn < nt) = P(Z < nt) → 1

since nt → ∞. Hence, Fn (t) → F (t) for all t 6= 0 and so Xn 0. Notice that Fn (0) =
1/2 6= F (1/2) = 1 so convergence fails at t = 0. That doesn’t matter because t = 0 is
not a continuity point of F and the definition of convergence in distribution only requires
convergence at continuity points.
Now consider convergence in probability. For any  > 0, using Markov’s inequality,
1
2 E(Xn2 )
2 n
P(|Xn | > ) = P(|Xn | >  ) ≤ = 2 →0
2 
P
as n → ∞. Hence, Xn → 0.

3
The next theorem gives the relationship between the types of convergence.

Theorem 5 The following relationships hold:


qm P
(a) Xn → X implies that Xn → X.
P
(b) Xn → X implies that Xn X.
P
(c) If Xn X and if P(X = c) = 1 for some real number c, then Xn → X.
as P
(d) Xn → X implies Xn → X.
In general, none of the reverse implications hold except the special case in (c).
qm
Proof. We start by proving (a). Suppose that Xn → X. Fix  > 0. Then, using
Markov’s inequality,
E|Xn − X|2
P(|Xn − X| > ) = P(|Xn − X|2 > 2 ) ≤ → 0.
2

Proof of (b). Fix  > 0 and let x be a continuity point of F . Then

Fn (x) = P(Xn ≤ x) = P(Xn ≤ x, X ≤ x + ) + P(Xn ≤ x, X > x + )


≤ P(X ≤ x + ) + P(|Xn − X| > )
= F (x + ) + P(|Xn − X| > ).

Also,

F (x − ) = P(X ≤ x − ) = P(X ≤ x − , Xn ≤ x) + P(X ≤ x − , Xn > x)


≤ Fn (x) + P(|Xn − X| > ).

Hence,

F (x − ) − P(|Xn − X| > ) ≤ Fn (x) ≤ F (x + ) + P(|Xn − X| > ).

Take the limit as n → ∞ to conclude that

F (x − ) ≤ lim inf Fn (x) ≤ lim sup Fn (x) ≤ F (x + ).


n→∞ n→∞

This holds for all  > 0. Take the limit as  → 0 and use the fact that F is continuous at x
and conclude that limn Fn (x) = F (x).
Proof of (c). Fix  > 0. Then,

P(|Xn − c| > ) = P(Xn < c − ) + P(Xn > c + )


≤ P(Xn ≤ c − ) + P(Xn > c + )
= Fn (c − ) + 1 − Fn (c + )
→ F (c − ) + 1 − F (c + )
= 0 + 1 − 1 = 0.

4
Proof of (d). Omitted.
Let us now show that the reverse implications do not hold.
Convergence in √in quadratic mean. Let U ∼ Unif(0, 1)
√probability does not imply convergence
and let Xn = nI(0,1/n) (U ). Then P(|Xn | > ) = P( nI(0,1/n) (U ) > ) = P(0 ≤ U < 1/n) =
P R 1/n
1/n → 0. Hence, Xn → 0. But E(Xn2 ) = n 0 du = 1 for all n so Xn does not converge in
quadratic mean.
Convergence in distribution does not imply convergence in probability. Let X ∼ N (0, 1).
Let Xn = −X for n = 1, 2, 3, . . .; hence Xn ∼ N (0, 1). Xn has the same distribution
function as X for all n so, trivially, limn Fn (x) = F (x) for all x. Therefore, Xn X. But
P(|Xn − X| > ) = P(|2X| > ) = P(|X| > /2) 6= 0. So Xn does not converge to X in
probability. 

The relationships between the types of convergence can be summarized as follows:

q.m.

a.s. → prob → distribution

P
Example 6 One might conjecture that if Xn → b, then E(Xn ) → b. This is not true.
Let Xn be a random variable defined by P(Xn = n2 ) = 1/n and P(Xn = 0) = 1 − (1/n).
P
Now, P(|Xn | < ) = P(Xn = 0) = 1 − (1/n) → 1. Hence, Xn → 0. However, E(Xn ) =
[n2 × (1/n)] + [0 × (1 − (1/n))] = n. Thus, E(Xn ) → ∞.

Example 7 Let X1 , . . . , Xn ∼ Uniform(0, 1). Let X(n) = maxi Xi . First we claim that
P
X(n) → 1. This follows since
Y
P(|X(n) − 1| > ) = P(X(n) ≤ 1 − ) = P(Xi ≤ 1 − ) = (1 − )n → 0.
i

Also
P(n(1 − X(n) ) ≤ t) = P(X(n) ≥ 1 − (t/n)) = 1 − (1 − t/n)n → 1 − e−t .
So n(1 − X(n) ) Exp(1).

Some convergence properties are preserved under transformations.

Theorem 8 Let Xn , X, Yn , Y be random variables.


P P P
(a) If Xn → X and Yn → Y , then Xn + Yn → X + Y .
qm qm qm
(b) If Xn → X and Yn → Y , then Xn + Yn → X + Y .
P P P
(c) If Xn → X and Yn → Y , then Xn Yn → XY .

5
In general, Xn X and Yn Y does not imply that Xn + Yn X + Y . But there
are cases when it does:

Theorem 9 (Slutzky’s Theorem) If Xn X and Yn c, then Xn + Yn X + c. Also,


If Xn X and Yn c, then Xn Yn cX.

Theorem 10 (The Continuous Mapping Theorem) Let Xn , X, Yn , Y be random vari-


ables. Let g be a continuous function.
P P
(a) If Xn → X, then g(Xn ) → g(X).
(b) If Xn X, then g(Xn ) g(X).

Exercise: Prove the continuous mapping theorem.

3 The Law of Large Numbers


The law of large numbers (LLN) says that the mean of a large sample is close to the mean
of the distribution. For example, the proportion of heads of a large number of tosses of a
fair coin is expected to be close to 1/2. We now make this more precise.
Let X1 , X2 , . . . be an iid sample, let µ = E(X1 ) and σ 2 = Var(X1 ). Recall that the sample
n
mean is defined as X n = n−1 i=1 Xi and that E(X n ) = µ and Var(X n ) = σ 2 /n.
P

Theorem 11 (The Weak Law of Large Numbers (WLLN))


P
If X1 , . . . , Xn are iid, then X n → µ. Thus, X n − µ = oP (1).

Interpretation of the WLLN: The distribution of X n becomes more concentrated


around µ as n gets large.

Proof. Assume that σ < ∞. This is not necessary but it simplifies the proof. Using
Chebyshev’s inequality,
 Var(X n ) σ2
P |X n − µ| >  ≤ =
2 n2
which tends to 0 as n → ∞. 

Theorem 12 The Strong Law of Large Numbers. Let X1 , . . . , Xn be iid with mean µ.
as
Then X n → µ.

The proof is beyond the scope of this course.

6
4 The Central Limit Theorem
The law of large numbers says that the distribution of X n piles up near µ. This isn’t enough
to help us approximate probability statements about X n . For this we need the central limit
theorem.
2
Suppose that X1 , . . . , Xn are iid
P with mean µ and variance σ . The central limit the-
−1
orem (CLT) says that X n = n i Xi has a distribution which is approximately Normal
with mean µ and variance σ 2 /n. This is remarkable since nothing is assumed about the
distribution of Xi , except the existence of the mean and variance.
Theorem 13 (The Central Limit P Theorem (CLT)) Let X1 , . . . , Xn be iid with mean µ
and variance σ 2 . Let X n = n−1 ni=1 Xi . Then

Xn − µ n(X n − µ)
Zn ≡ q = Z
σ
Var(X n )
where Z ∼ N (0, 1). In other words,
Z z
1 2
lim P(Zn ≤ z) = Φ(z) = √ e−x /2 dx.
n→∞ −∞ 2π
Interpretation: Probability statements about X n can be approximated using a Nor-
mal distribution. It’s the probability statements that we are approximating, not the
random variable itself.
Remark: We often write
σ2
 
Xn ≈ N µ,
n

as short form for n(X n − µ) N (0, 1).

Recall that if X is a random variable, its moment generating function (mgf) is ψX (t) =
EetX . Assume in what follows that the mgf is finite in a neighborhood around t = 0.
Lemma 14 Let Z1 , Z2 , . . . be a sequence of random variables. Let ψn be the mgf of Zn . Let
Z be another random variable and denote its mgf by ψ. If ψn (t) → ψ(t) for all t in some
open interval around 0, then Zn Z.
−1/2
P
Proof of the central limit theorem. Let Y i = (X i − µ)/σ. Then, Zn =
√ n i Yi .
Let ψ(t) be the mgf of Yi . The mgf of i Yi is (ψ(t))n and mgf of Zn is [ψ(t/ n)]n ≡ ξn (t).
P
Now ψ 0 (0) = E(Y1 ) = 0, ψ 00 (0) = E(Y12 ) = Var(Y1 ) = 1. So,
0 t2 00 t3 000
ψ(t) = ψ(0) + tψ (0) + ψ (0) + ψ (0) + · · ·
2! 3!
t2 t3 000
= 1 + 0 + + ψ (0) + · · ·
2 3!
t2 t3 000
= 1 + + ψ (0) + · · ·
2 3!
7
Now,
  n
t
ξn (t) = ψ √
n
2
n
t3

t 000
= 1+ + ψ (0) + · · ·
2n 3!n3/2
" #n
t2 t3 000
+ 1/2 ψ (0) + · · ·
= 1+ 2 3!n
n
2 /2
→ et

which is the mgf of a N(0,1). The result follows from Lemma 14. In the last step we used
the fact that if an → a then  an  n
1+ → ea . 
n


The central limit theorem tells us that Zn = n(X n − µ)/σ is approximately N(0,1).
However, we rarely know σ. We can estimate σ 2 from X1 , . . . , Xn by
n
1 X
Sn2 = (Xi − X n )2 .
n − 1 i=1

This raises the following question: if we replace σ with Sn , is the central limit theorem still
true? The answer is yes.

Theorem 15 Assume the same conditions as the CLT. Then,



n(X n − µ)
Tn = N (0, 1).
Sn
Proof. Here is a brief proof. We have that

Tn = Zn Wn

where √
n(X n − µ)
Zn =
σ
and
σ
Wn = .
Sn
P
Now Zn N (0, 1) and Wn → 1. The result follows from Slutzky’s theorem. 

8
Here is an extended proof.

P
Step 1. We first show that Rn2 → σ 2 where
n
1X
Rn2 = (Xi − X n )2 .
n i=1

Note that !2
n n
2 1X 2 1X
Rn = X − Xi .
n i=1 i n i=1
Define Yi = Xi2 . Then, using the LLN (law of large numbers)
n n
1X 2 1X P
Xi = Yi → E(Yi ) = E(Xi2 ) = µ2 + σ 2 .
n i=1 n i=1

Next, by the LLN,


n
1X P
Xi → µ.
n i=1
Since g(t) = t2 is continuous, the continuous mapping theorem implies that
n
!2
1X P
Xi → µ2 .
n i=1

Thus
P
Rn2 → (µ2 + σ 2 ) − µ2 = σ 2 .

Step 2. Note that  


n
Sn2 = Rn2 .
n−1
P P
Since, Rn2 → σ 2 and n/(n − 1) → 1, we have that Sn2 → σ 2 .

Step 3. Since g(t) = t is continuous, (for t ≥ 0) the continuous mapping theorem
P
implies that Sn → σ.

Step 4. Since g(t) = t/σ is continuous, the continuous mapping theorem implies that
P
Sn /σ → 1.

Step 5. Since g(t) = 1/t is continuous (for t > 0) the continuous mapping theorem
P
implies that σ/Sn → 1. Since convergence in probability implies convergence in distribution,
σ/Sn 1.

9
Step 5. Note that
√  
n(X n − µ) σ
Tn = ≡ Vn Wn .
σ Sn

Now Vn Z where Z ∼ N (0, 1) by the CLT. And we showed that Wn 1. By Slutzky’s


theorem, Tn = Vn Wn Z × 1 = Z.
The next result is very important. It tells us how close the distribution of X is to the
Normal distribution.

Theorem 16 (Berry-Esseen Theorem) Let X1 , . . . , Xn ∼ P . Let µ = E[Xi ] and σ 2 =


Var[Xi ]. Assume that µ3 = E[|Xi − µ|3 ] < ∞. Let
√ 
n(X n − µ)
Fn (z) = P ≤z .
σ

Then
33 µ3
sup |Fn (z) − Φ(z)| ≤ √ .
z 4 σ3 n

There is also a multivariate version of the central limit theorem. Recall that X =
(X1 , . . . , Xk )T has a multivariate Normal distribution with mean vector µ and covariance
matrix Σ if  
1 1 T −1
f (x) = exp − (x − µ) Σ (x − µ) .
(2π)k/2 |Σ|1/2 2
In this case we write X ∼ N (µ, Σ).

Theorem 17 (Multivariate central limit theorem) Let X1 , . . . , Xn be iid random vec-


tors where Xi = (X1i , . . . , Xki )T withP
mean µ = (µ1 , . . . , µk )T and covariance matrix Σ. Let
X = (X 1 , . . . , X k )T where X j = n−1 ni=1 Xji . Then,

n(X − µ) N (0, Σ).

Remark: There is also a multivariate version of the Berry-Esseen theorem but it is more
complicated than the one-dimensional version.

5 The Delta Method


If Yn has a limiting Normal distribution then the delta method allows us to find the limiting
distribution of g(Yn ) where g is any smooth function.

10
Theorem 18 (The Delta Method) Suppose that

n(Yn − µ)
N (0, 1)
σ
and that g is a differentiable function such that g 0 (µ) 6= 0. Then

n(g(Yn ) − g(µ))
N (0, 1).
|g 0 (µ)|σ
In other words,
! !
σ2 σ 2
Yn ≈ N µ, implies that g(Yn ) ≈ N g(µ), (g 0 (µ))2 .
n n

Example 19 Let √ X1 , . . . , Xn be iid with finite mean µ and Xfinite variance σ 2 . By the central
limit theorem, n(X n − µ)/σ N (0, 1). Let Wn = e n . Thus, Wn = g(X n ) where
0
g(s) = e . Since g (s) = e , the delta method implies that Wn ≈ N (eµ , e2µ σ 2 /n).
s s

There is also a multivariate version of the delta method.

Theorem 20 (The Multivariate Delta Method) Suppose that Yn = (Yn1 , . . . , Ynk ) is a


sequence of random vectors such that

n(Yn − µ) N (0, Σ).
Let g : Rk → R and let  ∂g 
∂y1
∇g(y) =  ...  .
 
∂g
∂yk

Let ∇µ denote ∇g(y) evaluated at y = µ and assume that the elements of ∇µ are nonzero.
Then √
N 0, ∇Tµ Σ∇µ .

n(g(Yn ) − g(µ))

Example 21 Let      
X11 X12 X1n
, , ...,
X21 X22 X2n
be iid random vectors with mean µ = (µ1 , µ2 )T and variance Σ. Let
n n
1X 1X
X1 = X1i , X2 = X2i
n i=1 n i=1

and define Yn = X 1 X 2 . Thus, Yn = g(X 1 , X 2 ) where g(s1 , s2 ) = s1 s2 . By the central limit


theorem,

 
X 1 − µ1
n N (0, Σ).
X 2 − µ2

11
Now !
∂g  
∂s1 s2
∇g(s) = ∂g =
∂s2
s1
and so
  
σ11 σ12 µ2
∇Tµ Σ∇µ = (µ2 µ1 ) = µ22 σ11 + 2µ1 µ2 σ12 + µ21 σ22 .
σ12 σ22 µ1

Therefore,

 
n(X 1 X 2 − µ1 µ2 ) N 0, µ22 σ11 + 2µ1 µ2 σ12 + µ21 σ22 . 

12

You might also like