Lecture Notes On Probability Theory Dmitry Panchenko
Lecture Notes On Probability Theory Dmitry Panchenko
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Kolmogorov’s consistency theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Conditional expectations and distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.1 Martingales. Uniform integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.2 Stopping times, optional stopping theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.3 Doob’s inequalities and convergence of martingales . . . . . . . . . . . . . . . . . . . . 162
6 Brownian Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.1 Stochastic processes. Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.2 Donsker’s invariance principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.3 Convergence of empirical process to Brownian bridge . . . . . . . . . . . . . . . . . . 182
6.4 Reflection principles for Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.5 Skorohod’s imbedding and laws of the iterated logarithm . . . . . . . . . . . . . . . 198
8 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
8.1 Moment problem and de Finetti’s theorem for coin flips . . . . . . . . . . . . . . . . 235
8.2 General de Finetti and Aldous-Hoover representations . . . . . . . . . . . . . . . . . . 240
8.3 The Dovbysh-Sudakov representation for Gram-de Finetti arrays . . . . . . . . . 251
Chapter 1
Introduction
Proof. First, let us show that (3) implies (30 ). If we denote Cn = Bn \ Bn+1 then Bn
is the disjoint union [k nCk [ B and, by (3),
Since the last sum is the tail of convergent series, limn!• P(Bn ) = P(B). Next, let
us show that (30 ) implies (3). If, given disjoint sets (An ), we define Bn = [i n+1 Ai ,
then [ [ [ [ [
Ai = A1 A2 · · · An Bn ,
i 1
and, by finite additivity,
⇣[ ⌘ n
P Ai = Â P(Ai ) + P(Bn ).
i 1 i=1
/ Therefore, by (30 ),
Clearly, Bn ◆ Bn+1 and, since (An ) are disjoint, \n 1 Bn = 0.
limn!• P(Bn ) = 0 and (3) follows.
Let us give several examples of probability spaces. One most basic example of a
probability space is ([0, 1], B([0, 1]), l ), where B([0, 1]) is the Borel s -algebra on
[0, 1] and l is the Lebesgue measure. Let us quickly recall how this measure is con-
structed. More generally, let us consider the construction of the Lebesgue-Stieltjes
measure on R corresponding to a non-decreasing right-continuous function F(x).
One considers an algebra of finite unions of disjoint intervals
n[ o
A= (ai , bi ] : n 1, all (ai , bi ] are disjoint
in
and defines the measure F on the sets in this algebra by (we slightly abuse the
notations here)
⇣[ ⌘ n
F (ai , bi ] = Â F(bi ) F(ai ) .
in i=1
whenever all Ai and [i 1 Ai are finite unions of disjoint intervals. The proof is
exactly the same as in the case of F(x) = x corresponding to the Lebesgue measure
case. Once countable additivity is proved on the algebra, it remains to appeal to
the following key result. Recall that, given an algebra A, the s -algebra A = s (A)
generated by A is the smallest s -algebra that contains A.
Theorem 1.1 (Caratheodory’s extension theorem). If A is an algebra of sets and
µ : A ! R is a non-negative countably additive function on A, then µ can be ex-
1.1 Probability spaces 3
Therefore, F above can be uniquely extended to the measure on the s -algebra s (A)
generated by the algebra of finite unions of disjoint intervals. This is the s -algebra
B(R) of Borel sets on R. Clearly, (R, B(R), F) will be a probability space if
(1) F(x) = F(( •, x]) is non-decreasing and right-continuous,
(2) limx! • F(x) = 0 and limx!• F(x) = 1.
The reason we required F to be right-continuous corresponds to our choice that
intervals (a, b] in the algebra are closed on the right, so the two conventions agree
and the measure F is continuous, as it should be, e.g.
⇣\ ⌘
F (a, b] = F (a, b + n 1 ] = lim F (a, b + n 1 ] .
n!•
n 1
In Probability Theory, functions satisfying properties (1) and (2) above are called
cumulative distribution functions, or c.d.f. for short, and we will give an alternative
construction of the probability space (R, B(R), F) in the next section.
Other basic ways to define a probability is through a probability function or density
function. If the measurable space is such that all singletons are measurable, we can
simply assign some weights pi = P(wi ) to a sequence of distinct points wi 2 W,
such that Âi 1 pi = 1, and let
P(A) = P A \ {wi }i 1 .
for integer i 0 is called the Poisson distribution with the parameter l > 0. (Nota-
tion: given a set A, we will denote by I(x 2 A) or IA (x) the indicator that x belongs
to A.) t
u
Example 1.1.2. A probability measure on R corresponding to the density function
f (x) = l e l x I(x 0) is called the exponential distribution with the parameter
l > 0. t
u
Example 1.1.3. A probability measure on R corresponding to the density function
1 x2 /2
f (x) = p e
2p
is called the standard normal, or standard Gaussian, distribution on R. t
u
Recall that a measure P is called absolutely continuous with respect to another
measure Q, P Q, if for all A 2 A ,
Q(A) = 0 =) P(A) = 0,
in which case the existence of the density is guaranteed by the following classical
result from measure theory.
Theorem 1.2 (Radon-Nikodym). On a measurable space (W, A ) let µ be a s -
finite measure and n be a finite measure absolutely continuous with respect to µ,
n µ. Then there exists the Radon-Nikodym derivative h 2 L 1 (W, A , µ) such
that Z
n(A) = h(w) dµ(w)
A
for all A 2 A . Such h is unique modulo µ-a.e. equivalence.
Of course, the Radon-Nikodym theorem also applies to finite signed measures n,
which can be decomposed into n = n + n for some finite measures n + , n – the
so-called Hahn-Jordan decomposition. Let us recall a proof of the Radon-Nikodym
theorem for convenience.
Proof. Clearly, we can assume that µ is a finite measure. Consider the HilbertR space
H = L 2 (W, A , µ +n) and the linear functional T : H ! R given by T ( f ) = f dn.
Since Z Z
f dn | f | d(µ + n) Ck f kH ,
R
T
R is a continuous linear functional and, by the
R Riesz-Fréchet
R theorem, f dn =
f g d(µ + n) for some g 2 H. This implies f dµ = f (1 g) d(µ + n). Now
g(w) 0 for (µ + n)-almost all w, which can be seen by taking f (w) = I(g(w) <
0), and similarly g(w) 1 for (µ + n)-almost all w. Therefore, we can take 0
g 1. Let E = {w : g(w) = 1}. Then
Z Z
µ(E) = I(w 2 E) dµ(w) = I(w 2 E)(1 g(w)) d(µ + n)(w) = 0,
1.1 Probability spaces 5
R R
and since n µ, n(E) = 0. Since f g dµ = f (1 g) dn, we can restrict both
integral to ER c , and replacing
R f by f /(1 g) and denoting h := g/(1 g) (defined on
E c ) we get E c f dn = E c f h dµ (more carefully, one can truncate f /(1 g)
R first and
c
then use the monotone convergence theorem). Therefore, n(A \ E ) = A\E c h dµ
and this finishes the proof if we set h = 0 on E. To prove uniqueness,
R consider
two such h and h0 and let A = {w : h(w) > h0 (w)}. Then 0 = A (h h0 )dµ and,
therefore, µ(A) = 0. t
u
Let us now write down some important properties of s -algebras and probability
measures.
Lemma 1.2 (Approximation lemma). If A is an algebra of sets then for any B 2
s (A) there exists a sequence Bn 2 A such that limn!• P(B 4 Bn ) = 0.
We will prove that D is a s -algebra and, since A ✓ D, this will imply that s (A) ✓
D. One can easily check that
Z
d(B,C) := P(B 4C) = |IB (w) IC (w)| dP(w)
W
Dynkin’s theorem. We will now describe a tool, the so-called Dynkin’s theorem,
or p–l theorem, which is often quite useful in checking various properties of prob-
abilities.
p-systems: A collection of sets P is called a p-system if it is closed under taking
intersections, i.e.
1. if A, B 2 P then A \ B 2 P.
l -systems: A collection of sets L is called a l -system if
6 1 Introduction
1. W 2 L ,
2. if A 2 L then Ac 2 L ,
3. if An 2 L are disjoint for n 1 then [n 1 An 2 L .
Given any collection of sets C , by analogy with the s -algebra s (C ) generated by
C , we will denote by L (C ) the smallest l -system that contains C . If is easy to
see that the intersection of all l -systems that contain C is again a l -system that
contains C , so this intersection is precisely L (C ).
Theorem 1.3 (Dynkin’s theorem). If P is a p-system, L is a l -system and P ✓
L , then s (P) ✓ L .
We will give typical examples of application of this result below.
Proof. First of all, it should be obvious that the collection of sets which is both a
p-system and a l -system is a s -algebra. Therefore, if we can show that L (P) is
a p-system then it is a s -algebra and
P ✓ s (P) ✓ L (P) ✓ L ,
which proves the result. Let us prove that L (P) is a p-system. For a fixed set
A ✓ W, let us define
GA = B ✓ W B \ A 2 L (P) .
L = B 2 B : P1 (B) = P2 (B)
is trivially a l -system, by the properties of probability measures. On the other
hand, the collection P of all open sets is a p-system and, therefore, if we know
that P1 (B) = P2 (B) for all open sets then, by Dynkin’s theorem, this holds for all
Borel sets B 2 B. Similarly, one can see that a probability on the Borel s -algebra
on the real line is determined by probabilities of the sets ( •,t] for all t 2 R. t
u
for all A 2 A . It is a standard result in measure theory that every finite measure on
(Rk , B(Rk )) is regular. In the setting of complete separable metric spaces, this is
known as Ulam’s theorem, which we will prove below.
Theorem 1.4. Every probability measure P on a metric space (S, d) is closed reg-
ular.
First of all, let us show that each closed set F 2 L ; we only need to show that an
open set U = F c satisfies (1.1). Let us consider sets
Fn = s 2 S : d(s, F) 1/n .
It is obvious that all Fn are closed, and Fn ✓ Fn+1 . One can also easily check that,
since F is closed, [n 1 Fn = U and, by the continuity of measure,
This proves that U = F c satisfies (1.1) and F 2 L . Next, one can easily check that
L is a l -system, which we will leave as an exercise below. Since the collection
of all closed sets is a p-system, and closed sets generate the Borel s -algebra A ,
by Dynkin’s theorem, all measurable sets are in L . This proves that P is closed
regular. t
u
Proof. First, let us show that there exists a compact set K ✓ S such that P(S \
K)
S•
e. Consider a sequence {s1 , s2 , . . .} that is dense in S. For any m 1, S =
1 1
i=1 B(si , m ), where B(si , m ) is the closed ball of radius 1/m centered at si . By the
continuity of measure, for large enough n(m),
⇣ n(m)
[ ⇣ 1 ⌘⌘ e
P S\ B si , m.
i=1
m 2
If we take
\ n(m)
[ ⇣ 1⌘
K= B si ,
m 1 i=1
m
then
e
P(S \ K) Â m
= e.
m 12
Obviously, by construction, K is closed and totally bounded. Since S is complete,
K is compact. By the previous theorem, given A 2 A , we can find a closed subset
F ✓ A such that P(A \ F) e. Therefore, P(A \ (F \ K)) 2e, and since F \ K is
compact, this finishes the proof. t
u
Exercise 1.1.3. Check that L in (1.3) is a l -system. (In fact, one can check that
this is a s -algebra.)
1.2 Random variables 9
{X t} := w 2 W : X(w) 2 ( •,t] 2 A .
is a s -algebra. Since sets ( •,t] 2 D, this will imply that B(R) ✓ D. The fact that
D is a s -algebra follows simply because taking pre-image preserves set operations.
For example, if we consider a sequence Di 2 D for i 1 then
⇣[ ⌘ [
X 1 Di = X 1 (Di ) 2 A ,
i 1 i 1
S
because X 1 (Di ) 2 A and A is a s -algebra. Therefore, i 1 Di 2 D. Other prop-
erties can be checked similarly, so D is a s -algebra. t
u
Given a random element X on (W, A , P) with values in (S, B), let us denote the
image measure on B by PX = P X 1 , which means that for B 2 B,
1 1
PX (B) = P(X 2 B) = P(X (B)) = P X (B).
Another construction can be given on the probability space ([0, 1], B([0, 1]), l )
with the Lebesgue measure l , using the quantile transformation. Given a c.d.f. F,
let us define a random variable X : [0, 1] ! R by the quantile transformation
This means that, to define the probability space (R, B(R), F), we can start with
([0, 1], B([0, 1]), l ) and let F = l X 1 be the image of the Lebesgue measure by
the quantile transformation, or the law of X on R. A related “inverse” property is
left as an exercise below.
Given random element X : (W, A ) ! (S, B), the s -algebra
1
s (X) = X (B) : B 2 B
Example 1.2.1. Consider a random variable X on ([0, 1], B([0, 1]), l ) defined by
⇢
0, 0 x 1/2,
X(x) =
1, 1/2 < x 1.
Lemma 1.4. Consider a probability space (W, A , P), a measurable space (S, B)
and random elements X : W ! S and Y : W ! R. Then the following are equivalent:
1. Y = g(X) for some (Borel) measurable function g : S ! R,
2. Y : W ! R is s (X)-measurable.
1.2 Random variables 11
It should be obvious from the proof that R can be replaced by any separable metric
space.
Proof. The fact that 1 implies 2 is obvious, since for any Borel set B ✓ R the set
B0 = g 1 (B) 2 B and, therefore,
Y = g(X) 2 B = X 2 g 1 (B) = B0 = X 1
(B0 ) 2 s (X).
Let us show that 2 implies 1. For all integer n and k, consider sets
n h k k + 1 ⌘o ⇣h k k + 1 ⌘⌘
An,k = w : Y (w) 2 n , n =Y 1 , .
2 2 2n 2n
By 2, An,k 2 s (X) = {X 1 (B) : B 2 B} and, therefore, An,k = X 1 (B
n,k ) for some
Bn,k 2 B. Let us consider a function
k
gn (x) = Â 2n I(x 2 Bn,k ).
k2Z
P(A1 \ · · · \ An ) = ’ P(Ai )
in
Example 1.2.2. Consider a regular tetrahedron die with red, green and blue sides
and a red-green-blue base. If we roll this die then the colors provide an example of
pairwise independent random variables that are not independent, since
12 1 Introduction
1 1
P(r) = P(b) = P(g) = and P(rb) = P(rg) = P(bg) = ,
2 4
⇣ ⌘3
1
while P(rbg) = 4 6= P(r)P(b)P(g) = 12 . t
u
Lemma 1.5. If algebras Ai , i n are independent then s -algebras s (Ai ) are inde-
pendent.
Lemma 1.6. If collections of sets Ci for i n are p-systems (closed under finite
intersections) then their independence implies the independence of the s -algebras
s (Ci ) they generate.
(b) If the laws of Xi have densities fi on R then these random variables are in-
dependent if and only if a joint density f on Rn of the vector (Xi )1in exists
and n
f (x1 , ..., xn ) = ’ fi (xi ).
i=1
Proof. (a) This is obvious by Lemma 1.6, because the collection of sets ( •,t] for
t 2 R is a p-system that generates the Borel s -algebra on R.
(b) Let us start with the “if” part. If we denote X = (X1 , . . . , Xn ) then, for any
Ai 2 B(R),
1.2 Random variables 13
⇣\
n ⌘
P {Xi 2 Ai } = P(X 2 A1 ⇥ · · · ⇥ An )
i=1
Z n
=
A1 ⇥···⇥An i=1
’ fi (xi ) dx1 · · · dxn
n Z n
=’ fi (xi ) dxi = ’ P(Xi 2 Ai ).
i=1 Ai i=1
for all A in the Borel s -algebra on Rn , which would means that the joint den-
sity exists and is equal to the product of individual densities. One can prove the
above equality for all A 2 B(Rn ) by appealing to the Monotone Class Theorem
from measure theory, or the Caratheodory Extension Theorem 1.1, since the above
equality, obviously, can be extended from the semi-algebra of measurable rectan-
gles A1 ⇥ · · · ⇥ An to the algebra of disjoint unions of measurable rectangles, which
generates the Borel s -algebra. However, we can also appeal to the Dynkin’s theo-
rem, since the family L of sets A that satisfy the above equality is a l -system by
properties of measures and integrals, and it contains the p-system P of measur-
able rectangles A1 ⇥ · · · ⇥ An that generates the Borel s -algebra, B(Rn ) = s (P).
t
u
If we would like to construct finitely many independent random variables (Xi )in
with arbitrary distributions (Pi )in on R, we can simply consider the space W = Rn
with the product measure
P1 ⇥ . . . ⇥ Pn
and define a random variable Xi by Xi (x1 , . . . , xn ) = xi . The main result in the next
section will imply that one can construct an infinite sequence of independent ran-
dom variables with arbitrary distributions on the same probability space, and here
we will give a sketch of another construction on the space ([0, 1], B([0, 1]), l ).
We will write P = l to emphasize that we think of the Lebesgue measure as a
probability.
Step 1. If we write the dyadic decomposition of x 2 [0, 1],
x= Â2 n
en (x),
n 1
then it is easy to see that (en )n 1 are independent random variables with the distri-
bution P(en = 0) = P(en = 1) = 1/2, since for any n 1 and any ai 2 {0, 1},
since fixing the first n coefficients in the dyadic expansion places x into an interval
of length 2 n .
Step 2. Let us consider injections km : N ! N for m 1 such that their ranges
km (N) are all disjoint and let us define
Xm = Xm (x) = Â2 n
ekm (n) (x).
n 1
It is an easy exercise to check that each Xm is well defined and has the uniform
distribution on [0, 1], which can be seen by looking at the dyadic intervals first.
Moreover, by the Grouping Lemma above, the random variables (Xm )m 1 are all
independent since they are defined in terms groups of independent random vari-
ables.
Step 3. Given a sequence of probability distributions (Pm )m 1 on R, let (Fm )m 1
be the sequence of the corresponding c.d.f.s and let (Qm )m 1 be their quantile
transforms. We have seen above that each Ym = Qm (Xm ) has the distribution Pm
on R, and they are obviously independent of each other. Therefore, we constructed
a sequence of independent random variables Ym on the space ([0, 1], B([0, 1]), l )
with arbitrary distributions Pm .
1.2 Random variables 15
Lemma 1.9. (1) If F is the c.d.f. (and the law) of X on R then, for any measurable
function g : R ! R, Z
Eg(X) = g(x) dF(x).
R
(2) If the distribution of X is discrete, i.e. P(X 2 {xi }i 1 ) = 1, then
(3) If the distribution of X : W ! Rn on Rn has the density function f (x) then, for
any measurable function g : Rn ! R,
Z
Eg(X) = g(x) f (x) dx.
Rn
Proof. All these properties follow by making the change of variables x = X(w),
Z Z Z
1
Eg(X) = g(X(w)) dP(w) = g(x) dP X (x) = g(x) dPX (x),
W
Lemma 1.10. If X,Y : W ! R are independent and E|X|, E|Y | < • then EXY =
EXEY.
Exercise 1.2.1. If a random variable X has continuous c.d.f. F(t), show that F(X)
is uniform on [0, 1], i.e. the law of F(X) is the Lebesgue measure on [0, 1].
16 1 Introduction
R
Exercise 1.2.2. If F is a continuous distribution function, show that F(x) dF(x) =
1/2.
Exercise 1.2.3. ch(l ) is a moment generating function of a random variable X with
distribution P(X = ±1) = 1/2, since
el + e l
Eel X = = ch(l ).
2
Does there exist a (bounded) random variable X such that Eel X = chm (l ) for
0 < m < 1? (Hint: compute several derivatives at zero.)
Exercise 1.2.4. Consider a measurable function f : X ⇥Y ! R and a product prob-
ability measure P ⇥ Q on X ⇥Y. For 0 < p q, prove that
i.e.
⇣Z ⇣Z ⌘q/p ⌘1/q ⇣Z ⇣Z ⌘ p/q ⌘1/p
p
| f (x, y)| dP(x) dQ(y) | f (x, y)|q dQ(y) dP(x) .
Assume that both sides are well-defined, for example, that f is bounded.
Exercise 1.2.5. If the event A 2 s (P) is independent of the p-system P then
P(A) = 0 or 1.
Exercise 1.2.6. Suppose X is a random variable and g : R ! R is measurable. Prove
that if X and g(X) are independent then P(g(X) = c) = 1 for some constant c.
Exercise 1.2.7. Suppose that (em )1mn are i.i.d. exponential random variables
with the parameter l > 0, and let
be the order statistics (the random variables arranged in the increasing order). Prove
that the spacings
e(1) , e(2) e(1) , . . . , e(n) e(n 1)
are independent exponential random variables, and e(k+1) e(k) has the parameter
(n k)l .
Exercise 1.2.8. Suppose that (en )n 1 are i.i.d. exponential random variables with
the parameter l = 1. Let Sn = e1 + . . . + en and Rn = Sn+1 /Sn for n 1. Prove that
(Rn )n 1 are independent and Rn has density nx (n+1) i(x 1). Hint: Let R0 = e1
and compute the joint density of (R0 , R1 , . . . , Rn ) first.
Exercise 1.2.9. Let N be a Poisson random variable with the mean l , i.e. P(N =
j) = l j e l / j! for integer j 0. Then, consider N i.i.d. random variables, indepen-
dent of N, taking values 1, . . . , k with probabilities p1 , . . . , pk . Let N j be the number
1.2 Random variables 17
Exercise 1.2.10. Suppose that a measurable subset P ✓ [0, 1] and the interval I =
[a, b] ✓ [0, 1] are such that l (P) = l (I), where l is the Lebesgue measure on [0, 1].
Show that there exists a measure-preserving transformation T : [0, 1] ! [0, 1], i.e.
l T 1 = l , such that T (I) ✓ P and T is one-to-one (injective) outside a set of
measure zero.
18 1 Introduction
for any finite subsets N ✓ M and any Borel set B 2 BN . (Of course, to be careful,
we should define these probabilities for ordered subsets and also make sure they
are consistent under rearrangements, but the notations for unordered sets is clear
and should not cause any confusion.)
Our goal is to construct a probability space (W, A , P) and random variables
Xt : W ! R that have (PN ) as their finite dimensional distributions. We take
W = RT = w : T ! R
to be the space of all real-valued functions on T , and let Xt be the coordinate pro-
jection Xt = Xt (w) = w(t). For the coordinate projections to be measurable, the
following collection of events,
A = B ⇥ RT \N : B 2 BN ,
must be contained in the s -algebra A . It is easy to see that A is, in fact, an algebra
of sets, and it is called the cylindrical algebra on RT . We will then take A = s (A)
to be the smallest s -algebra on which all coordinate projections are measurable.
This is the so-called cylindrical s -algebra on RT . A set B ⇥ RT \N is called a cylin-
der. As we already agreed, the probability P on the sets in algebra A is given by
P B ⇥ RT \N = PN (B).
Given two finite subsets N ✓ M ⇢ T and B 2 BN , the same set can be represented
as two different cylinders,
B ⇥ RT \N = B ⇥ RM\N ⇥ RT \M .
However, by the consistency condition, the definition of P will not depend on the
choice of the representation. To finish the construction, we need to show that P
can be extended from algebra A to a probability measure on the s -algebra A . By
the Caratheodory Extension Theorem 1.1, we only need to show that the following
holds.
1.3 Kolmogorov’s consistency theorem 19
and, therefore,
⇣\ \ ⌘ ⇣[ ⌘
P Ci ⇥ RT \Ni \ Ki ⇥ RT \Ni P (Ci \ Ki ) ⇥ RT \Ni
in in in
⇣ ⌘ e e
 P (Ci \ Ki ) ⇥ RT \Ni  .
in in 2i+1 2
where \
Kn = (Ki ⇥ RNn \Ni )
in
w m 2 K m ✓ K n ⇥ RNm \Nn
Exercise 1.3.1. Does the set C([0, 1], R) of continuous functions on [0, 1] belong
to the cylindrical s -algebra A on R[0,1] ? Hint: An exercise in Section 1 might be
helpful.
Exercise 1.3.2. Let B be the cylindrical s -algebra on R[0,1] and let W be the set of
all continuous functions on [0, 1], i.e. W = C[0, 1] ✓ R[0,1] . Consider the s -algebra
A := {B \ W : B 2 B} on W = C[0, 1], which consists of intersections of sets in
B with the set of continuous functions W = C[0, 1]. Show that the set
n Z 1 o
A := w 2 C[0, 1] : w(t) dt < 1
0
This section can be studied anytime before Chapter 4. The topic of this section
can be viewed as a vast abstract generalization of the Fubini theorem for product
spaces. For example, given a pair of random variables (X,Y ), how can we average
a function f (X,Y ) by fixing X = x and averaging with respect to Y first? Can we
define some distribution of Y given X = x that can be used to compute this average?
We will begin by defining these conditional averages, or conditional expectations,
and then we will use them to define the corresponding conditional distributions.
Let (W, A , P) be a probability space and X : W ! R be a random variable such
that E|X| < •. Let B be a s -subalgebra of A , B ✓ A . Random variable Y : W !
R is called the conditional expectation of X given B if
1. Y is measurable on B, i.e. for every Borel set B on R, Y 1 (B) 2 B.
2. For any set B 2 B, we have EXIB = EY IB .
This conditional expectation is denoted by Y = E(X|B). If X, Z are random vari-
ables then the conditional expectation of X given Z is defined by
2. (Uniqueness) Suppose there exists another Y 0 = E(X|B) such that P(Y 6= Y 0 ) >
0, i.e. either P(Y > Y 0 ) > 0 or P(Y < Y 0 ) > 0. Since both Y,Y 0 are measurable on
B, the set B = {Y > Y 0 } 2 B. If P(B) > 0 then E(Y Y 0 )IB > 0. On the other
hand,
E(Y Y 0 )IB = EXIB EXIB = 0,
a contradiction. Similarly, P(Y < Y 0 ) = 0. Of course, the conditional expectation
can be redefined on a set of measure zero, so uniqueness is understood in this sense.
3. (Linearity) Conditional expectation is linear,
just like the usual integral. It is obvious that the right hand side satisfies the def-
inition of the conditional expectation E(cX + Y |B) and the equality follows by
uniqueness. Again, the equality holds almost surely.
4. (Smoothing) If s -algebras C ✓ B ✓ A then
E(X|C ) = E(E(X|B)|C ).
E(X|B) = EX.
E(X|B) E(Z|B)
To prove this, let us first note that, since E(Xn |B) E(Xn+1 |B) E(X|B), there
exists a limit
g = lim E(Xn |B) E(X|B).
n!•
Since E(Xn |B) are measurable on B, so is the limit g. It remains to check that for
any set B 2 B, EgIB = EXIB . Since Xn IB " XIB and E(Xn |B)IB " gIB , the usual
monotone convergence theorem implies that
Since EIB E(Xn |B) = EXn IB , this implies that EgIB = EXIB and, therefore, g =
E(X|B) almost surely.
9. (Dominated convergence) If |Xn | Y, EY < •, and Xn ! X then
Y gn = inf Xm Xn hn = sup Xm Y
m n m n
and observe that gn " X, hn # X, |gn | Y and |hn | Y . Therefore, by the monotone
convergence theorem,
f (E(X|B)) E( f (X)|B).
Let h(x) be some monotone (thus, measurable) function such that h(x) is in the
subgradient ∂ f (x) of the convex function f for all x. Then, by convexity,
Notice that Bn 2 B and, by property 9, E(Xn |B) = IBn E(X|B) 2 [ n, n]. Since
both f and h are bounded on [ n, n], now we don’t have any integrability issues
and, taking conditional expectations of both sides, we get
Now, let n ! •. Since E(Xn |B) = IBn E(X|B) ! E(X|B) and f is continuous,
f (E(Xn |B)) converges to f (E(X|B)). On the other hand,
P|B -null set so there is some freedom in defining P f |B (C, ·) for a given C, but it
is not obvious that we can define P f |B (C, ·) simultaneously for all C in such a way
that (i) is also satisfied. However, when the space (Y, Y ) is regular, i.e. (Y, d) is a
complete separable metric space and Y is its Borel s -algebra, then a conditional
distribution always exists.
for P|B -almost all w, i.e. the conditional distribution is unique modulo P|B -almost
everywhere equivalence.
(d) for any C 2 C and the specific sequence Kn as chosen, (1.6) holds a.s..
Since there are countably many conditions in (a) - (d), there exists a set N 2 B
such that P(N ) = 0 and conditions (a) - (d) hold for all w 62 N . We will now
show that for all such w, P f |B (·, w) can be extended to a probability measure on
Y and this extension is such that for any C 2 Y , P f |B (C, ·) is a version of the
conditional expectation, i.e. (ii) holds.
First, we need to show that for all w 62 N , P f |B (·, w) is countably additive on C
or, equivalently, for any sequence Cn 2 C such that Cn # 0/ we have P f |B (Cn , w) # 0.
26 1 Introduction
If not, then there exists a sequence Cn # 0/ such that P f |B (Cn , w) d > 0. Using
(d), take compact Kn 2 D such that P f |B (Cn \ Kn , w) d /3n . Then
d d
P f |B (K1 \ . . . \ Kn , w) P f |B (Cn , w) Â i
>
1in 3 2
for all C 2 C on the set of w of measure one, since C is countable, and we know
that the probability measure on the s -algebra Y is uniquely determined by its
values on the generating algebra C . t
u
for P|B -almost all w. Therefore, (1.7) holds for all simple functions. For g 0,
take a sequence of simple functions 0 gn " g. Since Eg( f ) < •, by monotone
convergence theorem for conditional expectations,
for all w 62 N for some N 2 B with P(N ) = 0. Assume also that for w 62 N ,
(1.7) holds for all functions in the sequence (gn ). By the usual monotone conver-
1.4 Conditional expectations and distributions 27
gence theorem,
Z Z
gn (y) P f |B (dy, w) " g(y) P f |B (dy, w)
for all w and, therefore, (1.7) holds for all w 62 N . This prove the claim for g 0
and the general case follows. t
u
Product space case. Consider measurable spaces (X, X ) and (Y, Y ) and let
(W, A ) be the product space, W = X ⇥Y and A = X ⌦ Y , with some probability
measure P on it. Let
be the coordinate projections. Let B be the s -algebra generated by the first coor-
dinate projection h, i.e.
B = h 1 (X ) = X ⇥Y.
Suppose that a conditional distribution P f |B of f given B exists, for example, if
(Y, Y ) is regular. For a fixed C 2 Y , P f |B (C, ·) is B-measurable, by definition,
and since B = h 1 (X ),
for some X -measurable function Px (C). In the product space setting, Px is called
a conditional distribution of y given x. Notice that for any set D 2 X , {w : h(w) 2
D} 2 B and
Of course, (1) and (2) imply that Px (C) defines a version of the conditional expec-
tation of I(y 2 C) given the first coordinate x. Moreover, (3) implies the following
more general statement.
Theorem 1.9. If g : X ⇥Y ! R is P-integrable then
28 1 Introduction
Z ZZ
g(x, y) dP(x, y) = g(x, y) dPx (y)dµ(x). (1.9)
Proof. This coincides with (3) for the indicator of the rectangle D ⇥ C and as a
result holds for indicators of disjoint unions of rectangles. Then, using the mono-
tone class theorem (or Dynkin’s theorem) as in the proof of Fubini’s theorem,
(1.9) can be extended to indicators of all measurable sets in the product s -algebra
A = X ⌦ Y and then to simple functions, positive measurable functions and all
integrable functions. t
u
This result is a generalization of Fubini’s theorem in the case when the measure
is not a product measure. One can also view this as a way to construct more general
measures on product spaces than product measures. One can start with any function
Px (C) that satisfies properties (1) and (2) and define a measure P on the product
space as in property (3). Such functions satisfying properties (1) and (2) are called
probability kernels or transition functions. In the case when Px (C) = n(C), we
recover the product measure µ ⇥ n.
A special version of the product space case is the so called disintegration the-
orem. Let (Y, Y ) be regular and let n be a probability measure on Y . Consider a
measurable function p : (Y, Y ) ! (X, X ) and let µ = n p 1 be the push-forward
measure on X . Let P be the push-forward measure on the product s -algebra
X ⌦ Y by the map y ! (p(y), y), so that P has marginals µ and n.
Proof. To prove (1.10) just take g(x, y) = h(x) f (y) in (1.9). If we replace h(x) with
h(x)g(x) and use (1.10) twice we get
Z ⇣Z ⌘ Z
f (y)h(p(y)) dPx (y) g(x) dµ(x) = f (y)h(p(y))g(p(y)) dn(y)
Z ⇣Z ⌘ Z ⇣Z ⌘
= f (y)dPx (y) h(x)g(x) dµ(x) = f (y)h(x) dPx (y) g(x) dµ(x).
This also holds for µ-almost all x simultaneously for countably many functions f .
We can choose these f to be the indicators of sets in the algebra C in the proof of
Theorem 1.7, and this choice obviously implies that h(p(y)) = h(x) for Px -almost
all y. t
u
1.4 Conditional expectations and distributions 29
Exercise 1.4.1. Let (X,Y ) be a random variable on R2 with the density f (x, y).
Show that if E|X| < • then
Z .Z
E(X|Y ) = x f (x,Y )dx f (x,Y )dx,
R R
Exercise 1.4.2. Suppose that X,Y and U are random variables on some probability
space such that U is independent of (X,Y ) and has the uniform distribution on
[0, 1]. Prove that there exists a measurable function f : R ⇥ [0, 1] ! R such that
(X, f (X,U)) has the same distribution as (X,Y ) on R2 .
Exercise 1.4.3. Consider a measurable space (W, B) and suppose that random
pairs (X,Y ) and (X 0 ,Y 0 ), defined on different probability spaces and taking values
in R ⇥ W, have the same distribution. If E|X|, E|X 0 | < •, show that the conditional
expectations E(X|Y ) and E(X 0 |Y 0 ) have the same distribution.
Chapter 2
Laws of Large Numbers
When we toss an unbiased coin many times, we expect the number of heads or
tails to be close to a half — a phenomenon called the law of large numbers. More
generally, if (Xn )n 1 is a sequence of independent identically distributed (i.i.d.)
random variables, we expect that their average
Sn 1 n
Xn = = Â Xi
n n i=1
is, in some sense, close to the expectation µ = EX1 , assuming that it exists. In the
next section, we will prove a general qualitative result of this type, but we begin
with more quantitative statements for bounded random variables. We will need a
couple of simple lemmas.
Lemma 2.1 (Chebyshev’s inequality). If a random variable X 0 then, for t > 0,
EXI(X t) EX
P(X t) .
t t
Proof. The first inequality follows from
The first main result in this section is the following classical inequality.
Theorem 2.1 (Hoeffding’s inequality). If e1 , . . . , en are independent random vari-
ables such that |ei | 1 and Eei = 0 then, for any a1 , . . . , an 2 R and t 0,
⇣ n ⌘ ⇣ t2 ⌘
P Â ei ai t exp
2 Âni=1 a2i
.
i=1
The most classical case is that of i.i.d. Rademacher random variables (en )n 1 such
that
1
P(en = 1) = P(en = 1) = ,
2
which is equivalent to tossing a coin with heads and tails replaced by ±1.
2 a2 /2
where in the last step we used Lemma 1.10. By Lemma 2.2, Eel ei ai el i and,
therefore,
⇣n ⌘ ⇣ l 2 n 2⌘
P Â ei ai t exp lt + Â ai .
i=1 2 i=1
Optimizing over l 0 finishes the proof. t
u
We can also apply the same inequality to ( ei ) and, combining both cases,
⇣ n ⌘ ⇣ t2 ⌘
P Â ei ai t 2 exp
2 Âni=1 a2i
.
i=1
This show that, no matter how small t > 0 is, the probability that the average devi-
ates from the expectation Ee1 = 0 by more than t decreases exponentially fast with
n.
Let us now consider a more general case. For p, q 2 (0, 1), we consider the
function
p 1 p
D(p, q) = p log + (1 p) log ,
q 1 q
which is called the Kullback-Leibler divergence. To see that D(p, q) 0, with
equality only if p = q, just use that log x x 1 with equality only if x = 1.
Proof. Notice that the probability is zero when µ +t > 1, since the average can not
exceed 1, so µ + t 1 is not really a constraint here. Using the convexity of the
exponential function, we can write for x 2 [0, 1] that
el x = el (x·1+(1 x)·0)
xel + (1 x)el ·0 = 1 x + xel ,
These inequalities show that, no matter how small t > 0 is, the probability that
the average X n deviates from the expectation µ by more than t in either direction
decreases exponentially fast with n. Of course, the same conclusion applies to any
bounded random variables, |Xi | M, by shifting and rescaling the interval [ M, M]
into the interval [0, 1].
Even though the Hoeffding-Chernoff inequality applies to all bounded random
variables, in real-world applications in engineering, computer science, etc., one
would like to improve the control of the probability by incorporating other mea-
sures of closeness of the random variable X to the mean µ, for example, the vari-
ance
s 2 = Var(X) = E(X µ)2 = EX 2 (EX)2 .
The following inequality is classical. Let us denote
f (x) = (1 + x) log(1 + x) x.
We will center the random variables (Xn ) and instead work with Zn = Xn µ.
for all t 0.
Using Taylor series expansion and the fact that EZ = 0, EZ 2 = s 2 and |Z| M,
we can write
• •
(l Z)k EZ k
Eel Z = E Â = Â lk
k=0 k! k=0 k!
• •
lk lk k 2 2
= 1+ Â EZ 2 Z k 2
1+ Â
M s
k=2 k! k=2 k!
s 2 • l k Mk s2 ⇣ ⌘
= 1+ 2 Â = 1 + 2 el M 1 l M
M k=2 k! M
⇣ s2 ⇣ ⌘⌘
lM
exp e 1 l M
M2
where in the last inequality we used 1 + x ex . Therefore,
⇣ n ⌘ ⇣ s2 ⇣ ⌘⌘
P Â Zi nt exp n lt + 2 el M
M
1 lM .
i=1
s2 ⇣ lM ⌘
t+ Me M = 0,
M2
tM
el M =
+ 1,
s2
✓ ◆
1 tM
l = log 1 + 2 .
M s
Since this l 0, plugging it into the above bound,
✓ ✓ ◆ ✓ ✓ ◆◆◆
t tM s 2 tM tM
exp n log 1 + 2 + 2 + 1 1 log 1 +
M s M s2 s2
✓ 2✓ ✓ ◆ ✓ ◆◆◆
s tM tM tM tM
= exp n log 1 + 2 log 1 + 2
M 2 s 2 s s 2 s
✓ 2✓ ✓ ◆ ✓ ◆◆◆
s tM tM tM
= exp n 1 + 2 log 1 + 2
M2 s 2 s s
✓ 2
✓ ◆◆
ns tM
= exp f ,
M 2 s2
finishes the proof. t
u
To simplify the bound in Bennett’s inequality, one can notice that the function
x2
f (x) := f (x) 2(1+x/3) 0, because
x2 (x + 9)
f (0) = f 0 (0) = 0 and f 00 (x) = 0,
(x + 1)(x + 3)3
36 2 Laws of Large Numbers
For small t, the denominator is of order ⇠ 2s 2 , and we get a better control of the
probability when the variance is small.
Poisson approximation of the Binomial. As a motivation for one of the exer-
cises below, let us prove a classical approximation of the Binomial distribution by
Poisson. Consider independent Bernoulli random variables Xi ⇠ B(pi ) for i n for
some pi 2 (0, 1). Recall that, for l > 0, Poisson distribution Pl is defined by
lk l
Pl ({k}) = e for integer k 0.
k!
The following holds.
Proof. The proof is based on a construction on “the same probability space”. Let
us construct Bernoulli r.v. Xi ⇠ B(pi ) and Poisson r.v. Xi⇤ ⇠ P pi on the same proba-
bility space as follows. Let us consider the standard probability space ([0, 1], B, P)
with the Lebesque measure P. Let us fix i for a moment and define a random vari-
able ⇢
0, 0 x 1 pi ,
Xi = Xi (x) =
1, 1 pi < x 1.
Clearly, Xi ⇠ B(pi ). Let us construct Xi⇤ as follows. For k 0, let us denote
k
(pi )`
ck = Â e pi
`=0 `!
and let
2.1 Inequalities for sums of independent random variables 37
8
>
> 0, 0 x c0 ,
<
1, c0 < x c1 ,
Xi⇤ = Xi⇤ (x) =
> 2, c1 < x c2 ,
>
:
...
Clearly, Xi⇤ has the Poisson distribution P pi . When do we have Xi 6= Xi⇤ ? Since
1 pi e pi = c0 , this can only happen for 1 pi < x c0 and c1 < x 1, which
implies that
pi pi pi pi
P(Xi 6= Xi⇤ ) = e (1 pi ) + (1 e pi e ) = pi (1 e ) p2i .
Then we construct pairs (Xi , Xi⇤ ) for i n on separate coordinates of the product
space [0, 1]n with the product Lebesgue measure, thus, making them independent
for i n. It is well-known (and easy to check) that a sum of independent Poisson
random variables is again Poisson with the parameter given by the sum of indi-
vidual parameters and, therefore, Âin Xi⇤ has the Poisson distribution Pl , where
l = p1 + . . . + pn . Finally, we use the union bound to conclude that
n n
P(Sn 6= Sn⇤ ) Â P(Xi 6= Xi⇤ ) Â p2i ,
i=1 i=1
t2
D(1 µ + t, 1 µ) .
2µ(1 µ)
Hint: compare second derivatives.
Exercise 2.1.2. In the setting of Theorem 2.2, show that, for s > 0,
⇣ 1 n ⌘ ⇣ s2 ⇣ ⌘⌘
P p  (Xi µ) s exp +O n 1/2
.
n i=1 2µ(1 µ)
Notice that, by the previous exercise, if µ 1/2 then the same inequality holds
without O n 1/2 term. (Once we learn the Central Limit Theorem, this will show
that the Hoeffding-Chernoff inequality is almost sharp for i.i.d. Bernoulli B(µ)
random variables in the regime t = psn for s 1.)
Exercise 2.1.3. Fix l > 0 and s > 0. (a) In the setting of Theorem 2.2, if µ = ln ,
show that
⇣n ⌘ ⇣ l +s ⌘
P Â Xi l + s exp s (l + s) log +O n 1 .
i=1 l
lk l,
(b) If X has Poisson distribution P(X = k) = k! e k = 0, 1, 2, . . . , show that
38 2 Laws of Large Numbers
⇣ ⌘ ⇣ l +s⌘
P X l + s exp s (l + s) log
l
and, when l + s is integer,
⇣ ⌘ 1 ⇣ l +s⌘
P X l +s p exp s (l + s) log .
e l +s l
Hint: For the lower bound, use Stirling’s formula in the form k! ekk+1/2 e k . (To-
gether with Theorem 2.4 above, this exercise shows that the Hoeffding-Chernoff
inequality is almost sharp in the regime µ = ln , t = ns for s 1.)
Exercise 2.1.5. Suppose that the random variables X1 , . . . , Xn , X10 , . . . , Xn0 are inde-
pendent and, for all i n, Xi and Xi0 have the same distribution. Prove that
⇣ n n ⌘1/2 ⌘
P  (Xi Xi0 ) > 2t  (Xi Xi0 )2 e t.
i=1 i=1
Hint: think about a way to introduce Rademacher random variables ei into the
problem and then use Hoeffding’s inequality.
In this section, we will study two types of convergence of the average to the mean,
in probability and almost surely. Consider a sequence of random variables (Yn )n 1
on some probability space (W, A , P). We say that Yn converges in probability to a
p
random variable Y (and write Yn ! Y ) if, for all e > 0,
lim P(|Yn Y| e) = 0.
n!•
P w : lim Yn (w = Y (w)) = 1.
n!•
in probability.
Proof. By Chebyshev’s inequality, also using that EXi X j = 0,
EX 2n
P |X n 0| e = P X 2n e2
e2
1 1 n nK K
=
n e
2 2
E(X1 + · · · + Xn )2
= Â
n e i=1
2 2
EXi2 2 2 = 2 ! 0,
n e ne
EXi = q , Var(Xi ) = s 2 (q ).
s 2 (q )
P(|X n q | > e) ,
ne 2
and, therefore,
2kuk• s 2 (q )
|Eu(X n ) u(q )| max |u(x) u(q )| + .
|x q |e ne 2
P(Xi = 1) = q , P(Xi = 0) = 1 q,
and let u : [0, 1] ! R be continuous. Then, by the above theorem, the following
linear combination of the Bernstein polynomials,
n ⇣ ⌘ ⇣ n ⌘ n ⇣ ⌘✓ ◆
k k n k
Bn (q ) := Eu(X n ) = Â u P Â Xi = k = Â u q (1 q )n k
k=0 n i=1 k=0 n k
Lemma 2.3. Consider a sequence (pi )i 1 such that pi 2 [0, 1). Then
’(1 pi ) = 0 () Â pi = +•.
i 1 i1
”=)”. We can assume that pi 1/2 for i m for large enough m, because, other-
wise, the series obviously diverges. Since 1 p e 2p for p 1/2 we have
⇣ ⌘
’ (1 p i ) exp 2 Â i
p
min min
The following result plays a key role in Probability Theory. Consider a sequence
(An )n 1 of events An 2 A on the probability space (W, A , P). We will denote by
\ [
An i.o. := Am
n 1m n
the event that An occur infinitely often, which consists of all w 2 W that belong to
infinitely many events in the sequence (An )n 1 . Then the following holds.
P |X n µ| tn i.o. = 0.
This means that for P-almost all w 2 W, the difference |X n (w) µ| will become
smaller than tn for large enough n n0 (w). If we recall the definition of almost
sure convergence, this means that X n converges to µ almost surely — the so-called
strong law of large number. Next, we will show that this holds even for unbounded
random variables under a minimal assumption that the expectation µ = EX1 exists.
Strong law of large numbers. The following simple observation will be useful: if
a random variable X 0 then
Z •
EX = P(X x) dx.
0
As before, let (Xn )n 1 be i.i.d. random variables on the same probability space.
1 n
Xn = Â Xi ! µ = EX1 almost surely.
n i=1
1 n 1 n + 1 n
 i n  Xi
n i=1
X = Â Xi ! EX1+
n i=1
EX1 = EX1 .
i=1
the Borel-Cantelli lemma implies that P({Xi 6= Yi } i.o.) = 0. This means that for
some (random) i0 and for i i0 we have Xi = Yi and, therefore,
1 n 1 n
lim
n!• n
 i n!• n  Yi a.s.
X = lim
i=1 i=1
1 1
 e 2 n(k)2 Var(Tn(k) ) =  e 2 n(k)2  Var(Yi )
k 1 k 1 in(k)
1 1 1
 e 2 n(k)2  EYi2 =
e 2 iÂ1 i k:n(k)
EY 2 Â
n(k)2
.
k 1 in(k) i
ak
n(k) = ba k c a k
2
and if k0 = min{k : a k i} then
1 4 4 K
 n(k)2
 2k = 2k
a a 0 (1 1
)
i2
,
n(k) i ak i a2
We proved that
ÂP |Tn(k) ETn(k) | en(k) < •
k 1
and the Borel-Cantelli lemma implies that
as k ! •, so we proved that
Tn(k)
! EX almost surely.
n(k)
Exercise 2.2.1. Suppose that random variables (Xn )n 1 are i.i.d. such that E|X1 | p <
• for some p > 0. Show that maxin |Xi |/n1/p goes to zero in probability.
Exercise 2.2.2 (Weak LLN for U-statistics). If (Xn )n 1 are i.i.d., EX1 = µ and
s 2 = Var(X1 ) < •, show that
✓ ◆ 1
n
2 Â Xi X j ! µ 2
1i< jn
in probability as n ! •.
0 j1 ,..., j n
k ik
Exercise 2.2.4. If E|X| < • and limn!• P(An ) = 0, show that limn!• EXIAn = 0.
(Hint: use the Borel-Cantelli lemma over some subsequence.)
Exercise 2.2.7. Suppose that (Xn )n 1 are independent random variables. Show that
P(supn 1 Xn < •) = 1 if and only if Ân 1 P(Xn > M) < • for some M > 0.
Exercise 2.2.8. Consider i.i.d. nonnegative random variables Xi for i 1 such that
EX1 = +•, and let Sn = X1 + . . . + Xn . Show that limn!• Snn = +• almost surely.
Exercise 2.2.9. If (Xn )n 1 are i.i.d. and not constant with probability one, then
P(Xn converges) = 0.
Exercise 2.2.11. Suppose that (Xn )n 1 are i.i.d. with EX1+ < • and EX1 = •.
Show that Sn /n ! • almost surely.
Exercise 2.2.12. Suppose that (Xn )n 1 are i.i.d. such that E|X1 | < • and EX1 = 0.
If (cn ) is a bounded sequence of reals numbers, prove that
1 n
 ci Xi ! 0 almost surely.
n i=1
Hint: either (a) group the close values of c or (b) examine the proof of the strong
law of large numbers.
Exercise 2.2.13. Suppose (Xn )n 1 are i.i.d. standard normal. Prove that
⇣ |Xn | ⌘
P lim sup p = 1 = 1.
n!• 2 log n
2.3 0 – 1 laws, convergence of random series 47
is a tail event. It turns out that when (Xi )i 1 are independent then all tail events
have probability 0 or 1.
Theorem 2.8 (Kolmogorov’s 0–1 law). If (Xi )i 1 are independent then P(A) = 0
or 1 for all A 2 T .
This is an algebra, because any set operations on finite number of such sets can
again be expressed in terms of finitely many random variables Xi . By the Ap-
proximation Lemma, we can approximate any set A 2 s (Xi )i 1 by sets in A .
Therefore, for any e > 0, there exists a set A0 2 A such that P(A 4 A0 ) e and,
therefore,
|P(A) P(A0 )| e, |P(A) P(A \ A0 )| e.
By definition, A0 2 s (X1 , ..., Xn ) for large enough n and, since A is a tail event,
A 2 s ((Xi )i n+1 ). The Grouping Lemma implies that A and A0 are independent
and P(A \ A0 ) = P(A)P(A0 ). Finally, we get
Example 2.3.2. Consider the series Âi 1 Xi zi on the complex plane, for z 2 C. Its
radius of convergence is
1
r = lim inf |Xi | i .
i!•
48 2 Laws of Large Numbers
For any x 0, the event {r x} is, obviously, a tail event. This implies that r =
const with probability 1 when Xi ’s are independent. t
u
Next we will prove a stronger result under a more restrictive assumption that the
random variables (Xi )i 1 are not only independent, but also identically distributed.
A set B 2 RN is called symmetric if, for all n 1,
In other words, the set B is symmetric under the permutations of finitely many
coordinates. We will call an event A 2 s ((Xi )i 1 ) symmetric if it is of the form
{(Xi )i 1 2 B} for some symmetric set B 2 RN in the cylindrical s -algebra B • on
RN . For example, any event in the tail s -algebra T is symmetric.
Theorem 2.9 (Savage-Hewitt 0–1 law). If (Xi )i 1 are i.i.d. and A 2 s ((Xi )i 1 ) is
symmetric then P(A) = 0 or 1.
Proof. By the Approximation Lemma, for any e > 0, there exists n 1 and An 2
s (X1 , ..., Xn ) such that P(An 4 A) e. Of course, we can write this set as
that switches the first n coordinates with the second n coordinates. Denote X =
(Xi )i 1 and recall that A = {X 2 B} for some symmetric set B 2 RN that, by defi-
nition, satisfies GB = Gx : x 2 B = B. Now we can write
Random series. We already saw above that, by Kolmogorov’s 0–1 law, the se-
ries Âi 1 Xi for independent (Xi )i 1 converges with probability 0 or 1. This means
that either Sn = X1 + . . . + Xn converges to its limit S with probability one, or with
probability one it does not converge. We know that almost sure convergence is
stronger that convergence in probability, so in the case when with probability one
Sn does not converge, is it still possible that it converges to some random variable
in probability? The answer is no, because we will now prove that for random se-
ries convergence in probability implies almost sure convergence. We will need the
following.
The equality in the middle holds because the events {|S j | x} and {|Sn S j | < a}
are independent, since the first depends only on X1 , ..., X j and the second only on
X j+1 , ..., Xn . The last inequality holds, because if |Sn S j | < a and |S j | x then,
by triangle inequality, |Sn | > x a.
To deal with the maximum, instead of looking at an arbitrary partial sum S j , we
will look at the first partial sum that crosses the level x. We define this first time by
t = min j n : |S j | x
and let t = n + 1 if all |S j | < x. Notice that the event {t = j} also depends only on
X1 , ..., X j , so we can again write
50 2 Laws of Large Numbers
Proof. The “only if” direction is obvious, so we only need to prove the “if” part.
Since the sequence Mn is decreasing, it converges to some limit Mn # M 0 every-
where. Since for all e > 0,
P(M e) P(Mn e) ! 0 as n ! •,
this means that P(M = 0) = 1 and Mn ! 0 almost surely. Of course, this implies
that Yn ! Y almost surely. t
u
Proof. Suppose that the partial sums Sn converge to some random variable S in
probability, i.e., for any e > 0, for large enough n n0 (e) we have P(|Sn S|
e) e. If k j n n0 (e) then
Let us give one easy-to-check criterion for convergence of random series. Again,
we will need one auxiliary result.
lim P(|Yn Ym | e) = 0
n,m!•
Proof. Again, the “only if” direction is obvious and we only need to prove the “if”
part. Given e = ` 2 , we can find m(`) large enough such that, for n, m m(`),
⇣ 1⌘ 1
P |Yn Ym | 2. (2.4)
`2 `
Without loss of generality, we can assume that m(` + 1) m(`) so that
⇣ 1⌘ 1
P |Ym(`+1) Ym(`) | 2.
`2 `
Then, ⇣ 1⌘ 1
ÂP |Ym(`+1) Ym(`) |
`2
 2 <•
`
` 1 ` 1
and, by the Borel-Cantelli lemma,
⇣ 1 ⌘
P |Ym(`+1) Ym(`) | i.o. = 0.
`2
As a result, for large enough (random) ` and for k > `,
1 1
|Ym(k) Ym(`) | Â 2
< .
i `i ` 1
This means that, with probability one, (Ym(`) )` 1 is a Cauchy sequence and there
exists an almost sure limit Y = lim`!• Ym(`) . Together with (2.4) this implies that
Yn ! Y in probability. t
u
Since ri ! 0, given e > 0, we can find n0 such that for i n0 we have |ri | e and
| Âni=n10 +1 (bi+1 bi )ri | ebn . Therefore,
n n0
bn 1 Â xi bn 1 Â (bi+1 bi )ri + e + bn 1 b1 |r0 | + |rn |.
i=1 i=1
probability 1 p. Show that the event {Sn = 0 i.o.} has probability 0 or 1. (Hint:
Hewitt-Savage 0 – 1 law.)
Exercise 2.3.2. In the setting of the previous problem, show: (a) if p 6= 1/2 then
P(Sn = 0 i.o.) = 0; (b) if p = 1/2 then P(Sn = 0 i.o.) = 1. Hint: use the fact that
the events n on o
lim inf Sn 1/2 , lim sup Sn 1/2
n!• n!•
are symmetric.
Exercise 2.3.3. Given any sequence of random variables (Yi )i 1 , the event A =
limi!• Yi exists and any e > 0, show that there exist integers k, n, m 1 such
that an event \ n 1o
Ak,n,m = Y j Yi
ni< jm
k
approximates A in a sense that P(A 4 Ak,n,m ) e. Hint: use continuity of measure
and Cauchy’s convergence test.
Exercise 2.3.4. Suppose that (Xn )n 1 are i.i.d. with EX1 = 0 and EX12 = 1. Prove
that for d > 0,
n
1
n (log n)1/2+d
1/2 Â Xi ! 0
i=1
almost surely. Hint: use Kolmogorov’s strong law of large numbers.
Exercise 2.3.5. Let (Xn ) be i.i.d. random variables with continuous distribution F.
We say that Xn is a record value if Xn > Xi for i < n. Let In be the indicator of the
event that Xn is a record value.
(a) Show that the random variables (In )n 1 are independent and P(In = 1) = 1/n.
Hint: if Rn 2 {1, . . . , n} is the rank of Xn among the first n random variables
(Xi )in , prove that (Rn ) are independent.
(b) If Sn = I1 + . . . + In is the number of records up to time n, prove that
Sn / log n ! 1 almost surely. Hint: use Kolmogorov’s strong law of large num-
bers.
In this section, we will have our first encounter with two concepts, stopping times
and Markov property, in the setting of the sums of independent random variables.
Later, stopping times will play an important role in the study of martingales, and
Markov property will appear again in the setting of the Brownian motion.
Consider a sequence (Xi )i 1 of independent random variables and an integer
valued random variable t 2 {1, 2, . . .}. We say that t is independent of the future
if {t n} is independent of s ((Xi )i n+1 ). Suppose that t is independent of the
future and E|Xi | < • for all i 1. We can formally write
In (*) we can interchange the order of summation if, for example, the double se-
quence is absolutely summable, by the Fubini-Tonelli theorem. Since t is indepen-
dent of the future, the event {t n} = {t n 1}c is independent of s (Xn ) and
we get
ESt = Â EXn P(t n). (2.5)
n 1
This implies the following.
Theorem 2.14 (Wald’s identity). If (Xi )i 1 are i.i.d., E|X1 | < • and Et < •, then
ESt = EX1 Et.
The reason we can interchange the order of summation in (*) is because under our
assumptions the double sequence is absolutely summable since
Given a stopping time t, we would like to describe all events that depend on t
and the sequence X1 , . . . , Xt up to this stopping time. A formal definition of the
s -algebra st generated by the sequence up to a stopping time t is the following:
When t is a stopping time, one can easily check that this is a s -algebra, and the
meaning of the definition is that, if we know that t n then the corresponding part
of the event A is expressed only in terms of X1 , . . . , Xn .
Theorem 2.15 (Markov property). Suppose that (Xi )i 1 are i.i.d. and t is a stop-
ping time. Then the sequence Tt = (Xt+1 , Xt+2 , . . .) is independent of the s -algebra
st and
d
Tt = (X1 , X2 , . . .),
d
where = means the equality in distribution.
In words, this means that the sequence Tt = (Xt+1 , Xt+2 , . . .) after the stopping
time is an independent copy of the entire sequence, also independent of everything
that happens before the stopping time.
P A \ {Tt 2 B} = ÂP A \ {t = n} \ {Tt 2 B}
n 1
= ÂP A \ {t = n} \ {Tn 2 B} ,
n 1
A \ {t = n} = A \ {t n} \ A \ {t n 1} 2 s (X1 , . . . , Xn ).
P A \ {Tt 2 B} = ÂP A \ {t = n} P Tn 2 B
n 1
= ÂP A \ {t = n} P T1 2 B = P A P T1 2 B ,
n 1
Let us give one interesting application of the Markov property and Wald’s iden-
tity that will yield another proof of the Strong Law of Large Numbers.
Theorem 2.16. Suppose that (Xi )i 1 are i.i.d. such that EX1 > 0. If Z = infn 1 Sn
then P(Z > •) = 1.
56 2 Laws of Large Numbers
This means that partial sums can not drift down to • if the mean EX1 > 0. Of
course, this is obvious by the strong law of large number, but we want to prove this
independently, since this will give another proof of the SLLN.
Proof. Let us define
(2)
t1 = min k 1 : Sk 1 , Z1 = min Sk , Sk = St1 +k St1 ,
kt1
Therefore,
By Wald’s identity,
if we can show that Et1 < •. This is left as an exercise below. We proved that
P(Z < N) ! 0 as N ! •, which, of course, implies that P(Z > •) = 1. t
u
2.4 Stopping times, Wald’s identity, and strong Markov property 57
This result can be used to give another proof of the Strong Law of Large Num-
bers.
Theorem 2.17. If (Xi )i 1 are i.i.d. and EX1 = 0 then Sn /n ! 0 almost surely.
Proof. Given e > 0 we define Xie = Xi + e so that EX1e = e > 0. By the above
result, infn 1 (Sn + ne) > • with probability one. This means that for all n 1,
Sn + ne M > • for some random variable M. Dividing both sides by n and
letting n ! • we get
Sn
lim inf e
n!• n
with probability one. We can then let e # 0 over some sequence. Similarly, we prove
that
Sn
lim sup 0
n!• n
with probability one, which finishes the proof. t
u
Exercise 2.4.1. Let (Xi )i 1 be i.i.d. and EX1 > 0. Given a > 0, show that Et < •
for t = inf{k 1 : Sk > a}. (Hint: truncate Xi ’s and t and use Wald’s identity).
Exercise 2.4.2. Let S0 = 0, Sn = Âni=1 Xi be a random walk with i.i.d. (Xi ) such that
P(Xi = 1) = p, P(Xi = 1) = 1 p for p > 1/2. Consider integer b 1 and let
t = min{n 1 : Sn = b}. Show that for 0 < s 1,
⇣1 (1 4pqs2 )1/2 ⌘b
Est =
2qs
and compute Et.
Exercise 2.4.3. Suppose that we play a game with i.i.d. outcomes (Xn )n 1 such
that E|X1 | < •. If we play n rounds, we gain the largest of the first n outcomes. In
addition, to play each round we have to pay amount c > 0, so after n rounds our
total profit (or loss) is
Yn = max Xm cn.
1mn
In this problem we will find the best strategy to play the game, in some sense.
1. Given a 2 R, let p = P(X1 > a) > 0 and consider the stopping time T =
inf{n : Xn > a}. Compute EYT . (Hint: sum over sets {T = n}.)
2. Consider a such that E(X1 a)+ = c. For a = a show that EYT = a.
3. Show that Yn a + Ânm=1 ((Xm a)+ c) (for any a, actually).
4. Use Wald’s identity to conclude that for any stopping time t such that Et < •
we have EYt a.
This means that stopping at time T results in the best expected profit a among all
stopping times t with Et < •.
58 2 Laws of Large Numbers
for all i n, for some constants a1 , . . . , an . This means that changing the ith coor-
dinate of the function f while keeping all the other coordinates fixed can change
its value by not more than ai . In particular, the function f is bounded.
Theorem 2.18 (Azuma’s inequality). If (2.6) holds then
⇣ t2 ⌘
P f Ef t exp (2.7)
2 Âni=1 a2i
for any t 0.
Proof. For i = 1, . . . , n, let Ei denote the expectation in Xi+1 , . . . , Xn with the ran-
dom variables X1 , . . . , Xi fixed. One can think of (X1 , . . . , Xn ) as defined on a product
space with the product measure, and Ei denotes the integration over the last n i
coordinates. Let us denote Yi = Ei f Ei 1 f and note that En f = f and E0 f = E f .
Then we can write n
f E f = Â Yi
i=1
(this is called martingale-difference representation) and as before, for l 0,
⇣ ⌘ ⇣n ⌘
P f Ef t = P Â Yi t e lt
EelY1 +...+lYn .
i=1
and, therefore,
2 /2
EelY1 +...+lYn e(l an ) EelY1 +...+lYn 1 .
By induction on n, we get
2.5 Azuma inequality 59
EelY1 +...+lYn el
2
Âni=1 a2i /2
and ⇣ ⌘ ⇣
n
l 2 n 2⌘
P Â Yi t exp lt + Â ai .
2 i=1
i=1
Optimizing over l 0 finishes the proof. t
u
Notice that in the above proof, we did not use the fact that Xi ’s are random vari-
ables. They could be random vectors or arbitrary random elements taking values in
some measurable spaces. We only used the assumption that they are independent.
Keeping this in mind, let us give one example of application of Azuma’s inequality.
Then, all ei, j are independent Bernoulli B(p) random variables, by the defini-
tion of the Erdős–Rényi random graph. For i = 2, . . . , n, let us denote by Xi =
(e1,i , e2,i . . . , ei 1,i ) the vector of indicators of edges between the vertex vi and ver-
tices v1 , . . . , vi 1 . The vectors X2 , . . . , Xn are independent because they consist of
indicators of disjoint sets of edges, and the chromatic number is a function
since these vectors include indicators of all edges in the graph. To apply Azuma’s
inequality, we need to determine the stability constants a2 , . . . , an . Notice that if we
replace Xi with another value Xi0 = (e01,i , e02,i . . . , e0i 1,i ), this means that we modify
some edges between vi and v1 , . . . , vi 1 . The chromatic number can not increase by
more than 1, because we can always assign a new colour to the vertex vi , so
and this shows that the stability condition (2.6) holds with ai = 1. Therefore,
Azuma’s inequality implies (when applied to both f and f )
⇣ ⌘ t2
P c(G(n, p)) Ec(G(n, p)) t 2e 2(n 1) .
60 2 Laws of Large Numbers
p
For example, if we take t = 2n log n, we get
⇣ p ⌘ 2
P c(G(n, p)) Ec(G(n, p)) 2n log n 1 ,
n
p
so, with high probability, the chromatic number will be within 2n log n from its
expected value Ec(G(n, p)). It is known (but non-trivial) that this expected value
n 1
Ec(G(n, p)) ⇠ log ,
2 log n 1 p
p
so the deviation 2n log n is of a smaller order compared to the expectation. This
shows that the chromatic number is typically of the same order as its expectation,
which gives us an example of the law of large numbers for a very non-trivial func-
tional of independent random variables. t
u
Example 2.5.2 (Balls in boxes). Suppose that we throw n balls into m boxes at ran-
dom, so that the probability that a ball lands in any given box is 1/m, and indepen-
dently of each other. Let N be the number of non-empty boxes. If Xi 2 {1, . . . , m} is
the box number in which the ith ball lands then X1 , . . . , Xn are independent random
variables and N = card{X1 , . . . , Xn } is the number of distinct boxes hit.
It is easy to compute, writing N = Âim I(box #i not empty), that
⇣ ⇣ 1 ⌘n ⌘
EN = m 1 1 . (2.8)
m
When n is large and m = an for some fixed a > 0 then
1/a
EN ⇠ na 1 e .
Changing the value of one box Xi changes N by at most 1, so the stability constants
are all equal to ai = 1. As in the previous example,
t2
P N EN t 2e 2n , (2.9)
Example 2.5.3 (Max-Cut of sparse random graph). This example is quite simi-
lar to the previous one, only balls and boxes have a different meaning and, instead
of the number of non-empty boxes, we consider a much more complicated func-
tion.
2.5 Azuma inequality 61
Let V = {v1 , . . . , vn } be the set of vertices of a graph and E be its set of edges.
Max-Cut problem is to divide vertices into two groups in a way that maximizes
the number of edges between the two groups. This is a very important problem in
computer science that has many applications, for example, to layout of electronic
circuitry, and as a reformulations of various combinatorial optimization problems.
For example, imagine that we have a group of n people and edges represent people
that dislike each other. Then our goal is to separate them into two groups in such a
way that as many ‘enemies’ as possible are in the opposite groups.
Here we will consider a specific model of a random graph, and will show that
the maximal number of edges between the two groups concentrates around its ex-
pectation. We will select edges randomly, but using a different procedure than in
the Erdős–Rényi graph. We will take
m = dn
possible edges for some fixed d > 0 and, as usual, we think of n as being large.
Then we place each of these m edges at random among the set of n2 possible
edges, independently of each other. Each pair of vertices (a possible location to
place an edge) represents a box, and edges represent balls. The random variables
X1 , . . . , Xm that describe between which vertices each edge is placed take n2 pos-
sible values and are all independent. This model produces what is called a sparse
graph, because the number of edges m = nd is small relative to n2 and there is a
fixed average number of edges per vertex, d. As in the above example, let
It is possible that more than one edge is placed between two vertices, in which case
we keep one of them.
The function that we will consider on the above sparse random graph is called
Max-Cut. It is defined as follows. If we want to split all vertices into two groups,
one way to encode this is to assign each vertex one of the two labels { 1, 1}. Then
all vertices with the same label belong to the same group. In other words, each
vector
s = s1 , . . . , sn 2 { 1, 1}n
describes a possible cut of the graph into two groups. Let E(s ) be the number of
present edges connecting the vertices in opposite groups,
n o
E(s ) = card i < j : ei, j = 1, si 6= s j . (2.10)
Then the value M of the Max-Cut corresponds to a way to cut the graph so that the
number of edges between the two groups is as large as possible,
which is exactly one half of all the present edges. Taking the expectations on both
sides, we can write
1
EM Â P there is an edge between vi and v j
2 i< j
✓ ◆
1 n ⇣ ⇣ 1 ⌘dn ⌘
= 1 1 n ,
2 2 2
where on the right hand side we have the analogue of (2.8). Using the inequality
1 x e x , we can write
⇣ 1 ⌘dn 2d 2d 2 ⇣1⌘
dn/(n2) 2d
1 n e e n =1 + 2 +O 3 ,
2
n n n
where in the second line we used Taylor’s theorem for ex at zero. Plugging this in
the above inequality, we get
2.5 Azuma inequality 63
✓ ◆
1 n ⇣ 2d 2d 2 ⇣ 1 ⌘⌘ dn d + d2 ⇣1⌘
EM + O = +O .
2 2 n n2 n3 2 2 n
is called the Hamming distance between x and y. The function r is called the
Hamming metric on {0, 1}n . We will now use Azuma’s inequality to show that,
given a subset A ✓ {0, 1}n that contains a positive proportion of all points,
the set of all points within Hamming distance t from A. For e, d 2 (0, 1), let us
denote r r
1 1
te,d = 2 log + 2 log .
e d
p
Then the following lemma shows that te,d n-neighbourhood of A contains at least
(1 d ) proportion of all points in {0, 1}n .
Lemma 2.8. For e, d 2 (0, 1), if card(A) > e2n then
p
card B A,te,d n (1 d )2n . (2.17)
card(S)
P X 2S = .
2n
Let us consider a random variable Z = miny2A r(X, y) equal to the Hamming dis-
tance from X to the set A. Changing one coordinate Xi to Xi0 can change Z by at
64 2 Laws of Large Numbers
card(A)
P(Z = 0) = P(X 2 A) = > e,
2n
by our assumption, which is a contradiction. Therefore,
p
EZ te n.
Exercise 2.5.3 (3-SAT). Suppose that a debate class has a large number of stu-
dents, n. The professor creates m = 2n debate teams, each composed of 3 members,
and he does it by assigning each student to 6 different teams.
Each member of each team is assigned to defend a position from the platform of
one of two major political parties, Party A or Party B. The professor does not know
the party affiliations of the students, and we suppose that each student belongs to
either Party A or Party B with probability 1/2 each, independently of each other.
Given their turn, each team will select only one of its members to defend his or
her assigned position. However, if it turns out that all three members have been
assigned positions of the opposing party (not the one to which they belong), the
team will pass.
Let N be the number of speeches delivered in class. What is the expectation
EN? Use Azuma’s inequality to show that, with high probability, the number of
speeches N will be relatively close to EN.
Exercise 2.5.4 (Independent set). Let G(n, p) be the Erdős-Rényi random graph
on the set of n vertices V = {1, . . . , n}. A subset of vertices V 0 ✓ V in this graph is
called an independent set if there are no edges between any two vertices in V 0 . Let
X be the largest cardinality of an independent set in G(n, p). Show that
⇣ t2 ⌘
P |X EX| t 2 exp .
2n
67
Chapter 3
Central Limit Theorem
d
This is often denoted by Pn ! P or Pn =) P. Of course, the idea here is that
the measure P on the Borel s -algebra is determined uniquely by the integrals of
f 2 Cb (S), so we compare closeness R of theR measures by closeness of all these
integrals. To see this, suppose that f dP = f dQ for all f 2 Cb (S). Consider any
open set U in S and let F = U c . Using that d(x, F) = 0 if and only if x 2 F (because
F is closed), it is easy to see that
real line the convergence of probability measures can be expressed in terms of their
c.d.f.s, as follows.
Proof. “=)” Suppose that (3.1) holds. Let us approximate the indicator I(x t)
by continuous functions j1 , j2 2 Cb (R) so that
More carefully, we should write lim inf and lim sup but, since t is a point of con-
tinuity of F, letting e # 0 proves that the limit limn!• Fn (t) exists and is equal to
F(t).
“(=” Let PC(F) be the set of points of continuity of F. Since F is monotone,
the set PC(F) is dense in R. Take M large enough such that both M, M 2 PC(F)
and P(( M, M]c ) e. Clearly, for large enough n 1 we have Pn (( M, M]c )
2e. For any k > 1, consider a sequence of points M = x1k x2k . . . xkk = M such
k
that all xi 2 PC(F) and maxi |xi+1 xik | ! 0 as k ! •. Given a function f 2 Cb (R),
consider an approximating function
Since f in continuous,
Our typical strategy for proving the convergence of probability measures will be
based on the following elementary observation.
Lemma 3.1. If for any sequence (n(k))k 1 there exists a subsequence (n(k(r)))r 1
such that Pn(k(r)) ! P weakly then Pn ! P weakly.
Proof. Suppose not. Then for some f 2 Cb (S) and for some e > 0 there exists a
subsequence (n(k)) such that
Z Z
f dPn(k) f dP > e.
But this contradicts the fact that for some subsequence Pn(k(r)) ! P weakly. t
u
Proof. Let A = {a1 , a2 , . . .}. Take (n1 (k)) such that fn1 (k) (a1 ) converges. Take
(n2 (k)) ✓ (n1 (k)) such that fn2 (k) (a2 ) converges. Recursively, take (n` (k)) ✓
(n` 1 (k)) such that fn` (k) (a` ) converges. Now consider the sequence (nk (k)).
Clearly, fnk (k) (a` ) converges for any ` because for k `, nk (k) 2 {n` (k)} by con-
struction. t
u
Proof (of Theorem 3.2). We will prove the Selection Theorem for arbitrary metric
spaces, since this result will be useful to us later when we study the convergence
of laws on general metric spaces. However, when S = R one can see this in a much
more intuitive way, as follows.
(The case S = R). Let A be a dense set of points in R. Given a sequence of prob-
ability measures Pn on R and their c.d.f.s Fn , by Cantor’s diagonalization, there
exists a subsequence (n(k)) such that Fn(k) (a) ! F(a) for all a 2 A. We then define
F(x) = inf F(a) : x < a, a 2 A . Since, by construction, F is non-decreasing on
A, the function F is also non-decreasing. Since
F(x) = inf F(a) : x < a, a 2 A = inf inf F(a) : y < a, a 2 A = inf F(y),
x<y x<y
it is also right-continuous. The fact that Pn are uniformly tight ensures that F(x) !
0 or 1 if x ! • or +•, so F(x) is a cumulative distribution function. Although
it is possible that F(a) 6= F(a) for some a 2 A, it is clear from the definition of F
that if x is a point of continuity of F then
In order to prove weak convergence of Pn(k) to the measure P with the c.d.f. F, let
x be a point of continuity of F(x) and let a, b 2 A such that a < x < b. We have,
F(a) = lim Fn(k) (a) lim inf Fn(k) (x) lim sup Fn(k) (x) lim Fn(k) (b) = F(b).
k!• k!• k!• k!•
Therefore, (3.2) implies that Fn(k) (x) ! F(x) for all such x and, by Theorem 3.1,
this means that the laws Pn converge to P. Let us now prove the general case on
general metric spaces.
(The general case). If K is a compact then, obviously, Cb (K) = C(K). Later in these
lectures, when we deal in more detail with convergence on general metric spaces,
we will prove the following fact, which is well-known and is a consequence of the
Stone-Weierstrass theorem.
3.1 Convergence of laws. Selection theorem 71
f , g 2 a =) c f + g 2 a, 8c 2 R and f ^ g, f _ g 2 a.
fn # 0, 0 fn (x) f1 (x) k f1 k• .
72 3 Central Limit Theorem
Since Z Z Z
1
fn dPn(k) = fn dPn(k) + fn dPn(k) en,r + k f1 k• ,
Kr Krc r
we get Z
1
fn dPn(k) en,r + k f1 k• .
I( fn ) = lim
k!• r
Letting n ! • and r ! •, we get that I( fn ) ! 0. By the Stone-Daniell theorem,
Z
I( f ) = f dP
for some unique measure P on s (Cb (S)). The choice of f = 1 shows that I( f ) =
1 = P(S), which means that P is a probability measure. Finally, let us show that
s (Cb (S)) is the Borel s -algebra B generated by open sets. Since any f 2 Cb (S) is
measurable on B we get s (Cb (S)) ✓ B. On the other hand, let F ✓ S be any closed
set and take a function f (x) = min(1, d(x, F)). We have, | f (x) f (y)| d(x, y) so
f 2 Cb (S) and
f 1 ({0}) 2 s (Cb (S)).
However, since F is closed, f 1 ({0}) = {x : d(x, F) = 0} = F and this proves that
B ✓ s (Cb (S)). t
u
Conversely, the following holds (we will prove this result later for any complete
separable metric space).
Proof. For any e > 0, there exists large enough M > 0, such that P(|x| > M) < e.
Consider a function
8
< 0, s M,
j(s) = 1, s 2M,
:
(s M)/M, M s 2M,
For n large enough, n n0 , we get Pn (|x| > 2M) 2e. For n < n0 choose Mn so
that Pn (|x| > Mn ) 2e. Take M 0 = max{M1 , . . . , Mn0 1 , 2M}. As a result, Pn (|x| >
M 0 ) 2e for all n 1. t
u
3.1 Convergence of laws. Selection theorem 73
Lemma 3.3. Xn ! X in probability if and only if for any sequence (n(k)) there
exists a subsequence (n(k(r))) such that Xn(k(r)) ! X almost surely.
Proof. “(=”. Suppose Xn does not converge to X in probability. Then, for small
e > 0, there exists a subsequence (n(k))k 1 such that P(d(X, Xn(k) ) e) e.
This contradicts the existence of a subsequence Xn(k(r)) that converges to X almost
surely.
“=)”. Given a subsequence (n(k)) let us choose (k(r)) so that
⇣ 1⌘ 1
P d(Xn(k(r)) , X) 2.
r r
By the Borel-Cantelli lemma, these events can occur i.o. with probability 0, which
means that with probability one, for large enough r,
1
d(Xn(k(r)) , X) ,
r
i.e. Xn(k(r)) ! X almost surely. t
u
Proof. By Lemma 3.3, for any subsequence (n(k)) there exists a subsequence
(n(k(r))) such that Xn(k(r)) ! X almost surely. Given f 2 Cb (R), by the domi-
nated convergence theorem, E f (Xn(k(r)) ) ! E f (X), i.e. Xn(k(r)) ! X weakly. By
Lemma 3.1, Xn ! X in distribution. t
u
Exercise 3.1.1. Let Xn be random variables on the same probability space with
values in a metric space S. If for some point s 2 S, Xn ! s in distribution, show that
Xn ! s in probability.
Exercise 3.1.4. Suppose that random variables (Xn ) are independent and Xn ! X
in probability. Show that X is almost surely constant.
74 3 Central Limit Theorem
Exercise 3.1.5. Suppose that random variables X and Y taking values in [0, 1] sat-
isfy EX n EY n for all integer n 1. Does this imply that X is stochastically
greater than Y , i.e. P(X t) P(Y t) for all t?
3.2 Characteristic functions 75
In the next section, we will prove one of the most classical results in Probability
Theory – the central limit theorem – and one of the main tools will be the so-
called characteristic functions. Let X = (X1 , . . . , Xk ) be a random vector on Rk with
the distribution P and let t = (t1 , . . . ,tk ) 2 Rk . The characteristic function of X is
defined by Z
i(t,X)
f (t) = Ee = ei(t,x) dP(x).
In this section, we will collect various important and useful facts about the charac-
teristic functions. The standard normal distribution N(0, 1) on R with the density
1 x2 /2
p(x) = p e
2p
will play a central role, so let us start by computing its characteristic function. First
of all, notice that this is indeed a density, since
⇣ 1 Z ⌘2 1
ZZ
1
Z 2pZ •
x2 /2 (x2 +y2 )/2 r2 /2
p e dx = e dxdy = e r drdq = 1.
2p R 2p R2 2p 0 0
where we denoted
1 z2
j(z) = p e 2 for z 2 C.
2p
Since j is analytic, by Cauchy’s theorem, integral over a closed path is equal to 0.
Let us take a closed path it + x for x from M to +M, M + iy for y from t to 0,
x from M to M and, finally, M + iy for y from 0 to t. For large M, the function
j(z) is very small on the intervals ±M + iy, so letting M ! • we get
Z Z
j(z) dz = j(z) dz = 1,
it+R R
76 3 Central Limit Theorem
Lemma 3.5. If X is a real-valued random variable such that E|X|r < • for integer
r 1 then f (t) 2 Cr (R) and f ( j) (t) = E(iX) j eitX for j r.
For a partial converse, see an exercise below, where the existence of f 00 (0) is shown
to imply that EX 2 < •. One can similarly show that, for even n, the existence of
f (n) (0) implies that EX n < •.
by the dominated convergence theorem. This means that f 2 C(R), i.e. character-
istic functions are always continuous. If r = 1, E|X| < •, we can use
eitX eisX
|X|
t s
and, therefore, by the dominated convergence theorem,
eitX eisX
f 0 (t) = lim E = EiXeitX .
s!t t s
Also, by dominated convergence theorem, EiXeitX 2 C(R), which means that f 2
C1 (R). We proceed by induction. Suppose that we proved that f ( j) (t) = E(iX) j eitX
and that r = j + 1, E|X| j+1 < •. Then, we can use that
Next, we want to show that the characteristic function uniquely determines the
distribution. This is usually proved using convolutions. Let X and Y be two inde-
pendent random vectors on Rk with the distributions P and Q. We denote by P ⇤ Q
the convolution of P and Q, which is the distribution L (X +Y ) of the sum X +Y.
We have,
ZZ
P ⇤ Q(A) = EI(X +Y 2 A) = I(x + y 2 A) dP(x)dQ(y)
ZZ Z
= I(x 2 A y) dP(x)dQ(y) = P(A y) dQ(y).
Ps = P ⇤ N(0, s 2 I).
It turns out that Ps is always absolutely continuous and its density can be written
in terms of the characteristic function of P.
This is the characteristic function of the normal distribution N(µ1 + µ2 , s12 + s22 ),
so the claim follows from the Uniqueness Theorem. One can also prove this by a
straightforward computation using the formula (3.7) for the density of the convo-
lution, which is left as an exercise below. t
u
Lemma 3.6 gives some additional information when the characteristic function
is integrable.
3.2 Characteristic functions 79
R
Lemma 3.8 (Fourier inversion formula). If | f (t)| dt < • then P has density
⇣ 1 ⌘k Z
i(t,x)
p(x) = f (t)e dt.
2p
Proof. Since
i(t,x) 12 s 2 |t|2 i(t,x)
f (t)e ! f (t)e
pointwise as s # 0 and
i(t,x) 12 s 2 |t|2
f (t)e | f (t)|,
Proof. For any sequence (n(k)), by the Selection Theorem, there exists a subse-
quence (n(k(r))) such that Pn(k(r)) converges weakly to some distribution P. Since
ei(t,x) is bounded and continuous,
80 3 Central Limit Theorem
Z Z
ei(t,x) dPn(k(r)) ! ei(t,x) dP(x)
sin y sin 1
where in the last inequality we used that y < 1 if y > 1. t
u
Theorem 3.5 (Levy’s continuity theorem). Let (Xn ) be a sequence of random
variables on Rk . Suppose that
and f (t) is continuous at 0 along each axis. Then there exists a probability distri-
bution P such that Z
f (t) = ei(t,x) dP(x)
nate:
for large enough n. Since |Xn |2 = Âki=1 |Xn,i |2 , the union bound implies that
⇣ p ⌘
k
P |Xn | > 14ke,
d
which means that the sequence Pn = L (Xn ) is uniformly tight. t
u
Exercise 3.2.4. Suppose that f (t) = EeitX is differentiable near 0 and f 0 (t) is dif-
ferentiable at zero, i.e. f 00 (0) is well defined. Prove that EX 2 < •. Hint: consider
j(t) = | f (t)|2 = f (t) f ( t), which is the characteristic function of X1 X2 for two
independent copies X1 , X2 of X, apply l’Hospital’s rule to (1 j(t))/t 2 and then
use Fatou’s lemma to show that E(X1 X2 )2 < •; conclude using Fubini’s theo-
rem.
Exercise 3.2.5. Prove that a probability distribution on Rk is uniquely determined
by the probabilities of half-spaces.
Exercise 3.2.6. Prove that if P ⇤ Q = P on Rk then Q({0}) = 1. (Hint: use that a
characteristic function is continuous, in particular, in the neighborhood of zero.)
Exercise 3.2.7. Find the characteristic function of the law on R with density
max(1 |x|, 0). Then use the Fourier inversion formula to conclude that max(1
|t|, 0) is a characteristic function of some distribution.
Exercise 3.2.8. Let f be a continuous function on R with f (0) = 1, f (t) = f ( t)
for all t, and such that f restricted to [0, •) is convex, with f (t) ! 0 as t ! •.
Show that f is a characteristic function. (Hint: Approximate the general case by
piecewise linear. Then use that max(1 |t|, 0) is a characteristic function and that
characteristic functions are closed under convex combinations and scaling f (t) !
f (at) for a 2 R.)
Exercise 3.2.9. Does there exist a random variable X such that EeitX = e |t| ?
We will see that this inequality is rather accurate for large sample size n, since we
will show that
Z •
2 x2 /2 t!• 2 t 2 /2
lim P(|Zn | t) = p e dx ⇠ p e .
n!• 2p t t 2p
The asymptotic equivalence as t ! • is a simple exercise, and the large n limit is
a consequence of a more general result,
Z z
1 x2 /2
lim P(Zn z) = F(z) = p e dx, (3.8)
n!• 2p •
Sn nµ 1 n Xi µ
Zn = p =p  . (3.9)
ns 2 n i=1 s
Before we use characteristic functions to prove this result as well as its multi-
variate version, let us describe a more intuitive approach via the so-called Linde-
berg’s method, which emphasizes a little bit more explicitly the main reason behind
the central limit theorem, namely, the stability property of the normal distribution
proved in Lemma 3.7.
Theorem 3.6. Consider an i.i.d. sequence (Xi )i 1 such that EX1 = µ, Var(X) = s 2
and E|X1 |3 < •. Then the distribution of Zn defined in (3.9) converges weakly to
standard normal distribution N(0, 1).
Remark 3.1. One can easily modify the proof we give below to get rid of the unnec-
essary assumption E|X1 |3 < •. This will appear as an exercise in the next section,
84 3 Central Limit Theorem
where we will state and prove a more general version of the Lindeberg’s CLT for
non i.i.d. random variables.
Proof. First of all, notice that the random variables (Xi µ)/s have mean 0 and
variance 1. Therefore, it is enough to prove the result for
1 n
Zn = p  Xi
n i=1
under the assumption that EX1 = 0, EX12 = 1. Let (gi )i 1 be independent standard
normal random variables. Then, by the stability property,
1 n
Z = p  gi
n i=1
also has standard normal distribution N(0, 1). If, for 1 m n + 1, we define
1
Tm = p g1 + . . . + gm 1 + Xm + . . . + Xn
n
then, for any bounded function f : R ! R, we can write
n n
E f (Zn ) E f (Z) = Â E f (Tm ) E f (Tm+1 ) Â E f (Tm ) E f (Tm+1 ) .
m=1 m=1
If we denote
1
Sm = p g1 + . . . + gm 1 + Xm+1 + . . . + Xn
n
p p
then Tm = Sm + Xm / n and Tm+1 = Sm + gm / n. Suppose now that f has uni-
formly bounded third derivative. Then, by Taylor’s formula,
and
Now, let us prove the CLT using the method of characteristic functions.
Theorem 3.7 (Central Limit Theorem). Consider an i.i.d. p sequence (Xi )i 1 such
that EX1 = 0, EX12 = 1. Then the distribution of Zn = Sn / n converges weakly to
standard normal distribution N(0, 1).
p
Proof. We begin by showing that the characteristic function of Sn / n converges
to the characteristic function of the standard normal distribution N(0, 1),
Sn
it p 1 t2
lim Ee n =e 2 ,
n!•
Since EX12 < •, Lemma 3.5 implies that the characteristic function f (t) = EeitX1 2
C2 (R) and, therefore,
1 00
f (t) = f (0) + f 0 (0)t + f (0)t 2 + o(t 2 ) as t ! 0.
2
Since
we get
t2
f (t) = 1 + o(t 2 ).
2
Finally,
⇣ ⇣ t ⌘⌘n ⇣
itS
pn t2 ⇣ t 2 ⌘⌘n 1 2
Ee = f p
n = 1 +o ! e 2t
n 2n n
as n ! •. The result then follows from Levy’s continuity theorem in the previous
section. Alternatively, we could also use Lemma 3.9 since it is easy to check that
86 3 Central Limit Theorem
p
the sequence of laws (L (Sn / n))n 1 is uniformly tight. Indeed, by Chebyshev’s
inequality,
⇣ S ⌘ 1 ⇣S ⌘ 1
n n
P p > M 2 Var p = 2 < e
n M n M
for large enough M. t
u
Next, we will prove a multivariate version of the central limit theorem. For a ran-
dom vector X = (X1 , . . . , Xk ) 2 Rk , let EX = (EX1 , . . . , EXk ) denote its expectation
and
Cov(X) = EXi X j 1i, jk
its covariance matrix. Again, the main step is to prove convergence of characteris-
tic functions.
Theorem 3.8. Let (Xi )i 1 be a sequence of i.i.d. random k
2 p vectors on R such that
EX1 = 0 and E|X1 | < •. Then the distribution of Sn / n converges weakly to the
distribution P with the characteristic function
1 (Ct,t)
f p (t) = e 2 , (3.10)
where C = Cov(X1 ).
Proof. Consider any t 2 Rk . Then Zi = (t, Xi ) are i.i.d. real-valued random variables
and, by the central limit theorem on the real line,
⇣ S ⌘ 1 n ⇣ Var(Z ) ⌘ ⇣ 1 ⌘
E exp i t, p = E exp i p  Zi ! exp
n 1
= exp (Ct,t)
n n i=1 2 2
as n ! •, since
so we can use Lemma 3.9 to finish the proof. Alternatively, we can just apply
Levy’s continuity theorem. t
u
Since we know the characteristic function of the standard normal random variables
g` , the characteristic function of Ag is
E exp i(t, Ag) = E exp i(AT t, g) = E ’ exp i(AT t)` g` = ’ exp ((AT t)` )2 /2
`n `n
⇣ 1 ⌘ ⇣ 1 ⌘ ⇣ 1 ⌘
= exp |AT t|2 = exp t T AAT t = exp (Ct,t) ,
2 2 2
which means that Ag has the distribution N(0,C). On the other hand, given a sym-
metric non-negative definite matrix C one can always find A such that C = AAT .
For example, let C = QDQT be its eigenvalue decomposition, for orthogonal ma-
trix Q and diagonal matrix D. Since C is nonnegative definite, the elements of D are
nonnegative. Then, one can take n = k and A = C1/2 := QD1/2 QT or A = QD1/2 .
This means that any normal distribution can be generated as a linear transformation
of the vector of i.i.d. standard normal random variables.
Density in the invertible case. Suppose det(C) 6= 0. Take any invertible A such
that C = AAT , so that Ag ⇠ N(0,C). Since the density of g is
1 ⇣ 1 ⌘ ⇣ 1 ⌘k ⇣ 1 ⌘
’ p2p exp 2
x`2 = p
2p
exp
2
|x|2 ,
`k
Therefore, we get
88 3 Central Limit Theorem
Z ⇣
1 ⌘k 1 ⇣ 1 ⌘
P(Ag 2 W) = p p exp yT C 1
y dy.
W 2p det(C) 2
Consider
Sn ESn 1 n
Zn = p = Â Xni
Var(Sn ) Dn i=1
µni . (3.12)
We have EZn = 0 and Var(Zn ) = 1, but the variance of each summand sni2 /D2n is
not necessarily 1/n as in the case of i.i.d. random variables. One condition that
ensures that Zn satisfies the CLT is
1 n 3
lim
n!• D3n
 E Xni µni = 0, (3.13)
i=1
Theorem 3.9 also follows directly from the next more general result.
Theorem 3.10 (Lindeberg’s CLT). For each n 1, consider a vector (Xni )1in
of independent random variables such that
n
EXni = 0, Var(Sn ) = Â E(Xni )2 = 1.
i=1
leave it as an exercise below. Also, notice that for i.i.d. random variables this p
im-
plies the CLT without the assumption E|X1 |3 < •, since in that case Xni = Xi / n.
Proof. First of all, the sequence L Âin Xni is uniformly tight, because by
Chebyshev’s inequality
⇣ ⌘ 1
P Â Xni > M 2 e
in M
for large enough M. It remains to show that the characteristic function of Sn cov-
l2
erges to e 2 . For simplicity of notation, let us omit the upper index n and write Xi
instead of Xni . Since Eeil Sn = ’ni=1 Eeil Xi it is enough to show that
n
l2
log Eeil Sn = Â log 1 + Eeil Xi 1 ! . (3.15)
i=1 2
We will leave it as an exercise at the end of the section. Using this for m = 1,
l2
Eeil Xi 1 = Eeil Xi EXi21 il EXi
2
l2 2 l2 1
e + EXi2 I(|Xi | > e) l 2 e 2 (3.17)
2 2 2
for large n by (3.14) and for small enough e. Using Taylor’s expansion of log(1+z)
it is easy to check that
1
log(1 + z) z |z|2 for |z|
2
and, therefore, using this and (3.17),
n n
2
 log 1 + Eeil Xi 1 Eeil Xi 1  Eeil Xi 1
i=1 i=1
n
l4 l4 ⌘ n l4
2
 EXi2 max EXi2  EXi2 = max EXi2 .
i=1 4 4 1in i=1 4 1in
l2 2
eil Xi 1 il Xi I |Xi | > e X I |Xi | > e
2 i
and, therefore,
l2 2
eil Xi 1 il Xi + X I |Xi | > e l 2 Xi2 I |Xi | > e . (3.18)
2 i
Using (3.16) for m = 2, on the event |Xi | e ,
l2 2 l3 l 3e 2
eil Xi 1 il Xi + Xi I |Xi | e |Xi |3 I |Xi | e X . (3.19)
2 6 6 i
Combining the last two equations and using that EXi = 0,
l2 l 3e
Eeil Xi 1+ EXi2 l 2 EXi2 I |Xi | > e + EXi2 .
2 6
Finally, this implies
n n
l2 l2 n
 Eeil Xi 1 +
2
= Â Eeil Xi 1 + Â EXi2
2 i=1
i=1 i=1
n
l 3e l 3e
+ l 2 Â EXi2 I(|Xi | > e) !
6 i=1 6
Theorem 3.11 (Multivariate Lindeberg’s CLT). For each n 1, let (Xni )1in
be independent random vectors in Rk such that
n
EXni = 0, Cov(Sn ) = Â Cov(Xni ) ! C as n ! •.
i=1
Just like in the i.i.d. case, using characteristic functions, the proof is reduced to
the one-dimensional case. One only needs to observe that, for any t 2 Rk , random
variables (t, Xni ) satisfy the condition (3.14), because |(t, Xni )| |t||Xni |.
Three series theorem. We will now use Lindeberg’s CLT above to prove the “three
series theorem” for random series. Let us begin with the following observation.
The condition P ⇤ Q = P implies that fP (t) fQ (t) = fP (t). Since fP (0) = 1 and
fP (t) is continuous, for small enough |t| e we have | fP (t)| > 0 and, as a result,
fQ (t) = 1. Since
Z Z
fQ (t) = cos(tx) dQ(x) + i sin(tx) dQ(x),
R
for |t| e this implies that cos(tx) dQ(x) = 1 and, since cos(s) 1, this can
happen only if
Take s,t such that |s|, |t| e and s/t is irrational. For x to be in the support of Q
we must have xs = 2pk and xt = 2pm for some integer k, m. This can happen only
if x = 0. t
u
for n k N for large enough N. Suppose not. Then there exists e > 0 and se-
quences (n(`)) and (n0 (`)) such that n(`) n0 (`) and
94 3 Central Limit Theorem
Let us denote Y` = Sn0 (`) Sn(`) . Since {L (Y` )} is uniformly tight, by the selection
theorem, there exists a subsequence (`(r)) such that L (Y`(r) ) ! Q. Since
as n ! • for any fixed m. Intuitively, this should not happen, since Smn ! 0 in
probability but their variance goes to infinity. In principle, one can construct such
sequence of random variables, but in our case it will be ruled out by Lindeberg’s
CLT, as follows. Let us show that, by Lindeberg’s theorem,
Smn ESmn Zk EZk
Tmn =
smn
= Â smn
mkn
2 ! •. We
converges in distribution to N(0, 1) if m, n ! • in such a way that smn
only need to check that
3.4 Central limit theorem for triangular arrays 95
⇣Z EZk ⌘2 ⇣ Zk EZk ⌘
Â
k
E I >e !0
mkn smn smn
Exercise 3.4.1. Let (Xn ) be i.i.d. random variables with continuous distribution F.
We say that Xn is a record value if Xn > Xi for i < n. If Sn is the number of records
up to time n, show that Zn in (3.12) satisfies the CLT.
Exercise 3.4.2. Consider a triangular array of i.i.d. Bernoulli Xn1 , . . . , Xnn ⇠ B(pn ),
where pn varies with n. If pn ! 0 but npn ! +•, show that Zn in (3.12) satisfies
the CLT.
Exercise 3.4.4. Adapt the argument of Theorem 3.6 in the previous section to prove
Theorem 3.10. Hint: when using the Taylor expansion in that proof, mimic equa-
tions (3.18) and (3.19).
Exercise 3.4.5. Prove (3.16). (Hint: integrate the inequality to make the induction
step.)
Exercise 3.4.6. Consider the random series Ân 1 en n a where (en ) are i.i.d. ran-
dom signs P(en = ±1) = 1/2. Give the necessary and sufficient condition on a for
this series to converge.
Exercise 3.4.7. Let (Xn )n 1 be i.i.d. random variables such that EX1 = 0 and
0 < EX12 < •. Show that Ân 1 Xn /an converges almost surely if and only if
Ân 1 1/a2n < •.
xk 1 e x/q
I(x > 0).
q k G(k)
96 3 Central Limit Theorem
Exercise 3.4.9. Let (Xn )n 1 be i.i.d. random variables with the Poisson distribution
with mean l = 1. Can you find an and bn such that an Ân`=1 X` log ` bn converges
in distribution to standard Gaussian?
Exercise 3.4.10. Suppose that Xk for k 1 are independent random variables and
log k log k
P(Xk = ± k) = , P(Xk = 0) = 1 .
2k k
If Sn = Ânk=1 Xk and D2n = Ânk=1 Var(Xk ), does the distribution of Sn /Dn converge to
N(0, 1)?
97
Chapter 4
Metrics on Probability Measures
We will now study the class of the so called bounded Lipschitz functions on a metric
space (S, d), which will play an important role in the next section, when we deal
with convergence of laws on metric spaces. For a function f : S ! R, let us define
a Lipschitz semi-norm by
| f (x) f (y)|
k f kL = sup .
x6=y d(x, y)
k f kBL = k f kL + k f k• ,
be the set of all bounded Lipschitz functions on (S, d). We will now prove several
fact about these functions.
Lemma 4.1. If f , g 2 BL(S, d) then f g 2 BL(S, d) and k f gkBL k f kBL kgkBL .
Proof. First of all, k f gk• k f k• kgk• . We can write,
and, therefore,
Proof. Let us first find an extension such that khkL = k f kL . We will check that the
following choice works,
which means that h(x) = f (x) for x 2 A. Next, by the triangle inequality,
which means that khkL = k f kL . To extend preserving the bounded Lipschitz norm,
take
h0 = (h ^ k f k• ) _ ( k f k• ).
By part (1) of previous lemma, we see that kh0 kBL = k f kBL and, clearly, h0 coin-
cides with f on A. t
u
To prove the next property of bounded Lipschitz functions, let us first recall the
following famous generalization of the Weierstrass theorem. We will give the proof
for convenience.
ag(x) + b = c, ag(y) + b = d
100 4 Metrics on Probability Measures
By construction, h0 (s) h(s) + e and h0 (s) h(s) e for all s 2 S which means
that d• (h0 , h) e. Since h0 2 F , this proves that F is dense in (C(S), d• ). t
u
The above results can be now combined to prove the following property of
bounded Lipschitz functions.
Corollary 4.1. If (S, d) is a compact space then the set of bounded Lipschitz func-
tions BL(S, d) is dense in (C(S), d• ).
We will also need another well-known result from analysis. A set A ✓ S is totally
bounded if for any e > 0 there exists a finite e-cover of A, i.e. a set of points
a1 , . . . , aN such that A ✓ [iN B(ai , e), where B(a, e) = {y 2 S : d(a, y) e} is a
ball of radius e centered at a.
Theorem 4.3 (Arzela-Ascoli). If (S, d) is a compact metric space then a subset
F ✓ C(S) is totally bounded in d• metric if and only if F is equicontinuous and
uniformly bounded.
4.1 Bounded Lipschitz functions 101
Equicontinuous means that for any e > 0 there exists d > 0 such that if d(x, y) d
then, for all f 2 F , | f (x) f (y)| e. The following fact was used in the proof of
the Selection Theorem, which was proved for general metric spaces.
Corollary 4.2. If (S, d) is a compact space then C(S) is separable in d• .
Proof. By the above theorem, BL(S, d) is dense in C(S). For any integer n 1, the
set { f : k f kBL n} is, obviously, uniformly bounded and equicontinuous. By the
Arzela-Ascoli theorem, it is totally bounded and, therefore, separable, which can
be seen by taking the union of finite 1/m-covers for all m 1. The union
[
f : k f kBL n = BL(S, d)
n 1
Exercise 4.1.1. If F is a finite set {x1 , . R. . , xn } and P is a law with P(F) = 1, show
that for any law Q the supremum sup f d(P Q) : k f kL 1 can be restricted
to functions of the form f (x) = min1in (ci + d(x, xi )).
102 4 Metrics on Probability Measures
Let (S, d) be a metric space and B - a Borel s -algebra generated by open sets. Let
us recall that Pn ! P weakly if
Z Z
f dPn ! f dP
Now, letting m ! • and using the monotone convergence theorem implies that
Z
lim inf Pn (U) I(s 2 U) dP(s) = P(U),
n!•
If P(∂ A) = 0 then P(A) = P(int A) = P(A) and, therefore, limn!• Pn (A) = P(A).
(5) =) (1). Consider f 2 Cb (S) and let Fy = {s 2 S : f (s) = y} be a level set of
f . There exist at most countably many y such that P(Fy ) > 0. Therefore, for any
e > 0 we can find a sequence a1 . . . aN such that
1 n
µn (A)(w) = Â I(Xi (w) 2 A), A 2 B.
n i=1
However, the set of measure zero where this convergence is violated depends on f
and it is not right away obvious that the convergence holds for all f 2 Cb (S) with
probability one. We will need the following lemma.
Lemma 4.3. If (S, d) is separable then there exists a metric e on S such that (S, e)
is totally bounded, and e and d define the same topology, i.e. e(sn , s) ! 0 if and
only if d(sn , s) ! 0.
104 4 Metrics on Probability Measures
Proof. First of all, it is easy to check that c = d/(1 + d) is a metric (that defines
the same topology) taking values in [0, 1]. If {sn } is a dense subset of S, consider
the map
T (s) = c(s, sn ) n 1 : S ! [0, 1]N .
The key point here is that the sequence tm ! t in S, or limm!• c(tm ,t) = 0, if and
only if limm!• c(tm , sn ) = c(t, sn ) for each n 1. In other words, tm ! t if and
only if T (tm ) ! T (t) in the Cartesian product [0, 1]N equipped with the product
topology. This topology is compact and metrizable by the metric
r (un )n 1 , (vn )n 1 = Â2 n
|un vn |,
n 1
so we can now define the metric e on S by e(s, s0 ) = r(T (s), T (s0 )). Because of what
we said above, this metric defines the same topology as d. Moreover, it is totally
bounded, because the image T (S) of our metric space S by the map T is totally
bounded in ([0, 1]N , r), since this space is compact. t
u
Theorem 4.5 (Varadarajan). Let (S, d) be a separable metric space. Then µn con-
verges to µ weakly almost surely,
P w : µn (·)(w) ! µ weakly = 1.
Proof. Let e be the metric in the preceding lemma. Clearly, Cb (S, d) = Cb (S, e)
and, as we mentioned above, weak convergence of measures is not affected by this
change of metric. If (T, e) is the completion of (S, e) then (T, e) is compact. By
the Arzela-Ascoli theorem, BL(T, e) is separable with respect to the d• norm and,
therefore, BL(S, e) is also separable. Let ( fm ) be a dense subset of BL(S, e). Then,
by the strong law of large number,
Z Z
1 n
fm dµn = Â fm (Xi ) ! E fm (X1 ) =
n i=1
fm dµ a.s.
R R
Therefore, on the set of probability one, fm dµn ! fm dµ for all m 1 and,
since ( fm ) is dense in BL(S, e), on the same set of probability one, this convergence
holds for all f 2 BL(S, e). By the portmanteau theorem, µn ! µ weakly on this set
of probability one. t
u
Next, we will introduce two metrics on the set of all probability measures on
(S, d) with the Borel s -algebra B and, under some mild conditions, prove that
they metrize the weak convergence. For a set A ✓ S, let us denote by
where the last inequality follows from the fact that Ac ◆ Aece :
Therefore, for the set B = Aec , Q(B) > P(Be ) + e. This means that r(Q, P) > e
and, therefore, r(Q, P) r(P, Q). By symmetry, r(Q, P) r(P, Q) and, there-
fore, r(Q, P) = r(P, Q).
(2) Next, let us show that if r(P, Q) = 0 then P = Q. For any set F and any n 1,
1 1
P(F) Q(F n ) + .
n
1
If F is closed then F n # F as n ! • and, by the continuity of measure,
1
P(F) lim Q F n = Q(F).
n!•
Similarly, P(F) Q(F) and, therefore, P(F) = Q(F) for all closed sets and, there-
fore, for all Borel sets.
(3) Finally, let us prove the triangle inequality
which means that r(P, T) x + y, and minimizing over x, y proves the triangle
inequality. t
u
Given probability distributions P, Q on the metric space (S, d), we define the
bounded Lipschitz distance between them by
nZ Z o
b (P, Q) = sup f dP f dQ : k f kBL 1 .
Proof. It is obvious that b (P, Q) = b (Q, P) and the triangle inequality holds. It
remains to prove that if b (P, Q) = 0 then P = Q. Given a closed set F and U = F c ,
it is easy to see that
fm (x) = md(x, F) ^ 1 " IU
Let us now show that on separable metric space, the metrics r and b metrize
weak convergence. Before we prove this, let us recall the statement of Ulam’s
theorem proved in Theorem 1.5 in Section 1.1. Namely, every probability law P
on a complete separable metric space (S, d) is tight, which means that for any e > 0
there exists a compact K ✓ S such that P(S \ K) e.
Theorem 4.6. If (S, d) is separable or P is tight then the following are equivalent:
(1) Pn ! P.
R R
(2) For all f 2 BL(S, d), f dPn ! f d P.
(3) b (Pn , P) ! 0.
(4) r(Pn , P) ! 0.
Remark 4.1. We will prove the implications (1) =) (2) =) (3) =) (4) =) (1),
and the assumption that (S, d) is separable or P is tight will be used only in one
step, to prove (2) =) (3).
= P(Ae ) + (1 + e 1
)b (Pn , P) P(Ad ) + d ,
4.2 Convergence of laws on metric spaces 107
(4) =) (1). Suppose that r(Pn , P) ! 0, which means that there exists a sequence
en # 0 such that
B = f : k f kBL(S,d) 1 , BK = f K
: f 2 B ✓ C(K),
Finally,
Z Z Z Z
b (Pn , P) = sup f dPn f dP max f j dPn f j dP + 12e
f 2B 1 jk
and, using the assumption (2), lim supn!• b (Pn , P) 12e. Letting e ! 0 finishes
the proof.
Now, suppose that (S, d) is separable but not complete. Let (T, d) be the comple-
tion of (S, d). We showed in the previous section that every function f 2 BL(S, d)
can be extended to fˆ on (T, d) preserving the norm k f kBL . This extension is obvi-
ously unique, since S is dense in T , so there is one-to-one correspondence f $ fˆ
between the unit balls of BL(S, d) and BL(T, d). Next, we can also view the mea-
sure P as a probability measure on (T, d) as follows. If for any Borel set A in (T, d)
we let P̂(A) = P(A \ S) (we define P̂n similarly) then P̂ is a probability measure
on the Borel s -algebra on (T, d). Note that S might not be a measurable set in its
completion (because of this, there is no natural correspondence between measures
on S and measures on its completion, T ) but one can easily check that the sets A \ S
are Borel in (S, d) (this is because open sets in S are intersections of open sets in T
with S). One can also easily see that, with this definition,
Z Z
f dP = fˆ d P̂,
S T
which means that both statement in (2) and (3) hold simultaneously on (S, d) and
(T, d). This proves that (2) ) (3) on all separable metric spaces. t
u
Theorem 4.8. Let (S, d) be a complete separable metric space and P be a subset
of probability laws on S. Then the following are equivalent.
(1) P is uniformly tight.
(2) For any sequence Pn 2 P there exists a converging subsequence Pn(k) ! P
where P is a law on S.
(3) P has the compact closure on the space of probability laws equipped with
the Levy-Prohorov or bounded Lipschitz metrics r or b .
(4) P is totally bounded with respect to r or b .
Remark 4.2. In other words, we are showing that, on complete separable metric
spaces, total boundedness on the space of probability measures is equivalent to
uniform tightness. The rest is just basic properties of metric space. Also, implica-
tions (1) =) (2) =) (3) =) (4) hold without the completeness assumption and
the only implication where completeness will be used is (4) =) (1).
Proof. (1) =) (2). Any sequence Pn 2 P is uniformly tight and, by the Selection
Theorem, there exists a converging subsequence.
(2) =) (3). Since (S, d) is separable, by Theorem 4.6, Pn ! P if and only if
r(Pn , P) or b (Pn , P) ! 0. Every sequence in the closure P can be approximated
by a sequence in P. That sequence has a converging subsequence that, obviously,
converges to an element in P, which means that the closure of P is compact.
(3) =) (4). Compact sets are totally bounded and, therefore, if the closure P is
compact, the set P is totally bounded.
p
(4) =) (1). Since r 2 b , we will only deal with r. For any e > 0, there exists
a finite subset P0 ✓ P such that P ✓ P0e . Since (S, d) is complete and separable,
by Ulam’s theorem, for each Q 2 P0 there exists a compact KQ such that Q(KQ ) >
1 e. Therefore,
[
K= KQ is a compact and Q(K) > 1 e for all Q 2 P0 .
Q2P0
Let F be a finite set such that K ✓ F e (here we will denote by F e the closed e-
neighborhood of F). Since P ✓ P0e , for any P 2 P there exists Q 2 P0 such
that r(P, Q) < e and, therefore,
Thus, 1 2e P(F 2e ) for all P 2 P. Given d > 0, take em = d /2m+1 and find
Fm as above, i.e.
d d /2m
1 m
P Fm .
2
Then ⇣\ ⌘
d /2m d
P Fm 1 Â m = 1 d.
m 1 m 12
4.2 Convergence of laws on metric spaces 111
T d /2m
Finally, L = m 1 Fm is compact because it is closed and totally bounded by
construction, and S is complete. t
u
Exercise 4.2.1. Prove that the space of probability laws Pr(S) on a compact metric
space (S, d) is compact with respect to the metrics r and b .
Exercise 4.2.2. If (S, d) is separable, prove that Pr(S) is also separable with respect
to the metrics r and b . Hint: think about probability measures concentrated on
finite subsets of a dense countable set in S, and use metric b .
Exercise 4.2.3. Let (S, d) be a metric space and B its Borel s -algebra. Let Pr(S)
be the set of all probability measures on (S, B) equipped with the topology of
weak convergence. Let F be the Borel s -algebra on Pr(S) generated by open sets.
Consider the map T : Pr(S) ⇥ B ! [0, 1] defined by T (P, A) = P(A). Prove that
for any fixed A 2 B, the function T ( · , A) : Pr(S) ! [0, 1] is F -measurable. Hint:
First, prove this for open sets A ✓ S.
Exercise 4.2.5. Define a specific finite set of laws F on [0, 1] such that for every
law P on [0, 1] there exists Q 2 F with r(P, Q) < 0.1, where r is Prokhorov’s
metric. Hint: reduce the problem to metric b .
Exercise 4.2.6. Let X j be i.i.d. N(0, 1) random variables. Let H be a Hilbert space
with orthonormal basis {e j } j 1 . Let X = Â j 1 X j e j / j. For any e > 0, find a com-
pact K such that P(X 2 K) > 1 e.
112 4 Metrics on Probability Measures
is called the Ky Fan metric on the set L 0 (W, S) of classes of equivalences of such
random variables, where two random variables are equivalent if they are equal
almost surely. If we take a sequence
ek # a = a(X,Y )
then P(d(X,Y ) > ek ) ek and, since I(d(X,Y ) > ek ) " I(d(X,Y ) > a), by the
monotone convergence theorem, P(d(X,Y ) > a) a. Thus, the infimum in the
definition of a(X,Y ) is attained.
Proof. First of all, clearly, a(X,Y ) = 0 if and only X = Y almost surely. To prove
the triangle inequality,
so that a(X, Z) a(X,Y ) + a(Y, Z). This proves that a is a metric. Next, if an =
a(Xn , X) ! 0 then, for any e > 0 and large enough n such that an < e,
Lemma 4.7. For X,Y 2 L 0 (W, S), the Levy-Prohorov metric r satisfies
which means that r(L (X), L (Y )) e. Letting e # a(X,Y ) proves the result. t
u
We will now prove that, in some sense, the opposite is also true. Let (S, d) be a
metric space and P, Q be probability laws on S. Suppose that these laws are close
in the Levy-Prohorov metric r. Can we construct random variables s1 and s2 , with
laws P and Q, that are defined on the same probability space and are close to each
other in the Ky Fan metric a? We will construct a distribution on the product space
S ⇥ S such that the coordinates s1 and s2 have marginal distributions P and Q and
the distribution is concentrated on a neighbourhood of the diagonal s1 = s2 , and
the size of the neighbourhood is controlled by r(P, Q).
The following result will be a key tool in the proof of the main result of this
section. Consider two sets X and Y. Given a subset K ✓ X ⇥Y and A ✓ X we define
a K-image of A by
AK = y 2 Y : 9x 2 A, (x, y) 2 K .
A K-matching f of X into Y is a one-to-one function (injection) f : X ! Y such
that (x, f (x)) 2 K. We will need the following well-known matching theorem.
Theorem 4.10 (Hall’s marriage theorem). If X,Y are finite and, for all A ✓ X,
Theorem 4.11 (Strassen). Suppose that (S, d) is a separable metric space and
a, b > 0. Suppose the laws P and Q are such that, for all measurable sets F ✓ S,
Then for any e > 0 there exist two non-negative measures h, g on S ⇥ S such that
114 4 Metrics on Probability Measures
Remark 4.3. In the above statement, it is enough to assume that (4.4) holds only
for closed sets or only for open sets F; moreover, one can replace an open a-
neighbourhood F a = {s 2 S : d(s, F) < a} by a closed a-neighbourhood F a] =
{s 2 S : d(s, F) a}. This is because the set F e is open and F e] is closed, and, for
example, the condition (4.4) for closed sets implies P(F) P(F e] ) Q((F e] )a ) +
b Q(F a+2e ) + b for all measurable sets F, which simply replaces a by a + 2e
in (4.4).
Proof. Case A. The proof will proceed in several steps. We will start with the sim-
plest case which, however, contains the main idea. Given small e > 0, take n 1
such that ne > 1. Suppose that the laws P, Q are uniform on finite subsets M, N ✓ S
of equal cardinality,
1
card(M) = card(N) = n, P(x) = Q(y) = < e, x 2 M, y 2 N.
n
Using the condition (4.4), we would like to match as many points from M and N
as possible, but only points that are within distance a from each other. To use the
matching theorem, we will introduce some auxiliary sets U and V that are not too
big, with size controlled by the parameter b , and the union of these sets with M
and N satisfies a certain matching condition.
Take integer k such that b n k < (b + e)n. Let us take sets U and V such that
k = card(U) = card(V ) and U,V are disjoint from M, N. Define
X = M [U, Y = N [V.
Let us define a subset K ✓ X ⇥ Y such that (x, y) 2 K if and only if one of the
following holds:
(i) x 2 U,
(ii) y 2 V,
(iii) d(x, y) a + e if x 2 M, y 2 N.
This means that small auxiliary sets can be matched with any points, but only close
points, d(x, y) a + e, can be matched in the main sets M and N. Consider a set
4.3 Strassen’s theorem. Relationships between metrics 115
since k = card(V ) and AK = V [ (AK \ N). By the matching theorem, there exists
a K-matching f of X into Y . Let
T = {x 2 M : f (x) 2 N},
and let µ = h + g. First of all, obviously, µ has marginals P and Q because each
point in M or N appears in the sum h + g only once with the weight 1/n. Also,
card(M \ T ) k
h d(x, f (x)) > a + e = 0, g(S ⇥ S) < b + e. (4.5)
n n
Finally, both h and g are finite sums of point masses, which are product measures
of point masses.
Case B. Suppose now that P and Q are concentrated on finitely many points with
rational probabilities. Then we can artificially split all points into “smaller” points
of equal probabilities as follows. Let n be such that ne > 1 and
1
P0 (x, i) = for i = 1, . . . , j.
n
116 4 Metrics on Probability Measures
Define Q0 similarly. Let us check that the laws P0 , Q0 satisfy the assumptions of
Case A with a + e instead of a. Given a set F ✓ S ⇥ J, define
Using (4.4),
P0 (F) P(F1 ) Q(F1a ) + b Q0 (F a+e ) + b ,
because f (i, j) < e. By Case A in (4.5), we can construct µ 0 = h 0 + g 0 with
marginals P0 and Q0 such that
we get
h d(x, y) > a + 2e h 0 e((x, i), (y, j)) > a + 2e = 0.
Finally, µ is obviously a finite sum of product measures. Of course, we can replace
e by e/2.
Case C. (General case) Let P, Q be laws on a separable metric space (S, d). Let
A be a maximal set such that for all x, y 2 A the distance d(x, y) e. Such set A
is called an e-packing and it is countable, because S is separable, so A = {xi }i 1 .
Since A is maximal, for each x 2 S there exists y 2 A such that d(x, y) < e, other-
wise, we could add x to the set A. Let us create a partition of S using e-balls around
the points xi :
P0 (F) P(F e ).
4.3 Strassen’s theorem. Relationships between metrics 117
All the relations above also hold true for Q, Q0 and Q00 that are defined similarly,
and we can assume that the same point x0 plays a role of the auxiliary point for
both P00 and Q00 . Given F ✓ S, we can write
Let us also “move” the points (x0 , xi ) and (xi , x0 ) for i 0 into the support of g 00 :
since the total weight of these points is at most e, the total weight of g 00 does not
increase much:
g 00 (S ⇥ S) b + 5e.
It remains to “redistribute” these measures from the sequence {xi }i 0 to the entire
space S in a way that recovers marginal distributions P and Q and so that not much
accuracy is lost. Define a sequence of measures on S by
P(C \ Bi )
Pi (C) = if P(Bi ) > 0, and Pi (C) = 0 otherwise,
P(Bi )
and define Qi similarly. The measures Pi and Qi are concentrated on Bi . Define
h= Â h 00 (xi , x j )(Pi ⇥ Q j ).
i, j 1
118 4 Metrics on Probability Measures
P(A) Q(Ar+e ) + r + e.
then, by definition of the Ky Fan metric, a(X,Y ) r + 2e. If P and Q are tight
then there exists compact K such that P(K), Q(K) 1 d . For e = 1/n find µn as
in (4.6). Since µn has marginals P and Q, µn (K ⇥ K) 1 2d , which means that
the sequence (µn )n 1 is uniformly tight. By the Selection Theorem, there exists a
converging subsequence µn(k) ! µ. Obviously, µ again has marginals P and Q.
Since, by construction,
⇣ 2⌘ 2
µn d(x, y) > r + r+
n n
and {d(x, y) > r + e} is an open set in S ⇥ S, by the Portmanteau Theorem,
⇣ ⌘ ⇣ ⌘
µ d(x, y) > r + e lim inf µn(k) d(x, y) > r + e r.
k!•
This also implies the relationship between the bounded Lipschitz metric b and
Levy-Prohorov metric r.
Proof. We already proved the second inequality. To prove the first one, by the
previous theorem, given e > 0, we can find random variables X and Y with the
distributions P and Q, such that a(X,Y ) r + e. Consider a bounded Lipschitz
function f , k f kBL < •. Then
Z Z
f dP f dQ = |E f (X) E f (Y )| E| f (X) f (Y )|
k f kL (r + e) + 2k f k• P d(X,Y ) > r + e
k f kL (r + e) + 2k f k• (r + e) 2k f kBL (r + e).
Exercise 4.3.1. Show that the inequality b (P, Q) 2r(P, Q) is sharp in the fol-
lowing sense: for any t 2 (0, 1) and e > 0, one can find laws P and Q on R such
that r(P, Q) = t and b (P, Q) 2t e. Hint: let P, Q have atoms of size t at points
far apart, with P, Q otherwise the same.
120 4 Metrics on Probability Measures
Exercise 4.3.2. Suppose that X,Y and U are random variables on some probability
space such that P(X 2 A) P(Y 2 Ad ) + e for all measurable sets A ✓ R, and U
is independent of (X,Y ) and has the uniform distribution on [0, 1]. Prove that there
exists a measurable function f : R⇥[0, 1] ! R such that P X f (X,U) > d e
and f (X,U) has the same distribution as Y .
Exercise 4.3.3. Let (S, d) be a separable metric space, and let (Pr(S, d), r) be the
space of probability measures on S equipped with the Levy-Prohorov metric r. Let
Pr(2) (S, d) := Pr(Pr(S, d)) be the space of probability measures on (Pr(S, d), r) and
denote by r (2) the Levy-Prohorov metric on Pr(2) (S, d). (Elements of Pr(2) (S, d)
can be viewed as laws of random measures on S.) Given P(2) 2 Pr(2) (S, d), we
can define a two-step process to generate a random variable on S: first, generate a
probability measure P 2 Pr(S, d) from this measure P(2) and, second, generate a
random variable X from P, so that the distribution L (X) of X is given by
Z
Pr(X 2 A) = P(A) dP(2) (P).
Let Q(2) 2 Pr(2) (S, d) be another measure, and generate a random variable Y in the
same way. Prove that r(L (X), L (Y )) 2r (2) (P(2) , Q(2) ).
where sup is over all measurable (or closed, or open) sets A. Hint: Let b be the
supremum on the right hand side; see remark below Theorem 4.11; and notice that
I(x 2 A) I(y 2 Aa] ) I(d(x, y) > a).
4.4 Wasserstein distance and Kantorovich-Rubinstein theorem 121
Let (S, d) be a separable metric space. Denote by P1 (S) the set of all laws on S
such that for some z 2 S (equivalently, for all z 2 S),
Z
d(x, z) dP(x) < •.
S
is called the Wasserstein distance between P and Q. If P and Q are tight, this
infimum is attained (see exercise below).
A measure µ 2 M(P, Q) represents a transportation between the measures P
and Q. We can think of the conditional distribution µ(y|x) as a way to redistribute
the mass in the neighborhood of a point x so that the distribution P will be redis-
tributed to the distribution Q. If the distance d(x, y) represents the cost of moving
x to y then the Wasserstein distance gives the optimal total cost of transporting P
to Q. Given any two laws P and Q on S, let us define
nZ Z o
g(P, Q) = sup f dP f dQ : k f kL 1
Notice that, obviously, for P, Q 2 P1 (S), both g(P, Q), md (P, Q) < •. Let us
show that these two quantities are equal.
Proof. Given a function f such that k f kL 1, let us take a small e > 0 and set
g(y) = f (y) e. Then
and Z Z Z Z
f dP + g dQ = f dP f dQ e.
which means that kekL 1 and, therefore, md (P, Q) g(P, Q). This finishes the
proof. t
u
Proof. We only need to show the first equality. Consider a vector space V = C(S ⇥
S) equipped with k · k• norm and let
This can hold only if r(a) 0. This implies that if f1 f2 then r(f1 ) r(f2 ).
For any function f , both f , f kf k• · 1 and, by monotonicity of r,
|r(f )| kf k• r(1) = kf k• .
Since r|E = r,
124 4 Metrics on Probability Measures
Z Z Z
( f (x) + g(y)) dµ(x, y) = f dP + g dQ,
which implies that µ has marginal P and Q, i.e. µ 2 M(P, Q). This proves that
nZ o
md (P, Q) = sup r(f ) = sup h(x, y) dµ(x, y) : h(x, y) < d(x, y)
U
Z
= d(x, y) dµ(x, y) W (P, Q).
The opposite inequality is easy, because for any f , g such that f (x) + g(y) < d(x, y)
and any n 2 M(P, Q),
Z Z Z Z
f dP + g dQ = ( f (x) + g(y)) dn(x, y) d(x, y)dn(x, y). (4.7)
This finishes the proof and, moreover, it shows that the infimum in the definition
of W is attained. t
u
Remark 4.5. Notice that in the proof of this theorem we never used the fact that d is
a metric. Theorem holds for any d 2 C(S ⇥ S) under the corresponding integrability
assumptions. For example, one can consider loss functions of the type d(x, y) p for
p > 1, which are not necessarily metrics. However, in Lemma 4.9, the fact that d
is a metric was essential. t
u
Our next goal will be to show that W = g on separable and not necessarily
compact metric spaces. We start with the following.
Lemma 4.10. If (S, d) is a separable metric space then W and g are metrics on
P1 (S).
Proof. Since for a bounded Lipschitz metric b we have b (P, Q) g(P, Q), g is
also a metric, because if g(P, Q) = 0 then b (P, Q) = 0 and, therefore, P = Q. As in
(4.7), it should be obvious that g(P, Q) = md (P, Q) W (P, Q) and if W (P, Q) = 0
then g(P, Q) = 0 and P = Q. The symmetry of W is obvious. It remains to show
that W (P, Q) satisfies the triangle inequality. The idea here is very simple, and let
us first explain it in the case when (S, d) is complete. Consider three laws P, Q, T
on S and let µ 2 M(P, Q) and n 2 M(Q, T) be such that
Z Z
d(x, y) dµ(x, y) W (P, Q) + e and d(y, z) dn(y, z) W (Q, T) + e.
Letting e ! 0 proves the triangle inequality for W . In the case when (S, d) is not
complete, we will apply the same basic idea after we discretize the space without
losing much in the transportation cost integral. This can be done as in the proof
of Strassen’s theorem, Case C. Given e > 0, consider a partition (Sn )n 1 of S such
that diameter(Sn ) < e for all n. On each box Sn ⇥ Sm let
1 µ((C \ Sn ) ⇥ Sm ) 2 µ(Sn ⇥ (C \ Sm ))
µnm (C) = , µnm (C) =
µ(Sn ⇥ Sm ) µ(Sn ⇥ Sm )
be the marginal distributions of the conditional distribution of µ on Sn ⇥ Sm . Define
µ 0 = Â µ(Sn ⇥ Sm ) µnm
1 2
⇥ µnm .
n,m
µ 0 (C ⇥ S) = Â µ(Sn ⇥ Sm ) µnm
1 2
(C) ⇥ µnm (S)
n,m
= Â µ((C \ Sn ) ⇥ Sm ) = Â µ((C \ Sn ) ⇥ S) = Â P(C \ Sn ) = P(C).
n,m n n
Let Ym0 be an independent copy of Ym , also independent of Xn . Then the joint distri-
1 ⇥ µ 2 and
bution of (Xn ,Ym0 ) is µnm nm
126 4 Metrics on Probability Measures
Z
1 2
Ed(Xn ,Ym0 ) = d(x, y) d(µnm ⇥ µnm )(x, y).
Sn ⇥Sm
Then Z
d(x, y) dµ(x, y) = Â µ(Sn ⇥ Sm )Ed(Xn ,Ym ),
n,m
Z
d(x, y) dµ 0 (x, y) = Â µ(Sn ⇥ Sm )Ed(Xn ,Ym0 ).
n,m
Finally, d(Ym ,Ym0 ) diam(Sm ) e and these two integrals differ by at most e.
Therefore, Z
d(x, y) dµ 0 (x, y) W (P, Q) + 2e.
Similarly, we can define
n 0 = Â n(Sn ⇥ Sm ) nnm
1 2
⇥ nnm
n,m
such that Z
d(x, y) dn 0 (x, y) W (Q, T) + 2e.
We will now show that this special simple form of the distributions µ 0 (x, y), n 0 (y, z)
ensures that the conditional distributions of x and z given y are well defined. Let
Qm be the restriction of Q to Sm ,
Obviously, if Qm (C) = 0 then µnm2 (C) = 0 for all n, which means that µ 2 are
nm
absolutely continuous with respect to Qm and the Radon-Nikodym derivatives
2
dµnm
fnm (y) =
dQm
(y) exist and  µ(Sn ⇥ Sm ) fnm (y) = 1 a.s. for y 2 Sm .
n
Of course, we can set fnm (y) = 0 for y outside of Sm . Let us define a conditional
distribution of x given y by
= Â µ(Sn ⇥ Sm )µnm
1 2
(A)µnm (B) = µ 0 (A ⇥ B).
n,m
Next lemma shows that on a separable metric space any law with the “first mo-
ment”, i.e. P 2 P1 (S), can be approximated in metrics W and g by laws concen-
trated on finite sets.
Lemma 4.11. If (S, d) is separable and P 2 P1 (S) then there exists a sequence of
laws Pn such that Pn (Fn ) = 1 for some finite sets Fn and W (Pn , P), g(Pn , P) ! 0.
Proof. For each n 1, let (Sn j ) j 1 be a partition of S such that diam(Sn j ) 1/n.
Take a point xn j 2 Sn j in each set Sn j and for k 1 define a function
⇢
xn j , if x 2 Sn j for j k,
fnk (x) =
xn1 , if x 2 Sn j for j > k.
We have,
Z Z
d(x, fnk (x)) dP(x) = Â d(x, fnk (x)) dP(x)
j 1 Sn j
Z
1 2
n  P(Sn j ) + S\(Sn1 [···[Snk )
d(x, xn1 ) dP(x)
n
jk
R
for k large enough, because P 2 P1 (S), i.e. d(x, xn1 ) dP(x) < •, and the set
S \ (Sn1 [ · · · [ Snk ) # 0.
/
Let µn be the image on S ⇥ S of the measure P under the map x ! ( fnk (x), x) so
that µn 2 M(Pn , P) for some Pn concentrated on the set of points {xn1 , . . . , xnk }.
Finally,
Z Z
2
W (Pn , P) d(x, y) dµn (x, y) = d( fnk (x), x) dP(x) .
n
Since g(Pn , P) W (Pn , P), this finishes the proof. t
u
Letting n ! • proves that W (P, Q) g(P, Q). We saw above that the opposite
inequality always holds. t
u
Even though d(x, y) is not a metric for p > 1, Wp is still a metric on P p (Rn ),
which can be shown the same way as in Lemma 4.10. Namely, given nearly optimal
µ 2 M(P, Q) and n 2 M(Q, T) we can construct (X,Y, Z) ⇠ M(P, Q, T) such that
(X,Y ) ⇠ µ and (Y, Z) ⇠ n and, therefore,
1 1 1
Wp (P, T) (E|X Z| p ) p (E|X Y | p ) p + (E|Y Z| p ) p
1 1
(Wpp (P, Q) + e) p + (Wpp (Q, T) + e) p .
Proof. We will show below that for any continuous uniformly bounded function
d(x, y) on Rn ⇥ Rn ,
nZ o
inf d(x, y) dµ(x, y) : µ 2 M(P, Q) (4.10)
4.4 Wasserstein distance and Kantorovich-Rubinstein theorem 129
nZ Z o
= sup f dP + g dQ : f , g 2 C(Rn ), f (x) + g(y) d(x, y) .
This will imply (4.9) as follows. Let us take R 1 large enough so that for K =
{|x| R}, Z Z
|x| p dP 2 p 1 e, |x| p dQ 2 p 1 e.
Kc Kc
We can find such R, because P, Q 2 P p (Rn ).
Let d(x, y) = |x y| p ^ (2R) p . Then
p
for any x, y 2 K, we have d(x, y) = |x y| and for any µ 2 M(P, Q),
Z Z Z
p
|x y| dµ(x, y) d(x, y) dµ(x, y) + |x y| p dµ(x, y).
(K⇥K)c
Let us break the second integral into two integrals over disjoint sets {|x| |y|} and
{|x| < |y|}. On the first set, |x y| p 2 p |x| p and, moreover, x can not belong to K,
since in that case y must be in K c . This means that the part of the second integral
over this set in bounded by
Z Z
p p
2 |x| dµ(x, y) = 2 p |x| p dP(x) e/2.
K c ⇥Rn Kc
The second integral over the set {|x| < |y|} can be similarly bounded by e/2 and,
therefore, Z Z
p
|x y| dµ(x, y) d(x, y) dµ(x, y) + e.
Together with (4.10) this implies that
nZ o
Wp (P, Q) p = inf |x y| p dµ(x, y) : µ 2 M(P, Q)
nZ o
inf d(x, y) dµ(x, y) : µ 2 M(P, Q) + e
nZ Z o
= sup f dP + g dQ : f , g 2 C(Rn ), f (x) + g(y) d(x, y) + e
nZ Z o
sup f dP + g dQ : f , g 2 C(Rn ), f (x) + g(y) |x y| p + e.
P(C \ K) Q(C \ K)
PK (C) = , QK (C) =
P(K) Q(K)
µ⇤ = P(K)Q(K) µK + P ⇥ Q (K⇥K)c
,
If we take R large enough so that the last term is smaller then e then, using the
Kantorovich-Rubinstein theorem on compacts, we get
nZ o Z
inf d(x, y) dµ(x, y) : µ 2 M(P, Q) d(x, y) dµK (x, y) + e
nZ o
= inf d(x, y) dµ(x, y) : µ 2 M(PK , QK ) + e
nZ Z o
= sup f dPK + g dQK : f , g 2 C(K), f (x) + g(y) d(x, y) + e.
Z Z
f dPK + g dQK + 2e, (4.12)
for some f , g 2 C(K) such that f (x) + g(y) d(x, y). To finish the proof, we would
like to estimate the above sum of integrals by
Z Z
f dP + y dQ
for some functions f , y 2 C(Rn ) such that f (x) + y(y) d(x, y). This can be
achieved by “improving” our choice of functions f and g using infimum convolu-
tions, as we will now explain.
However, before we do this, let us make the following useful observation. Since
d 0, we can assume that the supremum in (4.12) is strictly positive (otherwise,
there is nothing to prove) and
Z Z
f dPK + g dQK 0.
exist points x0 , y0 2 K such that f (x0 ), g(y0 ) 0. Since f (x) + g(y) d(x, y),
This is called the infimum-convolution. Clearly, since f (x) d(x, y) g(y) for all
y 2 K, we have f (x) f (x) for x 2 K, which implies that
Z Z Z Z
f dPK + g dQK f dPK + g dQK .
Notice that, by definition, f (x) + g(y) d(x, y) for all x 2 Rn , y 2 K. Also, using
(4.13) and the fact that g(y0 ) 0,
Since g(y) d(x, y) f (x) for all x 2 Rn , we have g(y) y(y) for y 2 K, which
implies that
Z Z Z Z Z Z
f dPK + g dQK f dPK + g dQK f dPK + y dQK .
By definition, f (x) + y(y) d(x, y) for all x, y 2 Rn . Also, using (4.14) and the
fact that f (x0 ) f (x0 ) 0,
The estimates in (4.13) and (4.14) imply that kf k• , kyk• kdk• . Therefore, if
we write
Z Z Z Z Z
f dP = f dP + f dP = P(K) f dPK + f dP
ZK Kc Z KZ Kc
each of the last two terms can be bounded in absolute value by kdk• P(K c ), which
shows that, for large enough R 1,
Z Z
f dPK f dP + e.
K
132 4 Metrics on Probability Measures
and, since f (x) + y(y) d(x, y), this finishes the proof. t
u
Exercise 4.4.1. If P and Q are tight, show that the infimum in the definition of
W (P, Q) is attained.
Exercise 4.4.2. Suppose (S, d) is separable. Show that W (P, Q) r(P, Q)2 and
W (P, Q) b (P, Q). If the diameter D = supx,y2S d(x, y) of (S, d) is finite, show
that W (P, Q) r(P, Q)(1 + D).
Exercise 4.4.3. Let (S, d) be a separable metric space and, given a bounded mea-
surable function f : S2 ! R, let
nZ o
W f (P, Q) := inf f (x, y) dµ(x, y) : µ 2 M(P, Q)
Exercise 4.4.4. (a) Consider a countable set A and two probability measures P and
Q on A. Prove that
nZ o 1
inf I(x 6= y) dµ(x, y) : µ 2 M(P, Q) = Â P({a}) Q({a}) .
2 a2A
Hint: use the Kantorovich-Rubinstein theorem with metric d(x, y) = I(x 6= y).
(b) Let (S, d) be a separable metric space and B be the Borel s -algebra. Given
two probability measures P and Q on B, construct a measure n 2 M(P, Q) that
witnesses equality
nZ o
inf I(x 6= y) dµ(x, y) : µ 2 M(P, Q) = sup |P(A) Q(A)| =: TV(P, Q),
A2B
where TV(P, Q) is called the total variation distance between P and Q. Hint: use
the Hahn-Jordan decomposition.
R
Exercise 4.4.5. Let P1 (R) be the set of laws on R such that |x| dP(x) < •. Let
us define a map F : P1 (R) ! P1 (R) as follows. Consider a random variable t
with values in N such that Et < • and an independent random variable x such that
4.4 Wasserstein distance and Kantorovich-Rubinstein theorem 133
E|x | < •. Given a sequence (Xi ) of i.i.d. random variables with the distribution P 2
P1 (R), independent of t and x , let F(P) be the distribution of the sum x + Âti=1 Xi ,
where the sum Âti=1 Xi is zero if t = 0. If Et 2 [0, 1), prove that F has a unique fixed
point, F(P) = P. Hint: use the Banach Fixed Point Theorem for the Wasserstein
metric W1 . (You need to prove that the metric space (P1 (R),W1 ) is complete.)
Exercise 4.4.6. Let P be the set of probability laws on [ 1, 1] and define a map
F : P ! P as follows. Consider a random variable t with values in N and an
independent random variable x . Given a sequence (Xi ) of i.i.d. random variables
on [ 1, 1] with the distribution P 2 P, independent of t and x , let F(P) be the
distribution of cos(x + Âti=1 Xi ), where the sum Âti=1 Xi is zero if t = 0. Prove that
F has a fixed point, F(P) = P. Hint: use the Schauder Fixed Point Theorem for P
equipped with the topology of weak convergence. (Notice that now a fixed point is
possibly not unique.)
f1 , . . . , fn : A ! R
In this section we will make several connections between the Wasserstein distance
and other classical objects, with application to Gaussian concentration. Let us start
with the following classical inequality.
Proof. First, suppose that A and B are open and bounded. Since the Lebesgue mea-
sure is invariant under translations, let us translate A and B so that sup A = inf B = 0.
Let us check that, in this case, A [ B ✓ A + B. Since A is open, for each a 2 A there
exists e > 0 such that (a e, a+e) ✓ A. Since inf B = 0, there exists b 2 B such that
0 b < e/2. Then a 2 (a + b e, a + b + e) ✓ A + B, which proves that A ✓ A + B.
One can prove similarly that B is also a subset of A + B. Since A and B are disjoint,
we proved that g(A) + g(B) = g(A [ B) g(A + B).
Now, suppose that A and B are compact. Then, obviously, A + B is also compact.
For e > 0, let us denote by Ce an open e-neighborhood of the set C. Since Ae +
Be ✓ (A + B)2e , using the previous case of the open bounded sets, we get g(Ae ) +
g(Be ) g(Ae + Be ) g((A + B)2e ). Since A is closed, Ae # A as e # 0 and, by
the continuity of measure, g(Ae ) # g(A). The same holds for B and A + B, and we
proved the inequality for two compact sets.
Finally, consider arbitrary measurable sets A and B. We can assume that g(A) <
• and g(B) < •, otherwise, there is nothing to prove. By the regularity of the
Lebesgue measure, we can find compacts C ✓ A and D ✓ B such that g(A \ C)
e and g(B \ D) e. Using the previous case of the compact sets, we can write
g(A) + g(B) 2e g(C) + g(D) g(C + D) g(A + B), and letting e # 0 finishes
the proof. t
u
Then, Z ⇣Z ⌘l ⇣Z ⌘1 l
w dx u dx v dx .
Using the Prekopa-Leindler inequality one can prove that the Lebesgue measure g
on Rn satisfies the Brunn-Minkowski inequality
We will leave this as an exercise. From this one can easily deduce the famous
isoperimetric property of Euclidean balls with respect to the Lebesgue measure,
namely, that among all sets with the same measure, a ball has the smallest surface
area. If B is a unit open ball in Rn and g(A) = g(B) then, by (4.16),
In other words, volume of A grows faster than B as we expand the sets, which
means that the surface area of B is smaller.
Proof (of Theorem 4.18). The proof will proceed by induction on n. Let us first
show the induction step. Suppose the statement holds for n and we would like to
show it for n + 1. By the assumption, for any x, y 2 Rn and a, b 2 R,
Let us fix a and b and consider functions w1 (x) = w(x, l a + (1 l )b), u1 (x) =
u(x, a) and v1 (x) = v(x, b) on Rn that satisfy w1 (l x + (1 l )y) u1 (x)l v1 (y)1 l .
By the induction assumption,
Z ⇣Z ⌘l ⇣Z ⌘1 l
w1 dx u1 dx v1 dx .
Rn Rn Rn
and, similarly,
Z Z
u2 (a) = u1 (x, a) dx and v2 (b) = v1 (x, b) dx.
Rn Rn
cation and scaling. Also, we can assume that u and v are not identically zero, since
there is nothing to prove in that case, and we can scale them by their k · k• norm
and assume that kuk• = kvk• = 1. We have the following set inclusion,
{w a} ◆ l {u a} + (1 l ){v a},
When a 2 [0, 1], the sets {u a} and {v a} are not empty and the Brunn-
Minkowski inequality implies that
Finally,
Z Z Z 1 Z 1
w(z) dz = I(x w(z)) dadz = g(w a) da
R R 0 0
Z 1 Z 1
l g(u a) da + (1 l) g(v a) da
Z0 Z 0
⇣Z ⌘l ⇣Z ⌘1 l
=l u(z) dz + (1 l) v(z) dz u(z) dz v(z) dz ,
R R R R
where in the last step we used the arithmetic-geometric mean inequality. This fin-
ishes the proof. t
u
Remark 4.6. Another common proof of theR lastR step n = 1 uses transportation of
measure, as follows. We can assume that u = v = 1 by rescaling
u v w
u! R , v! R , w! R l R 1 l
.
u v ( u) ( v)
R
Then we need to show that w 1. Consider cumulative distribution functions
Z x Z x
F(x) = u(y) dy and G(x) = v(y) dy,
• •
and let x(t) and y(t) be their quantile transforms. Then, F(x(t)) = t and G(y(t)) = t
for 0 t 1. By the Lebesgue differentiation theorem, F 0 (x) = u(x) for all x 62 N
for some set N of Lebesgue measure zero. Since x(t) is (strictly) monotone, the
derivative x0 (t) exists for all t 62 N 0 for some set N 0 of Lebesgue measure zero.
Also, notice that
x(t) 2 N =) t 2 F(N ),
and, since F is absolutely continuous, F(N ) also has Lebesgue measure zero
(Lusin N property). Therefore, for t 62 F(N ) [ N 0 , the derivative x0 (t) exists and
F 0 (x(t)) = u(x(t)). Using the chain rule in the equation F(x(t)) = t, for all such t,
4.5 Wasserstein distance and entropy 137
u(x(t))x0 (t) = 1. Similarly, outside of some set of measure zero, v(y(t))y0 (t) = 1.
Now, consider the function z(t) = l x(t) + (1 l )y(t). This function is strictly
increasing and differentiable almost everywhere. Therefore,
Z +• Z 1 Z 1
w(z) dz 0
w(z(t))z (t) dt = w l x(t) + (1 l )y(t) z0 (t) dt.
• 0 0
and, by assumption,
Therefore,
Z • Z 1⇣ ⌘l ⇣ ⌘1 l
Z 1
0
w(z)dz u(x(t))x (t) v(y(t))y0 (t) dt = 1 dt = 1.
• 0 0
where 0 · log 0 = 0. Notice that EntP (u) 0, by Jensen’s inequality, since u log u is
a convex function. Entropy has the following variational representation, known in
physics as the Gibbs variational principle (see exercise below).
The right hand side Ris convex in l for l 0 (since u is nonnegative), the minimum
is achieved on l = u dP, and with this choice of l the right hand side is equal to
EntP (u). On the other hand,
R it is easy to see thatR this upper bound is achieved on
the function v = log u u dP , which satisfies ev dP = 1. t
u
Suppose now that a law Q is absolutely continuous with respect to P and denote
its Radon-Nikodym derivative by
dQ
u= .
dP
Then the Kullback-Leibler divergence of P from Q is defined by
Z Z
dQ
D(Q k P) := log u dQ = log dQ. (4.18)
dP
Notice that this quantity is not symmetric in P and Q, so the order is important.
Clearly, D(Q k P) = EntP (u), since
Z Z Z Z
dQ dQ dQ dQ dQ
EntP (u) = log · dP dP · log dP = log dQ.
dP dP dP dP dP
The variational characterization (4.17) implies that
Z Z Z
if ev dP 1 then v dQ = uv dP D(Q k P). (4.19)
where lmax (C) is the largest eigenvalue of C and K = 1/(2lmax (C)). We will use
this to prove the following useful inequality for the Wasserstein distance W2 defined
at the end of the previous section.
4.5 Wasserstein distance and entropy 139
Theorem
R 4.19. If P = N(0,C) and Q is absolutely continuous with respect to P
with |x|2 dQ(x) < • then
u(x) = e(1 t) f (x) V (x) , v(y) = etg(y) V (y) and w(z) = e V (z)
P(C \ A)
PA (C) = .
P(A)
Then, obviously, the Radon-Nikodym derivative
dPA 1
= IA
dP P(A)
and the Kullback-Leibler divergence
Z
1 1
D(PA k P) = log dPA = log .
A P(A) P(A)
Since W2 is a metric, for any two Borel sets A and B,
1 ⇣ t2 ⌘
P d(x, A) t exp .
P(A) 4lmax (C)
4.5 Wasserstein distance and entropy 141
If the set A is not too small, e.g. P(A) 1/2, this implies that
⇣ t2 ⌘
P d(x, A) t 2 exp .
4lmax (C)
This shows that the Gaussian measure is exponentially concentrated near any “large
enough” set. The constant 1/4 in the exponent is not optimal and can be replaced
by 1/2; this is just an example of application of the above ideas. The optimal result
is the famous Gaussian isoperimetry,
1 ⇣ ct 2 ⌘ 1 ⇣ t2 ⌘
P d(x, A) t exp = exp ,
P(A) 4 P(A) 4lmax (C)
142 4 Metrics on Probability Measures
Discrete metric and total variation. The total variation distance between two
probability measures P and Q on a measurable space (S, B) is defined by
Let us describe some connections of the total variation distance to the Kullback-
Leibler divergence and the Kantorovich-Rubinstein theorem. Let us start with the
following simple observation.
R
Lemma 4.13. If f is a measurable function on S such that | f | 1 and f dP = 0
then for any l 2 R, Z
2
el f dP el /2 .
However, since any uncountable set S is not separable w.r.t. the discrete metric d,
we can not apply the Kantorovich-Rubinstein theorem directly. In this case, one
can use the Hahn-Jordan decomposition to show that W coincides with the total
variation distance, W (P, Q) = TV(P, Q) and it is easy to construct a measure µ 2
M(P, Q) explicitly that witnesses the above equality. This was one of the exercises
at the end of the previous section. One can also easily check directly that g(P, Q) =
TV(P, Q). Thus, for the discrete metric d,
Hamming metric on a product space. Let us consider a finite set A and, given
integer n 1, consider the following Hamming metric on An ,
1 n
dH (x, y) = Â I(xi 6= yi ).
n i=1
On the other hand, on the product space An , one is often interested to understand
how different a measure µ is from some (or any) product measure n = n1 ⇥ . . . ⇥ nn
with respect to the above Wasserstein metric. Consider the function
where H(P) = Âs P(s) log P(s) is called the entropy of the discrete measure P.
The inequality (4.27) then shows that the KL divergence between µ and the product
measure with the same marginals, µ1 ⇥ . . . ⇥ µn (which is an explicit quantity that
in principle should be easier to compute), can be used to control from below the
Wasserstein distance WH (µ, n) between µ and an arbitrary product measure n with
4.5 Wasserstein distance and entropy 145
respect to the Hamming distance dH . Let us also note that one can strengthen the
inequality (4.27) by replacing fA (WH (µ, n)) with
n1 n o
inf Â
n i=1
fA (l (xi 6= yi )) : l 2 M(µ, n) .
Exercise 4.5.2. Using the Prekopa-Leindler inequality prove that if gN is the stan-
dard Gaussian measure on RN , A and B are Borel sets and l 2 [0, 1] then
Exercise 4.5.4. If C is convex and symmetric around 0, X and Y are (centred) Gaus-
sian on RN and Cov(X) Cov(Y ) (i.e. their difference is nonnegative definite) then
P(X 2 C) P(Y 2 C). Hint: use the previous problem.
Chapter 5
Martingales
Let (W, B, P) be a probability space and let (T, ) be a linearly ordered set. We
will mostly work with countable sets T , such as Z or subsets of Z. Consider a
family of random variables Xt : W ! R and a family of s -algebras (Bt )t2T such
that Bt ✓ Bu ✓ B for t u. A family (Xt , Bt )t2T is called a martingale if the
following hold:
1. Xt is Bt -measurable for all t 2 T (Xt is adapted to Bt ),
2. E|Xt | < • for all t 2 T ,
3. E(Xu |Bt ) = Xt for all t u.
If the last equality is replaced by E(Xu |Bt ) Xt then the process is called a super-
martingale and if E(Xu |Bt ) Xt then it is called a submartingale. If (Xt , Bt ) is a
martingale and Xt = E(X|Bt ) for some random variables X then the martingale is
called right-closable. If some t0 2 T is an upper bound of T (i.e. t t0 for all t 2 T )
then the martingale is called right-closed. Of course, in this case Xt = E(Xt0 |Bt ).
Clearly, a right-closable martingale can be made into a right-closed martingale by
adding an additional point, say t0 , to the set T so that t t0 for all t 2 T , and
defining Xt0 := X. Let us give several examples of martingales.
Clearly, B (n+1) ✓ B n . For 1 k n, by symmetry (we leave the details for the
exercise at the end of the section),
E(X1 |B n ) = E(Xk |B n ).
Therefore,
Sn
and Z n := n = E(X1 |B n ). Thus, (Z n , B n ) n 1 is a right-closed martingale.
Hn = Dn Gn , Yn = H1 + . . . + Hn , Zn = G1 + · · · + Gn .
5.1 Martingales. Uniform integrability 149
For example, when |Xt | Y for all t 2 T and EY < • then, clearly, (Xt )t2T is
uniformly integrable. Other basic criteria of uniform integrability will be given in
the exercises below. The following criterion of the L1 -convergence, which can be
viewed as a strengthening of the dominated convergence theorem, will be useful.
Lemma 5.1. Consider random variables (Xn ), X, such that E|Xn | < •, E|X| < •.
The following are equivalent:
1. E|Xn X| ! 0 as n ! •.
2. (Xn )n 1 is uniformly integrable and Xn ! X in probability.
lim sup E|Xn X| e + 2 sup E|Xn |I(|Xn | > K) + 2E|X|I(|X| > K).
n!• n
Now, first letting e ! 0, and then letting K ! • and using that (Xn ) is uniformly
integrable proves the result.
1=)2. By Chebyshev’s inequality,
1
P(|Xn X| > e) E|Xn X| ! 0
e
150 5 Martingales
We can also choose M large enough so that E|Xn |I(|Xn | > M) 2e for n n0 and
this finishes the proof. t
u
Next, we will prove some uniform integrability properties for martingales and
submartingales, but first let us make the following simple observation.
Lemma 5.2. Let f : R ! R be a convex function such that E| f (Xt )| < • for all t.
Suppose that either of the following two conditions holds:
1. (Xt , Bt ) is a martingale;
2. (Xt , Bt ) is a submartingale and f is increasing.
Then ( f (Xt ), Bt ) is a submartingale.
Proof. Under the first condition, for t u, by conditional Jensen’s inequality,
Under the second condition, since Xt E(Xu |Bt ) for t u and f is increasing,
and, therefore,
Also, since the function max(a, x) is convex and increasing in x, by the part (b) of
the previous lemma,
E max(a, Xt ) E max(a, X). (5.2)
Finally, if we take M > |a| then the inequality | max(Xt , a)| > M can hold only if
max(Xt , a) = Xt > M. Combining all these observations,
E| max(Xt , a)|I(| max(Xt , a)| > M) = EXt I(Xt > M) EXI(Xt > M)
Exercise 5.1.1. In the setting of the Example 3 above, prove carefully that, for
1 k n, E(X1 |B n ) = E(Xk |B n ).
Exercise 5.1.3. Let (Xn )n 1 be i.i.d. random variables and, given a bounded mea-
surable function f : R2 ! R such that f (x, y) = f (y, x), consider
152 5 Martingales
2
S n =
n(n Â0
1) 1`<`
f (X` , X`0 )
n
where X(1) , . . . , X(n) are the order statistics of X1 , . . . , Xn . Prove that F (n+1) ✓ F n
and S n = E(S 2 |F n ), i.e. (S n , F n ) n 1 is a right-closed martingale.
Exercise 5.1.4. Suppose that (Xn )n 1 are i.i.d. and E|X1 | < •. If Sn = X1 +. . .+Xn ,
show that the sequence (Sn /n)n 1 is uniformly integrable.
Exercise 5.1.5. Show that if supt2T E|Xn |1+d < • for some d > 0 then (Xt )t2T is
uniformly integrable.
Exercise 5.1.6. Show that (Xt )t2T is uniformly integrable if and only if
(a) supt E|Xt | < • and
(b) for any e > 0, there exists d > 0 such that supt2T E|Xt | IA e if P(A) d .
Bt = B 2 B : {t n} \ B 2 Bn for all n
consisting of all events that “depend on the data up to a stopping time t”. If a
sequence of random variable (Xn ) is adapted to (Bn ), i.e. Xn is Bn -measurable,
then random variables such as Xt or Âtk=1 Xk are Bt -measurable. For example,
[
{t n} \ {Xt 2 A} = {t = k} \ {Xk 2 A}
kn
[⇣ \ ⌘
= {t k} \ {t k 1} {Xk 2 A} 2 Bn .
kn
{t k} \ {t n} = {t k ^ n} 2 Bk^n ✓ Bn ,
5. If t1 and t2 are stopping times then the sum t1 + t2 is a stopping time. This is
true, because
n
[
{t1 + t2 n} = {t1 = k} \ {t2 n k} 2 Bn .
k=0
154 5 Martingales
For example, if t1 = 0 and A = W then EXt2 = EX0 if the condition (5.3) is satisfied.
For example, if the stopping time t2 is bounded then (5.3) is obviously satisfied.
Let us note that without some kind of integrability condition, Theorem 5.2 can not
hold, as the following example shows.
Example 5.2.1. Consider independent (Xn )n 1 such that P(Xn = ±2n ) = 1/2. If
Bn = s (X1 , . . . , Xn ) then (Sn , Bn ) is a martingale. Let t1 = 1 and t2 = min{k 1 :
Sk > 0}. Clearly, St2 = 2, because if t2 = k then
St2 = Sk = 2 22 ... 2k 1
+ 2k = 2.
However, 2 = ESt2 6= ES1 = 0. Notice that the second condition in (5.3) is violated,
since
P(t2 = n) = 2 n , P(t2 n + 1) = Â 2 k = 2 n
k n+1
and, therefore,
Proof (of Theorem 5.2). Consider a set A 2 Bt1 . The result is based on the follow-
ing formal computation with the middle step (*) proved below,
 EXn I
(⇤)
= A \ {t1 = n} I(n t2 ) = EXt1 IA I(t1 t2 ).
n 1
We begin by writing
EXn IAn I(n t2 ) = EXn IAn I(t2 = n) + EXn IAn I(n < t2 )
= EXt2 IAn I(t2 = n) + EXn IAn I(n < t2 ).
Since {n < t2 } = {t2 n}c 2 Bn and (Xn , Bn ) is a martingale, the last term
EXn IAn I(n < t2 ) = EXn+1 IAn I(n < t2 ) = EXn+1 IAn I(n + 1 t2 )
and, therefore,
EXn IAn I(n t2 ) = EXt2 IAn I(n t2 < m) + EXm IAn I(m t2 ). (5.5)
It remains to let m ! • and use the assumptions in (5.3). By the second assumption,
the last term
EXm IAn I(m t2 ) E|Xm |I(m t2 ) ! 0
as m ! •. Since
which gives a clean condition to check if we would like to show that EXt = EX0 .
Example 5.2.3 (Hitting times of simple random walk). Given p 2 (0, 1), con-
sider i.i.d. random variables (Xi )i 1 such that
P(Xi = 1) = p, P(Xi = 1) = 1 p,
and consider a random walk Sn+1 = Sn + Xn+1 starting with S0 = 0. Consider two
integers a 1 and b 1 and the stopping time
t = min k 1 : Sn = a or b .
(q/p)a 1
P(St = b) = .
(q/p)a (q/p)b
If q = p = 1/2 then Sn itself is a martingale and the equation (5.6) implies that
a
0 = bP(St = b) + a(1 P(St = b)) and P(St = b) = .
a b
One can also compute Et, which we will also leave as an exercise.
el St el St
lim E I(t n) = E I(t < •)
n!• j(l )t j(l )t
by the monotone convergence theorem. Since Y0 = 1, (5.6) implies in this case that
el St el Sn
1=E I(t < •) () lim E I(t n) = 0. (5.7)
j(l )t n!• j(l )n
The left hand side is called the fundamental Wald’s identity. In some cases one
can use this to compute the Laplace transform of a stopping time and, thus, its
distribution, as we will see below.
Notice that in the case when t < • almost surely and (either side of) the equation
(5.7) holds, we can write
el St
1=E .
j(l )t
Computing formally the derivative in l at l = 0, we get ESt = EX1 Et, which is
the Wald’s identity we proved in (2.4), so (5.7) is a generalization of that identity.
Of course, this is only a formal generalization because here we make a stronger
assumption on the existence of the Laplace transform in a neighbourhood of zero,
while the identity in ESt = EX1 Et was proved only under the assumption that the
expectations EX1 and Et are well defined.
t = min{k : Sk = z or z}.
el Sn e|l |z
E I(t n) P(t n)
j(l )n ch(l )n
and the right hand side of (5.7) holds. Therefore,
el St
1=E = el z E ch(l ) t I(St = z) + e lz
E ch(l ) t
I(St = z)
j(l )t
el z + e l z
= E ch(l ) t
2
by symmetry. Therefore,
t 1 1
E ch(l ) = and Eegt = 1
ch(l z) ch(ch (e g )z)
by the change of variables eg = 1/ ch l . t
u
158 5 Martingales
For more general stopping times the condition on the right hand side of (5.7)
might not be easy to check. We will now consider another point of view that may
sometimes be helpful in verifying the fundamental Wald’s identity. If P is the dis-
tribution of X1 , let Pl be the distribution with the Radon-Nikodym derivative with
respect to P given by
dPl el x
= .
dP j(l )
This is, indeed, a density with respect to P, since
Z
el x j(l )
dP = = 1.
R j(x) j(l )
For convenience of notation, we will think of (Xn ) as the coordinates on the prod-
uct space (R• , B • , P⌦• ) with the cylindrical s -algebra. Therefore, {t = n} 2
s (X1 , . . . , Xn ) ✓ B n is a Borel set on Rn . We can write,
•
el St el Sn
E
j(l )t
I(t < •) = Â E I(t = n)
n=1 j(l )
n
• Z
el (x1 +···+xn )
= Â j(l )n
dP(x1 ) . . . dP(xn )
n=1
{t=n}
• Z
= Â l (t < •).
dPl (x1 ) . . . dPl (xn ) = P⌦•
n=1
{t=n}
This means that we can think of the random variables Xn as having the distribution
Pl and, to prove Wald’s identity, we need to show that t < • with probability one.
Example 5.2.6 (Crossing a growing boundary). Suppose that we have a bound-
ary given by some sequence f : N ! R and a stopping time (crossing time):
t = min k : Sk f (k) .
To verify Wald’s identity in (5.7), we need to show that t < • with probability
one, assuming now that random variables Xn have the distribution Pl . Under this
distribution, the random variables Xn have the expectation
Z
el x j 0 (l )
El Xi = x dP(x) = .
j(l ) j(l )
By the strong law of large numbers
Sn j 0 (l )
lim =
n!• n j(l )
and, therefore, if the growth of the crossing boundary satisfies
5.2 Stopping times, optional stopping theorem 159
f (n) j 0 (l )
lim sup < (5.8)
n!• n j(l )
then, obviously, the random walk Sn will cross it with probability one, in which
case P⌦• l (t < •) = 1. By Hölder’s inequality, j(l ) is log-convex and, therefore,
j 0 (l )/j(l ) is increasing, which means that if this condition holds for l 0 then it
holds for l > l 0 . Also, note that, for l 2 (l , l+ ), the derivative of j 0 (l )/j(l )
is the variance of X ⇠ Pl , so this function is actually strictly increasing unless the
random variable X ⇠ P is constant. t
u
In the next two examples, we will consider a constant boundary f (k) = z for
some integer z > 0. In this case, the condition (5.8) becomes j 0 (l ) > 0. We will
consider the case when j 0 (0) = EX1 > 0, which means that the random walk has
a positive drift and, in particular, t < • with probability one.
el St 1
1=E t
= el z E .
j(l ) j(l )t
As in the example of the symmetric walk, this determines the Laplace transform
Eegt on some interval of g and determines the distribution of t. t
u
Example 5.2.8 (Geometric case). Let us now consider the case when Xi are i.i.d.
geometric random variables and, for some p 2 (0, 1),
P(Xi k) = (1 p)k 1
for k = 1, 2, 3, . . . .
d
In this case, one can show that t and St are independent and St = z 1 + X1 .
Indeed, for any y z and n 1, we can write
P(St y, t = n) = Â P(Sn y, Sn 1 = x, t = n)
x<z
= Â P(Xn y x, Sn 1 = x, t n 1).
x<z
P(St y, t = n) = Â (1 p)y x 1
P(Sn 1 = x, t n 1)
x<z
= (1 p)y z
 (1 p)z x 1
P(Sn 1 = x, t n 1)
x<z
160 5 Martingales
= (1 p)y z
 P(Xn z x, Sn 1 = x, t n 1)
x<z
= (1 p)y z P(St z, t = n) = (1 p)y z P(t = n).
This proves the above claim. Together with the fundamental Wald’s identity this
shows that
el St 1 1
1=E = Eel St E = el (z 1)
j(l )E .
j(l )t j(l )t j(l )t
The Laplace transform of the geometric distribution is
pel
j(l ) = for l < log(1 p),
1 (1 p)el
and we again determined the Laplace transform of t on some interval. t
u
Exercise 5.2.3. Let (Xn , Bn )n 0 be a martingale and t a stopping time such that
P(t < •) = 1. Suppose that E|Xt^n |1+d c for some c, d > 0 and for all n. Prove
that EXt = EX0 .
Exercise 5.2.4. Given 0 < p < 1, consider i.i.d. random variables (Xi )i 1 such that
P(Xi = 1) = p, P(Xi = 1) = 1 p and consider a random walk S0 = 0, Sn+1 =
Sn + Xn+1 . Consider two integers a 1 and b 1 and define a stopping time
t = min{k 1 : Sk = a or b}.
Exercise 5.2.6. Let X0 = 0, P(Xi = ±1) = 1/2 and Sn = Âkn Xk . Let t = min k :
Sk 0.95k + 100 . Prove that
5.2 Stopping times, optional stopping theorem 161
el St
E I(t < •) = 1
j(l )t
for l large enough.
162 5 Martingales
Let us average this inequality over the set A = {Yn = max1kn Xn M}, which
belongs to Bt1 , because
n o
A \ {t1 k} = max Xi M 2 Bk^n ✓ Bk .
1ik^n
This is precisely the first inequality in (5.9). The second inequality is obvious. t
u
which is the number of upward crossings of the interval [a, b] before time n.
E(Xn a)+
En(a, b, n) . (5.10)
b a
Proof. Since x ! (x a)+ is increasing convex function, Zn = (Xn a)+ is also a
submartingale. Clearly, the number of upcrossings of the interval [a, b] by (Xn ) can
be expressed in terms of the number of upcrossings of [0, b a] by (Zn ),
which means that it is enough to prove (5.10) for nonnegative submartingales. From
now on we can assume that 0 Xn and we would like to show that
EXn
En(0, b, n) .
b
Let us define a sequence of random variables h j for j 1 by
⇢
1, t2k 1 < j t2k for some k
hj =
0, otherwise,
i.e. h j is the indicator of the event that at time j the process is crossing [0, b]
upward. Define X0 = 0. Then, clearly,
n n
bn(0, b, n) Â h j (X j Xj 1) = Â I(h j = 1)(X j Xj 1 ).
j=1 j=1
i.e. the fact that at time j we are crossing upward is determined completely by the
sequence up to time j 1. Then
164 5 Martingales
n
bEn(0, b, n) Â EI(h j = 1)(X j Xj 1)
j=1
n
= Â EI(h j = 1)E Xj Xj 1 Bj 1
j=1
n
= Â EI(h j = 1) E(X j |B j 1) Xj 1 .
j=1
Therefore,
n n
bEn(0, b, n) ÂE E(X j |B j 1) Xj 1 = Â E(X j Xj 1) = EXn .
j=1 j=1
This finishes the proof for nonnegative submartingales, and the general case. t
u
Finally, we will now use Doob’s upcrossing inequality to prove our main result
about the convergence of submartingales, from which the convergence of martin-
gales will follow.
where the union is taken over all rational numbers a < b. This means that the
probability P(Xn diverges) > 0 if and only if there exist rational a < b such that
If we recall that n(a, b, n) denotes the number of upcrossings of [a, b], this is equiv-
alent to P limn!• n(a, b, n) = • > 0.
5.3 Doob’s inequalities and convergence of martingales 165
By Doob’s inequality,
X • = lim Xn
n! •
exists with probability one. By Fatou’s lemma and the submartingale property,
It remains to show that (Xn , Bn ) •n<• is a submartingale. First of all, the limit
X • = limn! • Xn is measurable on Bn for all n and, therefore, measurable on
B • = \Bn . Let us take a set A 2 \Bn . We would like to show that, for any m,
EX • IA EXm IA . Since (Xn , Bn )n0 is a right-closed submartingale, by part 2 of
Lemma 5.3 in Section 5.1, Xn _ a •<n0 is uniformly integrable for any a 2 R.
Since X • _ a = limn! • Xn _ a almost surely, by Section 5.1,
Therefore, En(a, b) < • for n(a, b) = limn!• n(a, b, n) and, as above, the limit
X+• = limn!+• Xn exists. Since all Xn are measurable on B+• , so is X+• . Finally,
by Fatou’s lemma,
+
EX+• lim inf EXn+ < •.
n!•
166 5 Martingales
Notice that a different assumption, supn E|Xn | < •, would similarly imply that
E|X+• | lim infn!• E|Xn | < •.
Proof of 3. By 2, the limit X+• exists. We want to show that for any m and for any
set A 2 Bm ,
EXm IA EX+• IA .
We already mentioned above that (Xn _ a) is a submartingale and, therefore,
E(Xm _ a)IA E(Xn _ a)IA for m n. Clearly, if (Xn+ ) •<n<• is uniformly in-
tegrable then (Xn _ a) •<n<• is uniformly integrable for all a 2 R. Since Xn _ a !
X+• _ a almost surely, by Lemma 5.1,
and this shows that E(Xm _ a)IA E(X+• _ a)IA . Letting a ! • and using the
monotone convergence theorem we get that EXm IA EX+• IA . t
u
Convergence of martingales. If (Xn , Bn ) is a martingale then both (Xn , Bn ) and
( Xn , Bn ) are submartingales and one can apply the above theorem to both of
them. For example, this implies the following.
1. If supn E|Xn | < • then a.s. limit X+• = limn!+• Xn exists and E|X+• | < •.
2. If (Xn ) •<n<• is uniformly integrable then (Xn , Bn ) •<n+• is a right-
closed martingale.
In other words, a martingale is right-closable if and only if it is uniformly inte-
grable. Of course, in this case we also conclude that limn!+• E|Xn X+• | = 0.
For a martingale of the type E(X|Bn ) we can identify the limit as follows.
Theorem 5.6 (Levy’s convergence theorem). Let (W, B, P) be a probability space
and X : W ! R be a random variable such that E|X| < •. Given a sequence of
sigma algebras
B1 ✓ . . . ✓ Bn ✓ . . . ✓ B+• ✓ B
S
where B+• = s 1n<• Bn , we have Xn = E(X|Bn ) ! E(X|B+• ) a.s. .
Proof. (Xn , Bn )1n<• is a right-closable martingale, because Xn = E(X|Bn ).
Therefore, it is uniformly integrable and a.s. limit X+• := lim
S n!+•
E(X|Bn ) ex-
ists. It remains to show that X+• = E(X|B+• ). Consider A 2 n Bn . Since A 2 Bm
for some m and (Xn ) is a martingale, EXn IA = EXm IA for n m. Therefore,
E(Xn+1 |X1 , . . . , Xn ) = 0.
then Sn /bn ! 0 almost surely. Indeed, Yn = Âkn (Xk /bk ) is a martingale and (Yn )
is uniformly integrable, since
1 1 n EXk2
E|Yn |I(Yn | > M)
M
E|Yn |2 = Â b2
M k=1 k
1 • EXk2
sup E|Yn |I(Yn | > M)
n
 b2 ! 0
M k=1 k
Exercise 5.3.1. Let (Xn )n 1 be a martingale with EXn = 0 and EXn2 < • for all n.
Show that ⇣ ⌘ EXn2
P max Xk M .
1kn EXn2 + M 2
Hint: (Xn + c)2 is a submartingale for any c > 0.
Exercise 5.3.2. Let (gi ) be i.i.d. standard normal random variables and let Xn =
Âni=1 si2 (g2i 1) for some numbers (si ). Prove that for t > 0,
168 5 Martingales
n
P max |Xi |
1in
t 2t 2
 si4 .
i=1
Exercise 5.3.4. Let X,Y be two non-negative random variables such that for every
t > 0, Z
P(Y t) t 1 XI(Y t)dP.
R
For any p > 1, k f k p = ( | f | p dP)1/p and 1/p+1/q = 1, show that kY k p qkXk p .
Hint: Use previous exercise, switch the order of integration to integrate in t first,
then use Holder’s inequality and solve for kY k p .
Exercise 5.3.5. Given a non-negative submartingale (Xn , Bn ), let Xn⇤ := max jn X j .
Prove that for any p > 1 and 1/p + 1/q = 1, kXn⇤ k p qkXn k p . Hint: use previous
exercise and Doob’s maximal inequality.
Exercise 5.3.6 (Polya’s urn model). Suppose we have b blue and r red balls in the
urn. We pick a ball randomly and return it with c balls of the same color. Let us
consider the sequence
#(blue balls after n iterations)
Yn = .
#(total after n iterations)
Prove that the limit Y = limn!• Yn exists almost surely.
Exercise 5.3.9. Let (Xn )n 1 be i.i.d. random variables and, given a bounded mea-
surable function f : R2 ! R such that f (x, y) = f (y, x), consider the U-statistics
2
S n =
n(n Â0
1) 1`<`
f (X` , X`0 )
n
5.3 Doob’s inequalities and convergence of martingales 169
Chapter 6
Brownian Motion
Xt (w) = X(t, w) : T ⇥ W ! R
F1 ✓ F2 =) PF1 = PF2 .
RF1
Example 6.1.1. Let T = [0, 1], (W, P) = ([0, 1], l ) where l is the Lebesgue mea-
sure. Let Xt (w) = I(t = w) and Xt0 (w) = 0. Finite dimensional distributions of these
processes are the same, because for any fixed t 2 [0, 1],
P(Xt = 0) = P(Xt0 = 0) = 1.
Xt (w) : T ⇥ W ! R
is jointly measurable on the product space (T, B) ⇥ (W, F ), where B is the Borel
s -algebra on T.
Lemma 6.1. If (T, d) is a separable metric space and Xt is sample continuous then
Xt is measurable.
Xtn (w) is, obviously, measurable on T ⇥ W, because for any Borel set A on R,
[
(w,t) : Xtn (w) 2 A = S j ⇥ w : Xt j (w) 2 A .
j 1
Since Xt (w) is sample continuous, Xtn (w) ! Xt (w) as n ! • for all (w,t). Hence,
Xt is also measurable. t
u
Proof. Let us first show that ST ✓ B. Any element of the cylindrical algebra that
generates the cylindrical s -algebra BT is given by
Then
\
B ⇥ RT \F C(T, d) = x 2 C(T, d) : (xt )t2F 2 B = pF (x) 2 B
In the remainder of the section we will define two specific sample continuous
stochastic processes.
Brownian motion. Brownian motion is defined as a sample continuous process Xt
on T = R+ such that
(a) the distribution of Xt is a centered Gaussian for each t 0;
(b) X0 = 0 and EX12 = 1;
(c) if t < s then L (Xs Xt ) = L (Xs t );
(d) for any t1 < . . . < tn , the increments Xt1 X0 , . . . , Xtn Xtn 1 are independent.
If we denote s 2 (t) = Var(Xt ) then these properties imply
174 6 Brownian Motion
⇣t ⌘ 1 2
s 2 (nt) = ns 2 (t), s 2 = s (t) and s 2 (qt) = qs 2 (t)
m m
for all rational q. Since s 2 (1) = 1, s 2 (q) = q for all rational q and, by the sample
continuity, s 2 (t) = t for all t 0. For any t1 < . . . < tn , the increments Xt j+1 Xt j are
independent Gaussian N(0,t j+1 t j ) and, therefore, jointly Gaussian. This implies
that the vector (Xt1 , . . . , Xtn ) also has a jointly Gaussian distribution, since it can
be written as a linear map of the increments. The covariance matrix can be easily
computed, since for s < t,
Lemma 6.3. The tail probability of the standard normal distribution satisfies
Z •
1 x2 /2 c2
F(c) = p e dx e 2
2p c
for all c 0.
Proof. It is enough to construct Xt on the interval [0, 1]. Given a process Xt that has
the finite dimensional distributions of the Brownian process, but is not necessarily
continuous, let us define for n 1,
6.1 Stochastic processes. Brownian motion 175
Vk = X k+1
n
X kn for k = 0, . . . , 2n 1.
2 2
converges almost surely to some limit Zt , because, by (6.1), with probability one,
2
|Xt(n) Xt(n 1) | n
which proves the continuity of Zt . On the event in (6.1) of probability zero we set
Zt = 0. t
u
Notice that B0 = B1 = 0. t
u
is a Brownian motion on [0, 1]. Hint: To show that EWt Ws = min(t, s), use Parse-
val’s identity. To prove continuity, show that
Z t
en = Â gk2 n
0
fk2 n (x) dx
•
=2 (n+1)/2
max gk2
k2In
n
k2In
p
satisfies P en 2 2 n ln 2n 8 n and use the Borel-Cantelli lemma.
In this section we will show how Brownian motion Wt arises in a classical central
limit theorem on the space of continuous functions on R+ . When working with
continuous processes defined on R+ , such as the Brownian motion, the metric
k · k• on C(R+ ) is too strong. A more appropriate metric d can be defined by
1 dn ( f , g)
d( f , g) = Â 2n 1 + dn ( f , g) , where dn ( f , g) = sup | f (t)
0tn
g(t)|.
n 1
Kn = f [0,n]
: f 2K
the restriction of K to the interval [0, n]. Then, it is clear that K is compact with
respect to d if and only if each Kn is compact with respect to dn . Therefore, using
the Arzela-Ascoli theorem to characterize compacts in C[0, n] we get the following.
For a function x 2 C[0, T ], let us denote its modulus of continuity by
n o
mT (x, d ) = sup |xa xb | : |a b| d , a, b 2 [0, T ] .
This implies the following criterion of the uniform tightness of laws on the metric
space (C(R+ ), d), which is simply a translation of the Arzela-Ascoli theorem into
probabilistic language.
and
lim sup Pn mT (x, d ) > e = 0 (6.3)
d #0 n 1
Proof. (=)) Suppose that for any g > 0, there exists a compact K such that
Pn (K) > 1 g for all n 1. By the Arzela-Ascoli theorem, |x0 | l for some
l > 0 and for all x 2 K and, therefore,
Also, by equicontinuity, for any e > 0 there exists d0 > 0 such that for d < d0 and
for all x 2 K we have mT (x, d ) < e. Therefore,
((=) Fix any g > 0. For each integer T 1, find lT > 0 such that
g
sup Pn |x0 | > lT .
n 2T +1
For each integer T 1 and k 1, find dT,k > 0 such that
⇣ 1⌘ g
sup Pn mT (x, dT,k ) > T +k+1 .
n k 2
Consider the set
n 1 o
AT = x 2 C(R+ ) : |x0 | lT , mT (x, dT,k ) for all k 1 .
k
Then,
g g g
sup Pn (AT )c +Â =
n 2T +1 k 1 2T +k+1 2T
T
and their intersection A = T 1 AT
satisfies
⇣[ ⌘ g
sup Pn (Ac ) = sup Pn
n
(AT )c
n
 2T = g.
T 1 T 1
Remark 6.1. Of course, for the uniform tightness on (C[0, 1], k · k• ) we only need
the second condition (6.3) for T = 1. Also, it will be convenient to slightly relax
(6.3) and replace it with asymptotic equicontinuity condition
If this holds then, given g > 0, we can find d0 > 0 and n0 1 such that for all
n > n0 ,
6.2 Donsker’s invariance principle 179
Using this characterization of uniform tightness, we will now give our first ex-
ample of convergence on (C(R+ ), d) to the Brownian motion Wt . Consider a se-
quence (Xi )i 1 of i.i.d. random variables such that EXi = 0 and s 2 = EXi2 < •.
Let us consider a continuous partial sum process on [0, •) defined by
1 Xbntc+1
Wtn = p
ns  Xi + (nt bntc) p
ns
, (6.5)
ibntc
in distribution.
Proof. Since the last term in Wtn is of order n 1/2 , for simplicity of notations, we
will simply write
1
Wtn = p  Xi
ns int
and treat nt as an integer. By the central limit theorem,
1 p 1
p
ns  Xi = tp
nts
 Xi
int int
because the second maximum over 0 < j nd is taken over intervals of the same
size nd . As a consequence, if mT (W n , d ) > e then one of the events
n 1 eo
max p
0< jnd
Â
ns `nd <i`nd + j
Xi >
3
j<ind
for all T > 0 and e > 0. This proves that Wtn ! Wt weakly in (C[0, •), d). t
u
Exercise 6.2.2. Let (ei )1in be i.i.d. Rademacher random variables, i.e. P(ei =
±1) = 1/2. For d 2 (0, 1), let Fd be the subset of vectors in {0, 1}n with at most
d n coordinates equal to 1, and such that these coordinates are consecutive. Show
that, for large enough absolute constant c > 0,
⇣ 1 n p ⌘ c t2
P max p  ei fi c( d + t) e 2d
f 2Fd n i=1 d
for some large enough c > 0. Hint: Use discretization and Kolmogorov’s inequality
as in the proof of Donsker’s theorem, and then apply Hoeffding’s inequality instead
of the CLT.
182 6 Brownian Motion
The stochastic process Xtn is called the empirical process. The covariance of this
process, for s t,
is the same as the covariance of the Brownian bridge and, by the multivariate CLT,
all finite dimensional distributions of the empirical process converge to the f.d.
distributions of the Brownian bridge,
because (F(Xi )) are i.i.d. and have the uniform distribution on [0, 1]. This means
that the distribution of the left hand side does not depend on F and, in order to
infer whether the sample (Xi )1in comes from the distribution with the c.d.f. F,
statisticians need to know only the distribution of the supremum of the empirical
process or, as an approximation, the distribution of its limit. Equation (6.7) suggests
that
L sup |Xtn | ! L sup |Bt | , (6.8)
t t
6.3 Convergence of empirical process to Brownian bridge 183
and the right hand side is called the Kolmogorov-Smirnov distribution that will be
computed in the next section. Since Bt is sample continuous, its distribution is the
law on the metric space (C[0, 1], k·k• ). Even though Xtn is not continuous, its jumps
are equal to n 1/2 , so it can be approximated by a continuous process Ytn uniformly
within n 1/2 . Since k · k• is a continuous functional on (C[0, 1], k · k• ), (6.8) would
hold if we can prove weak convergence L (Ytn )!L (Bt ). We only need to prove
uniform tightness of L (Ytn ) because, by Lemma 6.2, (6.7) already identifies the
law of the Brownian motion as the unique possible limit. Thus, we need to address
the question of uniform tightness of (L (Ytn )) on the complete separable space
(C[0, 1], k · k• ) or, equivalently, by the result of the previous section, the asymptotic
equicontinuity of Ytn ,
so we can deal with the process Xtn , even though it is not continuous. By Cheby-
shev’s inequality,
1
P m(X n , d ) > e Em(X n , d )
e
and we need to learn how to control Em(X , d ). The modulus of continuity of X n
n
can be written as
p 1 n
m(X n , d ) = sup |Xtn Xsn | = n sup  I(s < xi t)
n i=1
(t s)
|t s|d |t s|d
p 1 n
= n sup
f 2F
 f (xi )
n i=1
Ef , (6.9)
We will now develop an approach to control the expectation of (6.9) for general
classes of functions F and we will only use the specific definition (6.10) at the
very end. This will be done in several steps.
Symmetrization. As the first step, we will replace the empirical process (6.9) by
a symmetrized version, called the Rademacher process, that will be easier to con-
trol. Let x10 , . . . , xn0 be independent copies of x1 , . . . , xn and let e1 , . . . , en be i.i.d.
Rademacher random variables, such that P(ei = 1) = P(ei = 1) = 1/2. Let us
define
1 n 1 n
Pn f = Â f (xi ) and P0n f = Â f (xi0 ).
n i=1 n i=1
184 6 Brownian Motion
1 n
Z = sup Pn f
f 2F
Ef and R = sup
f 2F
 ei f (xi ) .
n i=1
Then, using Jensen’s inequality, the symmetry, and then triangle inequality, we can
write
Equality in the second line holds because switching xi $ xi0 arbitrarily does not
change the expectation, so the equality holds for any fixed (ei ) and, therefore, for
any random (ei ). t
u
Given this symmetrization bound, for the specific class of functions in (6.10),
one can finish the proof of asymptotic equicontinuity using the last exercise in
the previous section together with a simple observation (6.13) below. However,
instead, we will describe a much more general approach to this problem.
Covering numbers, Kolmogorov’s chaining and Dudley’s entropy integral. To
control ER for general classes of functions F , we will need to use some measures
of complexity of F . First, we will show how to control the Rademacher process R
conditionally on x1 , . . . , xn . Suppose that (F, d) is a totally bounded metric space.
For any u > 0, a u-packing number of F with respect to d is defined by
Both packing and covering numbers measure how many points are needed to ap-
proximate any element in the set F within distance u. It is a simple exercise (at the
end of the section) to show that
and, in this sense, packing and covering numbers are closely related. Let F be a
subset of the cube [ 1, 1]n equipped with a rescaled Euclidean metric
6.3 Convergence of empirical process to Brownian bridge 185
⇣1 n ⌘1/2
nÂ
d( f , g) = ( fi gi )2 .
i=1
1 n
R( f ) = p  ei fi .
n i=1
Then we have the following version of the classical Kolmogorov chaining lemma.
We will assume that 0 2 F by increasing D(F, e, d) by 1 if necessary.
{0} = F0 ✓ F1 ✓ . . . ✓ Fj ✓ . . . ✓ F
where we break the tie arbitrarily. Then any f 2 F can be decomposed into the
telescopic series
•
f= Â (p j ( f ) pj 1 ( f )).
j=1
d(p j 1 ( f ), p j ( f )) d(p j 1 ( f ), f ) + d( f , p j ( f ))
( j 1) j j j+2
2 +2 = 3·2 2 .
186 6 Brownian Motion
We will call elements of these sets links (as in a chain). Since R( f ) is linear,
•
R( f ) = ÂR p j( f ) pj 1( f ) .
j=1
We first show how to control R on the set of all non-trivial links. Assume that
0 6= ` 2 L j 1, j . By Hoeffding’s inequality,
⇣ 1 n ⌘ ⇣ t2 ⌘ ⇣ t2 ⌘
P R(`) = p  ei `i t exp exp .
n i=1 2n 1 Âni=1 `2i 2 · 2 2 j+4
If |F| denotes the cardinality of the set F then |L j 1, j | |Fj 1 | · |Fj | |Fj |2 and,
therefore, by the union bound,
⇣ t2 ⌘ 1
P 8` 2 L j 1, j , R(`) t 1 |Fj |2 exp 2 j+5
=1 e u,
2 |Fj |2
after making the change of variables
⇣ ⌘1/2
2 j+5 jp
t= 2 (4 log |Fj | + u) 27/2 2 j
log1/2 |Fj | + 25/2 2 u.
Hence,
⇣ ⌘ 1
7/2 j jp
P 8` 2 L j 1, j , R(`) 2 2 log1/2 |Fj | + 25/2 2 u 1 e u.
|Fj |2
Given f 2 F, let integer k be such that 2 (k+1) < d(0, f ) 2 k . Then in the above
construction p0 ( f ) = . . . = pk 1 ( f ) = 0 and, with probability at least 1 e u ,
• • ⇣ ⌘
jp
ÂR p j( f ) pj 1( f ) Â 27/2 2 j
log1/2 |Fj | + 25/2 2 u
j=k j=k
6.3 Convergence of empirical process to Brownian bridge 187
•
kp
27/2 Â 2 j
log1/2 D(F, 2 j , d) + 27/2 2 u.
j=k
Note that 2 k < 2d( f , 0) and 27/2 2 k < 29/2 d( f , 0). Finally, since the packing
numbers D(F, e, d) are decreasing in e, we can write
•
29/2 Â 2 ( j+1)
log1/2 D(F, 2 j , d) (6.11)
j=k
Z 2 k Z 2d(0, f )
9/2 1/2 9/2
2 log D(F, e, d) de 2 log1/2 D(F, e, d) de.
0 0
The integral in (6.11) is called Dudley’s entropy integral. We would like to apply
the bound of the above theorem to
p 1 n
nR = sup p  ei f (xi )
f 2F n i=1
for a class of functions F in (6.10). Suppose that x1 , . . . , xn 2 [0, 1] are fixed and
let
n ⇣ ⌘ o
F = fi 1in = I s < xi t : t, s 2 [0, 1], |t s| d ✓ {0, 1}n .
1in
Lemma 6.4. N(F, u, d) Ku 4 for some absolute K > 0 independent of the points
x1 , . . . , xn .
This kind of estimate is called uniform covering numbers, because the bound does
not depend on the points x1 , . . . , xn that generate the class F from F .
Proof. We can assume that x1 . . . xn . Then the class F consists of all vectors
of the type
(0 . . . 1 . . . 1 . . . 0),
i.e. the coordinates equal to 1 come in blocks. Given u, let Fu be a subset of such
vectors with blocks of 1’s starting and ending at the coordinates kbnuc. Given any
vector f 2 F, let us approximate it by a vector in f 0 2 Fu by choosing the closest
starting and ending coordinates for the blocks of 1’s. The number of different co-
ordinates will be bounded by 2bnuc and, therefore, the distance between f and f 0
will be bounded by q
0
p
d( f , f ) 2n 1 bnuc 2u.
p
The cardinality of Fu is, obviously, of order
p u 2 and this proves that N(F, 2u, d)
Ku 2 . Making the change of variables 2u ! u proves the result. t
u
188 6 Brownian Motion
1 n 1 n
D2n = sup d(0, f )2 = sup  i
n i=1
f (x )2
= sup  I(s < xi t)
F F |t s|d n i=1
1 n 1 n
= sup  I(xi t)
n i=1 Â I(xi s) .
n i=1
|t s|d
Since the integral on the right hand side of (6.12) is concave in Dn , by Jensen’s
inequality,
1 n ⇣Z EDn r K ⌘
E sup p  ei f (xi ) K log du + EDn .
F n i=1 0 u
1 n
sup  I(xi t)
n i=1
t !0
t2[0,1]
The right-hand side goes to zero as d ! 0 and this finishes the proof of asymptotic
equicontinuity of X n . As a result, for any continuous function F on (C[0, 1], k · k• )
the distibution of F(Xtn ) converges to the distribution of F(Bt ). For example,
p 1
nÂ
n sup I(xi t) t ! sup |Bt |
0t1 0t1
in distribution. We will find the distribution of the right hand side in the next sec-
tion. Notice that the methods we used to prove equicontinuity were quite general
and the main step where we used the specific class of function F was to control
the covering numbers.
Exercise 6.3.2. If (xi )i 1 are i.i.d. uniform on [0, 1], prove that
1 n
sup  I(xi t)
n i=1
t ! 0 a.s. .
t2[0,1]
Exercise 6.3.3. If ei are i.i.d. Rademacher, gi are i.i.d. standard Gaussian, and F is
a bounded subset of Rn , show that
1 n 1 1 n
E sup p  ei fi E sup p  gi fi .
f 2F n i=1 E|g1 | f 2F n i=1
2 ⇣1 n ⌘
2 1/2
log D(F , e, d) V log , where d( f , g) =
e  f (wi )
n i=1
g(wi ) .
1 n ⇣ V ⌘1/2
E sup
f 2F
 f (Xi )
n i=1
E f (X1 ) K
n
for some absolute constant K. (Assume, for example, that F is countable to avoid
any measureability issues.)
190 6 Brownian Motion
generated by the data up to the stopping time t. Let us denote by B(C(R+ ), d) the
Borel s -algebra on (C(R+ ), d), where a metric d metrizes uniform convergence
on compacts. The following is very similar to the property of stopping times for
sums of i.i.d. random variables in Section 2.4.
Theorem 6.6 (Strong Markov Property). On the event {t < •}, the increments
Wt0 := Wt+t Wt of the process after the stopping time are independent of Ft and,
moreover, the distribution of Wt0 is that of a Brownian motion. More precisely, for
any B 2 Ft and A 2 B(C(R+ ), d),
E I(W 0 2 A)IB I(t < •) = E I(W 2 A)EIB I(t < •). (6.15)
the form C ⇥ R[0,•)\F for C 2 B(RF ) for a finite F = {t1 , . . . ,td } for some integer
d 1 and 0 t1 < . . . < td . By Dynkin’s theorem, it is enough to consider only
open sets C and, finally, approximating open sets by bounded Lipshitz functions
from below, it is enough to prove that
B \ {tn = k2 n } = B \ (k 1)2 n
t < k2 n
2 Fk2 n
In the second line we used the homogeneity of the Brownian motion to write
Ee(k2 n ) = Ee(0). Since e(tn ) ! e(t) almost surely, this finishes the proof. t
u
One can show that a stopping time is always Markov time and Ft ✓ Ft + . One can
check that the above Strong Markov property also holds for Markov times t and
s -algebras Ft + (exercise below). t
u
We will use the strong Markov property in the above form in the next section,
but in the examples in this section the events A will be allowed to depend on the
stopping time t. For this purpose, it will be convenient to have the following ver-
sion of the strong Markov property. Given time s 0, consider the following two
modifications on the Brownian motion:
W i (t; s) := W (t) I(t s) + (Ws + W̃t s ) I(t > s),
(6.17)
W r (t; s) := W (t) I(t s) + (Ws (Wt Ws )) I(t > s),
where the increment Wt Ws of the original process Wt after time s was replaced in
the first case by an independent Brownian motion (W̃h )h 0 at time t s and in the
second case by (Wt Ws ), i.e. the original increment reflected around zero. (To
accommodate the Brownian motion (W̃h )h 0 independent of all s -algebras Ft we
can extend the probability space W for example, to the product space W ⇥ C(R+ )
if necessary.) Of course, both modifications satisfy of the properties of a Brownian
motion, so
d d
(W i (t; s))t 0 = (W r (t; s))t 0 = (W (t))t 0 .
Our next result shows that this still holds if we replace a fixed time s by a stopping
time t.
Theorem 6.7 (Reflexion principle). If t is a stopping time and and a 2 {i, r} then
d
(W a (t; t))t 0 = (Wt )t 0
Proof. By the same reduction as in the previous theorem, it is enough to show that,
for any d 1, any 0 < t1 < . . . < td , and any f 2 Cb (Rd ),
E f (W a (t1 ; t), . . . ,W a (td ; t)) I(t < •) = E f (Wt1 , . . . ,Wtd ) I(t < •).
Let us denote e(s) := f (W a (t1 ; s), . . . ,W a (td ; s)) and note that it is continuous is s
because the processes W a (t; s) for a 2 {i, r} are continuous in (t, s) and f 2 Cb (Rd ).
Therefore, for the stopping time tn defined as in the proof of the strong Markov
property above, e(tn ) ! e(t) on the event {t < •}. Finally, write
and note that {tn = k2 n } 2 Fk2 n , while for s = k2 n the only difference between
Wt and W a (t; s) is how the increments after time k2 n were defined (all independent
6.4 Reflection principles for Brownian motion 193
of Fk2 n ). As a result the sum on the right hand side above is equal to
 E f (Wt1 , . . . ,Wtd )I tn = k2 n
= E f (Wt1 , . . . ,Wtd )I t < • ,
k 0
P(tc b) = P(tc b,Wb Wtc > 0) + P(tc b,Wb Wtc < 0),
By the reflexion principle above applied to W r (t, tc ), the two probabilities above
are equal, so
P(tc b) = 2P(tc b,Wb Wtc > 0) = P(tc b,Wb > c) = P(Wb > c),
where we could omit tc b because, by continuity of the process, the event Wb > c
automatically implies that tc b. Hence,
⇣ ⌘ Z •
1 x2
P sup Wt c = P(tc b) = 2P(Wb > c) = 2 p p e 2 dx. (6.19)
tb c/ b 2p
Next, we compute various probabilities for the suprema of the Brownian bridge.
In addition to the reflexion principle, we will need the following observation that
the Brownian bridge can be viewed as a Brownian motion conditioned to be equal
to zero at time t = 1 (a.k.a. pinned down Brownian motion).
Lemma 6.5 (BB as pinned down BM)). Conditional distribution of (Wt )t2[0,1]
given |W1 | < e converges to the law of (Bt )t2[0,1] ,
as e # 0. t
u
Using this lemma we will first prove the analogue of the above Example 6.4.1 for
the Brownian bridge.
Theorem 6.8. If Bt is a Brownian bridge then, for all b > 0,
⇣ ⌘ 2
P sup Bt b = e 2b .
t2[0,1]
Let us first analyze the upper bound. If we define a hitting time t = inf{t : Wt =
b e} then Wt = b e and
because the fact that W1 2 (2b 3e, 2b e) automatically implies that t 1 for
b > 0 and e small enough. Therefore, we proved that
6.4 Reflection principles for Brownian motion 195
0t1 n 1
and let tb and t b be the hitting times of b and b. By symmetry of the distribution
of the process Bt ,
⇣ ⌘
P sup |Bt | b = P tb or t b 1 = 2P(A1 , tb < t b ).
0t1
Again, by symmetry,
and, by induction,
As in Theorem 6.8, reflecting the Brownian motion each time we hit b or b, one
can show that
P W1 2 (2nb e, 2nb + e) 1 (2nb)2 2n2 b2
P(An ) = lim =e 2 =e
e#0 P W1 2 ( e, e)
P 9t : Bt = a or b = Â e 2(na+(n+1)b)2
+e 2((n+1)a+nb)2
 2e 2n2 (a+b)2
.
n 0 n 1
(6.20)
Proof. We have
and
An = 9t1 < . . . < tn : Bt1 = a, Bt2 = b, . . .
then, as in the previous theorem,
and, similarly,
By induction,
•
P(9t : Bt = a or b) = Â( 1)n 1 (P(An ) + P(Cn )).
n=1
Probabilities of the events An and Cn can be computed using the reflection prin-
2 2 2
ciple as above, P(A2n ) = P(C2n ) = e 2n (a+b) , P(C2n+1 ) = e 2(na+(n+1)b) , and
2
P(A2n+1 ) = e 2((n+1)a+nb) , which finishes the proof. t
u
Proof. First of all, (6.20) gives the joint c.d.f. of (X,Y ) because
∂ 2F
If f (a, b) = ∂ a∂ b is the joint p.d.f. of (X,Y ) then the c.d.f of the spread X +Y is
Z tZ t a
P Y +X t = f (a, b) db da.
0 0
P Y +X t = Â (2n + 1) e 2n2 t 2
e 2(n+1)2 t 2
 8n2t 2 e 2n2 t 2
n 0 n 1
= 1+2 Â e 2n2 t 2
 8n t 2 2
e 2n2 t 2
,
n 1 n 1
Exercise 6.4.1. Prove the Strong Markov Property for a Markov time t and s -
algebra Ft + in (6.16).
Exercise 6.4.5. If (Ft ) satisfies conditions (i) – (iii) at the beginning of this section,
prove that (F t ) satisfies them too.
198 6 Brownian Motion
In this section we will prove another classical limit theorem in Probability, called
the laws of the iterated logarithm, using the method of Skorohod’s imbedding,
which we will describe first. Throughout this section, (Wt )t 0 will be a Brownian
motion and Ft = s ((Ws )st ). We will need the following preliminary technical
result.
Theorem 6.12. If t < • is a stopping time such that Et < • then EWt = 0 and
EWt2 = Et.
Proof. Let us start with the case when a stopping time t takes a finite number of
values, t 2 {t1 , . . . ,tn }. Note that (Wt j , Ft j ) is a martingale, since
By the optional stopping theorem for martingales, EWt = EWt1 = 0. Next, let us
prove that EWt2 = Et by induction on n. If n = 1 then t = t1 and
Similarly,
Therefore,
EWt2 = Ea + (tn tn 1 )P(t = tn ) = Et
and this finishes the proof of the induction step.
Next, let us consider the case of a uniformly bounded stopping time t M < •.
In the previous lecture we defined a dyadic approximation
b2n tc + 1
tn = ,
2n
which is also a stopping time, tn # t, and by the sample continuity Wtn ! Wt
almost surely. Since (tn ) are uniformly bounded, Etn ! Et. To prove that EWt2n !
EWt2 , we need to show that the sequence (Wt2n ) is uniformly integrable. Notice that
6.5 Skorohod’s imbedding and laws of the iterated logarithm 199
tn < 2M and, therefore, tn takes possible values of the type k/2n for k k0 =
b2n (2M)c. Since the sequence (W1/2n , . . . ,Wk0 /2n ,W2M ) is a martingale, adapted to
a corresponding sequence of Ft , and tn and 2M are two stopping times such that
tn < 2M, by the optional stopping theorem, Theorem 5.2, Wtn = E(W2M |Ftn ). By
Jensen’s inequality,
Wt4n E(W2M
4
|Ftn ), EWt4n EW2M
4
= 6M,
EWt4n 6M
EWt2n I(|Wtn | > N) 2
2.
N N
This proves that EWtn ! EWt and EWt2n ! EWt2 . Since tn takes a finite number
of values, EWtn = 0 and EWt2n = Etn , and letting n ! • proves
Before we consider the general case, let us notice that for two bounded stopping
times t r M one can similarly show that
because, by the optional stopping, (Wtn , Ftn ), (Wrn , Frn ) is a martingale and
Finally, we consider the general case. Let us define t(n) = min(t, n). Since
2 = Et(n) Et < •, the sequence W
EWt(n) t(n) is uniformly integrable and, thus,
0 = EWt(n) ! EWt , which proves that EWt = 0. Next, for m n,
using (6.21), (6.22) and the fact that t(n), t(m) are bounded stopping times. Since
t(n) " t, Fatou’s lemma and the monotone convergence theorem imply
2
which means that EWt(m) ! EWt2 . Since EWt(m)
2 = Et(m) and Et(m) ! Et by
the monotone convergence theorem, this implies that EWt2 = Et. t
u
200 6 Brownian Motion
Proof. Let us start with the simplest case when Y takes only two values, Y 2
{ a, b} for a, b > 0. The condition EY = 0 determines the distribution of Y,
a
pb + (1 p)( a) = 0 and p = . (6.23)
a+b
Let t = inf{t > 0 | Wt = a or b} be a hitting time of the two-sided boundary a,
b. The tail probability of t can be bounded by
B1 ✓ B2 ✓ . . . ✓ B
Given B j , let us define B j+1 by splitting each finite interval [c, d) 2 B j into two
intervals [c, (c + d)/2) and [(c + d)/2, d), and splitting infinite interval ( •, j)
into ( •, ( j + 1)) and [ ( j + 1), j) and, similarly, splitting [ j, +•) into [ j, j +
1) and [ j + 1, •). Consider a right-closed martingale
Y j = E(Y |B j ). (6.25)
and Z
1
y= x dµ(x). (6.26)
µ([c, d)) [c,d)
6.5 Skorohod’s imbedding and laws of the iterated logarithm 201
If µ([c, d)) = 0 we pick any y 2 (c, d) as a value of Y j on [c, d). Since in the s -
algebra B j+1 the interval [c, d) is split into two intervals, the random variable Y j+1
can take only two values on the interval [c, d), say c y1 < y2 < d,
Note that {Wt j = y} 2 Ft j and, by the strong Markov property in Theorem 6.6,
The Skorohod’s imbedding will be used below to reduce the case of sums of i.i.d.
random variables to the following law of the iterated logarithm for the Brownian
motion, which we prove first.
p
Theorem 6.14. If u(t) = 2t`(t) where `(t) = log log(t) then
Wt
lim sup = 1 a.s.
t!+• u(t)
202 6 Brownian Motion
Let us start with the following result which shows that the Brownian motion does
not fluctuate much between times s and (1 + e)s for large s.
Lemma 6.6. For any e 2 (0, 1),
n |W
Ws | o p
t
lim sup sup : s t (1 + e)s 8 e a.s.
s!• u(s)
p
Proof. Let e > 0, tk = (1 + e)k and Mk = 2eu(tk ). By symmetry, the equation
(6.19) and the Gaussian tail estimate in Lemma 6.3,
⇣ ⌘ ⇣ ⌘
P sup |Wt Wtk | Mk 2P sup Wt Mk
tk ttk+1 0ttk+1 tk
⇣ 1 Mk2 ⌘
= 4N(0,tk+1 tk )(Mk , •) 4 exp
2 (tk+1 tk )
⇣ 4et `(t ) ⌘ ⇣ 1 ⌘2
k k
= 4 exp =4 .
2etk k log(1 + e)
The sum of these probabilities converges and, by the Borel-Cantelli lemma, for
large enough k, p
sup |Wt Wtk | 2eu(tk ).
tk ttk+1
This series converges so, by the Borel-Cantelli lemma, Wtk (1 + e)u(ak ) for large
enough k. If we take a = 1 + e, together with Lemma 6.6 this gives that, for large
enough k, if tk t < tk+1 then
Wt Wtk u(tk ) Wt Wtk u(tk ) p
= + (1 + e) + 8 e.
u(t) u(tk ) u(t) u(tk ) u(t)
Letting e # 0 over some sequence proves that, with probability one,
6.5 Skorohod’s imbedding and laws of the iterated logarithm 203
Wt
lim sup 1.
t!• u(t)
To prove that the upper limit is equal to 1, we will use the Borel-Cantelli lemma
for the independent increments Wak Wak 1 for large values of the parameter a > 1.
If 0 < e < 1 then, similarly to (6.28),
The series diverges and, since these events are independent, by Borel-Cantelli
lemma they occur infinitely often with probability one. We already proved in
(6.28) that, for e > 0, Wak (1 + e)u(ak ) for large enough k 1 or, by sym-
metry, Wak (1 + e)u(ak ) for large enough k 1. Together with Wak Wak 1
(1 e)u(a a 1 ) this gives
k k
and r⇣ r
Wt Wk 1⌘ 1
lim sup lim sup ak (1 e) 1 (1 + e) .
t!• u(t) k!• u(a ) a a
Letting e # 0 and a ! • over some sequences proves that the upper limit is equal
to one. t
u
The LIL for Brownian motion implies the LIL for sums of independent random
variables via Skorohod’s imbedding.
Theorem 6.15. Suppose that Y1 , . . . ,Yn are i.i.d. and EYi = 0, EYi2 = 1. If Sn =
Y1 + . . . +Yn then
Sn
lim sup p = 1 a.s.
n!• 2n log log n
d
Proof. Let us define a stopping time t(1) such that Wt(1) = Y1 . By the strong
Markov property, the increment of the process after stopping time is indepen-
dent of the process before stopping time and has the law of the Brownian motion.
d
Therefore, we can define t(2) such that Wt(1)+t(2) Wt(1) = Y2 and, by indepen-
d
dence, Wt(1)+t(2) = Y1 + Y2 and t(1), t(2) are i.i.d. By induction, we can define
i.i.d. t(1), . . . , t(n) such that, for T (n) = t(1) + . . . + t(n),
204 6 Brownian Motion
⇣ S ⌘ ⇣ ⌘
n d WT (n)
= .
u(n) n 1 u(n) n 1
By the Exercise 2.3.6, it is enough to prove the almost sure convergence of the right
hand side. Let us write
WT (n) Wn WT (n) Wn
= + .
u(n) u(n) u(n)
By the LIL for Brownian motion,
Wn
lim sup = 1.
n!• u(n)
The LIL for Brownian motion also implies the local LIL:
Wt
lim sup p = 1.
t!0 2t`(1/t)
It is easy to check that if Wt is a Brownian motion then tW1/t is also the Brownian
motion and the result follows by a change of variable t ! 1/t. To check that tW1/t
is a Brownian motion notice that for t < s,
1 1
EtW1/t sW1/s tW1/t = st t2 = t t =0
s t
2
and E tW1/t sW1/s = t +s 2t = s t.
Exercise 6.5.1. Show that the condition Et < • in Theorem 6.12 above can not be
replaced with the condition EWt2 < •. Hint: see previous section.
Exercise 6.5.2. Consider the sequence T (n) = t(1) + . . . + t(n) as in the proof of
Theorem 6.15. Prove that, for any e > 0,
⇣ WT ( j) W j ⌘
lim P max p e = 0.
n!• jn n
Hint: First prove that, by the SLLN, for any d > 0,
6.5 Skorohod’s imbedding and laws of the iterated logarithm 205
⇣ T ( j) j ⌘
lim P max d = 0,
n!• jn n
p d
and then use the fact that ( cWt )t 0 = (Wct )t 0.
p
Exercise 6.5.3. For t 1, define the process Xtn := WT (nt) / n when nt is an inte-
ger, and by linear interpolation in between, as in Donsker’s theorem. Let K be a
closed set in C([0, 1], k · k• ). Show that, for any e 0,
p d
Hint: Use previous exercise for e > 0 and the fact that ( cWt )t 0 = (Wct )t 0.
Exercise 6.5.4 (Donsker’s theorem). In the setting of previous exercise, prove that
(Xtn )t1 converges in distribution to (Wt )t1 in C([0, 1], k · k• ). Hint: Use the Port-
manteau theorem.
207
Chapter 7
Poisson Processes
In this section we will introduce and review several important properties of Poisson
processes on a measurable space (S, S ). In applications, (S, S ) is usually some
nice space, such as a Euclidean space Rn with the Borel s -algebra. However, in
general, it is enough to require that the diagonal {(s, s0 ) : s = s0 } is measurable on
the product space S ⇥ S which, in particular, implies that every singleton set {s} in
S is measurable as a section of the diagonal. This condition is needed to be able to
write P(X = Y ) for a pair (X,Y ) of random variables defined on the product space.
From now on, every time we consider a measurable space we will assume that it
satisfies this condition. Let us also notice that a product S ⇥ S0 of two such spaces
will also satisfy this condition since {(s1 , s01 ) = (s2 , s02 )} = {s1 = s2 } \ {s01 = s02 }.
Let µ and µn for n 1 be some non-atomic (not having any atoms, i.e. points of
positive measure) measures on S such that
For each n 1, let Nn be a random variable with the Poisson distribution P(µn (S))
with the mean µn (S) and let (Xn` )` 1 be i.i.d. random variables, also independent
of Nn , with the distribution
µn (B)
pn (B) = . (7.2)
µn (S)
We assume that all these random variables are independent for different n 1. The
condition that µ is non-atomic implies that P(Xn` = Xm j ) = 0 if n 6= m or ` 6= j.
Let us consider random sets
[
Pn = Xn1 , . . . , XnNn and P = Pn . (7.3)
n 1
The set P will be called a Poisson process on S with the mean measure µ. Let us
point out a simple observation that will be used many times below that if, first, we
208 7 Poisson Processes
are given the means µn (S) and distributions (7.2) that were used to generate the set
(7.3) then the measure µ in (7.1) can be written as
µ= Â µn (S)pn . (7.4)
n 1
We will show in Theorem 7.4 below that when the measure µ is s -finite then, in
some sense, this definition of a Poisson process P is independent of the particular
representation (7.1). However, for several reasons it is convenient to think of the
above construction as the definition of a Poisson process. First of all, it allows
us to avoid any discussion about what “a random set” means and, moreover, many
important properties of Poisson processes follow from it rather directly. In any case,
we will show in Theorem 7.5 below that all such processes satisfy the traditional
definition of a Poisson process.
One important immediate consequence of the definition in (7.3) is the Mapping
Theorem. Given a Poisson process P on S with the mean measure µ and a mea-
surable map f : S ! S0 into another measurable space (S0 , S 0 ), let us consider the
image set [
f (P) = f (Xn1 ), . . . , f (XnNn ) .
n 1
Since random variables f (Xn` ) have the distribution pn f 1 on S 0 , the set f (P)
resembles the definition (7.3) corresponding to the measure
 µn (S) pn f 1
= Â µn f 1
=µ f 1
.
n 1 n 1
Next, let us consider a sequence (lm ) of measures that satisfy (7.1), lm = Ân 1 lmn
and lmn (S) < •. Let Pm = [n 1 Pmn be a Poisson process with the mean measure
lm defined as in (7.3) and suppose that all these Poisson processes are generated
independently over m 1. Since
l := Â lm = Â Â lmn = Â lm(`)n(`)
m 1 m 1n 1 ` 1
The above results give us several ways to generate a new Poisson process from
the old one. However, in each case, the new process is generated in a way that de-
pends on a particular representation of its mean measure. We will now show that,
when the measure µ is s -finite, the random set P can be generated in a way that,
in some sense, does not depend on the particular representation (7.1). This point
is very important, because, often, given a Poisson process with one representation
of its mean measure, we study its properties using another, more convenient, rep-
resentation. Suppose that S is equal to a disjoint union [m 1 Sm of sets such that
0 < µ(Sm ) < •, in which case
µ= Â µ|Sm (7.7)
m 1
is another representation of the type (7.1), where µ|Sm is the restriction of µ to the
set Sm . The following holds.
210 7 Poisson Processes
Proof. Given x 0, let us denote the weights of the Poisson distribution P(x) with
the mean x by
xj x
p j (x) = P(x) { j} = e .
j!
Consider disjoint sets A1 , . . . , Ak and let A0 = ([ik Ai )c be the complement of their
union. Fix m1 , . . . , mk 0 and let m = m1 + · · · + mk . Given any set A, denote
Nn (A) = |Pn \ A|. With this notation, let us compute the probability of the event
W = Nn (A1 ) = m1 , . . . , Nn (Ak ) = mk .
Recall that the random variables (Xn` )` 1 are i.i.d. with the distribution pn defined
in (7.2) and, therefore, conditionally on Nn in (7.3), the cardinalities (Nn (A` ))0`k
have multinomial distribution and we can write
7.1 Overview of Poisson processes 211
P(W) = ÂP W Nn = m + j P Nn = m + j
j 0
= ÂP Nn (A0 ) = j \ W Nn = m + j pm+ j µn (S)
j 0
(m + j)! µn (S)m+ j
= Â j!m1 ! · · · mk !
pn (A0 ) j pn (A1 )m1 · · · pn (Ak )mk
(m + j)!
e µn (S)
j 0
1
= Â j!m1 ! · · · mk !
µn (A0 ) j µn (A1 )m1 · · · µn (Ak )mk e µn (S)
j 0
µn (A1 )m1 µn (A1 ) µn (Ak )mk µn (Ak )
= e ··· e .
m1 ! mk !
This means that the cardinalities Nn (A` ) for 1 ` k are independent random
variables with the distributions P(µn (A` )). Since all measures pn are non-atomic,
P(Xn j = Xmk ) = 0 for any (n, j) 6= (m, k). Therefore, the sets Pn are all disjoint
with probability one and
First of all, the cardinalities N(A1 ), . . . , N(Ak ) are independent, since we showed
that Nn (A1 ), . . . , Nn (Ak ) are independent for each n 1, and it remains to show that
N(A) has the Poisson distribution with the mean µ(A) = Ân 1 µn (A). The partial
sum Sm = Ânm Nn (A) has the Poisson distribution with the mean Ânm µn (A) and,
since Sm " N(A), for any integer r 0, the probability P(N(A) r) equals
⇣ ⌘
lim P(Sm r) = lim  p j  µn (A) =  p j µ(A) .
m!• m!•
jr nm jr
If µ(A) < •, this shows that N(A) has the distribution P(µ(A)). If µ(A) = • then
P(N(A) r) = 0 for all r 0, which implies that N(A) is countably infinite. t u
Proof (of Theorem 7.4). First, suppose that µ(S) < • and suppose that the Poisson
process P was generated as in (7.3) using some sequence of measures (µn ). By
Theorem 7.5, the cardinality N = N(S) has the distribution P(µ(S)). Let us show
that, conditionally on the event {N = n}, the set P “looks like” an i.i.d. sample of
size n from the distribution p(·) = µ(·)/µ(S) or, more precisely, if we randomly
assign labels {1, . . . , n} to the points in P, the resulting vector (X1 , . . . , Xn ) has the
same distribution as an i.i.d. sample from p. Let us consider some measurable sets
B1 , . . . , Bn 2 S and compute
Pn X1 2 B1 , . . . , Xn 2 Bn = P X1 2 B1 , . . . , Xn 2 Bn N = n , (7.9)
I(x 2 B` ) = Â d` j I(x 2 A j ),
jk
Pn X1 2 B1 , . . . , Xn 2 Bn = En ’ Â d` j I(X` 2 A j ) (7.10)
`n jk
= Â d1 j1 · · · dn jn Pn X1 2 A j1 , . . . , Xn 2 A jn .
j1 ,..., jn k
Pn X1 2 A j1 , . . . , Xn 2 A jn = Pn X` 2 A j for all j k, ` 2 I j .
If n j = |I j | then the last event can be expressed in words by saying that, for each
j k, we observe n j points of the random set P in the set A j and then assign labels
in I j to the points P \ A j . By Theorem 7.5, the probability to observe n j points in
each set A j , given that N = n, equals
P(N(A j ) = n j , j k)
Pn (N(A j ) = n j , j k) =
P(N = n)
µ(A j )n j µ(A j ) . µ(S)n
=’ e e µ(S)
,
jk n j! n!
while the probability to randomly assign the labels in I j to the points in A j for all
j k is equal to ’ jk n j !/n!. Therefore,
⇣ µ(A ) ⌘n j
Pn X1 2 A j1 , . . . , Xn 2 A jn = ’ = ’ p(A j` ).
j
(7.11)
jk µ(S) `n
Pn X1 2 B1 , . . . , Xn 2 Bn = Â d1 j1 p(A j1 ) · · · Â dn jn p(A jn )
j1 k jn k
= p(B1 ) · · · p(Bn ).
Clearly, the measure µ is s -finite, which can be seen, for example, by considering
the partition (0, •) = [m 1 Sm with S1 = [1, •) and Sm = [1/m, 1/(m 1)) for m
2, since Z •
z x 1 z dx = 1 < •.
1
By Theorem 7.4, each cardinality N(Sm ) = |P \ Sm | has the Poisson distribution
with the mean µ(Sm ) and, conditionally on N(Sm ), the points in P \ Sm are i.i.d.
with the distribution µ|Sm /µ(Sm ). Therefore, by Wald’s identity,
Z Z
xµ(dx)
E Â x = EN(Sm )
Sm µ(Sm )
=
Sm
xµ(dx)
x2P\Sm
and, therefore,
Z 1 Z 1
z
E Â xI(x < 1) = xµ(dx) = zx z
dx = < •. (7.13)
x2P 0 0 1 z
This means that the sum Âx2P xI(x < 1) is finite with probability one and, since
there are finitely many points in the set P \ [1, •), the sum Âx2P x is also finite
with probability one.
Exercise 7.1.1. Let P be a Poisson process with the mean measure (7.12). Show
a
that E Âx2P x < • for all 0 < a < z .
Exercise 7.1.2. Let P be a Poisson process on (0, •) with the mean measure l dx.
If we enumerate the points in P in the increasing order, X1 < X2 < . . . , show that
the increments X1 , X2 X1 , X3 X2 , . . . are i.i.d. exponential random variables with
the density l e l x on (0, •). Hint: compute the probability for X1 , . . . , Xn to be in
the intervals [x1 , x1 + Dx1 ], [x2 , x2 + Dx2 ], . . . , [xn , xn + Dxn ] for x1 < x2 < . . . < xn .
Exercise 7.1.3. If P be a Poisson process on (0, •) with the mean measure (7.12)
and (Ux )x2P are i.i.d. markings uniform on [0, 1], compute the mean measure of
the process {xUx : x 2 P}. Is it a Poisson process?
Exercise 7.1.4. If P be a Poisson process on (0, •) with the Lebesgue mean mea-
sure dx and z > 0, show that {x 1/z : x 2 P} is the Poisson process on (0, •) with
the mean measure (7.12).
214 7 Poisson Processes
S= Â f (x), (7.14)
x2P
where P is a Poisson process with the mean measure µ and f is a real-valued mea-
surable function. Since the points in P are not necessarily ordered, we understand
that the series converges if it converges absolutely. If we consider disjoint subsets
S+ = { f > 0} and S+ = { f < 0} then P+ = P \ S+ and P = P \ S are inde-
pendent Poisson processes and we can decompose S as the difference S+ S of
two independent random variables
Theorem 7.6 (Campbell’s Theorem I). If P is a Poisson process with the mean
measure µ and f 0 then the Laplace transform of S = Âx2P f (x) equals
h Z i
Ee tS = exp (1 e t f (x) ) dµ(x) . (7.17)
finite or infinite. (Of course, the variance is defined only if ES < •.)
Proof. First, suppose that f is a simple (step) function taking non-zero values
( f j ) jk (and possibly zero) and each level set A j = {x : f (x) = f j } has finite mea-
7.2 Sums over Poisson processes 215
sure, m j = µ(A j ) < •. Then the random variables N j = card(P \ A j ) are indepen-
dent Poiss(m j ) for j k, the sum S = Âx2P f (x) = Â jk f j N j and, by indepen-
dence,
Ee tS = Ee t  jk f j N j = ’ Ee t f j N j .
jk
so
h i
Ee tS
= exp  m j (e t fj
1)
jk
hZ i h Z i
t f (x) t f (x)
= exp (e 1) dµ(x) = exp (1 e ) dµ(x) .
The general case follow from the decomposition (7.15), although we have to
replace the Laplace transform by characteristic function to make sure it is defined
for both positive and negative parts.
Theorem 7.7 (Campbell’s Theorem II). If P is a Poisson process with the mean
measure µ then the series S = Âx2P f (x) is absolutely convergent almost surely if
and only if Z
min(| f (x)|, 1) dµ(x) < •, (7.20)
and in this case the characteristic function of S equals
hZ i
Ee = exp (eit f (x) 1) dµ(x) .
itS
(7.21)
Moreover, Z
ES = f (x) dµ(x), (7.22)
216 7 Poisson Processes
where the expectation exists if and only if the integral converges. If ES converges
then Z
Var(S) = f (x)2 dµ(x), (7.23)
finite or infinite.
for z = t, implies the same formula for imaginary z = it, since both sides are
analytic on Re z < 0. t
u
Exercise 7.2.2. Let P be a Poisson process on (0, •) with the mean measure
z x 1 z dx for z 2 (0, 1). Show that the Laplace transform of Âx2P x is of the form
exp( ct z ) and find the constant c.
Exercise 7.2.3. If z 2 (0, 1), P is a Poisson process on (0, 1) with the mean mea-
sure µ(dx) = x 2 z dx, and Sm = [1/m, 1/(m 1)) for m 2, show that the series
⇣ Z ⌘
  x x dµ(x)
Sm
m 2 x2P\Sm
The motivation for this definition is to understand the so called Lévy processes,
which are stochastic processes X(c) indexed by c 0 with X(0) = 0, with inde-
pendent and stationary increments, and continuous in probability. This means:
(a) (Independence of increments) for any 0 c1 < c2 < · · · < cn < •, the incre-
ments X(c1 ), X(c2 ) X(c1 ), X(c3 ) X(c2 ), . . . , X(cn ) X(cn 1 ) are inde-
pendent;
(b) (Stationary increments) for any c0 < c, the increment X(c) X(c0 ) is equal in
distribution to X(c c0 );
(c) (Continuity in probability) for any e > 0 and c 0, limh!0 P(|X(c + h)
X(c)| > e) = 0.
The first two properties imply that the distribution of X(c) for any given c must
be infinitely divisible because X(c) is equal to the sum of i.i.d. increments X((k +
1)c/n) X(kc/n) for k = 0, . . . , n 1.
functions of infinitely divisible distributions do not vanish and for this we will
need the following lemma.
Lemma 7.1. If f (t) = EeitX then 1 Re f (2t) 4(1 Re f (t)). Moreover, | f (t)|2
is also a characteristic function and, therefore, 1 | f (2t)|2 4(1 | f (t)|2 ).
Proof. First statement follows from:
Z Z
1 Re f (2t) = (1 cos 2tx) dP(x) = 2 sin2 tx dP(x)
Z
=2 (1 + costx)(1 costx) dP(x)
Z
4 (1 costx) dP(x) = 4(1 Re f (t)).
Since f (0) = 1 and f (t) is continuous, we can choose k large enough such that
x = | f (t/2k )| > 0. Since x2/n ! 1 as n ! •, choosing n large enough we can make
the right hand side above smaller than 1/2, which implies that | fn (t)|2 > 1/2 and,
therefore, | f (t)| = | fn (t)|n 6= 0. t
u
If r(t) = | f (t)| 6= 0, since f (0) = 1, we can represent f (t) = r(t)eiq (t) for some
unique continuous q (t) such that q (0) = 0. If f (t) = fn (t)n for some continuous
fn (t) with fn (0) = 1, because rn (t) := | fn (t)| = r(t)1/n 6= 0, we can also represent
fn (t) = rn (t)eiqn (t) for some unique continuous qn (t) such that qn (0) = 0. Clearly
this means that qn (t) = q (t)/n. In other words, if f (t) is a characteristic function
of an infinitely divisible distribution, we can take fn (t) = r(t)1/n eiq (t)/n .
Next, let us state without a proof the obvious result that follows immediately
from the definition.
Theorem 7.9. Convolution of infinitely divisible distributions is infinitely divisible.
In other words, the sum of independent infinitely divisible random variables is
infinitely divisible. In particular, if f (t) is a characteristic function of an infinitely
divisible distribution then so is | f (t)|2 = f ( t) f (t).
Theorem 7.10. If a distribution P is a limit of infinitely divisible distributions Pk
then it is also infinitely divisible.
7.3 Infinitely divisible distributions: basic properties 219
Proof. Let f be the c.f. of P, let fk be the c.f. of Pk and, by infinite divisibility of
Pk , let fk,n be the c.f. such that fk (t) = fk,n (t)n . The convergence Pk ! P implies
that fk (t) ! f (t) and, therefore,
Proof.
p If mc = m/n 2 Q then, for any k 1, fnk (t)m/n = (jk (t))k , where jk (t) =
p
nk
( f (t) ) . Since f (t) is infinitely divisible, f (t) is a characteristic function
and its mth power, jk (t), is also a characteristic function. This proves that f (t)m/n
is infinitely divisible. Taking a limit m/n ! c and using previous theorem implies
this for all c > 0. t
u
Theorem 7.12. (i) If (X(c))c 0 is a Lévy process and f (t) is the characteristic
function of X(1) then the characteristic function of X(c) is f (t)c .
(ii) If f (t) is a characteristic function of an infinitely divisible distribution then
there exists a Lévy process (X(c))c 0 such that f (t) is the characteristic function
of X(1).
Proof. (i) By the properties (a) and (b) of the increments of a Lévy process,
If m/n ! c then f (t)m/n ! f (t)c and, by continuity in probability (c) of the Lévy
d
process, X(m/n) ! X(c). This implies that the characteristic function of X(c) is
f (t)c .
(ii) For any
0 c1 < c2 < · · · < cn < •,
let us define the distribution of (X(c1 ), . . . , X(cn )) by the conditions that X(0) = 0
and the increments X(c` ) X(c` 1 ) for ` = 1, . . . , n are independent and have char-
acteristic functions f (t)c` c` 1 . These distributions are consistent because, if we
remove a point c` , the increment X(c`+1 ) X(c` 1 ) must have characteristic func-
tion f (t)c`+1 c` 1 which equals f (t)c`+1 c` f (t)c` c` 1 – the characteristic function
of the sum of X(c`+1 ) X(c` ) and X(c` ) X(c` 1 ). By Kolmogorov’s consistency
theorem, we can define a family of random variables (X(c))c 0 with these finite
dimensional distributions. The properties (a) and (b) of the Lévy process hold by
construction. Since f (t)h ! 1 as h # 0, the increment X(c + h) X(c) converges in
distribution to 0, which implies the continuity in probability property (c). t
u
Example 7.3.2. Let us give another example which is not obviously infinitely di-
visible and which follows from that above properties. Given p 2 (0, 1), let us con-
sider a geometric distribution P({n}) = (1 p)pn for integer n = 0, 1, 2, . . . . If p
is the probability of failure, this is the distribution of the number of failures before
the first success occurring with probability 1 p. To see that this distribution is
infinitely divisible, let us compute the characteristic function
1 p
f (t) = Â eitn (1 p)pn =
1 peit
.
n 0
pk ikt
log f (t) = log(1 p) log(1 peit ) = Â (e 1).
k 1 k
Now, let us recall that if X has Poisson distribution with the mean l > 0 then its
characteristic function is
lk (l eit )k
EeitX = Â eitk k!
e l
=e l
 k! = exp[l (eit 1)].
k 0 k 0
This means that exp[l (eict 1)] for c 2 R is the characteristic function of cX. If we
denote by P(pk /k) Poisson random variables with the mean pk /k, independent for
k 1, then it is easy to check that the series Âk 1 kP(pk /k) converges and its char-
acteristic function is f (t) defined above. In other words, this random series gives a
7.3 Infinitely divisible distributions: basic properties 221
(l f (t))k
 k!
N N l
Eeit Âi=1 Xi = E[Eeit Âi=1 Xi | N] = E f (t)N = e
k 0
h Z i
= exp[l ( f (t) 1)] = exp l (eitx 1) dP(x) . (7.24)
We will use this to show the following fundamental fact that will be used in the
next section to derive a representation for general infinitely divisible distributions.
Theorem 7.13. Infinitely divisible distributions are limits of compound Poisson.
Proof. If f (t)
pis a characteristic function of some infinitely divisible distribution
and fn (t) = n f (t) then
is the characteristic function of X(c) for c 0. For a fixed n, the right hand side
corresponds to the following Lévy process Xn (c). If P is a Poisson process on
222 7 Poisson Processes
(0, •) with the mean measure n dx and (Yx )x2P are i.i.d. markings of this process
from the distribution Pn then
It is obvious that its increments are stationary, independent, and X(c) X(c0 ) has
a compound Poisson distribution with l = n(c c0 ) and P = Pn . In other words,
in the sense of finite dimensional distributions, all Lévy processes are limits of
such jump processes, with i.i.d. jumps Yx occurring at the locations x of a Poisson
process P.
Exercise 7.3.3. In the proof of Theorem 7.10, show that qk (t) ! q (t) for all t.
Exercise 7.3.4. If P is a Poisson process on R such that Âx2P |x| < • a.s., is the
distribution of Âx2P x infinitely divisible?
7.4 Canonical representations of infinitely divisible distributions 223
X =b+ Âx
x2P
has
R the distribution P with Laplace transform f (t). Notice that the condition
(0,•)(x ^ 1) dµ(x) < • coincides with the one in Campbell’s theorem and ensures
that X < • almost surely. For any n 1, if Pn is a Poisson process on (0, •) with
the mean measure µ/n then the distribution Pn of b /n + Âx2Pn x satisfies P = P⇤nn ,
by the superposition property of Poisson processes, or by looking at the Laplace
transforms. In other words, all f (t) as in (7.28) are characteristic functions of a
sum of a Poisson process on (0, •) plus a constant.
Now, suppose that X 0 is infinitely divisible, which means that, for all n 0,
there exists a distribution Pn on R+ = [0, •) such that P = P⇤n
n , which implies that
Z •
tX
f (t) = Ee = fn (t)n , where fn (t) = e tx
dPn (x).
0
First of all, if we take t = 1 then this implies that, for large enough n,
Z •
n (1 e x ) dPn (x) 1 log f (1).
0
224 7 Poisson Processes
One can check that 1 e x (1 e 1 )(x ^ 1) and, therefore, for large enough n,
Z •
1 log f (1)
n (x ^ 1) dPn (x) < •.
0 1 e 1
If we consider the measures Gn on [0, •) defined by
then Gn (R+ ) K < • for all n, i.e. these measures are uniformly bounded.
Next, we show that these measures are also uniformly tight. For t 1, we can
write
Z •
n tx
Gn ([1/t, •)) = nPn ([1/t, •)) 1
(1 e ) dPn (x)
1 e 1/t
Z •
n tx log f (t)
1
(1 e ) dPn (x) ! .
1 e 0 1 e 1
Since limt#0 f (t) = limt#0 Ee tX = 1, by choosing t small enough we can make
the right hand side above smaller than any e > 0 so, for large enough n, we have
Gn ([1/t, •)) e. This implies that (Gn ) is uniformly tight.
By the Selection Theorem, we can choose a subsequence (nk )k 1 such that
Gnk ! G weakly for some finite measure G on R+ . On the one hand,
Z • Z •
1 e tx tx
dGnk (x) = nk (1 e ) dPnk (x) ! log f (t)
0 x^1 0
where p(x) = e x (sinh(x)/x 1).R Since 0 p(x) 1 for x > 0 and p(x) = O x
2
near the origin, the assumption (0,•)(x ^ 1) dµ(x) < • implies that
Z
p(x) dµ(x) < •.
(0,•)
Next, we will consider the case of general infinitely divisible random variables
and derive the canonical Lévy-Khintchine representation of their characteristic
functions.
Proof. We sawp in (7.25) that, if Pn is the distribution with the characteristic func-
tion fn (t) = n f (t) then
h Z i
f (t) = lim exp n (eitx 1) dPn (x)
n!•
for large enough n and, since 1 cos x x2 /3 for |x| 1, we get that
Z
n x2 dPn (x) K < •. (7.31)
|x|1
Next, let us consider the average of the above limit over t 2 [0, d ]. Since
Z dh Z i Z⇣
1 sin d x ⌘
n (1 costx) dPn (x) dt = n 1 dPn (x),
d 0 dx
we have Z⇣ Z
sin d x ⌘ 1 d
lim n 1 dPn (x) = log | f (t)| dt,
n!• dx d 0
which is bounded because | f (t)| =
6 0 for any infinitely divisible distribution. Since
sin x/x 1 and 1 sin x/x 1/2 for |x| 1,
Z ⇣ Z d
sin d x ⌘ 2
nPn (|x| 1/d ) 2n 1 dPn (x) log | f (t)| dt + e
|x| 1/d dx d 0
for n large enough. First of all, for d = 1 this implies that nPn (|x| 1) is bounded
by a constant for large enough n and, combining this with (7.31) proves that
Z
n (x2 ^ 1) dPn (x) K 0 < •. (7.32)
then Gn (R) K 00 < • for all n, i.e. total masses of these measures are uniformly
bounded. Next, since | f (t)| ⇡ 1 for small t, we can choose d above small enough
so that
Gn (|x| 1/d ) = nPn (|x| 1/d ) 2e
7.4 Canonical representations of infinitely divisible distributions 227
for large enough n. This means that the sequence of measures (Gn ) is uniformly
tight.
By the Selection Theorem, we can choose a subsequence (nk )k 1 such that
Gnk ! G weakly for some finite measure G on R. On the one hand,
Z Z
1
(eitx 1) dGnk (x) = nk (eitx 1) dPnk (x) ! log f (t)
x2 ^ 1
as k ! •. On the other hand, in contrast with the case of positive infinitely divisible
random variables above, the function (eitx 1)/(x2 ^ 1) blows up at x = 0 and can
not be extended by continuity to zero. To fix this, let us take any c > 0 and rewrite
Z Z
itx 1 1
(e 1) 2 dGnk (x) = eitx 1 itx I(|x| c) dGnk (x)
x ^1 Z x2 ^ 1
1
+ it x I(|x| c) dGnk (x).
x2 ^ 1
The function h(x) = (eitx 1 itx I(|x| c)/(x2 ^ 1) can be extended by continuity
to zero, h(0) = t 2 /2, in which case its only points of discontinuity are x = c, c.
However, if we choose c > 0 such that c and c are points of continuity of the
limiting measure G then the weak convergence Gnk ! G implies that
Z Z
h(x) dGnk (x) ! h(x) dG(x)
as k ! •. Since the sum of the two integrals above also converges (to log f (t)),
this forces the second integral to converge,
Z
1
it x I(|x| c) dGnk (x) ! itgc ,
x2 ^ 1
for some gc 2 R. We proved that
h Z i
1
f (t) = exp itgc + eitx 1 itx I(|x| c) dG(x) .
x2 ^ 1
Changing the indicator I(|x| c) to I(|x| 1) will only result in changing the term
itgc to itg for some g 2 R, so
h Z i
1
f (t) = exp itg + eitx 1 itx I(|x| 1) 2 dG(x) .
x ^1
As before, although Gn ({0}) = 0, in the limit we might have G({0}) = s 2 0, so
h Z i
t 2s 2 1
f (t) = exp itg + eitx 1 itx I(|x| 1) dG(x) .
2 R\{0} x2 ^ 1
Finally, if we consider the measure
228 7 Poisson Processes
dG(x)
dµ(x) := (7.34)
x2 ^ 1
R
on R \ {0} then R\{0}(x2 ^ 1) dµ(x) = G(R \ {0}) < • and the representation
(7.30) holds.
To prove the uniqueness of such representation, one can easily check that
Z t+1 Z ⇣ Z
1 itx sin x ⌘ 1
log f (t) log f (s) ds = e 1 dG(x) = eitx dF(x),
2 t 1 x x2 ^ 1
where we defined the measure F on R by
⇣ sin x ⌘ 1
dF(x) = 1 dG(x).
x x2 ^ 1
One can see that the prefactor in front of dG(x) is strictly positive (when extended
by continuity to 1/6 at x = 0) and R bounded, so F is a positive finite measure.
Since its characteristic function eitx dF(x) determines F, we showed that f (t)
determines F, and therefore G, uniquely. Once we know G, we know s 2 = G({0})
and µ, which proves that the representation is unique. t
u
Theorem 7.16. All functions f (t) of the form (7.30) correspond to some infinitely
divisible distributions.
S1 = Â xk Nk x k Dk ,
k 1
where Nk are independent Poiss(Dk ). The series converges almost surely because
because on each interval Sm (separated away from zero) the measure µ2 is finite
and, thus, both integrals are well defined. Let P be a Poisson process with the
mean measure µ2 on I, and consider the random series
⇣ Z ⌘
S2 = Â Â x x dµ2 (x) .
m 1 x2P\Sm Sm
By Campbell’s theorem,
Z ⇣ ⌘ Z
E Â x=
Sm
x dµ2 (x), Var  x = x2 dµ2 (x),
Sm
x2P\Sm x2P\Sm
R
so the sum of variances is again bounded by I x2 dµ < • and, therefore, the above
series S2 converges almost surely. Again, using Campbell’s theorem, one can see
that its characteristic function equals (7.37). Since the series is obviously infinitely
divisible, this finishes the proof. t
u
This means that if f (t) can be represented in such a way and µ is not a positive
measure (i.e. the component µ2 is non-trivial) then f (t) is not infinitely divisible.
Proof. Suppose we have two such representations with parameters (g1 , s12 , µ11 , µ21 )
and (g2 , s22 , µ12 , µ22 ). Then, rearranging the terms,
Z
t 2 s12
itg1 + eitx 1 itx I(|x| 1) d(µ11 + µ22 )(x)
2 R\{0}
Z
t 2 s22
= itg2 + eitx 1 itx I(|x| 1) d(µ12 + µ21 )(x).
2 R\{0}
1 b 1 + ae it
f (t) = · .
1 + a 1 beit
Find the distribution for which f (t) is a characteristic function, and show that it is
not infinitely divisible. Hint: compute log f (t) and use Lemma 7.2.
Exercise 7.4.2. In the setting of the previous exercise, suppose that a b and show
that | f (t)|2 is an infinitely divisible characteristic function. (Together with the pre-
vious exercise this implies that the sum or difference of two non-infinitely divisible
random variables can be infinitely divisible.)
R
Exercise 7.4.3. If in Theorem 7.16 the measure µ satisfies R (|x| ^ 1) dµ(x) < •,
what is the random variable corresponding to such c.f. f (t)?
Exercise 7.4.4. Show that if X is an infinitely divisible
R random variable with c.f.
(7.30) then, for p > 0, E|X| p < • if and only if {|x|>1} |x| p dµ(x) < •. Hint: Con-
sider the decomposition in the proof of Theorem 7.16 and prove that the existence
of pth moment is determined by the compound Poisson component corresponding
to the first integral in (7.35), because the last component (call it Y ) corresponding
to the second integral in (7.35) has the moment generating function
Z
EezY = exp (ezx 1 zx) dµ(x)
[ 1,1]
for all z 2 C.
Exercise 7.4.5. Suppose that X is an infinitely divisible random variable with the
c.f. (7.30) with s 2 = 0 (i.e. without the Gaussian component) and suppose that the
distribution of X is symmetric. Show that, in this case,
Z
itX
f (t) = Ee = exp cos(tx) 1 dµ(x). (7.39)
R
7.4 Canonical representations of infinitely divisible distributions 231
for some constant Kn that depends on n and the parameters of the distribution. If
g = 0 and m1 = m2 then Kn = 0.
Proof. If f (t) in (7.40) is the c.f. of X then f (tc) is the c.f. of cX, which is equal
to the exponential of
Z
itcg + eitcx 1 itcx I(|x| 1) p(x) dx
Z
{y=cx}
= itcg + |c|a eity 1 ity I(|y| c) p(y) dy
Z
= itL + |c|a eity 1 ity I(|y| 1) p(y) dy,
7.5 a-stable distributions and Lévy processes 233
is called a subordinator. t
u
Chapter 8
Exchangeability
Let us start by recalling an example in Section 5. Let (Xi )i 1 be i.i.d. with the
Bernoulli distribution and the probability of success q 2 [0, 1], i.e. Pq (Xi = 1) = q
and Pq (Xi = 0) = 1 q , and let u : [0, 1] ! R be some continuous function on
[0, 1]. Then, by Theorem 2.6 in Section 5, the Bernstein polynomials
n ⇣ ⌘ ⇣n ⌘ n ⇣ ⌘✓ ◆
k k n k
Bn (q ) :=  u Pq  Xi = k =  u q (1 q )n k
k=0 n i=1 k=0 n k
(n) (n)
By taking u ⌘ 1, it is easy to see that Ânk=0 pk = 1, so we can think of pk as the
distribution ⇣ k⌘ (n)
P X (n) = = pk (8.2)
n
of some random variable X (n) . We showed that EBn (X) = Eu(X (n) ) ! Eu(X) for
any continuous function u, which means that X (n) converges to X in distribution. In
other words, given finitely many moments of X, this construction gives an explicit
approximation of the distribution of X.
If we want to express the moment µk for a fixed k in terms of the distribution
(8.2), we can write for, n k,
⇣ ⌘n k
µk = EX k = EX k X + 1 X
n k✓ ◆
n k
=Â EX k+ j (1 X)n (k+ j)
j=0 j
n ✓ ◆ (8.3)
n k
= Â m
EX (1 X) n m
m=k m k
n ✓ ◆ n ✓ ◆✓ ◆ 1
n k n k n
= Â n m
µm = Â
(n)
( D) pm .
m=k m k m=k m k m
Let us note that the part of this formula which is expressed only in terms of the
sequence (µk )k 0 ,
n ✓ ◆
n k
µk = Â ( D)n m µm , (8.4)
m=k m k
holds for any sequence (µk )k 0 and not necessarily a sequence of moments of
some random variables X 2 [0, 1]. For example, for n = k + 1 this formula (called
inversion formula) says that µk = µk+1 Dµk , and it can be proved by induction
for all n k (exercise).
Next, given a sequence (µk ), we consider the following question: When is (µk )
the sequence of moments of some [0, 1] valued random variable X? By the above,
it is necessary that
(n)
and, by the assumption (8.5), pm 0. Notice that, by (8.3), for any fixed k,
n ✓ ◆✓ ◆ 1
n k n
µk = Â
(n)
pm
m=k m k m
n n ⇣ ⌘k
m(m 1) · · · (m k + 1) (n) m
= Â pm ⇡ Â
(n)
pm
m=k n(n 1) · · · (n k + 1) m=0 n
where k = x1 + . . . + xn .
In other words, in order to generate such an exchangeable sequence of 0’s and 1’s,
we first pick p 2 [0, 1] from some distribution F and then generate a sequence of
i.i.d Bernoulli random variables with probability of success x.
We have
Similarly, by induction,
Since, by exchangeability, changing the order of 1’s and 0’s does not affect the
probability, this finishes the proof. t
u
Example 8.1.1 (Polya’s urn model). Suppose we have b blue and r red balls in the
urn. We pick a ball randomly and return it with c balls of the same color. Consider
the random variable
⇢
1, if the ith ball picked is blue
Xi =
0, otherwise.
Xi ’s are not independent but it is easy to check that they are exchangeable. For
example,
b b+c r
P(bbr) = ⇥ ⇥
b + r b + r + c b + r + 2c
b r b+c
= ⇥ ⇥ = P(brb).
b + r b + r + c b + r + 2c
To identify the distribution F in de Finetti’s theorem, let us look at its moments µk
in (8.6),
b b+c b + (k 1)c
µk = P b| .{z
. . b} = ⇥ ⇥···⇥ .
b+r b+r+c b + r + (k 1)c
k times
One can recognize or check that (µk ) are the moments of Beta(a, b ) distribution
with the density
G(a + b ) a 1
x (1 x)b 1 I(0 x 1)
G(a)G(b )
with the parameters a = b/c, b = r/c. By de Finetti’s theorem, we can generate
Xi ’s by first picking p from the distribution Beta b/c, r/c and then generating
i.i.d. Bernoulli (Xi )’s with the probability of success p. By the strong law of large
8.1 Moment problem and de Finetti’s theorem for coin flips 239
numbers, applied conditionally on p, the proportion of blue balls picked in the first
n trials
1
pn = (X1 + . . . + Xn )
n
will converge to this probability of success p, i.e. in the limit it will be random with
the Beta distribution. Recall that this example came up in the exercises in Section
15 on the convergence of martingales, where one showed that
b + (X1 + . . . + Xn )c b + pn nc
Yn = =
b + r + nc b + r + nc
converges almost surely because it is a bounded martingale. Since this limits coin-
cides with the limit of pn , this means that we identified its distribution. t
u
Exercise 8.1.3. In the setting of de Finetti’s theorem above, prove that the limit
limn!• n 1 Âni=1 Xi exists almost surely and, conditionally on this limit, the se-
quence (Xi )i 1 is i.i.d..
240 8 Exchangeability
where w and (u` ) are i.i.d. random variables with the uniform distribution on [0, 1].
Before we proceed to prove the above results, let us recall the following definition.
Definition 8.1. A measurable space (W, B) is called a Borel space if there exists a
one-to-one function j from W onto a Borel subset A ✓ [0, 1] such that both j and
j 1 are measurable. t
u
Perhaps, the most important example of Borel spaces are complete separable
metric spaces and their Borel subsets (see e.g. Section 13.1 in R.M. Dudley “Real
Analysis and Probability”). The existence of the isomorphism j automatically im-
plies that if we can prove the above results in the case when the elements of the
sequence (s` ) or array (s`,`0 ) take values in [0, 1] then the same representation re-
sults hold when the elements take values in a Borel space. Similarly, other standard
results for real-valued or [0, 1]-valued random variables are often automatically ex-
tended to Borel spaces. For example, one can generate any real-valued random
variable as a measurable function of a uniform random variable on [0, 1] using the
quantile transform and, therefore, any random element on a Borel space can also
be generated by a function of a uniform random variable on [0, 1].
Let us describe another typical measure theoretic argument that will be used
many times below.
Lemma 8.1 (Coding Lemma). Suppose that a random pair (X,Y ) takes values in
the product of a measurable space (W1 , B1 ) and a Borel space (W2 , B2 ). Then,
there exists a measurable function f : W1 ⇥ [0, 1] ! W2 such that
d
(X,Y ) = X, f (X, u) , (8.9)
[0, 1]. Rather than using the Coding Lemma itself we will often use the general
ideas of conditioning and generating random variables as functions of uniform
random variables on [0, 1] that are used in its proof, without repeating the same
argument.
Proof. By the definition of a Borel space, one can easily reduce the general case to
the case when W2 = [0, 1] equipped with the Borel s -algebra. Since in this case the
regular conditional distribution Pr(x, B) of Y given X exists, for a fixed x 2 W1 , we
can define by F(x, y) = Pr(x, [0, y]) the conditional distribution function of Y given
x and by
f (x, u) = F 1 (x, u) = inf s 2 [0, 1] : u F(x, s)
its quantile transformation. It is easy to see that f is measurable on the product
space W1 ⇥ [0, 1] because, for any t 2 R,
and, by the definition of the regular conditional probability, F(x,t) = Pr(x, [0,t])
is measurable in x for a fixed t. If u is uniform on [0, 1] then, for a fixed x 2 W1 ,
f (x, u) has the distribution Pr(x, · ) and this finishes the proof. t
u
The most important way in which the exchangeability condition (8.7) will be
used is to say that for any infinite subset I ✓ N,
d
s` `2I
= s` ` 1
. (8.10)
FI = s s` : ` 2 I (8.11)
is the s -algebra generated by the random variables s` for ` 2 I then the following
holds.
Lemma 8.2. For any infinite subset I ✓ N and j 62 I, the conditional expectations
E f (s j ) FI = E f (s j ) FN\{ j}
Proof. First, using the property (8.10) for I [ { j} instead of I implies the equality
in distribution (see Exercise 1.4.3 in Section 1.4),
d
E f (s j ) FI = E f (s j ) FN\{ j} ,
E f (s j ) FI 2
= E f (s j ) FN\{ j} 2
. (8.12)
242 8 Exchangeability
Proof (of Theorem 8.3). Let us take any infinite subset I ✓ N such that its com-
plement N \ I is also infinite. By (8.10), we only need to prove the representa-
tion (8.8) for (s` )`2I . First, we will show that, conditionally on (s` )`2N\I , the ran-
dom variables (s` )`2I are independent. This means that, given n 1, any distinct
`1 , . . . , `n 2 I, and any bounded measurable functions f1 , . . . , fn : R ! R,
⇣ ⌘
E ’ f j (s` j ) FN\I = ’ E f j (s` j ) FN\I , (8.14)
jn jn
where the second equality follows from Lemma 8.2. This implies that
⇣ ⌘ ⇣ ⌘
E ’ f j (s` j ) FN\I = E ’ f j (s` j ) FN\I E fn (s`n ) FN\I
jn jn 1
and (8.14) follows by induction on n. Let us also observe that, because of (8.10),
the distribution of the array (s` , (s j ) j2N\I ) does not depend on ` 2 I. This implies
that, conditionally on (s j ) j2N\I , the random variables (s` )`2I are identically dis-
tributed in addition to being independent. The product space [0, 1]• with the Borel
s -algebra generated by the product topology is a Borel space (recall that, equipped
with the usual metric, it becomes a complete separable metric space). Therefore,
in order to conclude the proof, it remains to generate X = (s` )`2N\I as a function
of a uniform random variable w on [0, 1], X = X(w), and then use the argument in
the Coding Lemma 8.1 to generate the sequence (s` )`2I as ( f (X(w), u` ))`2I , where
(u` )`2I are i.i.d. random variables uniform on [0, 1]. This finishes the proof. t
u
8.2 General de Finetti and Aldous-Hoover representations 243
µw = l g(w, · ) 1 ,
where l is the Lebesgue measure on [0, 1]. By the strong law of large numbers for
empirical measures (Varadarajan’s Theorem 4.5 in Section 4.2),
1 n
µw = lim
n!• n
 dg(w,u` ) .
`=1
The limit on the right hand side is taken on the complete separable metric space
P(W, B) of probability measures on W equipped, for example, with the bounded
Lipschitz metric b or Levy-Prokhorov metric r, and this limit exists almost surely
over (u` )` 1 . This implies that
(i) the limit µ = limn!• 1n Ân`=1 ds` exists almost surely;
(ii) µ is a (random) probability measure on W, i.e. a random element in P(W, B);
(iii) given µ, the sequence (s` )` 1 is i.i.d. from with the distribution µ.
The measure µ is called the empirical measure of the sequence (s` )` 1 . One can
now interpret de Finetti’s representation as follows. First, we generate µ as a func-
tion of a uniform random variable w on [0,1] and then generate s` as a function
of µ and i.i.d. random variables u` uniform on [0, 1], using the Coding Lemma.
Combining two steps, we generate s` as a function of w and u` .
Remark 8.1. There are two equivalent definitions of a random probability measure
on a complete separable metric space W with the Borel s -algebra B. On the one
hand, as above, these are just random elements taking values in the space P(W, B)
of probability measures on (W, B) equipped with the topology of weak conver-
gence or a metric that metrizes weak convergence. On the other hand, we can think
of a random measure h as a probability kernel, i.e. as a function h = h(x, A) of
a generic point x 2 X for some probability space (X, F , Pr) and a measurable set
A 2 B such that for a fixed x, h(x, · ) is a probability measure and for a fixed A,
h(·, A) is a measurable function on (X, F ). It is well known that these two defini-
tions coincide (see, e.g. Lemma 1.37 and Theorem A2.3 in Kallenberg’s “Founda-
tions of Modern Probability”).
Example 8.2.1 (de Finetti’s representation for {0, 1}-valued sequences). Com-
ing back to the example considered in the previous section, consider an exchange-
able sequence (s` )` 1 of random variables that take values {0, 1}. Then the em-
pirical measure is a random probability measure on {0, 1}. It is encoded by one
(random) parameter p = µ({1}) 2 [0, 1]. To fix µ means to fix p and, given p, the
sequence (s` )` 1 is i.i.d. Bernoulli with the probability of success equal to p. If h
is the distribution of p then, for any n 1 and any e1 , . . . , en 2 {0, 1},
244 8 Exchangeability
Z
P(s1 = e1 , . . . , sn = en ) = pk (1 p)n k
dh(p),
[0,1]
where k = e1 + . . . + en . t
u
Another piece of information that can be deduced from de Finetti’s theorem and
is often useful is the following. Suppose that in addition to the sequence (s` )` 1 we
are given another random element Z taking values in a complete separable metric
space, and their joint distribution is not affected by permutations of the sequence,
d
Z, (s` )` 1 = Z, (sp(`) ` 1
). (8.15)
Theorem 8.4. Conditionally on µ, the sequence (s` )` 1 is i.i.d. with the distribu-
tion µ and independent of Z.
For example, one can deduce Theorem 8.7 in the exercise below from this state-
ment rather directly.
Proof (of Theorem 8.4). If we define t` = (Z, s` ) then (8.15) implies that (t` )` 1 is
exchangeable. Let
1 n 1 n
n = lim  dt` = lim  d(Z,s` )
n!• n n!• n
`=1 `=1
be the empirical measure of this sequence. Obviously, n = dZ ⇥ µ. Since, given n,
the sequence (Z, s` ) is i.i.d. from n, this means that, given n, the sequence (s` ) is
i.i.d. from µ. In other words, given Z and µ, the sequence (s` ) is i.i.d. from µ. The
statement then follows from the following simple lemma, which appears as one of
the exercises in Section 1.4. t
u
Lemma 8.3. Suppose that three random elements X,Y and Z take values on some
complete separable metric spaces. The conditional distribution of X given the pair
(Y, Z) depends only on Y (s (Y )-measurable) if and only if X and Z are independent
conditionally on Y .
in the sense that their finite dimensional distributions are equal. Here is one natural
example of an exchangeable array. Given a measurable function s : [0, 1]4 ! R
and sequences of i.i.d. random variables w, (u` ), (v`0 ), (x`,`0 ) that have the uniform
distribution on [0, 1], the array
8.2 General de Finetti and Aldous-Hoover representations 245
is, obviously, exchangeable. It turns out that all exchangeable arrays are of this
form.
Theorem 8.5 (Aldous-Hoover). Any infinite exchangeable array (s`,`0 )`,`0 1 is
equal in distribution to (8.17) for some function s .
Another version of the Aldous-Hoover representation holds in the symmetric case.
A symmetric array s is called weakly exchangeable if for any permutation p of
finitely many indices we have equality in distribution
d
sp(`),p(`0 ) `,`0 1
= s`,`0 `,`0 1
. (8.18)
Because of the symmetry, we can consider only half of the array indexed by (`, `0 )
such that ` `0 or, equivalently, consider the array indexed by sets {`, `0 } and
rewrite (8.18) as
d
s{p(`),p(`0 )} `,`0 1 = s{`,`0 } `,`0 1 . (8.19)
Notice that, compared to (8.16), the diagonal elements now play somewhat differ-
ent roles from the rest of the array. One natural example of a weakly exchangeable
array is given by
for any measurable functions g : [0, 1]2 ! R and f : [0, 1]4 ! R, which is symmetric
in its middle two coordinates u` , u`0 , and i.i.d. random variables w, (u` ), (x{`,`0 } )
with the uniform distribution on [0, 1]. Again, it turns out that such examples cover
all possible weakly exchangeable arrays.
Theorem 8.6 (Aldous-Hoover). Any infinite weakly exchangeable array is equal
in distribution to the array (8.20) for some functions g and f , which is symmetric
in its two middle coordinates.
After we give a proof of Theorem 8.6, we will leave a similar proof of Theorem 8.5
as an exercise (with some hints). We will then give another, quite different, proof of
Theorem 8.5. The proof of the Dovbysh-Sudakov representation in the next section
will be based on Theorem 8.5.
The most important way in which the exchangeability condition (8.18) will be
used is to say that, for any infinite subset I ✓ N,
d
s{`,`0 } `,`0 2I
= s{`,`0 } `,`0 1
. (8.21)
Again, one important consequence of this observation will be the following. Given
j, j0 2 I such that j 6= j0 , let us now define the s -algebra
In other words, this s -algebra is generated by all elements s{`,`0 } with both indices
` and `0 in I, excluding s{ j, j0 } . The following analogue of Lemma 8.2 holds.
Lemma 8.4. For any infinite subset I ✓ N and any j, j0 2 I such that j 6= j0 , the
conditional expectations
E f (s{ j, j0 } ) FI ( j, j0 ) = E f (s{ j, j0 } ) FN ( j, j0 )
Proof. The proof is almost identical to the proof of Lemma 8.2. The property (8.21)
implies the equality in distribution,
d
E f (s{ j, j0 } ) FI ( j, j0 ) = E f (s{ j, j0 } ) FN ( j, j0 ) ,
E f (s{ j, j0 } ) FI ( j, j0 ) 2
= E f (s{ j, j0 } ) FN ( j, j0 ) 2
.
Proof (of Theorem 8.6). Let us take an infinite subset I ✓ N such that its com-
plement N \ I is also infinite. By (8.21), we only need to prove the representation
(8.20) for (s{`,`0 } )`,`0 2I . For each j 2 I, let us consider the array
⇣ ⌘
S j = s{`,`0 } `,`0 2(N\I)[{ j} = s{ j, j} , s{ j,`} `2N\I , s{`,`0 } `,`0 2N\I . (8.23)
It is obvious that the weak exchangeability (8.19) implies that the sequence (S j ) j2I
is exchangeable, since any permutation p of finitely many indices from I in the
array (8.19) results in the corresponding permutation of the sequence (S j ) j2I . We
can view each array S j as an element of the Borel space [0, 1]• and de Finetti’s
Theorem 8.3 implies that
d
Sj j2I
= X(w, u j ) j2I
(8.24)
for some measurable function X on [0, 1]2 taking values in the space of such arrays.
Next, we will show that, conditionally on the sequence (S j ) j2I , the off-diagonal
elements s{ j, j0 } for j, j0 2 I, j 6= j0 , are independent and, moreover, the conditional
distribution of s{ j, j0 } depends only on S j and S j0 . This means that if we consider
the s -algebras
then we would like to show that, for any finite set C of indices {`, `0 } for `, `0 2 I
such that ` 6= `0 and any bounded measurable functions f`,`0 corresponding to the
indices {`, `0 } 2 C , we have
⇣ ⌘
E ’ f`,`0 (s{`,`0 } ) F = ’ E f`,`0 (s{`,`0 } ) F`,`0 . (8.26)
{`,`0 }2C {`,`0 }2C
Notice that the definitions (8.22) and (8.25) imply that F j, j0 = F(N\I)[{ j, j0 } ( j, j0 ),
since all the elements s{`,`0 } with both indices in (N \ I) [ { j, j0 }, except for s{ j, j0 } ,
appear as one of the coordinates in the arrays S j or S j0 . Therefore, by Lemma 8.4,
where the second equality follows from (8.27). Since F j, j0 ✓ F , this implies that
⇣ ⌘ ⇣ ⌘
E ’ f`,`0 (s{`,`0 } ) F = E ’ `,` {`,` }
f 0 (s 0 ) F E f j, j0 (s{ j, j0 } ) F j, j0
{`,`0 }2C {`,`0 }2C 0
for some measurable function h and i.i.d. uniform random variables x{ j, j0 } on [0, 1].
The reason why the function h can be chosen to be the same for all { j, j0 } is be-
cause, by symmetry, the distribution of (S j , S j0 , s{ j, j0 } ) does not depend on { j, j0 }
and the conditional distribution of s{ j, j0 } given S j and S j0 does not depend on
{ j, j0 }. Also, the arrays (S j , S j0 , s{ j, j0 } ) and (S j0 , S j , s{ j, j0 } ) are equal in distribution
and, therefore, the function h is symmetric in the coordinates S j and S j0 . Finally,
let us recall (8.24) and define the function
f w, u j , u j0 , x{ j, j0 } = h X(w, u j ), X(w, u j0 ), x{ j, j0 } ,
which is, obviously, symmetric in u j and u j0 . Then, the equations (8.24) and (8.28)
imply that
248 8 Exchangeability
⇣ ⌘ ⇣ ⌘
d
Sj j2I
, s{ j, j0 } j6= j0 2I
= X(w, u j ) j2I
, f w, u j , u j0 , x{ j, j0 } j6= j0 2I
.
Proof of Theorem 8.5. One proof of Theorem 8.5 similar to the proof of Theorem
8.6 is sketched in an exercise below. We will now give a different proof, and the
main part of the proof will be based on the following observation. Suppose that
we have an exchangeable sequence of pairs (t` , s` )` 1 with coordinates in com-
plete separable metric spaces. Given the sequence (t` )` 1 , how can we generate
the sequence of second coordinates (s` )` 1 ? Consider the empirical measures
1 n 1 n
µ = lim
n!• n
 (t` ,s` )
d , µ 1 = lim
n!• n
 dt` .
`=1 `=1
Lemma 8.5. Given (t` )` 1 , we can generate the sequence (s` )` 1 in distribution as
d
s` ` 1
= f (µ1 ,t` , v, x` ) ` 1
for some measurable function f and i.i.d. uniform random variables v and (x` )` 1
on [0, 1].
Proof. First, let us note how to generate the empirical measure µ given the se-
quence t = (t` )` 1 . Given µ, the sequence (t` , s` )` 1 is i.i.d. from µ and, since µ1
is the first marginal of µ, (t` )` 1 are i.i.d. from µ1 . This means that, if we consider
the triple (µ, µ1 ,t) then the conditional distribution of t given (µ, µ1 ) depends only
on µ1 ,
P t 2 · | µ, µ1 ) = P t 2 · | µ1 ).
By Lemma 8.3, t and µ are independent given µ1 . Therefore, again by Lemma 8.3,
P µ 2 · | t, µ1 ) = P µ 2 · | µ1 ).
P µ 2 · | t) = P µ 2 · | µ1 ).
8.2 General de Finetti and Aldous-Hoover representations 249
Proof (of Theorem 8.5). Let us for convenience index the array s`,`0 by ` 1 and
`0 2 Z instead of `0 1. Let us denote by
X`0 = s`,`0 ` 1
and X = X`0 `0 0
the `0 -th column and ‘left half’ of this array. Since the sequence of columns
(X`0 )`0 2Z is exchangeable, we showed in the proof of de Finetti’s theorem that,
conditionally on X, the columns (X`0 )`0 1 in the ‘right half’ of the array are i.i.d.
If we describe the distribution of one column X1 given X then we can generate all
columns (X`0 )`0 1 independently from this distribution. Therefore, our strategy will
be to describe the distribution of X1 given X, and then combine it with the structure
of the distribution of X. Both steps will use exchangeability with respect to permu-
tations of rows, because so far we have only used exchangeability with respect to
permutations of columns. Let
Y` = s`,`0 `0 0
be the elements in the `-th row of the ‘left half’ X of the array. We want to describe
the distribution of X1 = (s`,1 )` 1 given X = (Y` )` 1 and we will use the fact that
the sequence (Y` , s`,1 )` 1 is exchangeable. By Lemma 8.5, conditionally on X =
(Y` )` 1 , (s`,1 )` 1 can be generated as
where µ1 is the empirical measure of (Y` ) and where instead of v and (x` ) we wrote
v1 and (x`,1 ) to emphasize the first column index 1. Since, conditionally on X, the
columns (X`0 )`0 1 in the ‘right half’ of the array are i.i.d., we can generate
where v`0 and x`,`0 are i.i.d. uniform random variables on [0, 1]. Finally, since
(Y` )` 1 are i.i.d. given the empirical distribution µ1 , we can generate µ1 = h(w)
as a function of a uniform random variable w on [0, 1] and then, using the Cod-
250 8 Exchangeability
Exercise 8.2.2. Give a similar proof of Theorem 8.5, using only global symmetry
considerations. Hints: Since the row and column indices play a different role in
this case, the first step will be slightly different (the second step will be essentially
the same). One has to consider two sequences indexed by j 2 I,
⇣ ⌘ ⇣ ⌘
S1j = s`, j `2N\I , s`,`0 `,`0 2N\I and S2j = s j,` `2N\I , s`,`0 `,`0 2N\I .
for any permutations p and r of finitely many indices. Notice that these sequences
are not independent. In this case one needs to prove (as a part of the exercise) the
following modification of de Finetti’s representation.
Theorem 8.7. If the sequences (s1` )` 1 and (s2` )` 1 are separately exchangeable
then there exist measurable functions g1 , g2 : [0, 1]2 ! R such that
⇣ ⌘ ⇣ ⌘
d
s1` ` 1 , s2` ` 1 = g1 (w, u` ) ` 1 , g2 (w, v` ) ` 1 (8.29)
where w, (u` ) and (v` ) are i.i.d. random variables uniform on [0, 1].
s1`,`0 `,`0 1
and s2`,`0 `,`0 1
that are separately exchangeable in the first coordinate and jointly exchangeable in
the second coordinate, that is,
⇣ ⌘ ⇣ ⌘
d
s1p1 (`),r(`0 ) `,`0 1 , s2p2 (`),r(`0 ) `,`0 1 = s1`,`0 `,`0 1 , s2`,`0 `,`0 1
for any permutations p1 , p2 , r of finitely many coordinates. Show that there exist
two functions s1 , s2 such that these arrays can be generated in distribution by
where all the arguments are i.i.d. uniform random variables on [0, 1].
8.3 The Dovbysh-Sudakov representation for Gram-de Finetti arrays 251
In addition, suppose that R is positive definite with probability one, where by posi-
tive definite we will always mean non-negative definite. Such weakly exchangeable
positive definite arrays are called Gram-de Finetti arrays. It turns out that all such
arrays are generated essentially as the covariance matrix of an i.i.d. sample from
a random measure on a Hilbert space. Let H be the Hilbert space L2 ([0, 1], dv),
where dv denotes the Lebesgue measure on [0, 1].
Theorem 8.8 (Dovbysh-Sudakov). There exists a random probability measure h
on H ⇥ R+ such that the array R = (R`,`0 )`,`0 1 is equal in distribution to
where w, (u` ), (vi ), (x`,i ) are i.i.d. random variables with the uniform distribution
on [0, 1]. By the strong law of large numbers (applied conditionally on R), for any
`, `0 1,
1 n
 g`,i g`0 ,i ! R`,`0
n i=1
almost surely as n ! •. Similarly, by the strong law of large numbers (now applied
conditionally on w and (u` )` 1 ),
1 n
Â
n i=1
s (w, u` , vi , x`,i )s (w, u`0 , vi , x`0 ,i ) ! E0 s (w, u` , v1 , x`,1 )s (w, u`0 , v1 , x`0 ,1 )
252 8 Exchangeability
almost surely, where E0 denotes the expectation with respect to the random vari-
ables v1 and (x`,1 )` 1 . Therefore, (8.32) implies that
d
R`,`0 `,`0 1
= E0 s (w, u` , v1 , x`,1 )s (w, u`0 , v1 , x`0 ,1 ) `,`0 1
. (8.33)
If we denote
Z Z
s (1) (w, u, v) = s (w, u, v, x) dx, s (2) (w, u, v) = s (w, u, v, x)2 dx,
then the off-diagonal and diagonal elements on the right hand side of (8.33) are
given by
Z Z
s (1) (w, u` , v)s (1) (w, u`0 , v) dv and s (2) (w, u` , v) dv
correspondingly. Notice that, for almost all w and u, the function v ! s (1) (w, u, v)
is in H = L2 ([0, 1], dv), since
Z Z
s (1) (w, u, v)2 dv s (2) (w, u, v) dv
and, by (8.33), the right hand side is equal in distribution to R1,1 . Therefore, if we
denote Z
h` = s (w, u` , ·), a` = s (2) (w, u` , v) dv h` · h` ,
(1)
It remains to observe that (h` , a` )` 1 is an i.i.d. sequence from the random measure
h on H ⇥ R+ given by the image of the Lebesgue measure du on [0, 1] by the map
⇣ Z Z ⌘
u ! s (w, u, · ), s (w, u, v) dv
(1) (2)
s (1) (w, u, v)s (1) (w, u, v) dv .
Lemma 8.6. There exists a measurable function h 0 = h 0 (R) of the array (R`,`0 )`,`0 1
with values in P(H ⇥ R+ ) such that h 0 = h (U, id) 1 almost surely for some or-
thogonal operator U on H that depends on the sequence (h` )` 1 .
8.3 The Dovbysh-Sudakov representation for Gram-de Finetti arrays 253
Proof. Let us begin by showing that the norms kh` k can be reconstructed almost
surely from the array R. Consider a sequence (g` ) on H such that g` · g`0 = R`,`0 for
all `, `0 1. In other words, kg` k2 = kh` k2 + a` and g` · g`0 = h` · h`0 for all ` 6= `0 .
p
Without loss of generality, let us assume that g` = h` + a` e` , where (e` )` 1 is
an orthonormal sequence orthogonal to the closed span of (h` ). If necessary, we
identify H with H H to choose such a sequence (e` ). Since (h` ) is an i.i.d. se-
quence from the marginal G of the measure h on H, with probability one, there are
elements in the sequence (h` )` 2 arbitrarily close to h1 and, therefore, the length
of the orthogonal projection of h1 onto the closed span of (h` )` 2 is equal to kh1 k.
As a result, the length of the orthogonal projection of g1 onto the closed span of
(g` )` 2 is also equal to kh1 k, and it is obvious that this length is a measurable func-
tion of the array g` · g`0 = R`,`0 . Similarly, we can reconstruct all the norms kh` k as
measurable functions of the array R and, thus, all a` = R`,` kh` k2 . Therefore,
h` · h`0 `,`0 1
and (a` )` 1 (8.35)
are both measurable functions of the array R. Given the matrix (h` · h`0 ), we can
find a sequence (x` ) in H isometric to (h` ), for example, by choosing x` to be in
the span of the first ` elements of some fixed orthonormal basis. This means that
all x` are measurable functions of R and that there exists an orthogonal operator
U = U((h` )` 1 ) on H such that
Since (h` , a` )` 1 is an i.i.d. sequence from distribution h, by the strong law of large
numbers for empirical measures (Varadarajan’s theorem in Section 17),
1
 d(h` ,a` ) ! h weakly
n 1`n
almost surely. The left hand side is, obviously, a measurable function of the array R
in the space of all probability measures on H ⇥ R+ equipped with the topology of
weak convergence and, therefore, as a limit, h 0 = h (U, id) 1 is also a measurable
function of R. This finishes the proof. t
u
1 N ` `0
RN`,`0 = Â si si .
N i=1
but not under all permutations of integers Z, but only those permutations that map
positive integers into positive and non-positive into non-positive. Prove that there
exists a pair of random measures G1 and G2 on a separable Hilbert space H (not
necessarily independent) such that
d
R`,`0 `6=`0 2Z
= h` · h`0 `6=`0 2Z
,
where (h` )`0 is an i.i.d. sample from G1 and (h` )` 1 is an i.i.d. sample from G2 .
255
Chapter 9
Finite State Markov Chains
In this chapter, we will study finite state Markov chains, namely, sequences (Xi )i 1
of random variables taking values in a finite set
S = s1 , . . . , sm (9.1)
This means that the conditional distribution of the next outcome Xk+1 given the
preceding outcomes X1 = x1 , . . . , Xk = xk depends only on the most recent outcome
Xk = xk . This property is called a Markov property of the sequence, also called a
memoryless property, in the sense that we do not need to remember the entire past
and only need to know the most recent outcome to know the chances of the next
outcome. The conditional distribution in (9.2) is also called transition probability.
Markov chain is called homogeneous if transition probabilities P(Xk+1 = a | Xk =
b) do not depend on the index k. For any n 1, the joint distribution of X1 , . . . , Xn
can be constructed sequentially
n 1
P X1 = x1 , . . . , Xn = xn = ’ P Xk+1 = xk+1 | Xk = xk , (9.3)
k=0
is called the initial distribution of the Markov chain. With this notation, the defini-
tion of the (homogeneous finite state) Markov chain can be rewritten as
The initial distribution µ will vary depending on how we want to start the chain,
so we can also express this definition entirely in terms of the transition matrix,
If we start the chain in the state x0 , then the probability that the chain will visit
states x1 , . . . , xn in the next n steps is the product of transition probabilities along
the path. If we multiply by µ(x0 ), we recover the previous equation.
It is also convenient to visualize transition probabilities by a transition graph.
For example, the transition matrix
2 3
0.25 0.25 0 0 0.5 0 0
6 0.4 0 0 0 0.6 0 0 7
6 7
6 0 0.3 0 0.7 0 0 0 7
6 7
P=6 6 0 0 0.25 0 0.75 0 0 7 7 (9.9)
6 0 0.9 0 0 0.1 0 0 7
6 7
4 0 0 0 0 0 0 0.55
0 0 0 0 0 0.5 0
can be depicted as in Figure 9.1, with dots representing states, and arrows repre-
senting non-zero transition probabilities. Notice that, for example, p12 6= p21 .
This graphical representation of transition probabilities suggests that we can
think of Markov chains as random walks on the set of states, where at each step we
pick one of the neighbours of the current state si with probabilities pi j and move
there. If pii 6= 0, we can end up staying at si .
A state si is called inessential if we can reach from it another state s j with pos-
itive probability (not necessarily in one step), from which we cannot come back
to si . For example, in Figure 9.1, the states s3 and s4 are inessential. Such states
are also called transient, although this terminology is better suited for infinite state
Markov chains. For finite state chains, once we reach s j , we can never come back
to si , so for the long-term behaviour of the Markov chain these states do not matter.
Two states si and s j are called communicating, denoted si $ s j , if we can go
from each of them to the other with positive probability. If
9.1 Definitions and basic properties 257
0.3
s2
0.25 | 0.4
s7 s3
0.5 0.5
0.6 | 0.9
0.25 s6
0.7 | 0.25
s1
0.1
s4
0.5 s5
0.75
Fig. 9.1 Transition graph of a Markov chain with the transition matrix (9.9). States s3 , s4 are inessential,
and essential states are divided into two clusters, {s1 , s2 , s5 } and {s6 , s7 }.
Si = s 2 S : si $ s
pi j (n) = Pn ij
(9.10)
and, in particular,
n
P(Xn = s j ) = Â µi pi j (n), (9.12)
i=1
258 9 Finite State Markov Chains
P(X2 = s j | X0 = si ),
which is the (i, j)th entry of P2 . The general case follows by induction, using a
similar calculation. Assuming that (9.11) holds,
m
P(Xn+1 = s j | X0 = si ) = Â P(Xn = sk , Xn+1 = s j | X0 = si )
k=1
m
= Â P(Xn+1 = s j | Xn = sk )P(Xn = sk | X0 = si )
k=1
m
= Â pik (n)pk j = pi j (n + 1).
k=1
If we multiply this by µi = P(X0 = si ) and sum over i m, we get (9.12) for Xn+1 ,
so the proof of the induction step is complete. t
u
The period di of a state si is defined by
the greatest common divisor all the times n when a walk starting at si can return
to si . If di = 1 then the state is called aperiodic. We call a Markov chain aperiodic
9.1 Definitions and basic properties 259
if all periods di = 1. It turns out that for irreducible chains this is the same as
requiring only one period di to be equal to 1.
Lemma 9.2. If a Markov chain is irreducible then all periods di are equal.
Proof. Let us suppose that s j can be reached from si in N steps and si can be
reached from s j in M steps with positive probabilities, pi j (N) > 0 and p ji (M) > 0.
If p j j (n) > 0 then
because we can reach s j in N steps, then come back to s j in n steps, and then
reach si in M steps, so the probability to return to si in N + M + n steps is positive.
This is also true for n = 0, because we can go to s j and come back to si in N + M
steps. Since di is the period of si , by definition, di divides all numbers N + M + n
as above. In particular, it divides N + M, which implies that it divides all n such
that p j j (n) > 0. This means that di divides d j , because d j is the greatest common
divisor of all such n. Similarly, d j divides di , so they must be equal. t
u
The following property of irreducible aperiodic Markov chains will be very im-
portant in Section 9.3.
Lemma 9.3. If a Markov chain is irreducible and aperiodic then there exists N 1
such that, for all n N,
pi j (n) > 0 for all i, j. (9.15)
In other words, for large n, all the entries of Pn are strictly positive.
Proof. Let T (s1 ) = {n 1 : p11 (n) > 0} be the set of all times the chain starting
at s1 can come back to s1 . Let d 1 be the smallest positive integer that can be
written as
d = a1 n1 + . . . + ak nk ,
for k 1, ni 2 T (s1 ) and ai 2 Z. Then d must divide all n 2 T (s1 ), because if it
does not divide some n 2 T (s1 ) then
n = ad + r,
r=n ad = n a1 n1 ... ak nk
is also a linear combination with integer coefficients of the times in T (s1 ), which
contradicts that d was the smallest such number. This proves that d divides the
period of s1 and, since the chain is aperiodic, d = 1. We showed that
1 = a1 n1 + . . . + ak nk , (9.16)
Let us take the largest |ai | and, for certainty, suppose that |a1 | is the largest. If
n = c1 n1 + . . . + ck nk ,
with non-negative integer coefficients ci . This will prove that p11 (n) > 0 for such
n, because the chain starting from the state s1 can come back to s1 in ni number
of steps, so we just need to repeat these loops ci times for all i k to see that the
chain can come back in n steps with positive probability.
Let us first consider any n between N1 and N1 + n1 , which means that n = N1 + `
for some ` n1 . By (9.16), we can write
where
ci = |a1 |n1 + `ai |a1 |` + `ai = (|a1 | + ai )` 0
as we wished, because we assumed that |a1 | is greater or equal to |ai |. Since
we can repeat the same argument for n in between N1 + n1 and N1 + 2n1 , and so
on.
Similarly, for each state si , we can find Ni such that pii (n) > 0 for n Ni . If we
take N 0 = max(N1 , . . . , Nm ) then pii (n) > 0 for all i and all n N 0 . Since the chain
is irreducible, we can always go from one state si to another state s j in under m
steps with positive probability (see exercise below), which implies that pi j (n) > 0
for all n N 0 + m. t
u
Exercise 9.1.1. Let U1 ,U2 , . . . be i.i.d. random variables with the uniform distribu-
tion on {1, . . . , m} and let
Xn = max Ui .
in
Show that X1 , X2 , . . . is a Markov chain and find its transition probabilities. What
are the essential states of this chain?
Exercise 9.1.2. Suppose that m while balls and m black balls are mixed together
and divided evenly between two baskets. At each step two balls are chosen at ran-
dom, one from each basket, and switched. We say that the system is in the state si
if there are i white balls in the first basket. Find the transition probabilities of this
chain.
Exercise 9.1.3. If a Markov chain on m states is irreducible, show that, for any
i, j m, pi j (k) > 0 for some k m.
9.1 Definitions and basic properties 261
Exercise 9.1.4. If a Markov chain is irreducible and p11 > 0, what is the period d2
of the state s2 ?
Exercise 9.1.5. If a Markov chain (Xn )n 0 is irreducible and has period d, show
that Yn = Xdn for n 0 is a Markov chain and that it is aperiodic. Does it have to
be irreducible?
Exercise 9.1.6. Let N ⇠ Poiss(l ) be a Poisson random variable. Let us run N in-
dependent Markov chains starting from the state s1 , and let Nn (i) be the number of
these chains in the state si at time n. Show that Nn (i) ⇠ Poiss(l p1i (n)).
262 9 Finite State Markov Chains
In Lemma 9.1 we showed that if the vector µ = (µ1 , . . . , µm ) describes the distri-
bution of the chain X0 at time zero, then µPn is the distribution of Xn at time n. The
distribution µ on the set of states is called stationary if
µ = µP, (9.17)
i.e. the distribution of X1 is the same as that of X0 . Of course, this implies that
µ = µPn , so the distribution of all Xn is the same. We will show that a stationary
distribution always exists, and for an irreducible Markov chain it is unique. For
irreducible chains, we will also derive a representation of the stationary distribution
in terms of expected return times. To find a stationary distribution in practice, one
can just solve a system of linear equations µ = µP.
Lemma 9.4 (Existence). For any finite state Markov chain, there exists at least
one stationary distribution.
Notice that if the stationary distribution is unique, all the rows of A in the above
proof must be equal.
Next, we will show the following.
9.2 Stationary distributions 263
Proof. Previous lemma shows that P I is not invertible, since there exists a non-
zero solution of µ(P I) = 0. Let us first show that, when P is the transition matrix
of an irreducible Markov chain, the rank of P I is m 1. Consider the linear
system (P I)vT = 0 for v = (v1 , . . . , vn ). Let us show that irreducibility implies
that
v1 = v2 = . . . = vn .
Consider the largest coordinate of v, let us say, v1 . The first equation in the system
reads m
 p1 j v j = v1 .
j=1
spanned by the vector (1, . . . , 1), which means that the rank of P I equals to m 1.
On the other hand, if we had more than one stationary distribution, the system µ =
µP would have at least two linearly independent solutions, which would mean that
the rank of P I would be not greater than m 2. This proves that the stationary
distribution is unique. t
u
Next, we will derive a more descriptive formula for the stationary distribution of
irreducible Markov chains. First, let us give a heuristic non-rigorous explanation
and then check carefully that our intuitive guess is correct. In the proof of Lemma
9.4 we considered the sequence of matrices
1
An = (1 + P + . . . + Pn )
n
and showed that any limit A over some subsequence consists of rows which are
stationary distributions. For irreducible chains, we just showed that the stationary
distribution µ is unique, so we must have that
264 9 Finite State Markov Chains
2 3 2 3
µ µ1 · · · µm
6 .. 7 6 .. . . .. 7 .
A=4.5=4 . . . 5
µ µ1 · · · µm
In particular, this means that the limits over all convergent subsequences are equal
to A and, therefore, the limit exists over the entire sequence,
2 3
µ1 · · · µm
1
lim (1 + P + . . . + Pn ) = 4 ... . . . ... 5 .
6 7
n!• n
µ1 · · · µm
we get
1 n
lim
n!• n
 P(Xk = s j ) = µ j . (9.20)
k=1
In other words, this equation holds no matter how we start the chain, i.e. no matter
what the distribution of X0 is. Since
P(Xk = s j ) = E I(Xk = s j ),
1 n
lim E
n!•
 I(Xk = s j ) = µ j .
n k=1
(9.21)
Clearly, this equation can be interpreted by saying that µ j is the expected propor-
tion of time the chain visits the state s j , if we looks at these proportions over long
periods of time.
There is another interpretation of the last equation. Since it does not matter how
we start the chain, let us start it in the state s1 . Then, let us divide the time between
1 and n into intervals between consecutive visits of the state s1 . Let us say that M
visits of s1 occurred between 1 and n, with intervals T1 , T2 , . . . , TM , so that
9.2 Stationary distributions 265
T1 + . . . + TM ⇡ n
in the sense that their ratio is close to 1. Let N1 ( j), . . . , NM ( j) be the number of
times we visited s j in between these consecutive returns to s1 , so that
n
N1 ( j) + . . . + NM ( j) ⇡ Â I(Xk = s j ).
k=1
Then
1 n N1 ( j) + . . . + NM ( j)
Â
n k=1
I(Xk = s j ) ⇡
T1 + . . . + TM
.
If we know that the Markov chain is visiting s1 , the future does not depend on
the past because of the memoryless property. This suggests that what happens in
between consecutive visits is independent of each other and the pairs
are actually independent and identically distributed. To make this statement rigor-
ous, once needs to prove the so called strong Markov property, which we are not
going to go into here. However, if we use it as a guiding intuition and divide both
numerator and denominator above by M, the law of large numbers tells us that
N1 ( j) + . . . + NM ( j) EN1 ( j)
⇡ .
T1 + . . . + TM ET1
This suggests another representation,
EN1 ( j)
µj = , (9.22)
ET1
where we assume that the chain starts in the state s1 ,
T1 = min n 1 : Xn = s1 (9.23)
{T1 = n} = {X1 6= s1 , . . . , Xn 1 6= s1 , Xn = s1 }.
Notice also that N1 (1) = 1, because the chain is in the state s1 only at time zero
before the return, so (9.22) implies that
1
µ1 = . (9.25)
ET1
Since it did not matter how we start the chain, if we start it at s j and define
T j = min n 1 : Xn = s j (9.26)
In order to prove the representation (9.22), we need to check two things. First,
we will need to check that ET1 < •. Because the number of visits to different states
before time T1 adds up to T1 ,
m
 N1 ( j) = T1 ,
j=1
and therefore N1 ( j) T1 , proving ET1 < • would also imply that EN1 ( j) < •.
This would show that the numbers in (9.22) are well defined, and they define a
distribution because
m m
EN1 ( j) ET1
 µj =  = = 1.
j=1 j=1 ET1 ET1
Lemma 9.6. If a Markov chain is irreducible then ET1 < •, where T1 defined in
(9.23) is the first return time to the state s1 for the chain that starts at s1 .
In the proof we will use the Markov property (9.2) in the following way. Let us
consider two vectors
of possible outcomes for these two blocks of the Markov chain, and consider the
conditional probability
9.2 Stationary distributions 267
P(X 2 2 B | X 1 2 A).
We can not use the Markov property directly, because the condition is a set, but we
can rewrite this as
P(X 2 2 B, X 1 2 A)
P(X 2 2 B | X 1 2 A) = (9.28)
P(X 1 2 A)
P(X 2 2 B, X 1 = a)
=Â
a2A P(X 1 2 A)
P(X 2 2 B | X 1 = a)P(X 1 = a)
= Â P(X 1 2 A)
a2A
P(X 2 2 B | Xt1 = at1 )P(X 1 = a)
= Â P(X 1 2 A)
,
a2A
where in the last step we used the Markov property to write the condition in terms
of the last outcome Xt1 = at1 . This is very useful because, for example, if we can
control the probability
P(X 2 2 B | Xt1 = at1 ) p
for all possible outcomes Xt1 = at1 then the above equation implies that
P(X 1 = a)
P(X 2 2 B | X 1 2 A) p  1
= p. (9.29)
a2A P(X 2 A)
In other words, Markov property can be used in this way even if the condition is a
set.
Proof (of Lemma 9.6). Since the chain is irreducible, if the chain is in the state si
at time k, Xk = si , we can reach the state s1 in under m steps, which means that
for some ` m. This implies that the probability that we do not reach the state s1
in the next m steps is strictly smaller than 1,
P(Xk+1 6= s1 , . . . , Xk+m 6= s1 | Xk = si )
P(Xk+` 6= s1 | Xk = si ) = 1 P(Xk+` = s1 | Xk = si ) < 1.
If we take
1)m
A = {s2 , . . . , sm }(N , B = {s2 , . . . , sm }m
to be the sets consisting of all the paths that avoid the state s1 and have lengths
(N 1)m and m then
The equation (9.30) implies that, no matter what the value of X(N 1)m is, the condi-
tional probability that all coordinates of X 2 avoid the state s1 is smaller than 1 e.
By the discussion before the proof and (9.29), the Markov property implies
P(X 2 2 B | X 1 2 A) 1 e,
and, therefore,
T1 1
N1 ( j) = Â I(Xk = s j ), (9.34)
k=0
where we assumed that the chain starts in the state s1 . To check that our heuristic
discussion above gives the correct answer, it remains to prove the following.
because the two equations only differ by a factor of 1/ET1 . To prove (9.35), first
let us consider j 6= 1. Because the chain starts at s1 , we have X0 = s1 6= s j and, in
the definition of N1 ( j), we can start the summation from k = 1,
T1 1 T1 1
N1 ( j) = Â I(Xk = s j ) = Â I(Xk = s j ).
k=0 k=1
where we also partitioned the event into different outcomes Xk 1 = si . The key
observation we need to make is that the event
{k 1 < T1 } = X1 6= s1 , . . . , Xk 1 6= s1
270 9 Finite State Markov Chains
P(Xk 1 = si , Xk = s j , k 1 < T1 )
= P(Xk = s j | Xk 1 = si , k 1 < T1 )P(Xk 1 = si , k 1 < T1 ).
If we use the Markov property in the form (9.28), we can drop the condition k 1 <
T1 and rewrite this as
P(Xk 1 = si , Xk = s j , k 1 < T1 )
= P(Xk = s j | Xk 1 = si )P(Xk 1 = si , k 1 < T1 )
= pi j P(Xk 1 = si , k 1 < T1 ).
Exercise 9.2.1. Show that if a Markov chain has two different stationary distribu-
tions then there exist infinitely many stationary distributions.
Exercise 9.2.2. Find the stationary distribution of Markov chain with the transition
matrix 2 3
0 0.5 0.5
P = 4 0 0 1 5.
0.5 0.5 0
If the chain starts at s2 , what is the expectation of the first return time to s2 ?
Exercise 9.2.3. If µ = (µ1 , . . . , µm ) is the stationary distribution of an irreducible
Markov chain, show that µi 6= 0 for all 1 i m.
Exercise 9.2.4 (Another proof of uniqueness). Consider an irreducible Markov
chain and suppose that there exist two stationary distributions µ 1 and µ 2 . Let j be
any minimizer
µ 1j µi1
= min .
µ 2j im µ 2
i
9.2 Stationary distributions 271
Show that
µ 1j µi1
=
µ 2j µi2
for any i such that pi j > 0. From this, derive that µ 1 = µ 2 .
Exercise 9.2.5. If p1 j = p2 j for all j and
(
s j , if Xn = s j for j 3,
Yn =
s⇤ , if Xn = s j for j 2,
show that Y1 ,Y2 , . . . is a Markov chain on the new state space S = {s⇤ , s3 , . . . , sm }.
Exercise 9.2.6. If a Markov chain is irreducible, prove that all moments ET1p < •
for p 1, where T1 is defined in (9.23).
Exercise 9.2.7. In Toronto, it rains (or snows) on about 38% of days per year. Let
us consider a Markov chain model of weather with two states s1 = ‘rain’, s2 = ‘no
rain’, and the transition matrix
p 1 p
P= ,
1 q q
for some p, q 2 (0, 1), so that if it rains today then it will rain tomorrow with
probability p, and if it does not rain today then it will not rain tomorrow with
probability q. Find all p and q for which the expected proportion of rainy days over
long periods of time is 0.38.
Exercise 9.2.8. Suppose that the state s1 is inessential and let S1 be the cluster of
states that communicate with s1 . Suppose that the chain starts at s1 , and let T be
the first time we exit the cluster S1 ,
T = min n 1 : Xn 62 S .
TA (si ) = min n 0 : Xn 2 A
be the first time the chain visits one of the states in A and let ti = ETA (si ) be its
expectation. Prove that
(a) If si 2 A then ti = 0.
(b) If si 62 A then ti = 1 + Âmj=1 pi j t j .
(c) The equations in parts (a) and (b) determine t = (t1 , . . . ,tm ) uniquely. Hint:
consider equation (9.18).
272 9 Finite State Markov Chains
Exercise 9.3.1. If the limit limn!• Pn exists then the limit limn!• 1n (1 + P + . . . +
Pn ) exists and is the same.
It is easy to see that this can not be true in general. For example, if a Markov
chain has two states s1 , s2 and it has the transition probabilities p12 = p21 = 1 then,
01 10
P2n = , P2n+1 = .
10 01
Another way to put it is that, if we start the chain in the state si , it can be back in
si only at even times. The problem with this example is that this chain has period
2, which results in this periodic behaviour. However, if the chain is aperiodic then
the convergence holds.
In one direction, multiplying the right hand side of (9.38) by n on the left, obvi-
ously, gives µ. In the other direction, if we take n = (0, . . . , 1, . . . , 0) where the ith
coordinate is 1, then
lim nPn = µ
n!•
ith
is the row of the limit limn!• Pn ,
so all the rows of this limit are equal to µ.
Recall that if n is the initial distribution of the chain, i.e. the distribution of X0 ,
then nPn is the distribution of Xn . Hence, the convergence theorem in the form
(9.39) states that the distribution of Xn converges to the stationary distribution µ
no matter what the initial distribution is.
Proof (Theorem 9.2). We will be proving the formula (9.39). Let us consider the
L1 -distance between vectors on Rm ,
m
kx yk1 = Â |xi yi |.
i=1
Let us notice that multiplying by any transition matrix P on the right does not
increase this distance, because
m m m m
kxP yPk1 = Â Â pi j (xi yi ) Â Â pi j |xi yi |
j=1 i=1 j=1 i=1
m m
= Â Â pi j |xi yi |
i=1 j=1
m
= Â |x j y j | = kx yk1 .
i=1
If we apply this to distributions n and µ and use that µ = µP, we get that
This suggests that the distributions get closer to each other after one step but, in
order to prove convergence, we would like to have a strict inequality and then
iterate it.
This is where the assumption that our chain is aperiodic comes into play. By
Lemma 9.3 in Section 9.1, for large enough N, all entries pi j (N) of PN are strictly
positive. Let e > 0 be the smallest among its entries. If we apply the above calcula-
tion to the transition matrix PN and distributions n and µ, the improvement comes
from the fact that m
 e(ni µi ) = e(1 1) = 0
i=1
and, therefore,
274 9 Finite State Markov Chains
m m
 pi j (N)(ni µi ) =  (pi j (N) e)(ni µi ).
i=1 i=1
Notice that all pi j (N) e 0, because e was the smallest among all entries, so
m m
knPN µPN k1 = Â Â pi j (N)(ni µi )
j=1 i=1
m m
= Â Â (pi j (N) e)(ni µi )
j=1 i=1
m m
  (pi j (N) e)|ni µi |
j=1 i=1
m m
= Â Â (pi j (N) e)|ni µi |
i=1 j=1
m
= (1 me) Â |n j µ j |.
i=1
This proves that knPn µPn k1 ! 0 for any two distributions n and µ. When µ is
the stationary distribution, µPn = µ, so knPn µk1 ! 0. This finishes the proof.u
t
Exercise 9.3.2. Given p 2 (0, 1), consider a Markov chain with the transition prob-
abilities
pi,i 1 = 1 p, pi,i+1 = p for i = 2, . . . , m 1
and p1,1 = 1 p, p1,2 = p, pm,m 1 = 1 p, pm,m = p. What is the limit limn!• Pn .
Hint: solve the system µ = µP explicitly.
Exercise 9.3.3. Show that if a Markov chain is irreducible and aperiodic then, for
any function f : S ! R, the expected value E f (Xn ) converges.
Exercise 9.3.4. Suppose that m people sit at the round table and all have plates
with rice in front of them. At the same time, each person divides his of her rice in
half and evens out one half with the half of the person on the right (i.e. they split
the total of their halves evenly), and another half with the person on the left. If they
keep repeating this, what will happen in the long run?
Let us consider a Markov chain with the transition matrix P. A probability distri-
bution µ = (µ1 , . . . , µm ) on the set of states is called reversible for the chain if
µi pi j = µ j p ji (9.40)
for all i and j. A Markov chain is called reversible if it has a reversible distribution.
The equations (9.40) are called the detailed balance equations. If we start the chain
with the distribution µ then (9.40) states that the first two outcomes (si , s j ) and
(s j , si ) are equally likely. Moreover,
P(X0 = si , X1 = s j , X2 = sk ) = µi pi j p jk
= p ji µ j p jk = p ji pk j µk = P(X0 = sk , X1 = s j , X2 = si ),
which means that the first three outcomes (si , s j , sk ) and (sk , s j , si ) are equally
likely. The same calculation show that any sequence of outcomes and the same
sequence in reverse order are equally likely, hence the name reversible. It is easy
to check that a reversible distribution is stationary.
Lemma 9.7. A reversible distribution µ is stationary.
Proof. Summing (9.40) over i, we get
m m
 µi pi j = µ j  p ji = µ j ,
i=1 i=1
In other words, in each state si the chain picks a neighbour uniformly at random
and moves there. A distribution µ is reversible for this chain if
278 9 Finite State Markov Chains
1 1
µi = µj .
Ni Nj
Therefore, all the ratios Nµii must be equal to the same constant c, so µi = cNi . Since
the probabilities must add up to 1, the constant c = N1 , where N = N1 + . . . + Nm ,
and
Ni
µi = . (9.43)
N
Since the graph is connected, the Markov chain is irreducible and µ is the unique
stationary distribution. t
u
The most important examples of reversible Markov chains appear in the con-
text of Markov Chain Monte Carlo (MCMC) algorithms. Let us describe one such
algorithm.
without computing µ? The idea of the Markov Chain Monte Carlo method is
to combine the law of large numbers with the convergence theorem for Markov
chains. Suppose that we can construct a Markov chain with the stationary distri-
bution µ, whose transition probabilities depends only on the ratios ri j . Then we
can pick the initial state at random and run the chain for a large number of steps,
n, to obtain a random variable Xn with the distribution close to µ. If we repeat
9.4 Reversible Markov chains 279
this procedure N times, we will get N i.i.d. random variables Xn1 , . . . , XnN with the
distribution close to µ. By the law of large numbers,
1 N m
 n
N k=1
f (X k
) ⇡ E f (Xn
1
) ⇡ Â f (si )µi .
i=1
so this definition makes sense. Notice also that the transition probabilities depend
only on the ratios ri j .
To show that µ is the stationary distribution of this Markov chain, we will check
that µ satisfies the detailed balance equations (9.40). For i = j there is nothing to
check, and if i 6⇠ j then both sides are zero, so we only need to check this for i ⇠ j.
If µ j Ni µi N j then, by (9.44),
1 1 µi N j µi
pi j = and p ji = = ,
Ni N j µ j Ni µ j Ni
which immediately implies (9.40). The case µ j Ni µi N j is exactly the same, so
the Metropolis chain has the stationary distribution µ.
If at least one pii > 0 then the chain is aperiodic. If the chain is periodic, a slight
modification will make it periodic without affecting the stationary distribution µ
(see exercise below). Then the convergence theorem applies and we can use this
chain to produce an i.i.d. sample with the distribution close to µ. t
u
Exercise 9.4.3. Show that a birth-and-death Markov chain is aperiodic if and only
if at least one pii > 0.
Exercise 9.4.4. Show that a Markov chain with the following transition matrix is
not reversible: 2 3
0 0.75 0.25
P = 40.25 0 0.755 .
0.75 0.25 0
Exercise 9.4.5. Check that the Binomial distribution B(m, 12 ) satisfies the detailed
balance equations (9.40) for the Ehrenfest chain (9.41).
Exercise 9.4.6. Suppose that two containers contain a total of m while balls and m
black balls. At each step a ball is chosen at random from all 2m balls and put into
the other container. We say that a system is in the state si if there are i white balls
in the first container. Find the transition probabilities of this chain and compute
limn!• Pn .
Exercise 9.4.7. Consider the Ehrenfest chain (9.41) and let Dn be the difference of
particles in the two containers at time n. This means that if the chain is in the state
si at time n then Dn = 2i m. Prove that
m 2 ⇣ m 2 ⌘n
EDn+1 = EDn = ED0 .
m m
Exercise 9.4.8. Consider a random walk on the following graph:
s3
s2 s4
s1 s6 s5
If we start the chain at s1 , what is the expected time of the first return to s1 ? What
is the expected number of visits to s4 before the first return to s1 ? Hint: recall the
results in Section 9.2.
281
Chapter 10
Gaussian Distributions
In this section, we will describe several Gaussian techniques, such as the Gaus-
sian integration by parts, interpolation and concentration. We will begin with the
Gaussian integration by parts. Let g be a centered Gaussian random variable with
variance v2 and let us denote the density function of its distribution by
1 ⇣ x2 ⌘
jv (x) = p exp . (10.1)
2pv 2v2
Since xjv (x) = v2 jv0 (x), given a differentiable function F : R ! R, we can for-
mally integrate by parts,
Z +•
Z
2 2
EgF(g) = xF(x)jv (x)dx = v F(x)jv (x) +v F 0 (x)jv (x)dx
Z •
if the limits limx!±• F(x)jv (x) = 0 and the expectations on both sides are finite.
In fact, this formula holds under the assumption that E|F 0 (g)| < •, i.e. the right
hand side is well defined.
Proof. By making the change of variables Z = g/v and considering the function
f (x) = F(vx), it is enough to prove that Z ⇠ N(0, 1) satisfies
z2 Rz x2
If we represent e 2 = • xe
2 dx for z 0 then
Z 0 0 z Z Z
1 z2 1 x2
p f 0 (z)e 2 dz = p f 0 (z)xe 2 dxdz
2p • 2p • •
Z 0 Z 0
1 x2
= p f 0 (z)xe 2 dzdx
2p • x
Z 0
1 x2
= p f (x) f (0) xe 2 dx,
2p •
where we could switch the order of integration, by Fubini’s theorem, because the
integrand is absolutely integrable over the region x z 0 by the assumption that
E| f 0 (Z)| < •. Similarly, representing
Z •
z2 x2
e 2 = xe 2 dx for z 0,
z
Adding up the two integrals, we get E f 0 (Z) = EZ f (Z) f (0) = EZ f (Z), which
finishes the proof. t
u
∂F
F` = (10.4)
∂ x` .
Then the following holds.
our assumption that, for all ` n, either E|F` (g)| < • or l` = 0 implies (10.7).
Moreover, by this assumption, (10.7) is absolutely integrable and integrating in g0
finishes the proof. t
u
We will encounter two types of examples, when the function F = F((x` )1`n )
and all its partial derivatives are bounded, or, when they grow at most exponentially
fast, that is, for some constants c1 , c2 > 0, for ` n,
∂F
(x) c1 ec2 |x| . (10.9)
∂ x`
The end points of this interpolation are f (0) = EF(Y ) and f (1) = EF(X), and the
derivative along the interpolation can be computed as follows.
284 10 Gaussian Distributions
∂F ∂ 2F
(x) c1 ec2 |x| and, either (x) c1 ec2 |x| or ai, j = bi, j , (10.13)
∂ xi ∂ xi ∂ x j
then
1 ∂ 2F
2 i,Â
f 0 (t) = (ai, j bi, j )E Z(t) . (10.14)
jn ∂ xi ∂ x j
Proof. When the partial derivatives have at most exponential growth, it is easy to
check that one can interchange the derivative and integral to write
d ∂F ∂F
f 0 (t) = E F(Z(t)) = E Â (Z(t))Zi0 (t) = Â E (Z(t))Zi0 (t).
dt in ∂ xi in ∂ xi
Let us apply Gaussian integration by parts to each of the terms on the right hand
side. Since the covariance
⇣ 1 1 ⌘p p 1
EZi0 (t)Z j (t) = E p Xi p Yi ( tX j + 1 tY j ) = (ai, j bi, j ),
2 t 2 1 t 2
the Gaussian integration by parts formula (10.5) implies that
∂F 1 ∂ 2F
E
∂ xi
(Z(t))Zi0 (t) =
2 Â (ai, j bi, j )E
∂ xi ∂ x j
(Z(t)),
jn
Using the above interpolation, one can prove the following classical Gaussian
concentration inequality for Lipschitz functions. Let us consider a function F =
F (x` )1`n : Rn ! R such that, for some L > 0,
Proof. First, let us suppose that F is differentiable and its gradient is bounded by
L, |—F| L. Take any l 0. The Gaussian interpolation we would like to consider
is of the form
⇣ p p p p ⌘
f (t) = E exp l F( tg1 + 1 tg) F( tg2 + 1 tg) , (10.17)
10.1 Gaussian integration by parts, interpolation, and concentration 285
where g = (gi )in , g1 = (g1i )in and g2 = (g2i )in are three independent standard
Gaussian vectors on Rn . If we consider a function G : R2n ! R given by the for-
mula
⇣ ⌘
G(x1 , . . . , xn , xn+1 , . . . , x2n ) = exp l F(x1 , . . . , xn ) F(xn+1 , . . . , x2n ) ,
and define Gaussian random vectors X and Y on R2n by X = (g1 , g2 ) and Y = (g, g),
then the above interpolation can be rewritten in the form (10.12) as
p p
f (t) = EG( tX + 1 tY ).
∂ 2G
= l 2 GFi (x1 , . . . , xn )Fi (xn+1 , . . . , x2n ).
∂ xi ∂ xi+n
By our assumption, |—F| L, so the partial derivatives Fi are all bounded, and F
grows at most linearly, |F(x)| c1 + c2 |x|. Hence, the growth conditions in (10.13)
(for G) are satisfied, and the formula (10.14) gives
p p n p p p p
f 0 (t) = l 2 EG( tX + 1 tY ) Â Fi ( tg1 + 1 tg)Fi ( tg2 + 1 tg).
i=1
l 2 L2 t 0 l 2 L2 t
f (t)e =e f 0 (t) l 2 L2 f (t) 0,
2 2 2 2
so f (t)e l L t is decreasing and f (1) el L f (0). Recalling the definition (10.17),
we see that f (0) = 1 and, therefore, we proved that
2 L2
f (1) = E exp l F(g1 ) F(g2 ) el .
The Gaussian interpolation and the assumption on the gradient, |—F| L, have
played their roles.
Since g1 and g2 are independent, integrating in g2 first (let us denote this integral
E2 ) and using Jensen’s inequality,
Integrating this in g1 , and using that g1 , g2 have the same distribution as g, we get
2 L2
E exp l F(g) EF(g) E exp l F(g1 ) F(g2 ) el .
This inequality holds for any l 0, and minimizing the right hand side over l (in
other words, setting l = t/2L2 ) proves that
⇣ ⌘ ⇣ t2 ⌘
P F(g) EF(g) t exp . (10.18)
4kFk2Lip
Since
X = g1 b1 + . . . + gn bn (10.23)
and, in particular, the norm of the random vector X in (10.23) can be written as
n o
kXkE = sup g1 z (b1 ) + . . . + gn z (bn ) : z 2 E ⇤ , kz kE ⇤ 1 . (10.25)
This functional is of the same form as in (10.21), with the set A in Rn given by
n o
A = (z (b1 ), . . . , z (bn ) : z 2 E ⇤ , kz kE ⇤ 1 .
This shows that the norm of a random vector X with Gaussian coordinates satisfies
the following concentration inequality,
⇣ ⌘ ⇣ t2 ⌘
P kXkE EkXkE t 2 exp . (10.27)
4s (X)2
Another common way to write this, replacing t by tEkXkE ,
⇣ ⌘ ⇣ t 2 ⇣ EkXk ⌘2 ⌘
E
P kXkE EkXkE tEkXkE 2 exp . (10.28)
4 s (X)
EkXkE 2
The quantity d(X) := s (X) is called the concentration dimension of X.
where c p is some constant that depends only on p. The first inequality is just
Jensen’s inequality, so we only need to show the second one.
If we denote x+ = max(0, x) the positive part of x, then we can bound
where we made the change of variables t = s x and then denoted by a p the constant
R
p 0•t p 1 e t /4 dt. Using the concentration inequality in the previous example, we
2
proved that
EkXkEp 2 p 1 (EkXkE ) p + 2 p 1 a p s (X) p . (10.31)
It remains to understand how to bound s (X) in (10.26).
For z 2 E ⇤ , the random variable
which implies that v2 = EY 2 = p2 (E|Y |)2 . Using this for Y = z (X), we get
p
z (b1 )2 + . . . + z (bn )2 = Ez (X)2 = (E|z (X)|)2 .
2
Taking supremum over z 2 E ⇤ such that kz kE ⇤ 1, we get that
p 2 p 2 p 2
s (X)2 = sup E|z (X)| E sup |z (X)| = EkXkE .
2 z 2 z 2
p
This means that s (X) p/2EkXkE , and plugging this into (10.31), finally
proves (10.29) with the constant c p = 2 p 1 (1 + a p (p/2) p/2 ).
Theorem 10.2 (Slepian-Fernique inequality I). Let X = (Xi )in and Y = (Yi )in
be two Gaussian vectors on Rn such that
1. EXi2 = EYi2 for all i n,
2. EXi X j EYiY j for all i, j n.
Then, for any choice of parameters (li )in 2 Rn ,
⇣\
n ⌘ ⇣\
n ⌘
P Xi li P Yi li (10.33)
i=1 i=1
and
E max Xi E max Yi . (10.34)
i i
What this means is that, if the coordinates of X have the same variance, but are less
correlated, then they are less likely to stay below given thresholds (li ) and, there-
fore, their maximum is bigger on average. After we prove this result, we will give
another proof of the second statement (10.34) under less restrictive assumptions.
because the derivatives applied to the factors i and j in the product will both be
negative, since all functions j` are decreasing. On the other hand, by assumption,
the difference of the covariances EXi X j EYiY j 0 is negative in this case, so
the corresponding term in (10.14) will be negative. The derivatives ∂ 2 /∂ xi2 are not
important because the variances are equal and EXi2 EYi2 = 0. This proves that
f 0 (t) 0 and, therefore,
292 10 Gaussian Distributions
which is the same as (10.33). Let us show how this implies (10.34).
Notice that, if we take all li = l then (10.33) can be rewritten as
⇣ ⌘ ⇣ ⌘
P max Xi l P max Yi l . (10.35)
i=1 i=1
P X+ l P Y+ l ,P X l P Y l .
then, under the first assumption that all variances are equal, EXi2 = EYi2 , the second
assumption that EXi X j EYiY j for all i, j n is equivalent to
We will now show that with this reformulation of the second assumption, we can
drop the first assumption, which is often convenient in applications.
Theorem 10.3 (Slepian-Fernique inequality II). Let X = (Xi )in and Y = (Yi )in
be two Gaussian vectors on Rn such that
Proof. We will apply the same interpolation as in the above proof, but to a special
smooth approximation of the maximum function maxin xi given by
10.3 Gaussian comparison inequalities 293
1
F(x) = log  eb xi ,
b in
for positive parameter b > 0. The reason we can view this as a smooth approxima-
tion of the maximum is because
1 log n
max xi F(x) = log  eb xi max xi + .
in b in in b
Indeed, if we keep only the largest term in the sum Âin eb xi we get the lower
bound, and if we replace each term in this sum by the largest one, we get the
upper bound. Now, we can make the second term log n/b becomes as small as we
like by taking b large. If we prove that EF(X) EF(Y ), we recover (10.36) by
letting b go to infinity. To use the Gaussian interpolation (10.14), let us compute
the derivatives of F. First of all,
∂F eb xi
pi (x) := = .
∂ xi  jn eb x j
∂ 2F ∂ 2F
= b (pi (x) pi (x)2 ), = b pi (x)p j (x) if i 6= j.
∂ xi2 ∂ xi ∂ x j
Therefore, the Gaussian interpolation formula (10.14) gives (to simplify notation,
we write pi instead of pi (Z(t)))
d 1 ∂ 2F
f 0 (t) = EF(Z(t)) = Â (EXi X j EYiY j )E (Z(t))
dt 2 i, jn ∂ xi ∂ x j
b b
=
2 Â (EXi2 EYi2 )E(pi p2i )
2 Â (EXi X j EYiY j )Epi p j .
in i6= j
eb xi
 pi (x) =  bxj
= 1.
in in  jn e
Switching indices i and j, this can also be written as Âi6= j (EX j2 EY j2 )Epi p j , and
taking average of the two, we get
1
 (EXi2 EYi2 )E(pi p2i ) =
2 i6Â
(EXi2 + EX j2 EYi2 EY j2 )Epi p j .
in =j
Plugging it into the above expression for the derivative f 0 (t) and collecting the
terms, we get
b
f 0 (t) =
4 Â (E(Xi X j )2 E(Yi Y j )2 )Epi p j 0,
i6= j
since, by the assumption of the theorem, each term is positive. This implies that
f (1) f (0) or, in other words, EF(X) EF(Y ). This finishes the proof. t
u
k f (t) f (t 0 )k kt t 0 k,
Obviously,
Next, we will prove a more general minimax analogue of the first theorem above
known as Gordon’s comparison inequality.
Theorem 10.4 (Gordon’s inequality). Let (Xi, j ), (Yi, j ) be two Gaussian vectors
indexed by i n, j m such that
10.3 Gaussian comparison inequalities 295
and
E min max Xi, j E min max Yi, j . (10.41)
i j i j
This reduces to Slepian’s inequality when the index i takes only one value, so the
condition 3 is empty.
Proof. Let us rewrite the indicator
⇣[
n \
m ⌘ ⇣
n m ⌘
I xi, j li, j = 1 ’ 1 ’ I xi, j li, j .
i=1 j=1 i=1 j=1
because the derivatives applied to two factors in the last product will both be non-
positive (since all functions ji, j are non-increasing). On the other hand, the dif-
ference of the covariances EXi, j Xi,` EYi, jYi,` 0 is negative in this case, so the
corresponding term in (10.14) will be negative. Similarly, for i 6= k,
∂ 2j n ⇣ m ⌘ ∂ m
∂ m
∂ xi, j ∂ xk,`
= ’ 1 ’ jk0 ,p (xk0 ,p ) ∂ xi, j ’ ji,p (xi,p ) ∂ xk,` ’ jk,p (xk,p )
k 6=i,k
0 p=1 p=1 p=1
which is the same as (10.40). If we take all li, j = l , this can be rewritten as
⇣ ⌘ ⇣ ⌘
P min max Xi, j l P min max Yi, j l .
i j i j
For the last statement of the previous theorem, one can also remove the assump-
tion about the equalities of variances.
Theorem 10.5 (Gordon’s inequality II). Let (Xi, j ) and (Yi, j ) be two Gaussian
vectors indexed by i n, j m such that
1. E(Xi, j Xi,` )2 E(Yi, j Yi,` )2 for all i, j, `,
2. E(Xi, j Xk,` )2 E(Yi, j Yk,` )3 for all i, j, k, ` such that i 6= k.
Then,
E min max Xi, j E min max Yi, j . (10.42)
i j i j
and, as in the proof of Theorem 10.3, show that EF(X) EF(Y ). The calculation
is a bit more involved but straightforward. Letting b ! • proves (10.42).
where (gi, j )in, jm are i.i.d. standard Gaussian random variables and parameters
max max X(t, u), min max X(t, u) or min max X(t, u)
t2T u2U t2T u2U u2U t2T
max max Y (t, u), min max Y (t, u) or min max Y (t, u).
t2T u2U t2T u2U u2U t2T
Let us take any two pairs of parameters (t, u) and (t 0 , u0 ) and compute
⇣ n m ⌘2
2
E X(t, u) X(t 0 , u0 ) =E Â Â gi, j (ti u j ti0 u0j ) (10.45)
i=1 j=1
n m
= Â Â (ti u j ti0 u0j )2 = ktk2 kuk2 + kt 0 k2 ku0 k2 2(t,t 0 )(u, u0 ).
i=1 j=1
t
u
In order to apply the results for the minimax mint2T maxu2U , we need the reverse
inequality for t = t 0 . Because of (10.48), we can only hope to get the equality
2 2
E Y (t, u) Y (t, u0 ) = E X(t, u) X(t, u0 ) . (10.50)
There are a couple of ways this can be achieved, as we will see in the following
examples.
Example 10.4.2. One way to do this is to modify the definition of the bilinear form
slightly and consider
n m
X + (t, u) = Â Â gi, j ti u j + zktkkuk, (10.51)
i=1 j=1
Notice that the inequality is reversed in this case and together with (10.49), we can
write
for arbitrary function l (t, u). Notice that (10.52) is also equal to zero if u = u0 so,
by the same logic, we can switch the role of the parameters t and u,
and
⇣[ \ ⌘ ⇣[ \ ⌘
P X + (t, u) l (t, u) P Y (t, u) l (t, u) . (10.57)
u2U t2T u2U t2T
Example 10.4.3. There is another special case that is very useful, when
for some constants a, b > 0. In other words, T is a subset of the sphere or radius a
in Rn and U is a subset of the ball of radius b in Rm . In this case, we will modify
the definition (10.44) slightly and consider a proper random linear form
n m
Y + (t, u) = b  hiti + a  g j u j . (10.59)
i=1 j=1
Example 10.4.4. In the setting of the Example 10.4.1, let us suppose that
t
u
so
E min max Y + (t, u) = Ekhk Ekgk.
u2U t2T
This can be computed explicitly, and it is well-known that
p
n 2G( n+1
2 )
p
p Ekhk = n n (10.66)
n+1 G( 2 )
(we leave it as an exercise below). The same holds for Ekgk with n replaced by m.
On the other hand,
⇣ n ⇣ m ⌘2 ⌘1/2
min max X(t, u) = min
u2U t2T kuk=1 i=1 j=1
  gi, j u j ,
where lmin (Z) is the smallest eigenvalue of Z. The first inequality in (10.62) gives
p
Ekhk Ekgk E nlmin (Z), (10.68)
the statements we obtained above for the sample covariance matrix Z can be trans-
ferred to similar statements for ZC , only now scaled by lmin (C). t
u
In this section, we will give one classical application of the results in the previous
section. Let us consider N 1 and a set
V = v1 , . . . , vm ✓ R N (10.73)
for all 1 k < ` m, where k · k denotes the Euclidean norm (in RN and Rn ).
The dimension n in (10.74) that we are allowed to choose depends on the distortion
parameter e and the number of points m, but it does not depend on N. The depen-
dence on m is logarithmic, so the dimension can be relatively small even when
the number of points is very large. We will give two proofs of this result, using
Gaussian concentration and Gaussian comparison.
Proof (using Gaussian concentration). In our first proof we will assume (10.74)
with constant 8 instead of 4 (however, using Gaussian concentration inequality
with optimal constant would resolve this issue).
Let us consider a random matrix
2 3
g11 g12 g13 . . . g1N
6g21 g22 g23 . . . g2N 7
6 7
G = 6 .. .. .. . . .. 7 ,
4 . . . . . 5
gn1 gn2 gn3 . . . gnN
where the entries gi j are i.i.d. standard Gaussian random variables, and show that
the linear map
304 10 Gaussian Distributions
1
f (x) = p G x (10.76)
n
satisfies the low-distortion property (10.75) with positive probability. Let us de-
note, for 1 k < ` m,
vk v`
ak` = .
kvk v` k
Then the equation (10.75) can be rewritten, for f in (10.76), as
p
n 1
p e p kG ak` k 1 + e
n+1 n
Since ak` 2 SN 1
for all for 1 k < ` m,
⇣ pn 1 ⌘
ne 2 /4
P p e p kG ak` k 1 + e 1 2e ,
n+1 n
and, by the union bound,
⇣ p ⌘
n 1 ne 2 /4
P 8 1 k < ` m, p e p kG ak` k 1 + e 1 m2 e .
n+1 n
The right hand side is strictly positive if n > e82 log m. The fact that the probability
is positive means that there exists G such that the low-distortion condition (10.75)
is satisfied. t
u
Proof (using Gaussian comparison). We will now show a proof using Gordon’s
inequality in the form of the Example 10.4.3. Let us again consider the set
n vk v` o
A = ak` = 2 SN 1 : 1 k < ` n ,
kvk v` k
and let B = {x 2 Rn : kxk 1} be the unit ball in Rn . Given the Gaussian n ⇥ N
matrix G in the previous proof, consider a bilinear form for a 2 A and b 2 B,
n N
X(a, b) := Â Â gi j bi a j .
i=1 j=1
10.5 Johnson–Lindenstrauss lemma 305
If we denote
N
D := max(g, a) = max  g j a j (10.78)
a2A a2A j=1
d
then (10.77) can be rewritten as (also using g = g)
b a2A 2 b
since card(A) m2 and we assumed that n > e42 log m. Together, the inequalities
above imply
p
n 1 1
p e < p E min kG ak p E max kG ak < 1 + e.
n+1 n a2A n a2A
This implies that
⇣ pn ⌘ 1
p e E min kG ak > 1 > (1 + e) 1 E max kG ak
n+1 a2A a2A
306 10 Gaussian Distributions
and, therefore,
h ⇣ pn ⌘ i
1
E min kG ak (1 + e) p e max kG ak > 0.
a2A n+1 a2A
This shows the existence of a linear map G with the distortion property (10.75). t
u
Suppose that we also want to control the distortion of the areas of triangles
formed by any three points vi , v j and vk . This can be done as follows. First, the
areas of all triangles in the same (two-dimensional) plane are distorted by the same
factor under a linear transformation. For each three points, we can add another
point vi jk in the same plane, which forms an isosceles right triangle with vi , v j .
This means that, instead of m points, we now have m + m3 m3 points on the list.
We can now find f as in Lemma 10.6 for this new enlarged list of points. Since
the dependence on m is logarithmic, we lose a factor of 3 in the bound (10.74) for
n. It remains to solve the following exercise to see that this procedure allows us to
control the distortion of areas.
Exercise 10.5.1. Suppose that three points vi , v j and vk in the setting of Lemma
10.6 form an isosceles right triangle D. Show that
p Area( f (D))
1 5e 1 + e.
Area(D)
The first statement holds for all k, but for negative k this threshold is not the
correct one. The second statement shows that the threshold is sharp for nonnegative
k, in the sense that for a below the threshold the set A is not empty with high
probability and for a above the threshold the set A is empty with high probability.
For k = 0, which corresponds to half-spaces through the origin, the threshold is
2, which means that one must intersect more than 2n half-spaces to get an empty
intersection with the unit sphere with high probability.
Proof (of (10.80)). Assume aE(z+k)2+ > 1. Let G be a m⇥n matrix whose entries
are i.i.d. standard Gaussian random variables, and 1 = (1, . . . , 1)T 2 Rm . Then A
is non-empty if and only if there exists x 2 Sn 1 such that Gx k1 where the
inequality of two vectors should be interpreted coordinate-wise. We notice that
this is equivalent to
min max l T (k1 Gx) 0, (10.82)
x l
Bm m
+ = l 2 R : kl k 1; li 0, 1 i m .
308 10 Gaussian Distributions
2
The second term on the right-hand side can be bounded by e e n/8 , so we will try
to bound the probability on the left-hand side from below. We will use Gordon’s
inequality in Theorem 10.4 in Section 10.3. Let
X(x, l ) := l T Gx + z, Y (x, l ) := l T g + xT h
and
where (g + k1)+ = ((gi + k)+ )im . By the assumption aE(z + k)2+ > 1, we can
choose e > 0 small enough such that aE(z + k)2+ (1 + 3e)2 , or
p p p
(mE(z + k)2+ )1/2 n 2e n e n.
Therefore,
⇣ p ⌘
P min max X(x, l ) + kl T 1 e n
x
⇣ l p ⌘
P k(g + k1)+ k khk e n
⇣ p p ⌘
P k(g + k1)+ k khk (mE(z + k)2+ )1/2 n 2e n
⇣ p p ⌘
1 P khk n+e n
⇣ p ⌘
P k(g + k1)+ k (mE(z + k)2+ )1/2 e n .
where Ce = 2e(aE(z + k)2+ )1/2 e2 2e e2 e for e 2 (0, 1). One can then
rewrite this as m h i
 E(gi + k)2+ (gi + k)2+ Ce n,
i=1
and use Markov’s inequality as in Section 2.1 to show that this probability is expo-
nentially small. We will leave this as an exercise. t
u
Proof (of (10.81)). Now we assume that aE(z + k)2+ < 1 and k 0. Because
k 0, the existence of a point x 2 Sn 1 such that Gx 1k is equivalent to the
existence of a point x 2 Bn in the unit ball such that Gx 1k. Analogously to
(10.82), this is equivalent to
min max l T (k1 Gx) = max min l T (k1 Gx) = min max l T (Gx k1).
x l l x l x
for some c > 0. In this direction, the calculation will be based on a comparison
inequality in the Example 10.4.2 in Section 10.4.
As above, let z ⇠ N(0, 1) be a standard Gaussian random variable independent
of the entries in G. Then, for any e > 0,
⇣ h p i ⌘
P min max l T Gx kl T 1 + zkl kkxk e nkl kkxk 0
l x
⇣ h i ⌘ p
P min max l T Gx kl T 1 0 + P(z e n).
l x
2
The second term on the right-hand side can be bounded by e e n/2 , so we will
try to bound the probability on the left-hand side from below. We will again use
Gordon’s inequality in Theorem 10.4, only now
and
EY (x1 , l1 )Y (x2 , l2 ) = kx1 kkx2 k(l1 · l2 ) + kl1 kkl2 k(x1 · x2 ).
If (x1 , l1 ) = (x2 , l2 ) then the variances are again equal. Also,
The minimax on the right hand side can be computed explicitly. First, we maximize
over x 2 Bn ,
⇣ p ⌘
max kxkl T g + kl kxT h e nkxkkl k
x
10.6 Intersecting random half-spaces 311
⇣ p ⌘
= max r l T g + kl kvT h
e nkl k
r2[0,1],v2Sn 1
⇣ p ⌘
T
= max r l g + kl kkhk e nkl k
r2[0,1]
⇣ p ⌘
= max 0, l T g + kl kkhk e nkl k
p
l T g e nkl k + kl kkhk.
If we represent l 2 Bm
+ as l = rv for r 2 [0, 1] and
n o
m 1
v 2 S+ = l 2 Sm 1 : li 0, 1 i m ,
d
using symmetry g = g. Since aE(z + k)2+ < 1, we can find e > 0 such that
p p p
(mE(z + k)2+ )1/2 + n 2e n > e n.
The rest of the argument is the same as in the first part above, based on Gaussian
concentration and the exercise below. t
u