Probability Theory Nate Eldredge
Probability Theory Nate Eldredge
Nate Eldredge
November 29, 2012
Caution! These lecture notes are very rough. They are mainly intended for my own use during lecture.
They almost surely contain errors and typos, and tend to be written in a stream-of-consciousness style. In
many cases details, precise statements, and proofs are left to Durretts text, homework or presentations. But
perhaps these notes will be useful as a reminder of what was done in lecture. If I do something that is
substantially different from Durrett I will put it in here.
Thursday, August 23
Overview: Probability: study of randomness. Questions whose answer is not yes or no but a number
indicating probability of yes. Basic courses: divide into discrete and continuous, and are mainly
restricted to finite or short term problems - involving a finite number of events or random variables. To
escape these bounds: measure theory (introduced to probability by Kolmogorov). Unify discrete/continuous,
and enable the study of long term or limiting properties. Ironically, many theorems will be about how
randomness disappears or is constrained in the limit.
Intro
1. Basic objects of probability: events and their probabilities, combining events with logical operations,
random variables: numerical quantities, statements about them are events. Expected values.
2. Table of measure theory objects: measure space, measurable functions, almost everywhere, integral,.
Recall definitions. Ex: Borel/Lebesgue measure on Rn .
3. Whats the correspondence? Sample space: all possible outcomes of an experiment, or states
of the world. Events: set of outcomes corresponding to event happened. -field F : all reasonable eventsmeasurable sets. Logic operations correspond to set operations. Random variables:
measurable functions, so statement about it i.e pullback of Borel set, is an event.
4. 2 coin flip example.
5. Fill in other half of the table. Notation matchup.
6. Suppressing the sample space
Tuesday, August 28
2
2.1
Random variables
Random vectors
Definition 2.1. A n-dimensional random vector is a measurable map X : Rn . (As usual Rn is equipped
with its Borel -field.)
Fact: If X has components X = (X1 , . . . , Xn ) then X is a random vector iff the Xi are random variables.
More generally we can consider measurable functions X from to any other set S equipped with a
-field S (measurable space). Such an X could be called an S -valued random variable. Examples:
Set (group) of permutations of some set. For instance when we shuffle a deck of cards, we get a
random permutation, which could be considered an S 52 -valued random variable.
A manifold with its Borel -field. Picking random points on the manifold.
Function spaces. E.g. C([0, 1]) with its Borel -field. Brownian motion is a random continuous path
which could be viewed as a C([0, 1])-valued random variable.
Many results that dont use the structures of Rn (e.g. arithmetic, ordering, topology) in any obvious way will
extend to any S -valued random variable. In some cases problems can arise if the -field S is bad. However
good examples are Polish spaces (complete separable metric spaces) with their Borel -fields. It turns out
that these measurable spaces behave just like R and so statements proved for R (as a measurable space) work
for them as well.
2.2
2.3
Distributions
Most important questions about X are what values does it take on, with what probabilities? Heres an
object that encodes that information.
Definition 2.2. The distribution (or law) of X is a probability measure on (R, BR ) defined by (B) = P(X
B). In other notation = P X 1 . It is the pushforward of P onto R by the map X. We write X . Easy to
check that is fact a probability measure.
Note tells us the probability of every event that is a question about X.
Note: Every probability measure on R arises as the distribution of some random variable on some
probability space. Specifically: take = R, F = BR , P = , X() = .
Example 2.3. = c a point mass at c (i.e. (A) = 1 if c A and 0 otherwise). Corresponds to a constant
r.v. X = c a.s.
P
Example 2.4. Integer-valued (discrete) distributions: (B) = nBZ p(n) for some probability mass funcP
tion p : Z [0, 1] with n p(n) = 1. Such X takes on only integer values (almost surely) and has
P
P(X = n) = p(n). Other notation: = n p(n)n .
Bernoulli: p(1) = p, p(0) = 1 p.
Binomial, Poisson, geometric, etc.
R
R
Example 2.5. Continuous distributions: (B) = B f dm for some f 0 with R f dm = 1. f is called the
density of the measure . RadonNikodym theorem: is of this form iff m(B) = 0 implies (B) = 0.
Uniform distribution U(a, b): f =
1
ba 1[a,b]
2
2
1 e(x) /2
2
2.4
Definition 2.6. Associated to each probability measure on R (and hence each random variable X) is the
(cumulative) distribution function F (x) := ((, x]). We write F X for F where X ; then F X (x) =
P(X x).
Fact: F is monotone increasing and right continuous; F() = 0, F(+) = 1. (*)
Proposition 2.7. F uniquely determines the measure .
Well use the proof of this as an excuse to introduce a very useful measure-theoretic tool: Dynkins -
lemma. We need two ad-hoc definitions.
Definition 2.8. Let be any set. A collection P 2 is a -system if for all A, B P we have A B P
(i.e. closed under intersection). A collection L 2 is a -system if:
1. L;
2. If A, B L with A B then B \ A L (closed under subset subtraction)
S
3. If An L where A1 A2 A3 . . . and A = An (we write An A) then A L (closed under
increasing unions).
Theorem 2.9. If P is a -system, L is a -system, and P L, then (P) L.
Proof: See Durrett A.1.4. Its just some set manipulations and not too instructive.
Heres how well prove our proposition:
Proof. Suppose , both have distribution function F. Let
P = {(, a] : a R
L = {B BR : (B) = (B)}.
Notice that P is clearly a -system and (P) = BR . Also P L by assumption, since ((, a]) = F(a) =
((, a]). So by Dynkins lemma, if we can show L is a -system we are done.
3
2.5
Joint distribution
If X is an n-dimensional random vector, its distribution is a probability measure on Rn defined in the same
way. (Its possible to define multi-dimensional cdfs but messier and I am going to avoid it.)
Given random variables X1 , . . . , Xn , their joint distribution is the probability measure on Rn which is
the distribution of the random vector (X1 , . . . , Xn ). This is not determined solely by the distributions of the
Xn ! Dumb example: X = 1 a coin flip, Y = X. Then both X, Y have the distribution = 12 1 + 21 1 .
But (X, X) does not have the same joint distribution as (X, Y). For instance, P((X, X) = (1, 1)) = 1/2 but
P((X, Y) = (1, 1)) = 0.
Moral: The distribution of X will let you answer any question you have that is only about X itself. If you
want to know how it interacts with another random variable then you need to know their joint distribution.
3
3.1
Expectation
Definition, integrability
Expectation EX or E[X].
Defining:
4
3.2
Inequalities
Theorem 3.2. Jensens inequality, Durrett Theorem 1.5.1. If is convex, then (EX) E(X). (Costandino
will present.)
Remark 3.3. Examples of convex functions: |t| (useful for remembering which way the inequalities go),
exp(t), |t| p for 1 p < , any C 2 function with 00 0.
Definition 3.4. The L p norm of a random variable X is kXk p := E[|X| p ]1/p . (Could be infinite.) L p is the
space of all random variables with finite L p norm.
Theorem 3.5. Holders inequality: If 1 < p, q < with
1
p
1
q
Proof. First, if kXk p = 0 then X = 0 a.s. (homework) and the inequality is trivial; likewise if kYkq = 0.
We have the following inequality for all nonnegative real numbers x, y:
xy
x p yq
+ .
p
q
(1)
(Calculus exercise: find where the difference is maximum.) Now take x = |X|/ kXk p , y = |Y|/ kYkq , and take
expectations:
1
1 1
E[|X| p
E|Y|q
+ = 1.
(2)
E|XY|
+
q =
p
p q
kXk p kYkq
p kXk p q kYkq
Special cases:
1. Take p = q = 2; this is the CauchySchwarz inequality.
2. Take Y = 1; this says E|X| kXk p . In particular, if E|X| p < then E|X| < , i.e. L p L1 . (We can
also get this from Jensen.) Note it is essential that we are using a finite measure!
0
3. Let 1 r r0 ; take X = |Z|r , p = r0 /r, Y = 1; this says kZkr kZkr0 , and in particular Lr Lr . (More
moments is better.)
5
1
a EX.
Remark 3.7. Due to Chebyshev. Named in accordance with Stiglers law (due to Merton).
Proof. Trivial fact: X a1{Xa} (draw picture: a box under a hump). Now take expectations and rearrange.
Corollary 3.8. Chebyshevs inequality (due to Bienayme): If X L2 , then P(|X EX| a) a12 Var(X).
Recall Var(X) = E[(X EX)2 ] = E[X 2 ] (EX)2 which exists whenever X L2 . This says an L2 random
variable has a small probability of being far from its mean.
Proof. Apply Markov with X (X EX)2 , a a2 .
3.3
In terms of distribution
Theorem 3.9 (Change of variables). Let X be a random variable with distribution . For any measurable
f : R R,
Z
E f (X) =
f d
(3)
where the expectation on the left exists iff the integral on the right does.
Remark 3.10. The same holds for random vectors. This is an illustration of the principle that any statement
about a single random variable should only depend on its distribution (and any statement about several
should only depend on their joint distribution).
Tuesday, September 4
4
4.1
Modes of convergence
Almost sure, L p
Definition 4.1. Xn X a.s. means what it says. I.e., P({ : Xn () X()}) = 1. Idea: In the long run,
Xn is guaranteed to be close to X. (Note this is a pointwise statement: the speed of convergence can depend
arbitrarily on . We rarely deal with uniform convergence of random variables.)
Definition 4.2. Xn X in L1 means E|Xn X| 0. Idea: In the long run, Xn is on average close to X.
Note by the triangle inequality, if Xn X in L1 then EXn EX. (Expectation is a continuous linear
functional on L1 .)
Example 4.3. Let U U(0, 1), and set
n, U n
Xn =
0, otherwise.
Then Xn 0 a.s. but EXn = 1 so Xn 6 0 in L1 .
6
Definition 4.4. Xn X in L p means E[|Xn X| p ] 0. A sort of weighting: places where Xn and X are far
apart contribute more to the average when p is bigger.
0
4.2
4.3
Convergence i.p.
Definition 4.11. We say Xn X in probability (i.p.) if for all > 0, P(|Xn X| ) 0 or equivalently
P(|Xn X| < ) 1. Idea: In the long run, Xn is very likely to be close to X.
Proposition 4.12. If Xn X a.s. then Xn X i.p.
Proof. If Xn X a.s., then for any we have, almost surely, |Xn X| < for sufficiently large n. That is,
P(lim inf n {|Xn X| < }) = 1. By homework, lim inf n P(|Xn X| < ) P(lim inf n {|Xn X| < }) =
1. Of course since these are probabilities the limsup has to be at most 1, so the limit exists and is 1.
Proposition 4.13. If Xn X in L p for any p 1 then Xn X i.p.
Proof. Using Chebyshev,
P(|Xn X| ) = P(|Xn X| p p )
1
E|Xn X| p 0.
p
(4)
Example 4.14. Let Yn be uniformly distributed on {1, . . . , n} (i.e. P(Yn = k) = 1/n, k = 1, . . . , n). (We
dont care about their joint distribution; they dont have to be independent.) Consider the triangular array of
random variables Xn,k , 1 k n, defined by Xn,k = 1{Yn =k} . Think of this as a sequence of random variables
X1,1 , X2,1 , X2,2 , X3,1 , . . . where
we traverse the rows of the array one by one. (If you like you could take
n
Xm = Xn,k where m = 2 + k.) Take X = 0. Note that for any < 1, we have P(|Xn,k X| ) = P(Xn,k =
1) = P(Yn = k) = 1/n 0, thus Xn,k 0 i.p. We also have E|Xn,k | p = 1/n so Xn,k 0 in L p for all p. But
with probability 1, the sequence Xn,k contains infinitely many 1s and also infintely many 0s, so Xn,k does not
converge a.s.
In this case, the large discrepancies (i.e. events when Xn,k = 1) become less and less likely to occur
at any given time, and their average effect is also becoming small. But they still happen infinitely often so
almost sure convergence is impossible.
7
Proposition 4.15. Suppose Xn X i.p. and f : R R is continuous. Then f (Xn ) f (X) i.p. (The
corresponding statement for a.s. convergence is obvious.)
Proof. Proved in Durrett Theorem 2.3.4 using the double subsequence trick. Homework: give a direct
proof.
4.4
BorelCantelli
P
n=1
P(An ) <
\ [
[
P(lim sup An ) = P
An = lim P
An .
m
(5)
[ X
P
An
P(An ).
(6)
m=1 n=m
n=m
n=m
P(An ) 0 as m .
P
Corollary 4.17. Suppose for any > 0 we have
n=1 P(|Xn X| ) < . Then Xn X a.s. (Note this
hypothesis looks similar to convergence i.p. except that we require P(|Xn X| ) to go to zero a little bit
faster. 1/n is not fast enough but 1/n2 or 2n is.)
Since
n=1
n=m
Proof. Take = 1/m. BorelCantelli implies that P(lim sup{|Xn X| 1/m}) = 0. This says that, almost
surely, |Xn X| is eventually less than 1/m, i.e. lim sup |Xn X| < 1/m. So let Bm = {lim sup |Xn X| < 1/m};
T
we just showed P(Bm ) = 1. If B = m Bm then P(B) = 1 as well. But on B we have lim sup |Xn X| < 1/m
for every m, which is to say lim sup |Xn X| = 0, which is to say Xn X.
4.5
This fact can often be used to take a theorem about a.s. converging sequences, and prove it under the
weaker assumption of i.p. convergence. For instance:
Theorem 4.19 (Upgraded DCT). Suppose Xn X i.p., and there exists Y with |Xn | Y and E|Y| < .
Then EXn EX (and Xn X in L1 ).
Proof. Suppose not. Then there is an > 0 and a subsequence Xnm such that |EXnm EX| > for all m.
We still have Xnm X i.p. so there is a further subsequence Xnmk such that Xnmk X a.s. We still have
|EXnmk EX| > for all k. But our classic DCT applies to Xnmk so EXnmk EX. This is a contradiction.
We can get upgraded MCT and Fatou in the same way.
4.6
(7)
by the triangle inequality. By uniform integrability, we can take M so large that for all n,
E|Xn M (Xn )| E[|Xn |; |Xn | M] <
(8)
and by taking M larger we can get the same to hold for X. So for such large M we have
E|Xn X| 2 + E| M (Xn ) M (X)|.
By continuous mapping M (Xn ) M (X) i.p., and by bounded convergence also in L1 , so E| M (Xn )
M (X)| 0. Taking the limsup we have lim sup E|Xn X| 2. But was arbitrary, so lim sup E|Xn X| = 0
and we are done.
9
Remark 4.24. The converse of this statement is also true: if Xn X in L1 then Xn X i.p. (this is just
Chebyshev) and {Xn } is ui. So the condition of ui for L1 convergence is in some sense optimal because it is
necessary as well as sufficient.
Remark 4.25. Vitali implies DCT since a dominated sequence is ui. (Homework.)
Lemma 4.26 (Crystal ball). Suppose for some p > 1 we have supXS E|X| p < (i.e. S is bounded in L p ).
Then S is ui.
Proof. Homework.
Smaller -fields
Our probability space comes equipped with the -field F consisting of all reasonable events (those whose
probability is defined). A feature which distinguishes probability from other parts of measure theory is that
we also consider various sub--fields G F .
Interpretation: if an event A is a question or a piece of binary data about an experiment, a -field G
is a collection of information, corresponding to the information that can answer all the questions A G.
This makes sense with the -field axioms: if you can answer the questions A, B, then with simple logic
you can answer the questions Ac (not A) and A B (A or B). If you can answer a whole sequence of
S
questions A1 , A2 , . . . , you can answer the question An (are any of the An true?).
We say a random variable X is G-measurable if X 1 (B) G for every Borel B R. As an abuse of
notation, we also write X G. Idea: you have been given enough information from the experiment that you
can deduce the value of the numerical data X.
Definition 5.1. If X is a random variable, (X) denotes the smallest -field with respect to which X is
measurable; we call it the -field generated by X. (As usual, smallest means that if G is a -field with
X G, then (X) G. This -field is obviously unique, and it also exists because it is the intersection of all
sub--fields G of F for which X G. The intersection is nonempty because X F by definition of random
variable.)
Think of (X) as all the information that can be learned by observing X.
Proposition 5.2. (X) = {X 1 (B) : B BR }.
Proof. Let G denote the collection on the right. Since X is (X)-measurable, we must have X 1 (B) (X)
for all Borel sets B. This proves G (X). Conversely, since preimages preserve unions and complements,
G is a -field, and since it contains X 1 (B) for all Borel B, we have X G. By the minimality of (X), we
must have (X) G.
Intuitively, Y (X) should mean knowing X is enough to determine Y. The next proposition makes
this explicit.
Proposition 5.3 (Doob-Dynkin lemma). Y (X) if and only if there exists a measurable f : R R such
that Y = f (X).
Proof. Presentation.
10
Independence
Definition 6.1. A sequence of events A1 , A2 , . . . (finite or infinite) is independent if, for any distinct indices
i1 , . . . , in , we have
P(Ai1 Ain ) = P(Ai1 ) . . . P(Ain ).
(9)
To understand this, start with just two events. Then this just says P(A1 A2 ) = P(A1 )P(A2 ). This may be
best understood in terms of conditional probability: it says P(A1 |A2 ) = P(A1 ), i.e. knowing that A2 happened
doesnt give you any information about whether A1 happened, and doesnt let you improve your estimate of
its probability.
More generally, for any i0 , i1 , . . . , in we have
P(Ai0 |Ai1 Ain ) = P(Ai0 ).
This says that if you want to know something about one of the events, Ai0 , then knowing that any number of
the other events happened is irrelevant information.
This is not a pairwise statement; it is not sufficient that any two Ai , A j are independent. We really need
to be allowed to consider subsequences of any (finite) size. It could be that no single A j is useful in learning
about Ai , but maybe several of them together are. You cant look at just two of the Ai at a time. However,
you can see from the definition that it is sufficient to only look at finitely many at a time.
Example 6.2. Flip two fair coins. Let A1 be the event that the first coin is heads, A2 the second coin is
heads, A3 the two coins are the same. Then any two of the Ai are independent, but all three together are not.
If you want to know whether A1 , then neither A2 nor A3 by itself helps you at all, but A2 and A3 together
answer the question completely.
Note we could extend this to infinite subsequences: if A1 , A2 , . . . are independent, then for any distinct
i1 , i2 , . . . we have
\
Y
P( Ai j ) =
P(Ai j ).
j=1
j=1
Definition 6.4. We say random variables X1 , X2 , . . . are independent if the -fields (X1 ), (X2 ), . . . are
independent in the sense of the previous definition.
To understand independence in other ways, this lemma will be very useful.
Lemma 6.5. Let A F be any event, and let
LA := {B F : P(A B) = P(A)P(B)}
be the collection of all events which are independent of A. Then LA is a -system.
Proof. (Dont do in class?)
1. P(A ) = P(A) = P(A)P(), so LA .
2. If B1 , B2 LA and B1 B2 , then we have
P(A (B2 \ B1 )) = P((A B2 ) \ (A B1 ))
= P(A B2 ) P(A B1 )
since A B1 A B2
= P(A)(P(B2 ) P(B1 ))
= P(A)P(B2 \ B1 ).
3. If B1 B2 . . . is an increasing sequence of events in LA , and B =
[
P(A B) = P( (A Bn ))
Bn then
= lim P(A Bn )
= lim P(A)P(Bn )
= P(A)P(B).
Corollary 6.6. If C is any collection of events, then LC := {B F : P(A B) = P(A)P(B) for all A C} is
a -system.
T
Proof. LC = AC LA , and it is simple to check from the definition that an arbitrary intersection of systems is another -system.
Proposition 6.7. If P1 , P2 , . . . are independent -systems, then (P1 ), (P2 ), . . . are independent -fields.
Proof. Let
n
o
P = Ai1 Ain : Ai j Pi j , i j 2
be the collection of all finite intersections of events from P2 , P3 , . . . . By the assumption of independence,
for any A P1 and B = Ai1 Ain P we have
P(A B) = P(A Ai1 Ain ) = P(A)P(Ai1 ) . . . P(Ain ) = P(A)P(B).
So A, B are independent. This shows A LP , so we have P1 LP . By the - lemma we have (P1 ) LP ,
which is to say that (P1 ), P2 , . . . are independent.
This is still a sequence of -systems, so we can repeat the argument, using P2 instead of P1 , to see
that (P1 ), (P2 ), P3 , . . . are independent. Indeed, for any n, we can repeat this n times, and learn that
(P1 ), . . . , (Pn ), Pn+1 , . . . are independent. Since the definition of independence only looks at a finite
number of the Pi at a time, this in fact shows that all the -fields (P1 ), (P2 ), . . . are independent.
12
Proposition 7.1. If X1 , X2 , . . . are independent, and f1 , f2 , . . . are measurable functions, then f1 (X1 ), f2 (X2 ), . . .
are independent.
Proof. Since fi (Xi ) is (Xi )-measurable, we have ( fi (Xi )) (Xi ). This implies that ( f1 (X1 )), ( f2 (X2 )), . . .
are independent.
Independence of random variables can be expressed in terms of their joint distribution. Recall the
following facts about product measures:
Definition 7.2. Let (S 1 , S1 ), . . . , (S n , Sn ) be measurable spaces, and let S = S 1 S n . The product
-field S on S is the -field generated by all rectangles of the form A1 An , Ai Si ; we abuse
notation and write S = S1 Sn .
Theorem 7.3. If 1 , . . . , n are finite measures on S 1 , . . . , S n respectively, there is a unique measure on
the product -field of S which satisfies (A1 An ) = 1 (A1 ) . . . n (An ). is called the product measure
of 1 , . . . , n and we write = 1 n .
This is a standard theorem of measure theory and I wont prove it here. The existence is an application
of the Caratheodory extension theorem; the uniqueness follows immediately using the - theorem (the
collection of rectangles is a -system which generates S.)
Also recall the FubiniTonelli theorem: Rsuppose = 1 n is a product measure on (S , S) and
f : S R is measurable. If either f 0 or | f | d < , then we have
Z
Z
Z
f d =
...
f (x1 , . . . , xn )n (dxn ) . . . 1 (dx1 )
S1
Sn
and the integrals on the rightRmay be interchanged at will. (Note that it is often useful to apply the nonnegative case to | f | to verify that | f | d < .
Theorem 7.4. Suppose X1 , . . . , Xn are random variables with distributions 1 , . . . , n . Then X1 , . . . , Xn are
independent if and only if their joint distribution X1 ,...,Xn is the product measure 1 n on Rn .
13
Proof. (Worth doing in class?) By Proposition 5.2 the events in (Xi ) are precisely those of the form
{Xi B} for Borel sets B.
Suppose X1 , . . . , Xn are independent. Let B1 , . . . , Bn be Borel subsets of R. Then the events {X1
B1 }, . . . , {Xn Bn } are in (X1 ), . . . , (Xn ) respectively, hence are independent. So
(B1 Bn ) = P((X1 , . . . , Xn ) B1 Bn )
= P({X1 B1 } {Xn Bn })
= P(X1 B1 ) . . . P(Xn Bn )
= 1 (B1 ) . . . n (Bn ).
Therefore = 1 n .
Conversely, suppose = 1 n . Let A1 (X1 ), . . . , An (Xn ). By Proposition 5.2 above, we
must have A1 = {X1 B1 }, . . . , An = {Xn Bn } for Borel sets B1 , . . . , Bn . Thus
P(A1 An ) = P({X1 B1 } {Xn Bn })
= P((X1 , . . . , Xn ) B1 Bn )
= (B1 Bn )
= 1 (B1 ) . . . n (Bn )
= P(A1 ) . . . P(An ).
Thursday, September 13
Proposition 7.5. If X1 , . . . , Xn are independent nonnegative random variables, then E[X1 . . . Xn ] = E[X1 ] . . . E[Xn ].
If X1 , . . . , Xn are independent integrable random variables, then X1 . . . Xn is integrable and E[X1 . . . Xn ] =
E[X1 ] . . . E[Xn ]. (Note that without independence, a product of integrable random variables need not be
integrable.)
Proof. Well just write out the case n = 2. If X, Y are independent and nonnegative then we have
Z
E[XY] =
xyX,Y (dx, dy)
change of variables
2
ZR Z
=
xyX (dx)Y (dy)
Tonelli
ZR R
Z
=
xX (dx)
yY (dy)
R
= E[X]E[Y].
If instead X, Y are independent and integrable,
R then |X|, |Y| are independent and nonnegative. By the previous
case we have E|XY| = E|X|E|Y| < . So R2 |xy|X,Y (dx, dy) < and we can use the same argument as
above, with Fubini in place of Tonelli.
Corollary 7.6. If X1 , . . . , Xn L2 are independent, then Var(X1 + + Xn ) = Var(X1 ) + + Var(Xn ).
Proof. Again we just do n = 2. If X, Y L2 , then by a simple computation we have Var(X + Y) =
Var(X) + Var(Y) + 2 Cov(X, Y), where Cov(X, Y) = E[XY] E[X]E[Y]. But by the previous proposition we
have Cov(X, Y) = 0.
14
Notation 8.1. If G1 , G2 are independent -fields we will write G1 y G2 ; likewise for independent events or
random variables. (We will not use this notation when more than two things are independent.)
Let us start with a result that looks very abstract but is actually profound.
Theorem 8.2 (Kolmogorov 0-1 law). Let G1 , G2 , . . . be a sequence of independent -fields. Define the tail
-field
\
T =
(Gn , Gn+1 , . . . ).
n=1
(d)
nP
n=1
o
Xn converges
On the other hand, the following apparently limiting objects are not in T :
(a) supn Xn , inf n Xn (for instance, the first term could affect the supremum if it is larger than all the
remaining terms)
P
(b)
n=1 Xn when it exists. (For instance, think about the case that all Xn 0, so the infinite sum
definitely exists. The value of X1 cannot affect whether or not the value of the sum is , but it
can affect whether the value is, say, 1 or 2.)
So, for instance, the Kolmogorov zero-one law says that if I have an independent sequence of random
variables, there is no element of chance affecting whether or not Xn converges. Depending on the distributions of the Xn , either it almost surely converges, or almost surely diverges; there is no middle ground.
Example 8.4. Percolation. Consider the integer lattice Z2 in the plane, and produce a random graph by
turning each edge on independently with probability p. You get a big graph. It is probably (certainly)
disconnected, so it has a bunch of components (or clusters). Whats the probability that one of those components is infinite (we say percolation occurs)?
If you think about it, if I hide any finite number of the edges (say, those in some ball), you can still tell
whether theres an infinite component or not; the presence or absence of any given finite set of edges cant
change that event. So this is a tail event; by the Kolmogorov zero-one law it must have probability zero or
one. Depending on the value of p, an infinite component is either guaranteed or impossible.
But which is it, for which values of p? Well, its intuitively clear that increasing p should make it easier
to have an infinite component, and this can be made rigorous with a coupling argument (explain?). So
P p (percolation) must jump from 0 to 1 at some critical value p = pc , i.e. for all p < pc percolation does
not happen, and for all p > pc it does.
This leaves two questions: what is the value of pc , and what happens at pc itself? In 1960 Harris showed
there is no percolation at p = 1/2, so pc 1/2. The other direction was open for 20 years until Harry Kesten
famously showed in 1980 that for the 2-dimensional integer lattice, pc 1/2. So the critical threshold is
1/2, and percolation does not occur at this threshold.
What about higher dimensions, i.e. the integer lattice Zd ? Theres no reason to expect a nice value for pc ,
but its been estimated numerically by simulations, and asymptotics as d are also known. In particular
it can be shown it is always strictly between 0 and 1. Is there percolation at pc itself? It is conjectured the
answer is no in every dimension. This is Harris and Kestens result in d = 2, and Hara and Slade in the early
1990s showed it also holds for all d 19. For 3 d 18 this remains one of the most notorious open
problems in probability today.
Lets think back to the situation of a sequence of events A1 , A2 , . . . . The (first) BorelCantelli lemma
gave us a sufficient condition to ensure P(lim sup An ) = 0, i.e. almost surely only finitely many An happen;
P
namely, that n P(An ) < . This didnt require anything about the relationship between the sets, and in
general this sufficient condition is not necessary. (Example: let U U(0, 1), An = {U < 1/n}. Then
P
P(An ) = 1/n so P(An ) = but P(lim sup An ) = P(U = 0) = 0.) But in the presence of independence this
sufficient condition is also necessary.
P
Theorem 8.5 (Second BorelCantelli lemma). Let A1 , A2 , . . . be independent events. If n P(An ) = , then
P(lim sup An ) = 1.
This is sometimes also combined with the first BorelCantelli lemma and stated as the Borel zero-one
law (what happened to Cantelli?):
Corollary 8.6 (Borel zero-one law). If A1 , A2 , . . . are independent events, then P(lim sup An ) is 0 or 1,
P
P
according to whether P(An ) < or P(An ) = .
Example 8.7. If you flip a fair coin infinitely many times, you will get infinitely many heads and infinitely
many tails. (What else?)
Tuesday, September 18
Formally we define expectation as a Lebesgue integral, but surely that isnt how you would define it intuitively. If I asked you whats the expected number of green M&Ms in a bag, youd go buy a bunch of
bags, count the number of green ones in each bag, and average them. So if Xi is the number in bag i, youd
compute n1 (X1 + + Xn ) for some large n, and youd expect that would be a good approximation to the
abstract expected value.
If probability theory is going to be a useful mathematical machine, it had better agree with our intuition
on this point: that expectation is a long-run average. Luckily it does, via the following fundamental theorem.
Theorem 9.1 (Strong law of large numbers). Let X1 , X2 , . . . be independent and identically distributed (we
write iid for short), and integrable. Then
X1 + + Xn
E[X1 ],
n
almost surely.
(A weak law is a statement of this kind, in which the conclusion is only convergence in probability.)
It looks a bit weird that we have singled out X1 to appear in the conclusion; but of course since the Xi
are identically distributed, they all have the same expected value. So we mean that the averages of the iid
random variables converge to their common expected value.
The above is the optimal version of the SLLN: the conclusion only uses the expectation of X1 , and
we dont assume any higher moments. Proving this sharp version is a bit involved and I dont intend to do
it. Anyway the full strength is rarely needed in practice; most of the random variables we meet in everyday
life have many more finite moments, usually even all of them. So well prove some results assuming more
integrability.
For the rest of this section, S n is the sum S n = X1 + + Xn , so we are interested in the convergence of
Sn
.
n
Theorem 9.2 (L2 WLLN). Let X1 , X2 , . . . be independent, identically distributed, and L2 . Then
in L2 (hence also in probability).
Sn
n
E[X1 ]
We are assuming more than the classic SLLN, i.e. L2 instead of L1 , and getting only convergence ip.
But the proof will be very easy.
Proof. For short, set = E[X1 ], 2 = Var(X1 ). Since the Xn are iid, we have E[S n ] = n and Var(S n ) = n2 .
Now
"
2
2 #
2
S n
= E S n = Var S n = 1 n2 = 0.
2
n
n
n
n
n2
17
The only place we used independence was the fact that Var(S n ) = n Var(X1 ), which follows from
E[Xi X j ] = E[Xi ]E[X j ], i , j or in other words Cov(Xi , X j ) = 0. That is, we only used that the Xi are
uncorrelated. In particular, it is sufficient for them to be pairwise independent.
Note that using Chebyshevs inequality with the above computation tells us that
2
S
n
>
P
n
n
which is nice because it gives you a quantitative bound.
Heres a strong law of large numbers, under even stronger hypotheses:
Theorem 9.3. Suppose X1 , X2 , . . . are iid and L4 . Then
Sn
n
Note this proof has shown that if Xn 0 and EX1 = , then Snn + almost surely. So again we have
a result that applies to either integrable or nonnegative random variables.
This doesnt lose all hope for convergence in the non-integrable case; for instance, we could try putting
something else in the denominator, which grows a bit faster than n. See Durrett Example 2.2.7 for an
example where you use this idea to get convergence in probability, and estimate the average size of a
non-integrable distribution (discuss in class?); see Theorem 2.5.9 for a proof that you cannot get almost sure
convergence in such cases, no matter how you normalize.
Example 9.5 (The St. Petersburg paradox). Consider the following game: we flip a coin repeatedly until
heads comes up. If the first heads is on flip number k, you win 2k dollars (so your winnings double with
each tails). Thus if X is the amount you win, we have P(X = 2k ) = 2k for k 1. How much should you
pay to get into such a game?
P
P
k
k
We can easily compute E[X] =
n=1 1 = . So your expected winnings are
k=1 2 P(X = 2 ) =
infinite. Suppose you get to play repeatedly, i.e. X1 , X2 , . . . are iid with the distribution of X, and S n =
X1 + + Xn is your total winnings up to time n, we showed above that Snn + almost surely. If you
pay an amount c for each play, your net winnings after time are S n cn = n( Snn c) a.s.; you will
eventually make back everything you paid.
On the other hand, this may not happen very fast. Say you pay c = 200 dollars for each play. You only
profit if at least 7 tails are flipped, which happens only 1/128 of the time. So the vast majority of plays
will see you lose money. Its just that, every once in a great while, youll win a huge amount that will, on
average, make up for all those losses.
The point is that if you pay a fixed amount for each play, the total amount you paid grows linearly with
time, while your winnings grow faster than linearly. How much faster? Durrett works out an example to
show that nSlgn n 1 i.p. So after a large number of plays, you have paid out cn dollars, and with high
probability, you have won about n lg n dollars. Thus you break even after about 2c plays. For c = 200
dollars that is a very long time.
Durretts Theorem 2.5.9 shows that we cannot get almost sure convergence of nSlgn n . For large n, S n is
very likely to be close to n lg n, but there will be occasional excursions that take it further away.
18
10
We have been blithely talking about iid sequences of random variables and such. But a priori it is not clear
that our development was not vacuous. Could there actually be a sequence of random variables on some
probability space with that complicated set of properties? We certainly hope so because the idea of an iid
sequence is so intuitive. But not just any old probability space (, F , P) will work. (For example, you can
show that it wont work for to be a countable set.) The natural spaces to use are infinite product spaces,
which we will now discuss.
But first an easy example.
Theorem 10.1. Let = [0, 1] and P be Lebesgue measure. Let Yn () be the nth bit in the binary expansion
of . Then the random variables Yn are iid Bernoulli.
Remark 10.2. (Normal numbers) In other words, if you choose a number uniformly at random in [0, 1], the
bits in its binary expansion look like fair coin flips. In particular, by the SLLN, asymptotically you will see
equal numbers of 0 and 1 bits (in the sense that the fraction of the first n bits which are 0 goes to 1/2 as
n ). We say a number with this property is normal in base 2, and weve just shown that almost every
number in [0, 1] is normal in base 2.
Of course, not every number has this property; for example, 1/2 = 0.1000 . . .2 is not normal in base 2.
But 1/3 = 0.01010101 . . .2 is.
We could ask the same for other bases. 1/3 = 0.1000 . . .3 is not normal in base 3. But the same argument
shows that almost every number is normal in base 3. Taking an intersection of measure 1 sets, almost every
number is normal in bases 2 and 3 simultaneously.
In fact, taking a countable intersection, almost every number is normal in every base simultaneously, or
in other words is a normal number. This is remarkable because, as far as I know, no explicit example of a
normal number is known. Rational numbers are not normal (p/q is not normal in base q because its base q
expansion terminates). It is a famous conjecture that is normal but this has never been proved. So this is
one of those cases in mathematics where it is hard to find a single example, but easy to show there must be
lots of them.
For this section, I = [0, 1] denotes the unit interval.
Definition 10.3. RN is the infinite Cartesian product of R with itself. You can think of it as the set of all
sequences of real numbers, or alternatively as the set of all functions x : N R; we will use the latter
notation. We define I N similarly.
Definition 10.4. A cylinder set is a subset of RN which is of the form A = B R R . . . , where for some
n we have B Rn and B is Borel. That is, x A iff (x(1), . . . , x(n)) B.
We equip RN with the infinite product -field BN which is the -field generated by the cylinder sets.
We remark that BN contains more than cylinder sets; for example it contains all products of Borel sets
B1 B2 . . . (this is the countable intersection of the cylinder sets B1 Bn R . . . ).
We can use RN to talk about infinite joint distributions. Suppose X1 , X2 , . . . is an infinite sequence of
random variables on some probability space (, F , P). Define a map X : RN as
(X())(k) = Xk ()
19
where we are again thinking of RN as a space of functions. Or in terms of sequences, X() = (X1 (), X2 (), . . . ).
It is simple to check that X is measurable. Thus we can use it to push forward P to get a measure on
(RN , BN ), which can be viewed as the joint distribution of the entire sequence X1 , X2 , . . . .
Much as the joint distribution of a finite number of random variables completely describes how they
interact, the same is true for an infinite sequence. It even describes infinite properties. For example,
suppose we have Xn 3 a.s. This says that the subset of RN consisting of sequences which converge to 3
(you can check this set is in BN ) has -measure 1.
Somewhat surprisingly, at this point we cease to be able to do all our work in the language of measure
theory alone; we have to bring in some topology.
RN and I N carry natural topologies: the product topology on Rn is generated by the sets of the form
U R R . . . where U Rn for some n and U is open.1 A better way to understand this topology is via
convergent sequences: a sequence x1 , x2 , RN converges to some x with respect to the product topology
iff it converges pointwise, i.e. iff for every k we have limn xn (k) = x(k). Actually RN with the product
topology is really a metric space, so we can do everything in terms of sequences. The metric is:
d(x, y) =
2k (|x y| 1).
(10)
k=1
A previous version of these notes erroneously stated that every open set was of this form.
20
4. X is complete (every Cauchy sequence converges) and totally bounded (for any > 0, we can cover
X with finitely many balls of radius ).
Topology and measure in R and similar spaces interact in interesting ways. Here is one that we will use.
Theorem 10.7. Any probability measure on Rn is regular, i.e. for all Borel B and all > 0, there exist
closed F and open U with F B U and (U \ F) < . F can even be taken compact.
Proof. Homework.
Now we have enough tools to prove a big theorem that lets us construct pretty much any conceivable
measure on I N .
Theorem 10.8 (Kolmogorov extension theorem). For each n, suppose n is a probability measure on I n ,
and the n are consistent in that for any Borel B I n , we have n (B) = n+1 (B I). Then there exists a
probability measure on I N (with its product -field) such that for each n and each Borel B I n , we have
n (B) = (B I I . . . ).
One measure to rule them all! Or, in fancier words, a projective limit.
Corollary 10.9. This also works if we replace I by R.
Proof. We first observe that we could do it for (0, 1). Any probability measure n on (0, 1)n extends to one
n on [0, 1]n in the obvious way (dont put any mass at the edges; or to say that in a fancier way, push the
measure forward under the inclusion map). It is easy to see that if {n } is a consistent sequence then so
is { n }, and the Kolmogorov extension theorem gives us a probability measure .
One can now check that
(0, 1)N is a Borel subset of [0, 1]N , the restriction of to (0, 1)N is a probability measure, and interacts
with the n as desired.
But R is homeomorphic to (0, 1) so we can push measures back and forth between them (under a homeomorphism and its inverse) at will.
Actually this would still work if we replaced (0, 1) by any Borel subset B [0, 1], and R by any
measurable space X for which there is a measurable : X B with a measurable inverse. It turns out that
this can be done, in particular, whenever X is any Polish space (a complete separable metric space, or any
topological space homeomorphic to one) with its Borel -algebra. So [0, 1] is universal in that sense; in fact
any uncountable Polish space will do here. We chose [0, 1] mainly because it is compact.
Corollary 10.10. Given any sequence {n } of consistent measures on Rn , there exists a probability space
{, F , P} and a sequence of random variables X1 , X2 , . . . defined on such that (X1 , . . . , Xn ) n .
Proof. Let be the measure on RN produced by the Kolmogorov extension theorem. Take the probability
space (, F , P) = (RN , BN , ), and for each n let Xn : RN R be the projection map Xn (x) = x(n). It
is simple to check that each Xn is measurable (hence a random variable on (, F , P)) and that the joint
distribution of (X1 , . . . , Xn ) is n .
Corollary 10.11. If 1 , 2 , . . . are probability measures on R, there exists a probability space with a sequence of independent random variables X1 , X2 , . . . such that Xn n .
21
Proof. Let n = 1 n . This is a consistent family by definition of product measure. By the previous
corollary we can find X1 , X2 , . . . with (X1 , . . . , Xn ) n = 1 n . This means X1 , . . . , Xn are independent, for any n. But the definition of independence only takes finitely many objects at a time, so in fact this
means the entire sequence X1 , X2 , . . . is independent.
In other words, infinite product measure exists. (This actually holds in more generality than Kolmogorovs theorem.) It is then clear that a sequence of random variables is independent iff their joint
distribution is an infinite product measure.
Now lets prove Kolmogorovs extension theorem.
Proof. We are going to use Caratheodorys extension theorem to produce the measure ; its the only nontrivial tool we have to construct measures.
Let A be the algebra of all cylinder sets, i.e. those of the form B I . . . , B I n for some n. Define
on A in the obvious way: (B I . . . ) = n (B). We have to check this is well defined: for example, we
could also write B I I . . . as (B I) I . . . in which case our definition should say its measure should
be n+1 (B I). But by our consistency condition this is the same value. By induction we can see we get the
same value no matter how we express the cylinder set.
It is not hard to see that is finitely additive on A. Suppose that A1 , A2 are two disjoint cylinder sets,
where A1 = B1 I . . . for some Borel B1 I n1 and A2 = B2 I . . . for some Borel B2 I n2 . Without
loss of generality, assume n1 n2 ; then we can rewrite A2 as B02 I . . . where B02 = B2 I n2 n1 I n1 . If
A1 , A2 are disjoint then so are B1 and B02 , and we have A1 A2 = (B1 B02 ) I . . . . So we have
(A1 A2 ) = n1 (B1 B02 ) = n1 (B1 ) + n1 (B02 ).
But n1 (B1 ) = (A1 ) by definition of , and by consistency we have n1 (B02 ) = n2 (B2 ) = (A2 ). So we have
finite additivity.
S
For countable additivity, suppose that A1 , A2 , A are disjoint and that A :=
n=1 An A as well;
P
we want to show n=1 (An ) = (A). It suffices to consider the case A = I N . (If this is shown, then for
P
any other A we may take A0 = Ac and see that 1 = (I N ) = (Ac ) +
n=1 (An ), but by finite additivity
PN
c
(A ) = 1 (A)). One inequality is easy: finite additivity gives n=1 (An ) 1 so the same holds in the
limit.
For the reverse inequality, fix > 0. For each n we can write An = Bn I . . . for some Borel Bn I n .
As mentioned above, n is a regular measure, so there is an open Un Bn with
n (Un ) n (Bn ) + 2n .
(11)
S
Set Vn = Un I . . . ; then Vn is an open subset of I N and An Vn . Since n An = I N we also have
S
N
N
n Vn = I . So {Vn } is an open cover of I . By compactness there is a finite subcover, say {V1 , . . . , VN },
which is to say that V1 VN = I N . By finite (sub)additivity we must have (V1 ) + + (VN ) 1. But
then
N
N
N
N
X
X
X
X
X
1
(Vn ) =
n (Un )
(n (Bn ) + 2n ) =
((An ) + 2n )
(An ) + .
n=1
n=1
n=1
n=1 (An )
n=1
n=1
1.
Remark 10.12. One can also prove Kolmogorovs extension theorem for uncountable products of R with
itself (i.e. the space of real-valued functions on an arbitrary uncountable set). (R can again be replaced by
22
any standard Borel space.) However this is more complicated and less useful. Its more complicated because
we have to use the full version of Tychonoffs theorem on the compactness of uncountable products, and
this requires Zorns lemma, ultrafilters, etc; it is much more abstract and nonconstructive. Also, uncountable
product space is a nasty measurable space; it has in some sense too many measurable sets (the product -field
is not generated by any countable collection) and not enough (for instance, finite sets are not measurable).
And every practical theorem in probability Ive ever seen can be done with just the countable version.
One might think the uncountable version of Kolmogorov would be useful when trying to construct
uncountable families of random variables, such as when constructing continuous-time stochastic processes
such as Brownian motion. Indeed, some textbooks actually take this approach (see for instance Karatzas
and Shreve). But when you do this it turns out the random variables you constructed dont actually do
everything you wanted, and you have to fix them (this is the idea of a modification of a process), and
your life is complicated by the nastiness of using uncountable product space as a probability space. All
this trouble can be avoided by first constructing a countable family of random variables with countable
Kolmogorov, and then describing the rest of the family in terms of them.
Well see more applications of Kolmogorovs extension theorem when we construct stochastic processes
such as Markov chains and Brownian motion.
Tuesday, September 26
11
Weak convergence
Our next result will be the central limit theorem. The SLLN tells us approximately how large S n is; its
about nE[X1 ], to first order. The central limit theorem tells us about its distribution, or in some sense its
shape: exactly how likely it is to be far from its expected value. Specifically, it says that for large n, the
distribution of S n is approximately normal.
The word approximately in the previous sentence needs to be interpreted in terms of some notion of
convergence. This is so-called weak convergence, which we will now discuss. Since we are interested in
distributions (i.e. probability measures on R or Rn ) being close, weak convergence is fundamentally a notion
of convergence of measures.
My goal here is to take a somewhat different approach from Durrett, that places a bit less emphasis on
working on R, and uses techniques that, where feasible, apply more generally.
Notation 11.1. Cb (Rd ) denotes the set of all bounded continuous f : Rd R. Cc (Rd ) is all the continuous
f : Rd R that have compact support.
d
Definition 11.2. Let 1 , 2 , . . . , be probability
R measuresR on R . We say n weakly if, for every
d
bounded continuous f : R R, we have f dn f d. (Durrett would write n .) [This
definition makes sense if Rd is replaced by any topological space X equipped with its Borel -field.]
Note: Durrett defines weak convergence in terms of the distribution functions Fn (x) = ((, x]). We
will see later that both definitions are equivalent. I prefer the definition in terms of bounded continuous
functions because it seems cleaner and it is not tied to the structure of R, so it is more general.
Example 11.3. For x Rd , let x be the Dirac delta measure which puts one unit of mass at x. If x1 , x2 ,
R, we haveR xn x weakly if and only if xn x.
Since f d x = f (x), we have xn x iff f (xn ) f (x) for every bounded continuous f . If xn x,
this is immediate from continuity. Conversely, if xn does not converge to x then there is a neighborhood U
23
of x such that xn < U infinitely often. Construct a continuous bump function which is 1 at x and 0 outside
U. Then the sequence { f (xn )} is 0 infinitely often and cannot converge to f (x) = 1.
Remark 11.4. If you were guessing what would be a good notion of convergence for probability measures,
perhaps the most obvious guess would be to require that n (B) (B) for every measurable set B. This
would not satisfy the above property, since if we take B = {x} and choose a sequence xn with xn , x,
we would have xn (B) = 0 for all n while x (B) = 1. This notion of convergence fails to respect the
d
d
topological
R
Rstructure on R ; it doesnt know that nearby points of R are close. Likewise, requiring that
f dn f d for all bounded measurable f would be problematic for the same reason.
A useful fact:
Lemma 11.5. If U Rd is open, there is a sequence fk of nonnegative bounded continuous functions
increasing to 1U . (I.e. fk = 0 outside U and fk 1 inside U.).
Proof. Notice that U c is closed, so d(x, U c ) = inf{d(x, y) : y U c } is continuous in x. Then just take
fk (x) = kd(x, U c ) 1. [This works on any metric space.2 ]
R
R
Corollary 11.6. If , are two probability measures on Rd and f d = f d for all bounded continuous
f , then = .
Proof. Fix an open set U and choose continuous fk 1U . Then by monotone convergence we get
Z
Z
Z
(U) =
1U d = lim
fk d = lim
fk d = (U).
Since the open sets are a -system which generates the Borel -field, we must have = .
R
R
Actually it is sufficient that f d = f d for all continuous, compactly supported f , since using
such f we can approximate the indicator of any bounded open set U, and the bounded open sets are also a
-system generating the Borel -field. [This works in any locally compact, separable metric space.]
Corollary 11.7. Weak limits are unique; if n weakly and n weakly, then = .
R
R
Theorem 11.8. If f dn f d for all compactly supported continuous f , then n weakly (i.e.
the same holds for all bounded continuous f . (The converse is trivial.) (Homework? Presentation?)
We can also think of weak convergence as a mode of convergence for random variables: we say Xn X
weakly if their distributions converge weakly, i.e n where Xn n , X . An equivalent statement,
thanks to the change-of-variables theorem: Xn X weakly iff for every bounded continuous f : R R,
we have E f (Xn ) E f (X).
This idea can be a bit misleading, since weak convergence is really inherently a property of measures,
not random variables, it cant tell the difference between random variables that have the same distribution.
d
For example, if Xn X weakly, and X = Y, then we also have Xn Y weakly, even though we may not
have X = Y a.s. (For instance, X and Y could be independent.) So weak limits are not unique as random
variables, they are only unique up to distribution. Moreover, we can even talk about weak convergence of
a sequence of random variables which are defined on completely different probability spaces! In this case,
2
I erroneously said in class that it also works for any completely regular space, i.e. whenever Urysohns lemma holds, and in
particular on locally compact Hausdorff spaces. This is not true and the uncountable ordinal space 1 + 1 is a counterexample; you
cannot approximate 11 by a sequence of continuous functions. I think the right topological condition for this is perfectly normal.
24
statements like almost sure convergence have no meaning, because you cant compare functions that are
defined on different spaces.
Youll prove a few properties of weak convergence in your homework. An important one: if Xn X in
probability then Xn X weakly. So this is the weakest mode of convergence yet.
Thursday, September 28
Here are some other equivalent characterizations of weak convergence:
Theorem 11.9 (Portmanteau theorem, named after the suitcase). Let 1 , 2 , . . . , be probability measures
on Rd . The following are equivalent:
1. n weakly;
2. For every open U Rd , (U) lim inf n (U);
3. For every closed F Rd , (E) lim sup n (E);
4. For every Borel B Rd satisfying (B) = 0, we have n (B) (B).
To understand why the inequalities are as they are, think again of our example of taking n = xn , = x
where xn x in Rd , where we are just pushing a point mass around. If xn U for every n, we could have
x < U if we push the mass to the boundary of U. Then the mass inside U decreases in the limit. But if x U
we have to have infinitely many xn U, so the mass inside U was already 1. That is, in a weak limit, an
open set could lose mass (if it gets pushed to its boundary) but it cannot gain mass. Conversely, a closed set
F can gain mass (if mass from outside F reaches the boundary of F in the limit). Statement 4 says that the
only way to have a drastic change in the mass inside any set B at the limit is to have some mass wind up on
the boundary of B.
Proof. (1 implies 2): Suppose n weakly and U is open.
R Choose bounded continuous fk 1U as in
our remark. Then for
each
k
and
each
n
we
have
(U)
lim sup n ( B)
( B).
The first and last are equal, so equality must hold throughout. In particular lim sup n (B) = lim inf n (B) =
(B).
R 4 implies 1: Recall in HW 1 you showed that for a nonnegative random variable X, we have EX =
P(X t) dt. Recasting this in the case where our probability space is R, it says that for any probability
0
measure on R and any nonnegative measurable f , we have
Z
Z
f d =
({ f t}) dt.
(12)
0
25
which thanks to (12) gives us exactly what we want. Finally we can get rid of the assumption that f is
nonnegative by considering f + and f .
[The proof works in any metric space.]
For probability measures on R there is a nice characterization of weak convergence in terms of distribution functions. Recall the distribution function F of a measure is defined by F(x) = ((, x]); F is
nondecreasing and right continuous.
Theorem 11.10. Let 1 , 2 , . . . , be probability measures on R with distribution functions n , . Then
n weakly if and only if, for every x R such that F is continuous at x, we have Fn (x) F(x). That is,
Fn F pointwise, except possibly at points where F is discontinuous. These correspond to point masses in
the measure .
Proof. One direction is easy. Suppose n weakly, and let x R. If F is continuous at x, this means
({x}) = 0. Since (, x] is a Borel set whose boundary is {x}, by Portmanteau part 4 we have n ((, x])
((, x]) which is to say Fn (x) F(x).
Conversely, suppose the distribution functions Fn converge as described. We will verify Portmanteau
part 2.
Let C be the set of points at which F is continuous. Note that F can have at most countably many
discontinuities (since can have at most countably many point masses, as they are disjoint sets of positive
measure), so C is co-countable and in particular dense. Let D be a countable dense subset of C (e.g. choose
one element of C in each rational interval).
If a, b D then we have n ((a, b]) = Fn (b) Fn (a) F(b) F(a) = ((a, b]). Likewise, if A is a
finite disjoint union of intervals of the form (a, b], a, b C, we also have n (A) (A). Actually saying
disjoint there was redundant because any finite union of such intervals can be written as a disjoint union
(if two intervals overlap, merge them into a single interval). So let A be the class of all such finite unions;
note that A is countable.
I claim any open set U can be written as a countable increasing union of sets of A. If x U we can find
an interval (a x , b x ], a x , b x D, containing x and contained in U. The union over all x of such intervals must
equal U, and since D is countable it is really a countable union. Then we can write the countable union as
an increasing union of finite unions.
So we can find An A with An U. In particular, by continuity from below, for any we can find
A A with A U and (A) (U) . Now for each n, n (U) n (A) so taking the liminf we have
lim inf n (U) lim inf n (A) = (A) (U) .
n
26
27
First proof. Consider the distribution functions F1 , F2 , . . . . Enumerate the rationals in [0, 1] as q1 , q2 , . . . .
If we set xn (i) = Fn (qi ) then {xn } is a sequence in [0, 1]N ; by Tychonoffs theorem it has a subsequence
subsequence xnk converging to some x [0, 1]N . That is to say, limk Fnk (qi ) = x(i) for each i.
We have to turn x into an F which will be the distribution function of the limit. F needs to be nondecreasing and right continuous, so let
F(t) = inf{x(i) : qi > t}
and F(1) = 1. This is clearly nondecreasing. To see it is right continuous, suppose tm t. For each rational
q > t we can find a tm with t < tm < q, thus F(tm ) F(q). Taking the limit in m, lim F(tm ) F(q). Taking
the infimum over q > t, lim F(tm ) F(t). The reverse inequality is immediate since F is nondecreasing.
Now suppose F is continuous at t. Fix and choose rational q < t < r with F(r) F(q) < . Now for
each k, Fnk (t) Fnk (q) F(q), so lim inf k Fnk (t) F(q) F(t) . Similarly, lim supk Fnk (t)
F(r) F(t) + . Letting 0 we see that we have limk Fnk (t) = F(t).
A similar proof can be used on [0, 1]d . It is also possible to extend it to [0, 1]N using the Kolmogorov
extension theorem and more work.
What goes wrong if we try to do this on R instead of [0, 1]? We can still find a nondecreasing, right
continuous F such that Fnk F where F is continuous. The only problem is that F may not be a proper
distribution function; it may fail to have F(+) = 1 and/or F() = 0 (interpreted via limits). This
happens, for instance, with our example of n = n ; in some sense we push mass off to infinity. The
limiting object could be interpreted as a sub-probability measure ; a measure with total mass less than
1,
say the n converge vaguely to this degenerate sub-probability measure. (We would have
R and we would
R
f dn f d for all f Cc (R), but not for all f Cb (R); consider f = 1 for instance.) But we are
really most interested in the case where the limit is actually an honest probability measure, and this is where
we would need the condition of tightness.
Second proof, uses functional analysis. C([0, 1]) is a separable Banach space, and the Riesz representation
theorem says its dual space C([0, 1]) is all the signed finite measures on [0, 1], with the total variation
norm. We note that weak-* convergence in C([0, 1]) is exactly the same as what we were calling weak
convergence. The unit ball B of C([0, 1]) , in the weak-* topology, is compact (Alaoglus theorem, really
Tychonoff again) and metrizable (its enough to check convergence on a countable dense subset of C([0, 1]),
so in fact B embeds in [0, 1]N ). Thus any sequence of probability measures has a weakly convergent
subseR
quence. It is easy
R to check the limit is a (positive) probability measure by noting that, since f dn 0
for f 0 and 1 dn = 1, the same properties must hold for .
This proof works on any compact metric space.
Remark 11.16. We could apply the Riesz/Alaoglu argument on [0, 1]N to prove the Kolmogorov extension
theorem. If n are measures on [0, 1]n , extend them to [0, 1]N in some silly way. For example let 0 be a Dirac
mass at (0, 0, . . . ) [0, 1]N , and set n = n 0 . Now some subsequence nk converges weakly to some
measure . I claim is the measure desired in the Kolmogorov extension theorem. Let V = U I I . . . be
an open cylinder set, where U I n is open. We have m (V) = n (U) for all m n, so by weak convergence
(V) n (U). Now since I n is a metric space and U c is closed, we can find open sets Ui U c . If we set
Vi = Ui I . . . we also have (Vi ) n (Ui ). Passing to the limit, (V c ) n (U c ), i.e. (V) n (U). By
a - argument the same holds for all cylinder sets, and has the desired property.
Now we can prove Prohorovs theorem for R.
28
Proof. We will actually prove it for (0, 1) which is homeomorphic to R. (Here the issue is not pushing mass
off to infinity, but pushing it off the edge of the interval toward 0 or 1.) If we have probability measures n
on (0, 1) then they extend in the obvious way to probability measures on [0, 1], and there is a subsequence
converging weakly (with respect to [0, 1]) to a measure on [0, 1], which we can then restrict to a measure
on (0, 1) again.
There are two issues. First, we have to check that ((0, 1)) = 1 (conceivably could put some of its
Rmass on theR endpoints). Second, we only have weak convergence with respect to [0, 1], which is to say
f dn f d for all f which are bounded and continuous on [0, 1]. We need to know it holds for f
which are merely continuous on (0, 1), which is a larger class (consider for instance f (x) = sin(1/x)).
So we need to use tightness. Fix > 0; there is a compact K (0, 1) with n (K) 1 for all n. K is also
compact and hence closed in [0, 1], so by Portmanteau 3, we have ((0, 1)) (K) lim sup n (K) 1 .
Letting 0 we see ((0, 1)) = 1.
Next, suppose f is bounded and continuous on (0, 1); say | f | C. K must be contained in some closed
interval [a, b] (0, 1). Let us modify f outside [a, b] to get a new function f thats continuous up to [0, 1];
for instance, let f = f (a) on all of [0, a] and f R= f (b) on [b,
R 1]. (In a more general setting we would use the
Tietze extension theorem here.) We now have f dn f d. But on the other hand, for each n we have
Z
Z
Z
f d
f dn
| f f | dn 2C(1 n (K)) 2C
n
since
f andRf agree inside K, and are each bounded by C outside K. The same goes for . So we must have
R
f dn f d and we have shown n weakly.
Remark 11.17. This proof would work if we replaced R by any space homeomorphic to a Borel subset of
[0, 1] (or [0, 1]d or [0, 1]N if we used a fancier proof of Helly, or any other compact metric space if we use
Riesz). It turns out that every complete separable metric space is homeomorphic to a Borel subset of [0, 1]N ,
so this proof applies to all Polish spaces. Actually with a bit more work it can be proved for any metric
space at all; again see Billingsley.
Example 11.18. For the central limit theorem, we want to examine the weak convergence of S n / n. (We
assume
0 and variance 1, which can be accomplished by looking at (X
for simplicity that X1 has mean
EX)/ Var(X).) Notice that Var(S n / n) = 1 for all n. So by Chebyshev, P(|S n |/ n a) 1/a2 . Given
any > 0, choose a so large that 1/a2 < ; then if we set K = [a, a], we have for all n that P(S n / n
Theorem 11.20 (Skorohod representation theorem). Suppose n , are probability measures on Rd and
F , P)
such
and random variables X n , X defined on
n weakly. There exists a probability space (,
X n X P-almost
surely. It is essential to keep in mind that the theorem only guarantees that the individual
distributions of Xn and X n are the same; in general their joint distributions will be different. In particular,
any independence that may hold among the Xn in general will not hold for the X n .
I will only sketch the proof here; see Durrett Theorem 3.2.2 for more details. The proof is based on the
following fact, which is Durretts Theorem 1.2.2.
Lemma 11.21. Suppose is a probability measure on R with distribution function F. Define the inverse
of F by
F 1 (t) = sup{x : F(x) < t}.
(This definition ensures that F 1 : (0, 1) R is everywhere defined and nondecreasing.) If U U(0, 1) is
a uniform random variable, then F 1 (U) .
Now to prove Skorohod, let Fn , F be the distribution functions of n , . We know that Fn (x) F(x)
at all points x where F is continuous. Essentially by turning our head sideways, we can show that we also
have Fn1 (t) F 1 (t) at all points t where F 1 is continuous. In particular, Fn1 F 1 almost everywhere
F , P)
on (0, 1). So if U is a single uniform U(0, 1) random variable defined on some probability space (,
1
(for example, we could use (0, 1) with Lebesgue measure and take U() = ), we can set X n = Fn (U),
X = F 1 (U). By the previous lemma, the distributions of X n , X are as desired, and since Fn1 F 1 almost
everywhere, we have X n X almost surely.
12
A very powerful way to describe probability measures on Rd is via their Fourier transforms, also called
characteristic functions.
Definition 12.1. Let be a probability measure on R. The Fourier transform or characteristic function
or chf of is the function : R C (or )
defined by
Z
=
eitx (dx).
R
|(t + ) (t)|
2
2
1 e(xc) /2
2
2 t2 /2
Example 12.3. For many other examples of computing characteristic functions, see Durrett.
The more moments a measure has, the smoother its characteristic function.
R
Proposition 12.4 (Durrett Exercise 3.3.14). If |x|n (dx) < , then is C n and its nth derivative is given
R
n itx
by (n)
(t) = (ix) e (dx). (Presentation)
For a random variable, this says:
n itX
Corollary 12.5. If E|X|n < then X is C n and its nth derivative is (n)
X (t) = E[(iX) e ].
When you add two independent random variables, their distributions convolve, which is a bit messy. But
their chfs multiply:
Proposition 12.6. If X, Y are independent, then X+Y = X Y .
Proof.
E[eit(X+Y) ] = E[eitX eitY ] = E[eitX ]E[eitY ]
since eitX , eitY are independent random variables for each t.
Thursday, October 11
Durretts development is completely in terms of probability. This makes it more self-contained but I
think it loses the connection with the wider world of Fourier theory. Ill use it more explicitly.
Definition 12.7. If f : R C is integrable, we define its Fourier transform as
Z
f (t) =
eitx f (x) dx.
R
Using t as the argument of f is perhaps wrong because it should really be a variable in the frequency
domain, not the time domain. However everyone in probability seems to use t for the Fourier transform
variable.
A nice class of functions to work with when doing Fourier analysis are the Schwartz functions:
Definition 12.8. The Schwartz class S consists of all functions f : R C which are C and, for every n, k,
we have that |x|n f (k) (x) is bounded. So f and all its derivatives decay at infinity faster than any polynomial;
2
in particular they are integrable. Examples: any C function with compact support, e|x| .
Theorem 12.9. If f S then f S.
Proof. This follows from two basic facts about the Fourier transform:
31
1
itx g(t) dt.
2 R e
Then:
=
f (t) eitx (dx) dt
2
Z
1
=
f(t) (t) dt.
2
R
R
The same holds for . So we can conclude that f d = f d for all f S which by the previous lemma
shows = .
Z
It is actually possible to invert the Fourier transform of a measure more explicitly. If the measure
has a density f Rwith respect to Lebesgue measure, its chf will be integrable and you will just have
1
f (x) = (x)
= 2
eitx (t) dt. If is not absolutely continuous to Lebesgue measure then may not be
32
Theorem 12.13. Let n be a sequence of probability measures on R with characteristic functions n . Suppose that n (t) converges pointwise to some function (t), and further suppose that {n } is tight. Then n
converges weakly to some probability measure , whose characteristic function is .
Example 12.14. To see we need something besides just pointwise convergence of chfs, let n be a normal
2
distribution with mean 0 and variance n, so that n (t) = ent /2 . Then n (t) converges pointwise, but the limit
is 1{0} which cannot be the chf of any measure because it is not continuous. This shows that the sequence
{n } does not converge weakly, and indeed is not tight.
Proof. By tightness, there is a subsequence nk converging weakly to some probability measure . Thus
nk pointwise so we must have = . If there is another convergent subsequence n0k converging to
some other measure 0 , the same argument shows that 0 = , whence from the previous theorem we have
0 = . So in fact every convergent subsequence of {n } converges to .
It will follow, using a double subsequence trick, that
converges to . Suppose it
R the entire sequence
R
does not. There is a bounded continuous
f
such
that
f
d
6
f
d.
We
can then find an > 0
n
R
R
and a subsequence mk so that f dmk f d > for all k. But mk is itself tight
R so it has a further
R
subsequence mk j converging weakly, and as argued above the limit must be . Then f dmk j f d
which is a contradiction.
Combined with a little calculus, we can prove the central limit theorem.
Theorem 12.15 (Central limit theorem). Let X1 , X2 , . . . be iid L2 random variables. Set S n = X1 + + S n .
Then
S n nE[X1 ]
N(0, 1) weakly
n Var(X1 )
where N(0, 1) is the standard normal distribution, i.e. the measure defined by d =
2
1 ex /2 dx.
2
Xn E[Xn ]
, we can assume without loss of generality that E[Xn ] = 0 and Var(Xn ) =
Var(Xn )
just have to show Snn N(0, 1) weakly. We have already shown the sequence is tight, so by the
2
theorem it suffices to show the chfs converge to the chf of the normal distribution, et /2 .
Proof. By replacing Xn by
1, so we
previous
If is the chf of Xn , then the chf of S n / n is given by n (t) = (t/ n)n . Since E|Xn |2 < , is C 2 by
Proposition 12.4; we have (0) = 0, 0 (0) = E[iX] = 0, and 00 (0) = E[X 2 ] = 1. Then Taylors theorem
tell us that
1
(t) = 1 t2 + o(t2 )
2
or in other words, (t) = 1 21 t2 + (t) where (t)/t2 0 as t 0. Thus
!n
t
t2
n (t) = 1
+ ( ) .
2n
n
We just have to compute the limit as n ; we are only asking for pointwise convergence so we can treat
t as fixed. The rest of the proof is nothing but calculus and could be assigned to a determined
nMath 1120
student. Notice if the term were not there, we would just have the classic limit limn 1 + nx = e x with
x = t2 /2. So we just have to show the term can be neglected.
2
t2
Set an = 2n
+ ( t n ). I claim nan t2 as n , because if we let bn = t n and note that bn 0,
we have n(bn ) = t2 (bn )/b2n 0. So now we just need to show:
33
Remark 12.17. Durrett goes to more trouble in estimating the remainder term (which I have called (t));
see his Lemma 3.3.7. He makes it sound like this is necessary to get the CLT for L2 random variables, but
unless I am really missing something, this is not so. The plain ordinary Taylor theorem says that (t) =
(0) + t0 (0) + 21 t2 00 (0) + o(t2 ) (the Peano remainder) provided only that 00 (0) exists, which we know to be
the case by Proposition 12.4; indeed, we know C 2 (R).
Theorem 12.13 requires convergence of chfs as well as tightness. For the CLT we were able to show the
tightness of Snn directly using Chebyshev (see Example 11.18). A nifty theorem due to Levy says you can
get tightness by looking at what the chfs converge to.
Theorem 12.18 (Levy continuity theorem). Let n be a sequence of probability measures with chfs n .
Suppose that n (t) converges pointwise to some function (t). If is continuous at t = 0, then {n } is tight.
It then follows from Theorem 12.13 that n converges weakly to the measure whose chf is .
Note the requirement that the limit is continuous at 0 is enough to exclude the situation of Example
12.14.
Proof. Lets start with what we know: the continuity of at 0. Since (0) = lim n (0) = 1, fix an > 0 and
choose > 0 such that |(t) 1| < for all |t| < . If we had the n converging to uniformly, we could say
something similar was true for n ; but we dont have uniform convergence, only pointwise.
So lets average instead. If we average over (, ), we will get something close to 1:
Z
Z
1
1
(t) dt 1
|(t) 1| dt < .
2
2
But by dominated convergence,
(t) dt
n
Z
1
n (t) dt 1 < 2.
2
Actually, since n (t) = n (t), the imaginary part of n is an odd function, so the integral in the previous
equation is actually real. So we can say
1
2
Since we will apply this with complex values of an , we should choose a branch of the log function whose branch cut stays
away from the positive real axis. Since an 0, for sufficiently large n we will avoid the branch cut.
34
Z Z
1
n (t) dt =
eitx n (dx) dt
2
Z
Z
1
=
eitx dt n (dx)
2
Z
sin(x)
=
n (dx)
x
(Fubini)
sin 0
0
35
13
A major focus of probability theory is the study of stochastic processes. The idea is to study random
phenomena that evolve over time. Some examples might include:
The weather
Stock prices
Profits in a gambling scheme (perhaps related to the previous example)
Noise in an electrical circuit
Small particles moving around in a fluid
In this section we are going to study discrete-time stochastic processes, where we model time by the
integers, so that it passes in discrete steps. For some applications this is reasonable: e.g. modeling the
change in weather from day to day, looking at closing prices for a stock, playing a game that proceeds one
play at a time. For other applications (electrical noise, moving particles, etc) it is not so reasonable, and
it is better to work in continuous time, where we model time by the reals. This can give more realistic
models, but it also adds quite a bit of mathematical complexity. Essentially, the issue is that the integers
are countable while the reals are not. Probability deals much better with things that are countable, and
so continuous-time models tend to build in enough continuity to guarantee that one can do everything on
countable dense subsets (such as the rationals).
Definition 13.1. Let (S , S) be any measurable space. An S -valued discrete-time stochastic process is
simply a sequence X0 , X1 , . . . of S -valued random variables (i.e. measurable maps from a probability space
to S ).
We think of a system evolving over time; Xn is what you see when you observe the system at time n.
In general the state space of possible observations could be any measurable space S . For most of our
examples it will be something like Rd .
A sequence of iid random variables {Xn } is technically a stochastic process, but this is not really what
you should think of, because it doesnt evolve: the behavior of X1 , . . . , Xn tells you absolutely nothing
about Xn+1 . In contrast, most interesting real-world events have some dependence. For instance, knowing
the weather today tells us a lot about the weather tomorrow. If it is 80 and sunny today, that makes it much
less likely that it will be 20 and snowing tomorrow (though in Ithaca this is perhaps not so clear).
So the canonical example is:
Example 13.2. Simple random walk. Let {i } be an iid sequence of coin flips: P(i = 1) = P(i = 1) = 1/2.
Our stochastic process is Xn = 1 + + n . Think of Xn as the position of a drunkard on the integers: at
each time step he moves one unit right or left, chosen independently of all other choices. Note that the Xn
themselves are not independent, since knowing Xn leaves only two possible values for Xn+1 .
We could also take i to have any other distribution in R, which would lead to a more general random
walk.
Example 13.3. If you are gambling, and with each (iid) play of the game your net winnings are given by
i , then Xn is your total net profit at time n. For instance, if you are playing roulette and betting $1 on black
20
each time, then P(i = 1) = 18
38 , P(i = 1) = 38 . If you are buying lottery tickets, then the distribution of i
1
is something like P(i = 70000000) = 175711536
, P(i = 1) = 175711535
175711536 .
36
Example 13.4 (Random walk on a group). The notion of sum (or product) of random variables makes
sense in any group. If we take as our state space S a group G (equipped with a -algebra such that the group
operation is measurable), then we can take a sequence of G-valued random variables 1 , 2 , . . . which are
iid and multiply them to get Xn = 1 . . . n (so the walk starts at the identity: X0 = e). Then {Xn } is a nice
example of a G-valued stochastic process, called a random walk on G. (For simple random walk, G = Z.)
As a concrete example of this, consider shuffling a deck of 52 cards according to the following procedure:
choose one of the 52 cards uniformly at random and swap it with the top card. This corresponds to a
random walk on the group G = S 52 , the symmetric group on 52 elements. (Specifically, S 52 is the set
of all bijections of the set {1, . . . , 52}, with composition as the group operation.) If the i are iid with the
distribution P(i = (1 k)) = 1/52 for k = 1, . . . , 52, then the permutation of the deck after n swaps is
Xn = n . . . 1 . (Note that here we are multiplying on the left instead of the right by the usual convention for
function composition; since this is a non-abelian group the order matters, but the theory is the same either
way.)
An unsurprising result is that Xn converges weakly to the uniform measure on S 52 , i.e. the measure
which assigns the same measure 1/52! to each of the 52! elements of S 52 . That is, this procedure really does
shuffle the deck. But the rate of convergence is important, because it tells you how long it would take before
the deck is mostly random. This topic is often called mixing times.
Example 13.5 (Random walk on a graph). Let G = (V, E) be a locally finite graph. We can imagine a
random walk on G as follows: start at some vertex X0 , and at each step n, let Xn+1 be a uniformly chosen
neighbor of Xn . There is a tricky detail here though: intuitively we would like the choices of neighbors to be
independent. But they cant actually be independent, because the values of X1 , . . . , Xn affect your location
Xn at time n, which affects the possible values for Xn+1 because it must be a neighbor of Xn . What we really
want, it turns out, is that Xn+1 can depend on the vertex Xn but not on how you got to that vertex. This is the
fundamental idea of a Markov chain, which well discuss later.
13.1
Filtrations
As the system evolves, we learn more about it; new information is revealed. Since we have thought about
encoding information in -fields, this leads to the notion of a filtration.
Definition 13.6. A filtration on a probability space (, F , P) is a sequence {Fn } of sub--fields of F which
are increasing: F0 F1 F . A probability space equipped with a filtration is sometimes called a
filtered probability space, i.e. a 4-tuple (, F , {Fn }, P).
Fn is interpreted as the information available at time n; an event A is in Fn if, by time n, we can
determine whether or not it happens. The Fn are increasing, which means that information learned at time n
is remembered from then on.
We sometimes define a last -field in a filtration: F := (Fn : n 0), which contains all information
that will ever be revealed. F is not necessarily equal to F (the probability space could contain additional
randomness that we never get to see), but there is often no loss of generality in replacing (, F , P) by the
probability space (, F , P).
Example 13.7. For random walk, you could think of the filtration Fn = (1 , 2 , . . . ) generated by the iid
sequence i ; the information available at time n is everything we have learned from observing the coin flips
so far. (Take F0 to be the trivial -field {, } since at time 0 nothing has happened yet.) Then, for instance:
The event that the first coin is heads {1 = 1} is in F1 .
37
The event that at least four of the first seven flips are heads is in F7 (but not F6 ).
The event that there is ever at least one heads is not in any of the Fn (but it is in F ).
We would like the process {Xn } that we are studying to be part of the available information.
Definition 13.8. A stochastic process {Xn } is adapted to a filtration {Fn } if Xn Fn (i.e. {Xn B} Fn for
all measurable B S ). Usually the filtration {Fn } is understood (we are working on a filtered probability
space) and we will just say {Xn } is adapted.
Example 13.9. In our running (walking?) random walk example, the random walk Xn = 1 + + n is
adapted.
13.2
Stopping times
A reasonable question about a randomly evolving system is when does some phenomenon5 occur? Since
the answer may be random, it should be expressed by a random variable. A random time is just an
N {}-valued random variable whose value is interpreted as a time. (The event { = } is to be interpreted
as the phenomenon never occurs.)
Example 13.10. Suppose Xn is the state of the weather on day n, taking values in the set S = {C, R, S }
(clear, rain, snow). Let us think of the filtration Fn = (X1 , . . . , Xn ). Here are some examples of random
times:
1. 1 = min{n : Xn = S } is the first day on which it snows. (To convince yourself that 1 is measurable,
notice that for each n, {1 = n} = {X1 , S } {Xn1 , S } {Xn = S }, and for any Borel set B,
S
{1 = B} = nBN {1 = n} is a countable union of such sets.) Day 1 is a good time to skip class and
go sledding.
2. 2 = 1 1 is the day before the first snow. Day 2 would be an excellent time to install your snow
tires and buy some rock salt.
3. 3 = min{n : Xn1 = S , Xn = R} is the first time we see rain coming right after snow. Day 3 is a good
time to buy sandbags because it may flood. (Day 3 1 would be even better.)
In this example, 2 is a bit problematic: since weather forecasting is imperfect, you will not know the
date of 2 until it has already passed (i.e. on day 1 when the snow actually falls, you will know 2 was the
previous day). So it isnt actually possible to accomplish the plan install snow tires on day 2 . To avoid
issues like this, we introduce the idea of a stopping time.
Definition 13.11. A random time is a stopping time if for each n, we have { = n} Fn .
That is, on day n, using the available information Fn , you can determine whether or not happens today.
Proposition 13.12. is a stopping time iff for each n we have { n} Fn .
This says, for a stopping time , you can tell on day n whether has already happened, and this gives an
equivalent definition.
5
I keep wanting to use the word event here but thats already in use.
38
S
Proof. If is a stopping time, for each k n we have { = k} Fk Fn . But { n} = nk=0 { = k} so
{ n} Fn also. Conversely, assume for every n we have { n} Fn . Then { = n} = { n}\{ n1}.
But { n} Fn , and { n 1} Fn1 Fn , so their difference is also in Fn .
Example 13.13. In Example 13.10, 1 is a stopping time, because {1 n} = {X0 = S } {Xn = S }.
This is a finite union of events from Fn . 2 is not a stopping time: for example, {2 2} = {X0 , S , X1 ,
S , X2 = S } but this event is not in F1 (unless the weather is way more predictable than we think). We cannot
know on day 1 whether or not it is going to snow on day 2. 3 is a stopping time because
n
[
{Xk = R, Xk1 = S }
{3 n} =
k=1
so
Ak F .
39
A useful thing to do is to observe an adapted process {Xn } at a stopping time . The result is the random
variable X . (Note that X, are both random variables so you should think of this at X() (). If you have
P
measurability worries, you can allay them by writing X = 0n Xn 1{=n} .)
One caution: for this to make sense, we have to know that < , at least almost surely, since our
stochastic process does not necessarily include an X . One workaround is to add an extra cemetery state
to the state space S , and define X () = whenever () = .
Proposition 13.18. If {Xn } is adapted and < is a stopping time, then X F .
Proof. If B S is measurable, we have
{X B} { = n} = {Xn B} Fn .
For example, if = D is the first time that Xn hits a measurable set D, then XD is the point of D that
actually gets hit. (And on the event that D is never hit, our above convention says to take XD = .
Proposition 13.19. If , are stopping times with , then F F .
That is, if always happens later than , then at time you have more information than at time .
Proof. Suppose A F . We have
A { = n} =
A { = k} { = n}.
k=0
n
[
(A { = k}) { = n}.
k=0
14
Conditional expectation
A big part of the study of stochastic processes is looking at how information accumulates over time, and
what we can do with it. The essential tool here is conditional expectation, which we now develop.
In elementary probability, one studies conditional probability: if we are looking at the probability of
an event B, but we have the additional information that another event A happened, this may let us improve
our estimate of P(B) to the conditional probability P(B | A) := P(B A)/P(A). In some sense, we are
restricting the probability measure P to the event A, and dividing by P(A) to normalize; we only consider
those outcomes that are in A, and ignore the others. To think in terms of information, knowing whether or
not A happened is one bit of information, and we can adjust our estimate of P(B) based on this bit: P(B | A)
if A happened, and P(B | Ac ) if it did not.
Likewise, if we are looking at an (integrable) random variable X and we have no information whatever
about the outcome of our experiment, the best guess we can make as to the value of X is its expectation
E[X]. But if we know whether or not A happened, then we can improve our guess to one of the values
40
R
E[X | A] = E[X; A]/P(A) or E[X | Ac ]. (Recall the notation: E[X; A] = E[X1A ] = A X dP is the expectation
of X over the event A.)
Now what if we have more than one bit of information? Say we have an entire -field worth of information, call it G. How can we describe the way this information lets us improve our estimate of a random
variable X? We could record, for each A G, the value of E[X | A]. One could imagine this as a map
G R), but in fact, we can encode all these improved estimates in a single random variable.
R Recall the notation: E[X; A] = E[X1A ] is the expectation of X over the event A; this is the same as
X dP.
A
Theorem 14.1. Let X be an integrable random variable, and G F a -field. There exists a random
variable Y such that:
1. Y G, and
2. For every event A G, we have E[Y; A] = E[X; A].
Moreover, Y is unique up to measure-zero events: if Y 0 also satisfies items 1,2 above, then Y = Y 0 a.s.
Thus Y is something that only depends on the information in G, but it encodes all the conditional expectations E[X; A] for A G; just compute E[Y; A].
We can also view it as the best approximation we can make to X, given the information in G. This
parallels the idea of (unconditional) expectation being the best approximation we can make to X given no
information about the outcome of the experiment.
Definition 14.2. The conditional expectation of X given G, denoted E[X | G], is the unique random variable
described in Theorem 14.1. If B is an event, we define P(B | G) as the random variable E[1B | G].
The uniqueness part of the proof is elementary. For existence, I dont know a proof which is completely
elementary; all the ones I know involve some deeper theorem or machinery. You dont have to worry too
much about this, because the existence can be used as a black box.
Proof. For uniqueness: suppose Y, Y 0 satisfy items 1,2. Then for any A G we have E[Y Y 0 ; A] = 0.
Since Y, Y 0 are both G-measurable, {Y Y 0 } G. Thus E[Y Y 0 ; Y Y 0 ] = 0. This is the expectation of
the nonnegative random variable (Y Y 0 )1{YY 0 } , but you proved in homework that a nonnegative random
variable with zero expectation is zero almost surely. Similarly, (Y Y 0 )1{Y<Y 0 } = 0 a.s. Adding, Y Y 0 = 0
a.s.
Existence: This uses the RadonNikodym theorem. For A G let (A) = P(A) and (A) = E[X; A].
Then is a probability measure and is a signed measure on the measurable space (, G). The
R Radon
Nikodym theorem says there is a (G-measurable!) Y : R such that d = Y d, i.e. (A) = A Y d for
every A G. But this says precisely that E[X; A] = E[Y; A].
Lets derive some properties of conditional expectation. They mostly parallel those of unconditional
expectation, and are usually proved using the uniqueness in Theorem 14.1.
Proposition 14.3. Conditional expectation is linear: if X, Y are integrable and a, b R then E[aX + bY |
G] = aE[X | G] + bE[Y | G] a.s.
41
Proof. Set Z = aE[X | G] + bE[Y | G]; we show Z satisfies the two properties in Theorem 14.1. Clearly
Z G since it is a linear combination of G-measurable random variables. Now if A G, we have
E[Z1A ] = aE[E[X | G]1A ] + bE[E[Y | G]1A ] = aE[X1A ] + bE[Y1A ] = E[(aX + bY)1A ]
by linearity of E and definition of E[X | G]. Thus by the uniqueness, we must have Z = E[aX + bY | G]
almost surely.
Proposition 14.4. Conditional expectation is monotone: if X Y a.s. then E[X | G] E[Y | G] a.s.
Proof. Replacing X by X Y and using linearity, we can assume Y = 0. Let A = {E[X | G] 0}, so that
E[X | G]1A 0. But taking expectations and noting that A G, we have E[E[X | G]1A ] = E[X1A ] 0
since X1A 0. Since E[X | G]1A is a nonpositive random variable with nonnegative expectation, we must
have E[X | G]1A = 0 (you proved this in HW 1), which is to say E[X | G] 0 a.s.
Proposition 14.5. Triangle inequality: |E[X | G]| E |X| | G], almost surely.
Proof. By monotonicity above, we have E[X + | G] and E[X | G] are nonnegative. Now
|E[X | G]| = E[X + | G] E[X | G]
E[X + | G] + E[X | G]
= E[X + + X | G] = E[|X| | G].
Proposition 14.6. Conditional monotone convergence theorem: if Xn 0, Xn X a.s., then E[Xn | G]
E[X | G almost surely.
Proof. By the monotonicity proved in the previous proposition, E[Xn | G is an increasing sequence, hence
converges a.s. to some limit, Y. We have to show Y = E[X | G], for which well use the uniqueness of
Theorem 14.1. As a limit of G-measurable random variables, Y is also G-measurable. If A G, we have
Xn 1A X1A and E[Xn | G]1A Y1A . Since E[Xn 1A ] = E[E[Xn | G]1A ], using (unconditional) monotone
convergence on both sides gives E[X1A ] = E[Y1A ]. So Y = E[X | G].
Proposition 14.7. Conditional Fatou lemma: If Xn 0 are integrable and so is X := lim inf Xn then
E[X | G] lim inf E[Xn | G], almost surely. (Presentation.)
The integrability assumptions can be removed using your homework problem which extends conditional
expectation to nonnegative random variables.
Proposition 14.8. Suppose Xn X in L1 . Then E[Xn | G] E[X | G] in L1 .
Proof.
E[|E[Xn | G] E[X | G]|] = E[|E[Xn X | G]|]
E[E[|Xn X| | G]]
= E[|Xn X|] 0.
In the last equality we used the fact E[E[Y | G]] = E[Y]; this comes from the definition of conditional
expectation, taking A = .
42
Remark 14.9. Even if we have Xn X a.s. and in L1 , we cannot conclude that E[Xn | G] E[X | G] almost
surely. It seems there is a fairly strong negation of this statement due to Blackwell and Dubins. [Nate: add
a reference here.]
Example 14.10. Consider the special case where we fix an event A and set G = ({A}) = {, , A, Ac }.
Then in order for a random variable Y to be G-measurable, it must be of the form Y = a1A + b1cA . So
to compute Y = E[X | G] for some other random variable X, we just note that a = E[Y1A ]/P(A) =
E[X1A ]/P(A) = E[X | A], the elementary definition of conditional expectation. Likewise b = E[X | Ac ]. So
E[X | G] = E[X | A]1A + E[X | Ac ]1Ac . On A, E[X | G] is constant, and its value is the average of X over A;
the conditional expectation just flattens out X to make it conform to the partition {A, Ac } of .
By a similar argument, if we partition into a finite or countable sequence of events A1 , A2 , . . . which
P
are pairwise disjoint and whose union is , and set G = (A1 , A2 , . . . ), then E[X | G] = i E[X | Ai ]1Ai .
Proposition 14.11. If Z G, and X, ZX are both integrable, then E[ZX | G] = ZE[X | G].
That is, when conditioning on G, G-measurable random variables act like constants and can be factored
out. Since G is the information we know, it makes sense that a quantity Z depending only on things we know
should behave like a constant.
Proof. Presentation.
Proposition 14.12. If X y G then E[X | G] = E[X], i.e. the conditional expectation is a constant.
If you are given some information G to make a prediction of X, but the information is completely
irrelevant (i.e. independent), then your best guess wont actually involve the given information (i.e. E[X | G]
is a deterministic constant) and will be the same as what you would guess given no information at all (i.e.
E[X]).
Proof. We use uniqueness: clearly E[X] is G-measurable (its a constant!) and if A G, then by independence E[1A E[X]] = E[1A ]E[X] = E[1A X].
Conversely:
Proposition 14.13. If for each Borel set B, P(X B | G) is a.s. equal to a constant, then X y G. (In
particular, this holds if E[ f (X) | G] is constant for all, say, bounded measurable f .)
Proof. First note that if P(X B | G) is a constant then it is equal to its expectation, so P(X B | G) =
E[P(X B | G)] = P(X B).
Suppose A G and C (X). We know C = {X B} for some Borel B. Then
P(A C) = E[1A 1B (X)] = E[1A E[1B (X) | G]] = E[1A P(X B | G)] = E[1A P(X B)] = P(A)P(C)
so A y C. Hence G y (X).
(y) (x)
(y) (x)
lim
yx
yx
yx
and that c could be taken to be any number between these two limits. Wed like to apply this same idea with
x = E[X | G]; but this is a random variable, i.e. depends on and so c will also depend on , and we have
to be a little careful to make sure the dependence is measurable.
. Convex functions are continuous, so cn is continuous also. As argued, for each
Set cn (x) = (x+1/n)(x)
1/n
x, c(x) := limn cn (x) exists and is finite, and c is measurable since it is a pointwise limit of measurable
functions. Moreover, as before, for any x, y we have (y) (x) + c(x) (y x). So taking y = X and
x = E[X | G] we can do:
E[(X) | G] E[(E[X | G]) + c(E[X | G])(X E[X | G]) | G]
(((
((| (
= (E[X | G]) + cE[X | G](
(E[X
G] | G]).
(((E[X
15
Martingales
The first main class of stochastic processes we will study are martingales. These have a rather special
property that, in some sense, they predict their own future behavior. There are a limited number of real-life
systems that are reasonable to model using martingales; but there are several very useful theorems that apply
to them, and it turns out to be possible to build martingales based on more general processes and use them
for analysis. Also, martingales are the cornerstone of the theory of stochastic calculus that will be developed
in Math 6720.
Well first give the definition and then think about what it means.
Definition 15.1. Let (, F , {Fn }, P) be a filtered probability space. An adapted stochastic process {Mn }n0
is a martingale if each Mn is integrable and for each n 0, we have
E[Mn+1 | Fn ] = Mn ,
a.s.
44
The way to think of a martingale is as a fair game: you are gambling in a casino where every game is
fair, and Mn is your fortune at time n. If you look at all the information available to you at time n (including
your current fortune Mn ) and try to predict your fortune after the next play, your best estimate of Mn+1 is the
same as your current fortune Mn ; since the games are fair, on average, you expect to neither win nor lose
money on the next play.
A submartingale is a favorable game, where you expect, on average, to win money every time (or at
worst to break even); likewise a supermartingale represents an unfavorable game. The names may seem
backwards; they arise because there is a correspondence with subharmonic and superharmonic functions
in PDEs and potential theory, but as a mnemonic you can think of the names as referring to the casinos
point of view: a submartingale is favorable to the player and hence sub-optimal for the casino, while a
supermartingale is super for the casino.
Notice that if we just look at expectations we see that for a martingale E[Mn+1 ] = E[E[Mn+1 | Fn ]] =
E[Mn ], i.e. a martingale has constant expectations. For a supermartingale, E[Mn ] decreases with n, and for
a submartingale it increases.
Also, Mn is a martingale we have by the tower property
E[Mn+2 | Fn ] = E[E[Mn+2 | Fn+1 ] | Fn ] = E[Mn+1 | Fn ] = Mn .
Iterating, E[Mn+k | Fn ] = Mn . So given the information at time n, your best estimate of the martingale at
any future time is Mn itself.
Example 15.2 (Independent increments). Let {i } be independent random variables with E[i ] = 0 for every
i (they do not have to be identically distributed) and take Fn = (1 , . . . , n ). Let M0 = 0, Mn = 1 + + n .
Then I claim Mn is a martingale; it is clearly adapted, and
E[Mn+1 | Fn ] = E[Mn + n+1 | Fn ] = E[Mn | Fn ] + E[n+1 | Fn ]
by linearity. But Mn Fn , so E[Mn | Fn ] = Mn , and n+1 is independent of Fn , so E[n+1 | Fn ] = E[n+1 ] =
0.
If E[i ] 0 we get a supermartingale, and if E[i ] 0 we get a submartingale.
Example 15.3. Independence of increments is not necessary for a martingale. Its okay if M1 , . . . , Mn affect
the distribution of the increment Mn+1 Mn somehow, as long as even given all this information, it still has
(conditional) expectation zero.
For example, let 1 , 2 , . . . be iid fair coin flips (taking values 1), and Fn = (1 , . . . , n ). Lets consider
the following gambling strategy: bet 1 dollar on the first coin flip (so you win $1 or lose $1 with probability
1/2). If you win, keep your dollar and quit playing. If you lose, bet $2 on the next flip. Keep this going;
if you win, quit playing, and if you lose, double your bet. We can see that when you eventually win, your
winnings cancel all your losses so far and leave you with a profit of $1. Then your profit at time n can be
written recursively as M0 = 0 and
Mn = 1
1,
Mn+1 =
Mn + 2n n+1 , otherwise.
45
I claim Mn is a martingale with respect to Fn . It is clearly adapted (use induction if you like), and we have
E[Mn+1 | Fn ] = E[1 Mn =1 + 1 Mn ,1 (Mn + 2n n+1 ) | Fn ]
= 1 Mn =1 + 1 Mn ,1 (Mn + 2n E[n+1 | Fn ])
= 1 Mn =1 + 1 Mn ,1 (Mn + 2n E[n+1 ])
= 1 Mn =1 + 1 Mn ,1 Mn
= Mn .
In the second line we used linearity and Proposition 14.11 repeatedly; in the third line we used Proposition
14.12 since n+1 is independent of Fn .
This doubling strategy is a bit weird. It results in a martingale, which should be a fair game; however,
as soon as any coin flip comes up heads, you win back all your losses and have a profit of $1, and with
probability 1, this will eventually happen. So this fair game is actually a guaranteed win.
Worse yet, we could repeat this with a biased coin that only comes up heads with probability 0 < p <
1/2. Then this is an unfavorable game, and the argument above shows it is a supermartingale. But again it
is a guaranteed win!
There are two catches though. First, you may have to wait arbitrarily long for a heads, so youd better
have unlimited time to play, because if you quit early the whole thing falls apart. Second, you could go very
far into debt while waiting for the first heads, so youd also better have unlimited credit. Well show a little
later that if either of these things is absent, you cannot use any such strategy to turn a fair or unfavorable
game in your favor.
Example 15.4. The property that E[Mn ] is constant is necessary in order to be a martingale, but not sufficient. Let be a fair coin flip, and let M1 = , M2 = + . So if is heads, you win a dollar at time
1, and another dollar at time 2; if is tails you lose both times. Clearly E[M1 ] = E[M2 ] = 0, so this
game is fair in a certain sense. But it isnt a martingale (with respect to its natural filtration, say) because
E[M2 | M1 ] = M2 , M1 .
You can interpret the fairness required to be a martingale as follows: no matter what has happened up
to time n, the game from then on is still fair. Someone who has been watching the game without playing
would be willing to jump in at any time, no matter what they have seen so far. This fails in this example; if
the coin comes up tails at time 1, then given this information, the play at time 2 is a guaranteed loss, and an
onlooker wouldnt want to enter the game at that point.
Here are a couple of simple facts:
Proposition 15.5. If Mn , Mn0 are martingales then so is aMn + bMn0 . (The martingales are a vector space.)
Proposition 15.6. If Mn is a martingale and is convex, then (Mn ) is a submartingale.
Proof. By the conditional Jensen inequality, E[(Mn+1 ) | Fn ] (E[Mn+1 | Fn ]) = (Mn ).
Proposition 15.7. If Mn is a submartingale and is convex and nondecreasing then (Mn ) is a submartingale.
Proof. Just as above, E[(Mn+1 ) | Fn ] (E[Mn+1 | Fn ]) (Mn ) where the second inequality is because
E[Mn+1 | Fn ] Mn and is nondecreasing.
Here is a more elaborate, but extremely powerful, way to get new martingales from old.
46
Definition 15.8. A process {Hn }n1 is said to be predictable if Hn Fn1 for each n. That is, from what
you know at time n 1, you can determine exactly what Hn will be.
Definition 15.9. Suppose Hn is predictable and Mn is a martingale. Define a new process (H M)n by
(H M)n =
n
X
Hi (Mi Mi1 ).
i=1
(H M)n is called the discrete stochastic integral or martingale transform of Hn with respect to Mn .
Perhaps the best way to think of this is as an investment strategy. Suppose Mn is the price of a stock at
the close of day n. Our strategy will be: at the start of day n, buy Hn shares of stock (at the previous closing
price Mn1 ) and then sell it at the end of the day (at the price Mn ). We can make the decision as to how
much stock to buy using any information gathered up to day n 1 (including the closing price Mn1 ) but of
course we cannot know what day ns closing price will be.
Proposition 15.10. If each Hn is bounded, then (H M)n is a martingale.
So if the stock price is fair, then (on average) you cant make money trading it.
Proof. Its easy to see it is adapted, since (H M)n is defined completely in terms of H1 , . . . , Hn and
M0 , . . . , Mn , all of which are Fn -measurable. We need Hn to be bounded in order to be sure that (H M)n is
integrable. Then we note that
E[(H M)n+1 (H M)n | Fn ] = E[Hn+1 (Mn+1 Mn ) | Fn ] = Hn+1 E[Mn+1 Mn | Fn ] = 0
since Hn+1 Fn .
If Mn is a supermartingale, then (H M)n is also a supermartingale given the additional assumption that
H 0. (If the stock is tending to lose money, so will any strategy based on it, provided that the strategy is
only allowed to hold positive shares of stock and isnt allowed to sell short.) The proof is just the same as
above. The analogous statement for submartingales is also true.
Its worth observing that (H M) is bilinear in H and M.
Heres a simple type of strategy: let be a stopping time, and set Hn = 1n1 . This is clearly predictable.
It corresponds to the strategy buy one share, and hold it until time , then sell it. (On the event = we
just hold the stock forever.) Indeed, we have (H M)n = Mn , so our previous proposition shows:
Proposition 15.11. If is a stopping time and Mn is a smartingale then so is Mn .
In particular, for martingales, for every n we have E[Mn ] = E[M0 ]. As a corollary, if is a bounded
stopping time, say C a.s., then E[M ] = E[M0 ]. This says that something like our doubling strategy
above, which makes us guaranteed free money, cannot work if we cannot wait arbitrarily long.
Theorem 15.12 (Doob decomposition). Yipu presents.
If < a.s., then M is a random variable; what can we say about it, and in particular its expectation?
We expect to have something like E[M ] = E[M0 ] since this is what happens when is deterministic. If we
can pass to the limit then this works:
Theorem 15.13 (Optional stopping theorem). If Mn is a submartingale, is a stopping time, < almost
surely, and {Mn }n0 is uniformly integrable, then E[M ] E[M0 ]. For supermartingales the inequality
reverses; for martingales we get equality.
47
Proof. The proof is just the Vitali convergence theorem. If < almost surely, then Mn M almost
surely (indeed, for every with () < , we have Mn () = M () for all n ()). So if {Mn } is ui,
Vitali says E[Mn ] E[M ]. But E[Mn ] E[M0 ] for all n because Mn is a submartingale.
Uniform integrability is the most general condition for this to work. But any condition that lets you
pass to the limit in E[Mn ] will also do, for instance, dominated convergence. The crystal ball condition
sup E|X| p < is also useful.
We cannot do without uniform integrability. If we consider our doubling martingale, and let = inf{n :
Mn = 1}, then < almost surely, but M = 1, and so 1 = E[M ] , 0 = E[M0 ]. Our doubling martingale
was not ui.
This also goes to show that something like our doubling martingale cannot work if we dont have unlimited credit: if we had a finite amount of credit, the martingale would be bounded, hence ui, and every
stopping strategy would have zero expectation.
Here are a couple of classic applications of optional stopping.
Example 15.14. Consider the simple random walk martingale: i are iid fair coin flips, Fn = (1 , . . . , n ),
and Mn = 1 + + n . So we bet $1 on successive fair coin flips and Mn records our net profit. A classic
question is the gamblers ruin problem: suppose we decide that we will quit when we have amassed either
b dollars of profit (victory) or a dollars of debt (ruin). What are the probabilities of victory and ruin?
Let = inf{n : Mn {a, b}} be the time at which we quit playing. Lets first argue that we will
eventually quit, i.e. < a.s.. For a crude argument, let r = a + b and let An = {nr+1 = = nr + r = 1}
be the event that all the r coin flips from nr + 1 to nr + r were tails. On this event, the game must be over
by time nr + r, since if we had fewer than b dollars at time nr + 1, we will have less than b r = a by
S
time nr + r. Now P(An ) = 2r > 0 and all the An are independent, so P( An ) = 1, i.e. almost surely such a
run will eventually happen, and this will end the game if it hadnt ended already. (Actually such a run will
happen infinitely often, by Second BorelCantelli.) Thus < a.s. In particular M is well defined, and
we are asking for the probabilities of {M = a} and {M = b}.
Next, we have a Mn b for all n, so in particular {Mn } is ui. So by the optional stopping
theorem, E[M ] = E[M0 ] = 0. However, since M must be either a or b, we have
0 = E[M ] = aP(M = a) + bP(M = b).
a
b
(and P(M = b) = a+b
).
Since P(M = a)+P(M = b) = 1, we can solve this to find that P(M = a) = a+b
b
So ruin occurs with probability a+b . Note that this is increasing in b and decreasing in a, which makes sense.
b
Since we showed that the probability of hitting a before b is a+b
, in particular the probability that we
ever hit a is at least this large. But as b this tends to 1. This shows that, almost surely, the random
walk will eventually hit any arbitrary negative value a, and we can say the same for positive values. So
taking a countable intersection, almost surely, simple random walk eventually hits every value. In fact it is
not hard to show that simple random walk hits every value infintely many times; we say it is recurrent.
Lemma 15.15 (Upcrossing inequality, out of order). Let Mn be a submartingale, a < b, and Uk the number
of upcrossings of [a, b] that Mn makes by time k. Then
(b a)EUn E(Mn a)+ E(M0 a)+ .
48
Example 15.16. We can also use optional stopping to analyze an asymmetric random walk. Again let i be
iid, but now with P(i = 1) = p, P(i = 1) = 1 p, for some p , 1/2. Let Xn = 1 + + n as before. This
is no longer a martingale, but we can turn it into one. The idea is to look at an exponential martingale: we
will find a number such that Mn := Xn is a martingale, and then use optional stopping on Mn .
To determine what value to use for , we compute
E[Mn+1 Mn | Fn ] = E[Mn (1 n+1 ) | Fn ] = Mn (1 E[n+1 ])
since Mn Fn and n+1 y Fn . So we need to choose so that E[n+1 ] = 1. But we have
E[n+1 ] = p + (1 p)1
so a little algebra shows we should take = 1p
p . (The other solution is = 1 but this results in the constant
martingale Mn = 1 which wont be useful for anything.)
Now as before let = inf{n : Xn {a, b}. By a similar argument as before we have < almost
surely, and we have a Mn b , so again Mn is ui. Optional stopping now shows
1 = E[M0 ] = E[M ] = a P(X = a) + b P(X = b)
and we obtain that the probability of ruin is
P(X = a) =
1 b
.
a b
+
E[Mk
] = E[Mk
] E[Mk ] E[Mk+ ] E[M0 ] 2C
where we used the fact that Mn is a submartingale and hence E[Mk ] E[M0 ]. In the end we have
E[|Mk |] 3C and can use Fatou as before to get E|M | < , and proceed.
Actually the assumption < here is unnecessary because the martingale convergence theorem, below,
implies that if {Mn } is ui then it converges almost surely, so M is well-defined even on the event { = }.
49
Theorem 15.18 (Martingale convergence theorem). Suppose Mn is a submartingale with supn EMn+ < .
Then Mn converges almost surely to some random variable M , and E|M | < .
Proof. Let C = supn EMn+ < .
Fix a < b and let Un be the number of upcrossings as in the previous lemma. Since (Xn a)+ Xn+ ,
the upcrossing lemma tells us that EUn C/(b a). If U is the total number of upcrossings (for all time),
clearly Un U, and so by MCT or Fatou we have EU C/(b a) as well; in particular U < . Thus if
Aa,b is the event that Mn makes infinitely many upcrossings of [a, b], we have P(Aa,b ) = 0.
Now suppose Mn () fails to converge, so that lim inf Mn () < lim sup Mn (). We can then choose
rational a, b (depending on ) such that lim inf Mn () < a < b < lim sup Mn (). Then Mn () has to be less
than a infinitely often, and greater than b infinitely often, so it makes infinitely many upcrossings of [a, b];
that is, Aa,b . Thus we have
[
{Mn does not converge}
Aa,b .
a,bQ
The right side is a countable union of probability-zero events, so we have that Mn converges almost surely.
Call the limit M; we have to show E|M| < . Since Mn+ M + almost surely, and EMn+ C, by
Fatous lemma we get EM + C. On the other hand, E[Mn ] = E[Mn+ ] E[Mn ] C E[M0 ] since Mn is a
submartingale and has E[Mn ] E[M0 ]. Thus E|M| = E[M + ] + E[M ] 2C E[M0 ] < . In particular,
M is finite almost surely.
Corollary 15.19. If Mn is a smartingale with supn E|Mn | < then Mn converges almost surely. In particular
this happens if {Mn } is ui, and in this case it also converges in L1 .
Proof. For a submartingale, we note that Mn+ |Mn | and use the previous theorem. For a supermartingale,
consider the submartingale Mn . In the ui case, we know that ui sequences are L1 -bounded
Corollary 15.20. If Mn is a positive supermartingale (or bounded below) it converges almost surely.
Proof. Mn is a submartingale and (Mn )+ = Mn = 0.
Example 15.21. The martingale convergence theorem gives us a quick proof of the recurrence of simple
random walk. Let Mn be simple random walk which is a martingale, pick an integer a > 0, and let =
inf{n : Mn = a}. Then Mn is a martingale with Mn a, so the martingale convergence theorem applies
and Mn converges almost surely. But on the event { = }, Mn = Mn diverges, because Mn moves by
1 at every time step. Hence we must have P( = ) = 0, i.e. Mn almost surely hits a. For a < 0, consider
Mn instead.
Example 15.22. For asymmetric random walk, say with p > 1/2, then for any a > 0, S na is a submartingale bounded above. As before, this means P(a = ) = 0. But we can do better: choosing = (1 p)/p < 1
as in our example above, so that Mn := S n is a positive martingale, we have that Mn converges almost surely
to a finite limit. This limit cannot take a nonzero value (because S n itself cannot converge) so the only possibility is that Mn 0 a.s. That means that S n + almost surely; the asymmetric random walk drifts to
+. Of course, we also knew this in several other ways.
Example 15.23 (GaltonWatson branching process). This is another nice example of a process that we
can analyze with martingale techniques. We are studying a population of some organism. In each generation, each individual in the population gives birth to a random number of offspring, and then dies. We are
interested in the long-term behavior of the population: in particular, will it become extinct?
50
Let Xn be the number of individuals in the population at generation n; well take X0 = 1. Let n,i be the
number of offspring produced by the ith individual in generation n; well assume that {n,i : n, i 1} are iid
nonnegative integer-valued random variables. (Note that for any given n, only finitely many of the n,i will
be used, but the number depends on the previous generation, so we dont know in advance how many well
need. So we do assume we have infinitely many n,i available in our model, but in any given outcome most
of them will go unused.) Then we can define
Xn =
X
n1
X
n,i .
i=1
This sum may look funny because of the random limit, but on a pointwise level it makes sense. Let E be
S
the event of extinction, i.e. E = n {Xn = 0}. We would like to know P(E). Certainly it will depend on the
distribution of the n,i (the offspring distribution); specifically on their mean. So let = E[n,i ] (assume
the expectation exists and is finite). The case = 0 results in immediate extinction (since there are no
offspring at all) so assume > 0. Set Fn = (k,i : k n).
Proposition 15.24. Mn := Xn /n is a martingale.
Proof. Clearly Mn is adapted. Next, we compute E[Xn | Fn1 ]. Intuitively, given the information Fn1 we
know the number of individuals Xn1 at time n 1, but we have no information about how many offspring
each one will have (since the n,i are independent of Fn1 ) so the best guess we can make is E[n,i ] = .
Thus our best guess at Xn should be Xn1 . To make this precise, we can do the following:
(cMCT)
E[n,i ]1{Xn1 i}
i=1
X
i=1
= Xn1 .
Thus, E[Mn | Fn1 ] = n E[Xn | Fn1 ] = (n1) Xn1 = Mn1 .
If = 1 but n,i is not identically 1, then we must have p0 := P(n,i = 0) > 0. Now in this case
{Xn } is a nonnegative martingale and hence converges almost surely. Since it is integer valued, it must be
eventually constant. So consider An,k = {Xn = Xn+1 = = k}. In order for An,k to happen, we must avoid
the events Em,k = {m,1 = = m,k = 0}, which is the event that in generation m, the first k individuals
die childless. However, the events Em,k , m n are independent, and P(Em,k ) = pk0 > 0, so by (a trivial
S
version of) BorelCantelli we have P( mn Em,k ) = 1, and therefore P(An,k ) = 0. Taking a countable union,
S
P( n0,k1 An,k ) = 0. This says that Xn cannot be eventually constant at any positive value. Since by almost
sure convergence Xn must be eventually constant at some value, it must be zero, which is extinction. (Note
that we have Xn 0 but E[Xn ] = 1, so the convergence is certainly not in L1 and this is yet another example
of a non-ui martingale.)
It can be shown that if > 1 then there is a positive probability that extinction is avoided. See Durrett.
15.1
L1 convergence
Here is a very simple type of martingale, which is actually very general as we shall see: Let X be an
integrable random variable, and set Mn = E[X | Fn ]. The idea is that Mn is the best approximation of X
that we can get given the information available at time n. This is clearly a martingale thanks to the tower
property.
Theorem 15.26. If X is integrable, then the set of random variables {E[X | G] : G F is a -field} is ui.
Corollary 15.29 (Levy zero-one law). If A F then P(A | Fn ) 1A a.s. and in L1 . (In particular, the
limit of P(A | Fn )() is almost surely either 0 or 1, depending on whether A.)
Interestingly, we can use this to prove the Kolmogorov zero-one law. Let Fn = (G1 , . . . , Gn ) where the
T
Gi are independent -fields, and we let T = n (Gn , Gn+1 , . . . ) be the tail -field. If A T F then
A y Fn for each n, and so P(A | Fn ) = P(A). Levy says the left side converges almost surely to 1A so we
must have 1A = P(A) a.s. This is only possible if P(A) is 0 or 1.
Actually it turns out every ui martingale is of the form Mn = E[X | Fn ].
Theorem 15.30. Suppose Mn is a martingale, and Mn M in L1 . Then Mn = E[M | Fn ] almost surely.
Proof. We use uniqueness to verify that the conditional expectation E[M | Fn ] is in fact Mn . Clearly
E[M | Fn ] Fn . Now let A Fn . For any k > n, since Mn = E[Mk | Fn ], we have E[Mn 1A ] = E[Mk 1A ].
Now let k ; since Mk M in L1 we have
|E[Mk 1A ] E[M 1A ]| E[|Mk M |1A ] E|Mk M | 0.
So E[Mk 1A ] E[M 1A ]. Passing to the limit we have E[Mn 1A ] = E[M 1A ].
52
16
Markov chains
The tricky part about any stochastic process is the dependence between the random variables X1 , X2 , . . . . In
most models you want some relationship between Xn and Xn+1 ; they shouldnt be independent, but they also
shouldnt be deterministically related, otherwise there is no randomness. If the dependence structure is too
complicated, you get a model that you cant analyze.
Random walks are a nice simple example: if Xn = 1 + + n , for iid i , then Xn+1 is Xn with a litlle
bit of extra randomness added (literally). But requiring that relationship to be additive is quite restrictive;
indeed, we often want to look at processes in state spaces that have no notion of addition or group structure.
The Markov chain is a model where the process evolves by making incremental random changes to its state,
and its a good compromise of being reasonable to analyze while still being quite a flexible model.
Fix a measurable space (S , S) to be our state space. For most of our development, we will take S to be
a finite or countable set, with S = 2S . This makes all the measure theory very simple: all subsets of S are
P
measurable, and any measure on S can be represented by a probability mass function: (A) = xA p(x).
Occasionally we might allow for more generality, but countable sets are what you should keep in mind.
The simple picture of a Markov chain is as a weighted directed graph. The vertex set is the state space S ,
P
and each directed edge (x, y) is labeled with a probability p(x, y) in such a way that for each x, y p(x, y) = 1.
(Any directed edge that is not present in the graph can be thought of as having weight 0.) The process starts
at some vertex x0 and evolves: if at some time it is at vertex x, its next move is to a randomly chosen neighbor
of x, so that the probability of moving to y is p(x, y). We have to say something about how these random
moves are made; in particular their dependence on one another. Intuitively we think the move from x should
be independent of all previous moves, but that is not quite right, since the previous moves determined the
vertex x where we are now sitting, which in turn determines the possible vertices for the next move (and
their respective probabilities). So this intuitive idea takes a little more work to describe formally.
We can describe it this way:
Definition 16.1. (Preliminary) Fix a filtered probability space. An adapted stochastic process {Xn }n0 with
state space (S , S) is a (discrete-time, time-homogeneous) Markov chain if, for each n and each measurable
B S, the conditional probability P(Xn+1 B | Fn ) only depends on B and Xn . So if we know Xn , the state
of the process at time n, then we cannot improve our estimate of P(Xn+1 B) using further information from
Fn (which includes older history about the process).
In other words, there should exist a function p : S S [0, 1] so that P(Xn+1 B | Fn ) = p(Xn , B).
This function should have the properties that:
1. For each B S, x 7 p(x, B) is a measurable function on S .
2. For each x S , B 7 p(x, B) is a probability measure on S .
53
p is called the transition function or transition probability of the chain {Xn }. In words, if the process is in
state x at some time, then p(x, B) is the probability that it will be in B at the next step.
When S is countable, any probability measure can be defined by a mass function. So we could instead
P
think of p : S S [0, 1] being jointly measurable and satisfying yS p(x, y) = 1 for every x, and let
P
p(x, B) = yB p(x, y). Then if the process is at state x, p(x, y) is the probability that its next step will be to
state y.
R
Corollary 16.2. E[ f (Xn+1 | Fn )] = S f (y)p(Xn , dy).
If p is allowed to depend on n as well, then {Xn } is a time-inhomogeneous Markov chain: the probability of transitioning from x to y may change over time. These models are considerably more annoying to
deal with so we will not.
If S is finite and we identify it with {1, . . . , N}, we can think of p as an N N matrix whose i jth entry
is p(i, j), the probability of transitioning from i to j.
There is another technicality to deal with. Normally a stochastic process in S would just be a sequence
of S -valued random variables X0 , X1 , . . . . However, we would like to be able to consider arbitrary starting
points x0 S . One option would be to have a whole family of processes, indexed by S , like {X0x , X1x , . . . } xS
where X0x = x. However a convention that turns out to work better, though it looks weird at the outset, is
to use a single process but vary the probability measure P. Thus our starting point is a measurable space
(, F ) (usually with a filtration {Fn }), a sequence X0 , X1 , . . . of measurable maps Xn : S , and a family
of probability measures {P x } xS , having the property that P x (X0 = x) = 1. You can read P x as probability
when started at x; so P x (X1 = y) is the probability that the process, when started at x, visits y at the next
step. We will use E x to denote expectation (or conditional expectation) with respect to the measure P x .
If I write P or E without a subscript (which I will try and avoid), it normally means that the statement
holds for every P x . Thus P(Xn x) = 1 means for every x S , P x (Xn z) = 1. That is, no matter where
the process starts, it converges to z almost surely. Likewise, I will try to qualify almost sure statements by
writing P x -a.s. but if I just say a.s. it means P x -a.s. for every x S .
An alternative approach is to use a single probability measure P with the property that P(X0 = x) > 0 for
every x S , and think of P x (A) as the conditional probability P(A | X0 = x). This works okay for countable
S , but it breaks down when S is uncountable, since P(X0 = x) is necessarily zero for all but countably many
x S , so you find yourself conditioning on events of probability zero.
So heres our full definition. It is different from Durretts but we will see that they are equivalent.
Definition 16.3. Suppose:
1. (S , S) is a measurable space;
2. (, F ) is a measurable space;
3. {Fn }n0 is a filtration on (, F );
4. {P x } xS is a family of probability measures on (, F ), such that for each A F , the function x 7
P x (A) is measurable;
5. {Xn }n0 is a sequence of measurable functions from to S which is adapted (Xn Fn );
6. For each x S , P x (X0 = x) = 1.
54
P x -a.s.
In this case, we can identify the transition function p as p(x, B) = P x (X1 B).
Lemma 16.4. If f0 , . . . , fk : S R are bounded measurable functions, then
(
E x ( f0 (Xn ) f1 (Xn+1 ) . . . fk (Xn+k ) | Fn ) =
fk (yk )p(yk1 , dyk ) fk1 (yk1 )p(yk2 , dyk1 ) . . . f1 (y1 )p(Xn , dy1 ) f0 (Xn ).
Proof. The base case is trivial since it just reads E x [ f0 (Xn ) | Fn ] = f0 (Xn ), which holds because Xn Fn .
Now suppose it holds for k 0. We have by the tower property
E x [ f0 (Xn ) . . . fk (Xn+k ) fk+1 (Xn+k+1 ) | Fn ] = E x [ f0 (Xn ) . . . fk (Xn+k )E[ fk+1 (Xn+k+1 ) | Fn+k ] | Fn ]
Z
= E x [ f0 (Xn ) . . . fk (Xn+k )
fk+1 (yk+1 )p(Xnk , dyk+1 ) | Fn ]
(
=
fk+1 (yk+1 )p(yk , dyk+1 ) fk (yk )p(yk1 , dyk ) . . . f1 (y1 )p(Xn , dy1 ) f0 (Xn )
applying the induction hypothesis, replacing fk (yk ) by fk (yk )
measurable function of yk .
Using this, we can prove a key property of a Markov process: that it starts fresh at every step. It
doesnt remember its history, so if at time n you observe that the process is in some state x, from then on it
behaves exactly like a new process which was started at x.
It is a bit challenging to state this precisely, in its full generality. Durrett does it by assuming that the
underlying probability space is the infinite product space S N and defining shift operators on this space. To
me this feels weird, so Im taking the following approach which makes the use of the sequence space more
explicit.
Let S N be the product of countably many copies of S , equipped with the infinite product -algebra SN
generated by cylinder sets B0 Bn S . . . . The elements of S N are sequences z = (z(0), z(1), . . . ) of
elements of S . (Here we are abusing notation to think of N as starting with 0.)
For each x S , we let x be the probability measure on S N which is the law of the Markov chain {Xn }
started at x; i.e. x (B) = P x ((X0 , X1 , . . . ) B). Observe that x puts all its mass on those sequences whose
0th term is x. Note that for each measurable B S N , the map x 7 x (B) is measurable.
Theorem 16.5 (Markov property). For any measurable B S N , any x S , and any n 0, we have
P x ((Xn , Xn+1 , . . . ) B | Fn ) = Xn (B),
P x -a.s.
(13)
So if you have a question to ask about the behavior of the process after time n, and you know its location
Xn at time n, then the answer is the same as for a brand new process whose starting point is Xn . More
precisely, conditionally on Fn , the law of {Xn , Xn+1 , . . . } is the same as that of {X0 , X1 , . . . }, under PXn .
Proof. Fix x S and n 0. Let L be the collection of all B S N such that (13). Its easy to verify that L
is a -system. Let P be the set of all B = B0 Bk S . . . with B0 , . . . , Bk S measurable. This is a
55
(
=
1Bk (yk )p(yk1 , dyk )1Bk1 (yk1 )p(yk2 , dyk1 ) . . . 1B1 (y1 )p(y, dy1 )1B0 (y)
Substituting Xn for y, we see this is the right side of (15), and we have established (13).
(17)
Corollary 16.6. The law x of {Xn } is completely determined by the transition function p.
Proof. In (17) we showed that x (B) can be written in terms of x and p for every B in a -system that
generates SN .
R
Corollary 16.7 (Multistep transition probabilities). Define pk recursively by pk+1 (x, B) = S p(y, B) pk (x, dy).
P
(In the countable case, we can say pk+1 (x, z) = yS p(x, y)pk (y, z); in the finite
R case, note this is just matrix
k
multiplication.) Then P(Xn+k B | Fn ) = p (Xn , B), and E[ f (Xn+k | Fn ] = f (y)pk (Xn , dy). In particular
(with n = 0 and taking expectations), P x (Xk B) = pk (x, B).
Corollary 16.8. For any measurable f : S N R, we have
E x [ f (Xn , Xn+1 , . . . ) | Fn ] =
Z
f dXn
SN
Weve just showed (in fancy language) that the map from Markov chains (up to distribution) to transition
functions is one-to-one. The next statement says it is also onto. So all you have to do is write down a
transition function and you know that the corresponding Markov chain exists.
Theorem 16.9. Given any transition function p on a standard Borel space (S , S), there is a Markov chain
(, F , {Fn }, {P x }, {Xn }) whose transition function is p.
56
Proof. Let = S N (here again N = {0, 1, 2, . . . }) with its product -field, and let Xn : S N S be the
coordinate maps. Set Fn = (X0 , X1 , . . . , Xn ). It remains to produce the measures {P x } and verify that they
make {Xn } into a Markov chain with transition function p. '
Fix x S . For measurable A S n+1 , define n (A) =
1A (x, x1 , . . . , xn )p(xn1 , dxn ) . . . p(x, dx1 )
(where 0 = x ). This is a consistent family of measures (verify). Let P x be the measure on S N produced by
the Kolmogorov extension theorem. We have P x (X0 = x) = 0 ({x}) = 1.
Now we have to check that P x (Xn+1 B | Fn ) = p(Xn , B); we use the uniqueness of conditional
expectation. p(Xn , B) is clearly Fn -measurable. Let A Fn ; since Fn = (X0 , . . . , Xn ) we have A =
{(X0 , . . . , Xn ) C} for some measurable C S n+1 . By definition of P x we have
E x [1B (Xn+1 )1A ] = E x [1CB (X0 , . . . , Xn+1 )]
= n+1 (C B)
(
=
1C (x, x1 , . . . , xn )1B (xn+1 )p(xn , dxn+1 . . . p(x, dx1 )
(
=
1C (x, x1 , . . . , xn )p(xn , B) . . . p(x, dx1 )
Z
=
1C (x, . . . , xn )p(xn , B) dn
= E x [1C (X0 , . . . , Xn )p(Xn , B)]
= E x [1A p(Xn , B)].
Repeating this for every x S we have a Markov chain with the desired properties.
To check that x 7 P x (A) is measurable, note that it holds for all cylinder sets A, and use -.
The following strong Markov property says that the Markov property also holds at stopping times.
After time , the process behaves like a fresh process whose starting point was X , and any other history
information from before time was irrelevant.
Theorem 16.10 (Strong Markov property). Let be a stopping time. Then for every x S and every B SN
we have
(18)
P x ( < , (X , X+1 , . . . ) B | F ) = 1< X (B), P x -a.s.
The { < } event is inserted to ensure that we are only writing X for outcomes where it makes sense.
Usually we apply it to stopping times with < almost surely, in which case this is redundant.
Proof. It suffices to show that for any n, we have
P x ( = n, (X , X+1 , . . . ) B | F ) = 1{=n} X (B)
since then we simply have to sum over all finite n. The right side is F measurable (since and X both are).
Let A F , so that A { = n} Fn . Then
E x [1A 1{=n} 1B (X , X+1 , . . . )] = E x [1A 1{=n} 1B (Xn , Xn+1 , . . . )]
= E x [E x [1A 1{=n} 1B (Xn , Xn+1 , . . . ) | Fn ]]
= E x [1A 1{=n} E x [1B (Xn , Xn+1 , . . . ) | Fn ]]
= E x [1A 1{=n} Xn (B)]
= E x [1A 1{=n} X (B)].
57
This proof was pretty easy; we just had to consider all the possible values for , of which there are
countably many because we are working in discrete time. In continuous time this gets harder, and in fact
there can be processes that satisfy the Markov property but not the strong Markov property. Strong Markov
is really the most useful one, and hence in continuous time one typically only studies processes that satisfy
strong Markov.
17
Example 17.1 (Simple random walk). Take S = Z and p(x, x+1) = p(x, x1) = 1/2, p(x, y) = 0 otherwise.
(We will verify later that n := Xn Xn1 are iid, or possibly in homework.)
Example 17.2 (General random walk). Take S = R, any probability measure on R, and p(x, B) = (B x).
Example 17.3 (Branching process). If is a probability measure on {0, 1, 2, . . . } then the branching process
with offspring distribution is a Markov chain with transition function p(n, B) = n (B). Less fancifully,
p(n, m) = P(1 + + n = m) where i are iid.
Example 17.4 (Random walk on a graph). G = (V, E) is a countable, locally finite graph; p(x, y) = 1/d(x).
(So the next step from x is to a uniformly chosen neighbor of x.)
18
In particular, if P x (T 1 < ) = 1, i.e. if we start at x we are guaranteed to return at least once, then
T k < for all k (P x -a.s.), so we are guaranteed to visit infinitely many times. In this case we say x is a
recurrent state. On the other hand, if P x (T 1 < ) < 1, i.e. there is a positive probability we will never
return, then we have
so we are guaranteed not to visit infinitely many times. In this case we say x is transient. Note this is a sort
of zero-one law: the probability of infinitely many returns to x is either 0 or 1.
Recurrent states are the ones we should study if we want to learn about long-term behavior of a Markov
chain. In classifying states as transient or recurrent, some connectivity properties are important.
58
Notation 18.2. For y S , let y = inf{n 1 : Xn = y}. For x, y S , let xy = P x (y < ) be the probability
that, when starting at x, we eventually reach y. (Note that xx is the probability of a return to x; the visit at
time 0 doesnt count.) If xy > 0 we write x y; this means it is possible to get from x to y.
We show that is the basically the connectivity relationship in the transition graph of the chain, i.e. its
the transitive closure of the graph.
Lemma 18.3. If p(x, y) > 0 then x y.
Proof. Obvious, since p(x, y) = P x (X1 = y) = P x (y = 1) P x (y < ).
Proof. Homework?
Lemma 18.5. If S is countable and x y, there is a path x = x0 , x1 , . . . , xm = y with p(xi , xi+1 ) > 0.
Proof. If xy = P x (y < ) then there exists some m with P x (y m) > 0, i.e. it is possible to get from
x to y in m steps. There are countably many sequences x = x0 , . . . , xm = y so it must be that for one
of them, we have P x (X0 = x0 , . . . , Xm = xm ) > 0. Unwinding the notation in Lemma 16.4 shows that
P x (X0 = x0 , . . . , Xm = xm ) = p(x0 , x1 ) . . . p(xm1 , xm ), so all these factors must be positive.
This can fail if S is uncountable. Consider a chain on state space {a} [0, 1] {b} with p(a, ) uniform
on [0, 1], p(x, {b}) = 1 for x [0, 1] or x = b. We have p(a, x) = 0 for every x, but (a, b) = 1.
Recurrence is contagious.
Theorem 18.6. If x is recurrent and x y then xy = yx = 1 and y is recurrent.
Proof. Let A be the event that y < and the chain visits x at some time after y . On the one hand, since
the chain must visit x infinitely many times if it starts there, we must have P x (A) = P x (y < ) = xy . On
the other hand, we can use the strong Markov property: if B S N is the set of sequences that contain at
least one x, then A = {y < } {(Xy , Xy +1 , . . . ) B}. So strong Markov says
P x (A) = E x [P x (y < , (Xy , Xy +1 , . . . ) B | Fy )]
= E x [1y < Xy (B)]
= E x [1y < y (B)]
= y (B)E x [1y < ]
= Py ( x < )P x (y < ) = yx xy .
Thus we have shown xy = yx xy . Since xy > 0 we must have yx = 1.
Let T k = inf{n > T k1 : Xn = x} be the time of the kth return to x, as above (with T 0 = 0); by recurrence
we have T k < for all k, P x -a.s. Let Ak = {n (T k , T k+1 ) : Xn = y} be the event that the chain visits y
between the kth and k + 1th visits to x. Then if B S N is the set of all sequences that start with s(0) = x,
and then contain a y before the next x, we have Ak = {(XTk , XTk+1 , . . . ) B}. By strong Markov,
P x (Ak | FTk ) = XTk (B) = x (B) = P x (y < x ).
The right side is deterministic, so we have shown that Ak is independent of FTk (under P x ). Since clearly
A1 , . . . , Ak1 FTk , we have shown that the events Ai are independent. Moreover, as they all have probability
P x (y < x ) P x (y < ) = xy > 0, Borel-Cantelli tells us that P x (Ak i.o.) = 1, i.e. if we start at x, we
almost surely make infinitely many visits to y. In particular, we almost surely make at least one, so xy = 1.
Finally, we have yy yx xy = 1 so y is recurrent.
59
Definition 18.7. A chain is irreducible if for every x, y we have x y. That is, from any state we can get
to any other state.
For a countable chain, this happens iff the transition digraph is connected.
Corollary 18.8. If {Xn } is irreducible then either every state is recurrent, or every state is transient.
For finite irreducible chains only one of these is possible.
Proposition 18.9. If S is finite then there exists a recurrent state.
Proof. (Sketch) Let Ny be the total number of visits to y. Suppose to the contrary every state is transient, so
for each y we have Py (Ny = ) = 0. Fix x S , and use the strong Markov property to show that for every
y S , we have P x (Ny = ) = 0. Use a pigeonhole argument to get a contradiction.
For an example of an (infinite-state) irreducible chain with every state transient, consider asymmetric
simple random walk. Another classic example is simple random walk on Zd with d 3, though we have not
proved this yet.
19
Stationary distributions
We already have a mechanism for starting a chain at any given state x thanks to our family of probability
measures P x . We could also start the chain at a state chosen randomly according to any given distribution.
Definition 19.1. Let be a probability measure on S , and define a probability measure P on by
Z
P (A) =
P x (A)(dx).
S
Proposition 19.2.
2. Under P , X0 ;
3. Under P , Xn is still a Markov chain with transition function p, and the strong Markov property still
holds.
P
4. If S is countable, then P (Xn = y) = x pn (x, y)(x). If we think of p as a matrix, we should think of
as a row vector; then the distribution of Xn is p.
Definition 19.3. A probability measure on S is a stationary measure for Xn if, under P , we have Xn
for every n. (In the countable case, this means that is a left eigenvector of p.)
Proposition 19.4. It suffices to verify the above with n = 1; if X1 under P then is stationary.
Proof. We show, by induction on n, that Xn for every n. By assumption this holds for n = 1. Let us
write this another way: we have
Z
(B) = P (X1 B) = E [P (X1 B | F0 )] = E [p(X0 , B)] =
p(x, B)(dx).
60
(x)p(x, y) =
X 1
1
1
d(x)
=
d(y)
2|E|
d(x) 2|E|
xy
Example 19.6. If Xn is random walk on a finite regular graph, then the uniform measure on G is stationary.
Example 19.7. Simple random walk has no stationary distribution. If it did, we would have (x) = 21 ((x
1) + (x + 1)). Rearranging, (x + 1) (x) = (x) (x 1). By induction, (x + 1) (x) = (1) (0) := C
for every x, i.e. (x) = (0) + Cx. If C , 0 this cannot be a positive measure, since (x) is negative for
appropriate x. If C = 0 then it is a uniform measure on Z which cannot be a probability measure.
Example 19.8. Asymmetric reflecting random walk. Youll compute a stationary distribution in your homework.
Proposition 19.9. If Xn has a stationary distribution , and (x) > 0, then x is recurrent. In particular if Xn
has a stationary distribution then it has at least one recurrent state.
Proof. On the one hand, we have
P (N x = ) = P (lim sup{Xn = x}) lim sup P (Xn = x) = (x) > 0
n
as you showed in an early homework. On the other hand, let B be the set of all sequences containing
infinitely many x. Then by the strong Markov property
P (N x = ) = P ( x < , (Xx , Xx +2 , . . . ) B)
= P ( x < ) x (B)
= P ( x < )P x (N x = ).
So we must have P x (N x = ) > 0, which means x is recurrent.
Note symmetric simple random walk is a Markov chain where every state is recurrent but no stationary
distribution exists. So recurrence is necessary but not sufficient. In fact the right condition is:
61
Definition 19.10. A state x in a Markov chain is positive recurrent if E x x < . That is, not only do you
return to x, but the average time to do so is finite. If x is recurrent but not positive recurrent, we say it is null
recurrent
For simple random walk, every state is null recurrent, since we showed using martingales that for any
x, y, E x y = .
We collect some related results, but wont give the proofs here. Consult Durrett.
Proposition 19.11. If x is positive recurrent and x y then y is positive recurrent.
Corollary 19.12. If Xn is irreducible and has one positive recurrent state, then every state is positive recurrent.
Proposition 19.13. If Xn has a positive recurrent state, then it has at least one stationary distribution.
Proposition 19.14. If is a stationary distribution and (x) > 0 then x is positive recurrent.
So positive recurrence is a necessary and sufficient condition for the existence of a stationary distribution.
We could explore this further but wont; most of the time, if a stationary distribution exists, it is obvious or
easy to compute directly what it is.
We are heading for a result that says that (under appropriate conditions) a Markov chain converges
weakly to its stationary distribution . That is, if you start the chain in some arbitrary state, let it run for a
long time, and then look at where it is, the state you see looks a lot like a sample drawn from the stationary
distribution .
Lets figure out what some of these appropriate conditions are, by looking at some examples where
this fails to hold.
One obvious issue is that if there is more than one stationary distribution, which one we converge to
might depend on the starting point.
Example 19.15. Consider a chain on two states a, b, with p(a, a) = p(b, b) = 1 (i.e. we never move). Then
a and b are both stationary. (In fact, every probability measure on {a, b} is stationary; the transition matrix
is the identity matrix.) Under Pa , we have Xn a for every n; under Pb we have Xn b .
Evidently, the problem is that we have several subsets of S that dont communicate with one another. In
some sense we should break the state space up and consider it as two separate chains.
As we shall see, irreducibility is a sufficient condition to rule this out, as we shall see. In an irreducible
chain, if a stationary distribution exists, it is unique.
Another obstruction to convergence is periodicity.
Example 19.16. Again take a chain with two states a, b, with p(a, b) = p(b, a) = 1 (so we flip-flop back and
forth). It is not hard to see that the unique stationary distribution is the uniform distribution (a) = (b) =
1/2. However, Xn does not converge weakly to if started at a; its distribution alternates between a and b .
This kind of behavior rarely happens except in contrived examples. (But note that it also happens in
random walk on Z; if we start at 0, say, then we can only visit odd-numbered states at odd-numbered times,
and even-numbered states at even-numbered times. However, random walk on Z doesnt have a stationary
distribution anyway.) We could study it further but instead well just discuss how to rule it out.
Definition 19.17. A state x S is aperiodic if there exists a number r x such that for all n r x , we have
pn (x, x) > 0. (So after waiting long enough, there are no obstructions to returning to x at any given time.)
The chain Xn is aperiodic if every state is aperiodic.
62
Notice that the above example fails this property: we have pn (a, a) = 0 for every odd n.
Here are some useful properties for verifying aperiodicity. I dont know if Ill go into the proofs.
Proposition 19.18. If p(x, x) > 0 then x is aperiodic.
Proof. Obvious. You could sit at x for n steps with probability p(x, x)n , so pn (x, x) p(x, x)n > 0.
Proof. If there exist numbers n1 , . . . , nm whose greatest common divisor is 1 and with pni (x, x) > 0 for all i,
then x is aperiodic.
Proof. Note that we have pn (x, x) > 0 for all n of the form n = a1 n1 + + am nm with ai 0, since
pn (x, x) (pn1 (x, x))a1 . . . (pnm (x, x))am > 0, i.e. we could return to x in n1 steps and repeat this a1 times,
etc. Now combine this with the elementary number theory fact that if n1 , . . . , nm are relatively prime then
any sufficiently large integer can be written in this form. (The Euclidean algorithm says any integer can be
written in this form if negative coefficients are allowed.)
Proposition 19.19. If x is aperiodic and y x y then y is aperiodic. In particular, in an irreducible
chain, if any state is aperiodic then the chain is aperiodic.
Proof. Since y x y we have p s (y, x) > 0 and pt (x, y) > 0 for some s, t. We also have pn (x, x) > 0 for
all n r x . So for all n s + t + r x we have pn (y, y) p s (y, x)pnst (x, x)pt (x, y) > 0.
Example 19.20. Random walk on an odd cycle is aperiodic. Random walk on an even cycle is not. Reflecting random walk is aperiodic (since p(0, 0) > 0).
So here is the main convergence theorem for countable Markov chains.
Theorem 19.21. Let Xn be a Markov chain on a countable state space S . Suppose that Xn is irreducible,
aperiodic, and has a stationary distribution . Then for every x S , we have P x (Xn = z) (z) for all z
(in fact, uniformly in z). In particular, under any P x , we have Xn weakly (in fact, in total variation).
Proof. The idea of the proof is a technique known as coupling. We will run two independent copies of
the chain, one (call it Xn ) started at a fixed state x, and the other (Yn ) started at the stationary distribution
. Then we will run them until they meet; the hypotheses will guarantee that this happens. After this time,
since they have both been transitioning by the same rules from the same point (the place where they met),
they must have the same distribution.
Note first that the theorem can be expressed solely in terms of the transition function p: the conclusion
is that pn (x, z) (z). So the specific random variables involved are unimportant; we can run the proof
using any Markov chain with this transition function.
Let (Xn , Yn ) be two independent copies of the original chain; so (Xn , Yn ) is a Markov chain on state
space S S whose transition function is p((x,
p((X
n , Yn ), (x0 , y0 ))
y0 S
p(Xn , x0 )p(Yn , y0 )
y0 S
= p(Xn , x0 )
X
y0 S
63
p(Yn , y0 )
X
X
X
(x)(y) p((x,
xS
yS
(Xk , z)]
= E[1{=k} p
(Yk , z)]
nk
nk
since { = k} Fk
(19)
(20)
Rearranging,
But P(Yn = z) = (z) for all n, since Yn starts in the stationary distribution and hence keeps that distribution.
Rearranging some more,
|P(Xn = z) (z)| = |P(Xn = z, > n) P(Yn = z, > n)| P(Xn = z, > n) + P(Yn = z, > n) 2P( > n).
(21)
Since < almost surely, P( > n) 0.
Remark 19.22. Note that if we can estimate P( > n) for a particular chain then we will have a statement
about the rate of convergence.
Remark 19.23. There was nothing magical about choosing Xn , Yn to be independent copies of the chain;
it was just a construction that made it convenient to verify that the meeting time was finite. We could
consider any other coupling, i.e. any adapted process (Xn , Yn ) on S S such that Xn and Yn are individually
Markov chains on S with transition function p; we just have to be able to show that they meet almost surely.
We dont even need (Xn , Yn ) to be a Markov chain. It may be that some other coupling could give us a better
bound on P( > n) and thus a better bound on the rate of convergence.
64
Corollary 19.24. If Xn is an irreducible Markov chain on a countable state space then it has at most one
stationary distribution.
Proof. Suppose it has two; call them , 0 . Suppose first that Xn is aperiodic. Fix any x S ; by the above
theorem, we have P x (Xn = y) (y) for all y. But by the same logic, we also have P x (Xn = y) 0 (y).
Hence = 0 .
To dispose of the periodicity assumption, let p be the transition function of Xn , and define a new chain
1
1
2 + 2 p(x, x), x = y
p(x,
y) =
1 p(x, y),
x , y.
2
So X n flips a coin at each step; if it is heads it doesnt move at that step, and if it is tails it moves according to
p. Weve produced a lazy version of the chain. X n is still irreducible, is clearly aperiodic since p(x,
x) > 0
0
for every x, and it is easy to check that , are stationary distributions for Xn . So by the previous case,
= 0 .
Remark 19.25. This proof itself was also lazy because we took advantage of a big theorem, and used a
sneaky trick to avoid the aperiodicity hypothesis. For a more honest proof, see Durretts Theorem 6.5.7.
Remark 19.26. The convergence theorem is the basis for the so-called Markov Chain Monte Carlo (MCMC)
method of sampling from a probability distribution. Suppose we have a probability measure on S and we
want to generate a random variable with distribution . One approach is to come up with a Markov
chain Xn whose stationary distribution is , start it at some arbitrarily chosen state x0 , and run it for some
large number of steps N, then take = XN . The convergence theorem says that the distribution of is
approximately . And if we have information about the rate of convergence, we can work out how large we
need to take N (the so-called mixing time) to get the distribution of within any given of .
For example, when you shuffle a deck of cards, you are running a Markov chain on the finite group
S 52 (whose transition function depends on your specific shuffling technique). For a very simple example,
suppose a shuffle consists of swapping the top card with a randomly chosen card. This is a random walk
on S 52 using the generators {id, (1 2), (1 3), . . . , (1 52)}, or in other words a random walk on the Cayley
graph, in which every vertex has degree 52. Thus the stationary distribution is uniform. So if you repeat this
shuffle enough times, you approximately choose a uniform permutation of the deck; every permutation
has approximately equal probability. This also works with a more elaborate shuffling procedure, such as a
riffle or dovetail shuffle, which mixes faster; Bayer and Diaconis famously showed that most of the mixing
happens within 7 shuffles, i.e. X7 is already quite close to uniform.
It may seem dumb to use a Markov chain algorithm to sample from the uniform distribution on a finite
set such as S 52 . After all, there are a zillion algorithms to generate (pseudo)-random integers of any desired
size, so its easy to sample uniformly from {1, 2, . . . , 52!}. But the problem is that it is not so easy to find
an explicit and easily computable bijection from {1, 2, . . . , 52!} to S 52 (given an integer, how do you quickly
find the permutation to which it corresponds?).
65