0% found this document useful (0 votes)
52 views33 pages

18 615 Notes

The document contains lecture notes from the course 18.615: Introduction to Stochastic Processes at MIT, taught by Professor Alexey Bufetov in Spring 2017. It covers various topics related to Markov chains, including their classification, convergence, and applications, along with examples and theorems. The notes were created in real-time during lectures and may contain errors, with contact information provided for suggestions or corrections.

Uploaded by

Stevany Lati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views33 pages

18 615 Notes

The document contains lecture notes from the course 18.615: Introduction to Stochastic Processes at MIT, taught by Professor Alexey Bufetov in Spring 2017. It covers various topics related to Markov chains, including their classification, convergence, and applications, along with examples and theorems. The notes were created in real-time during lectures and may contain errors, with contact information provided for suggestions or corrections.

Uploaded by

Stevany Lati
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

18.

615: Introduction to Stochastic Processes

Rachel Wu

Spring 2017

These are my lecture notes from 18.615, Introduction to Stochastic Processes,


at the Massachusetts Institute of Technology, taught this semester (Spring 2017)
by Professor Alexey Bufetov1 .
I wrote these lecture notes in LATEX in real time during lectures, so there
may be errors and typos. I have lovingly pillaged Tony Zhang’s2 formatting
commands and style. Should you encounter an error in the notes, wish to suggest
improvements, or alert me to a failure on my part to keep the web notes updated,
please contact me at [email protected].
This document was last modified 2018-05-01.

1 [email protected]
2 [email protected]

i
Rachel Wu Contents

Contents
1 February 28, 2017 1
1.1 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Classification of states . . . . . . . . . . . . . . . . . . . . . . . . 1

2 March 2, 2017 3
2.1 Markov chain convergence . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Implications of convergence . . . . . . . . . . . . . . . . . . . . . 4

3 March 7, 2017 5
3.1 Markov chains (continued) . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Simple random walks in Z . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Counting paths examples . . . . . . . . . . . . . . . . . . . . . . 6

4 March 9, 2017 8
4.1 More simple random walks in Z . . . . . . . . . . . . . . . . . . . 8
4.2 Applications to probability . . . . . . . . . . . . . . . . . . . . . 10

5 March 21, 2017 11


5.1 Last return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Maximum value of simple random walks . . . . . . . . . . . . . . 12

6 Mach 23, 2017 13


6.1 Maximum value, ctd. . . . . . . . . . . . . . . . . . . . . . . . . . 13
6.2 Time above line . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

7 April 4, 2017 15
7.1 Infinite-state Markov chains . . . . . . . . . . . . . . . . . . . . . 15

8 April 6, 2017 18
8.1 Infinite-state Markov chains, ctd. . . . . . . . . . . . . . . . . . . 18
8.2 Applications of Infinite-state MCs . . . . . . . . . . . . . . . . . 19

9 April 11, 2017 21


9.1 Probability generating functions . . . . . . . . . . . . . . . . . . 21
9.2 Branching processes . . . . . . . . . . . . . . . . . . . . . . . . . 22

10 April 13, 2017 25


10.1 Branching processes, ctd. . . . . . . . . . . . . . . . . . . . . . . 25
10.2 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . 26

11 April 20, 2017 28


11.1 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

ii
Rachel Wu Contents

12 April 27, 2017 30


12.1 Rip Liang is a bad influence . . . . . . . . . . . . . . . . . . . . . 30

13 May 1, 2018 30

iii
Rachel Wu 1 February 28, 2017

1 February 28, 2017


1.1 Markov chains
Markov chains have state space S = {1, 2, . . . , N } and transition matrix P . We
start at
ϕ0 = (P (X0 = 1), P (X0 = 2), . . . , P (X0 = N )). (1.1)
In addition,
(P (Xn = 1), . . . , P (Xn = N )) = ϕ0 P n (1.2)
and
Pn (i, j) = P (Xn = j|X0 = i). (1.3)

1.2 Classification of states


Definition 1.1. For i, j ∈ S, we write

1. i → j if ∃m ∈ Z+ such that Pm (i, j) > 0,


2. i ↔ j if ∃m1 , m2 ∈ Z+ such that Pm1 (i, j) > 0 and Pm2 (j, i) > 0.

The former implies that i is reachable from j, and the latter implies that j
is also reachable from i.

Lemma 1.2 (Transitivity)


If i ↔ j, and j ↔ k, then i ↔ k.

The state space S can be partitioned into S = S˜1 ∪ S˜2 ∪ · · · ∪ S̃l , where S̃i
are disjoint, nonempty, communication classes. Each of these classes has the
property that all vertices i, j have i ↔ j?

Example 1.3 (Communication classes)


How many communication classes?

b c

One communication class, since all states are reachable from all others.
How about for gambler’s ruin? There are three classes: 0, N , and the rest.

Since Markov chains are directed graphs, communication classes are just
maximal connected subgraphs.
Definition 1.4. A Markov chain is called irreducible if it has only one com-
munication class.
Definition 1.5. Let S̃ be a communication class, and let X0 ∈ S̃. If Pr {Xn } 6∈
S̃ → 1, then S̃ is called transient, and all states s ∈ S̃ are transient states.
Otherwise, S̃ is recurrent.

1
Rachel Wu 1 February 28, 2017

If there is any out degree of S̃, then S̃ is transient.


Definition 1.6. For each s ∈ S, we define T = {n ∈ Z}3 such that Pn (s, s) > 0.
Then, d = gcd(T1 ∪ T2 ∪ · · · ∪ Tn ) is the period of the Markov chain. If d = 1,
then this Markov chain is aperiodic.

Example 1.7
If we are given sets {2, 4, 6, 8, . . .}, then the period is 2. For gambler’s ruin,
the chain is aperiodic.

In fact, if there is any loop (node to self) in the graph, then the Markov
chain is aperiodic.

Proposition 1.8
If there exists a ∈ Z+ , such that all entries of P a are strictly positive, then
the Markov chain is irreducible and aperiodic.

Proof. If Pa (i, j) > 0, ∀i, j, then there is only one communication class, and it
is irreducible by definition. In addition, a ∈ T1 , and (a + 1) ∈ T1 as well, and
gcd(a, a + 1) = 1, so this is aperiodic.

Theorem 1.9 (Convergence theorem)


If all entries of P a are strictly positive, for some a, then there exists a unique
invariant distribution π, and for any initial distribution ϕ0 , ϕ0 P a → π.

Example 1.10 (Phone call)


Suppose we have a phone that is either free or busy.
 
3/4 1/4
(1.4)
1/2 1/2

The invariant distribution for this is π = (2/3, 1/3).

Example 1.11 (Symmetric random walk with reflective boundaries)


Take gambler’s ruin, except at the ends, we bounce back with p = 1. This
has a stationary distribution π = (1/2N, 1/N, . . . , 1/N, 1/2N ).

Example 1.12 (Lazy random walks)


The point is lazy, so it doesn’t like moving.

p(i, i + 1) = 1/4 p(i, i) = 1/2 p(i, i − 1) = 1/4 (1.5)

3 The professor used J.

2
Rachel Wu 2 March 2, 2017

2 March 2, 2017
Pset 2 is due next lecture. Office hours March 6, 2-3p.

2.1 Markov chain convergence


We prove theorem 1.9 from the previous lecture.
If P n converges to some matrix, as n → ∞, then the theorem holds.
 
π(1) π(2) . . . π(n)
 π(1) π(2) . . . π(n) 
Pn →  . (2.1)
 
.. 
 .. . 
π(1) . . . . . . π(n)

Then for any ϕ0 ,


 
π(1) ... π(n)
n  .. ..  = (π(0), π(1), . . . , π(n)).
ϕ0 P = ϕ0  . ... .  (2.2)
π(1) . . . π(n)

Now assume that π2 P = π2 . Then π2 = π2 P = π2 P 2 = . . . π2 P n , which only


works if P is invariant (?).
Let d = mini,j P0 (i, j) > 0, gn,j = maxi Pn (i, j), and sn,j = mini Pn (i, j).
We need to prove that gn,j − sn,j → 0. Essentially, this means that the elements
of P n converge.

Lemma 2.1
gn,j − sn,j → 0 as n → ∞.

Proof of lemma. We write out the matrix entry form of P a(n+1) , and see that
X
ga(n+1),j = Pa xxx · Pn+1 ≤ (1 − d)gxxx,j + dsxx,j . (2.3)
k

TODO: fill in all the indices... The professor uses a lot of indices I can’t see, so
basically, all the largest numbers are multiplied by smallest numbers sometime
during the matrix multiplication, so the larger numbers shrink and smaller
numbers grow.
Now we show that the smallest element increases.
X
sa(n+1),j = Pa (î, k) · Pn+1 (k, j) ≥ (1 − d)san,j + dgan,j . (2.4)
k

So
ga(n+1),j − sa(n+1),j ≤ (1 − 2d)(gn,j − sn,j ) (2.5)

Lemma 2.2
gn+1,j ≤ gn,j and sn+1,j ≥ sn,j .

3
Rachel Wu 2 March 2, 2017

Proof of lemma. P n+n multiples rows and columns, which sum to 1, with ele-
ments each greater than 0. So gn+1,j ≤ 1 · gn,j + 0 = gn,j , and also sn+1,j ≥ sn,j
for the same reason. So here, gn+1,j − sn+1,j ≤ gn,j − sn,j . (This only doesn’t
work in generality since they may be equal).

Question 2.3. Couldn’t we just have used eigenvalues, one λ = 1? Well yes,
but this is more general, for arbitrary dimensions. In general, they may also be
more invariant distributions.

2.2 Implications of convergence


We make the following observations.

• If THMC is irreducible and aperiodic, then ∃a, such that P a has all entries
strictly positive. If J ⊂ Z, and gcd(J) = 1, and J is closed under addition,
then J forms a group of the periods?

Corollary 2.4
Suppose we have function f : S → R. Then
" n #
1X
E f (Xi ) → π(1)f (1) + π(2)f (2) + . . . (2.6)
n i=1

since when n → ∞, the Markov chain does not change from π.

Corollary 2.5 (Ergodic theorem)


 Pn
Let f = E n1 i=1 f (Xi ) . Then as a generalization of the convergence


theorem,
n
1X
f (Xi ) → f. (2.7)
n i=1

Now consider a few more things.

• Let 1 be an indicator function for state r. Then E n1 i 1Xi =r → π(r),


 P 

or the number of visits to state r is proportional to nπ(r).


• Now let τ be the return time, which is the time it takes to hit Xi after
the first time. Suppose X0 = r, and τ (r) = min {i ≥ 1, Xi = r} Then
1
E [τr ] = π(r) .

We can just use these facts on psets. However, we must show that the
Markov chain is irreducible and aperiodic, and find the invariant distribution
(and of course, show that it is a Markov chain).

4
Rachel Wu 3 March 7, 2017

3 March 7, 2017
3.1 Markov chains (continued)
Today we continue Markov chains.

Example 3.1 (Ehrenfest Urn model)


There are two urns of balls. At each step, an urn is selected and one ball is
transferred to the other urn. The states are the number of balls in urn 1,
k = {0, . . . , n}. The transition probabilities are

n−k
Pr {k, k + 1} = (3.1)
n
k
Pr {k, k − 1} = (3.2)
n

We observe that the urns are the same, so π(k) = π(n − k). By definition of
a stationary distribution,

π(k − 1) Pr {k − 1, k} + π(k + 1) Pr {k + 1, k} = π(k), (3.3)

1 n

for k = 1, . . . , n − 1. We will show that π(k) = 2n k is a solution.

π(k − 1) Pr {k − 1, k} + π(k + 1) Pr {k + 1, k}
n! n−k+1 n! k+1
= · + ·
(k − 1)!(n − k + 1)! n (k + 1)!(n − k − 1)! n
n!
= = π(k) (3.4)
k!(n − k)!

This concept is similar to the second law of thermodynamics.

3.2 Simple random walks in Z


Let {0 , . . . , n } be i.i.d. random variables,
(
1 p = 1/2
i = (3.5)
−1 p = 1/2,

and let Sn be the sum of the first n random variables.

Si

• •

• •
i

5
Rachel Wu 3 March 7, 2017

How many paths are there from (0,0) to (n, x)? Well if we take p steps up
and q steps down,
n=p+q
x = p − q.
Then the number of paths is equal to
(
p+q n n+x
 
p = n+x
2 ∈ Z, |x| ≤ n
N(0,0)→(n,x) = 2 (3.6)
0 otherwise.

To quantify this, we introduce our best friend.

Theorem 3.2 (Stirling’s formula)


Most useful approximation ever,
√ n
n! ∼ 2πn( )n (3.7)
e


Using Stirling’s Pr {Sn = 0} approaches 1/ πn. Then we find that

X
E [# of returns to 0] = Pr {Sn = 0} = ∞. (3.8)
n−1

3.3 Counting paths examples


Example 3.3 (Burning houses)
Imagine that there is a river, a firetruck, and a burning house. The firetruck
must visit the river before rescuing the house. How do we determine the
fastest route?

Solution. Reflect the firetruck across the river, connect the house and reflection,
and determine where that intersects the river.

Example 3.4
We have two points (a, α) and (b, β), such that α, β > 0, b > a, and
a, b, α, β ∈ Z. How many paths between a and b touch or cross the x axis?

Solution. There is a bijection between the paths from a to b and the paths from
a0 to b, where a0 is a reflected across the x axis. Simply reflect across the x axis
at the first point the a0 → b line crosses the x axis. There are
 
b−a
(b−a)+(β−α) (3.9)
2

paths that satisfy the constraint.

6
Rachel Wu 3 March 7, 2017

Corollary 3.5
The number of paths from a → b which do not touch the x axis is equal to
the number of paths from a → b minus the number of paths from a0 → b.

Corollary 3.6
The number of paths from (1, 1) → (n, x) which are strictly above the x
axis is Nn−1,x−1 − Nn−1,x+1 .

Example 3.7
How many paths from (1, 1) → (5, 1) do not touch the x axis?

Solution. Trivially we could draw them and see that there are 2. Using the
formula,    
4 4
N4,0 − N4,2 = − = 6 − 4 = 2. (3.10)
2 3

Example 3.8 (Dyck paths)


How many paths from (1, 1) → (2n + 1, 1) do not touch the x axis?

Solution. We again use the formula and see that


     
2n 2n 1 2n
N2n,0 − N2n,2 = − = . (3.11)
n n+1 n+1 n

In particular, we have discovered Catalan numbers! This is also equivalent to


the number of Dyck paths from (0, 0) → (2n, 0).

7
Rachel Wu 4 March 9, 2017

4 March 9, 2017
4.1 More simple random walks in Z
Recall from last lecture that the number of paths from (0, 0) → (n, x) = Nn,x
(
p+q n
  n+x
p = n+x 2 ∈ Z, |x| ≤ n
Nn,x = 2 (4.1)
0 otherwise.
The number of paths from (a, α) → (b, β) which do not touch the x axis is
Nb−a,β−α − Nb−a,β+α . (4.2)
At the end of the course, we will model Brownian motion with such a simple
random walk.

“We will suffer all this combinatorics. . . ”—abufetov

Finally, we also found Catalan numbers in Dyck paths.

Theorem 4.1 (Ballot’s theorem)


The number of paths from (0, 0) → (n, x) such that si > 0, ∀i is
x
Nn−1,x−1 − Nn−1,x+1 = Nn,x
n

Proof. The left side is equal to the number of paths from (1, 1) → (n, x) since
from (0,0), we can only go up to (1,1). For the right side, let us take p steps up
and q steps down, such that x = p − q, n = p + q. Then
   
p+q−1 p+q−1 (p + q − 1)! 1 1
− = ( − ) (4.3)
p−1 p (p − 1)!? q p

Example 4.2 (Application of Ballot’s theorem)


Of the paths from (0, 0) → (3, 1), only one satisfies the non-negativity.

Si

• •


i

Proposition 4.3
The number of paths of length 2n, such that si > 0, ∀i 6= 0, is
1
N2n−1,1 = N2n,0 .
2

8
Rachel Wu 4 March 9, 2017

Proof. The number of paths satisfying this condition is



X
# paths (s0 = 0, s1 > 0, . . . , s2n−1 > 0, s2n = 2r)
r=1

X
= N2n−1,2n−1 − N2n−1,2n+1
r=1
= (N2n−1,1 − N2n−1,3 ) + (N2n−1,3 − N2n−1,5 ) + . . .
= N2n−1,1

since we have a telescoping series.

Corollary 4.4
The number of paths that do not return to 0 in the first 2n steps, (s0 =
0, s1 6= 0, . . . , s2n−1 6= 0, s2n = 2r) is
1
2 · N2n,0 = N2n,0 ,
2
since points are all positive or all negative.

For n = 2, we can have the positive or negative versions of the following.


Si • Si Si
• •
• • • • •
• • • •
• • •
i

Proposition 4.5
The number of paths that start from 0 and return to 0 at 2n is
1
4N2n−2,0 − N2n,0 = N2n,0
2n − 1
.

We demonstrate with n = 3, 2n = 6.
Si Si

• • • •
• • • • •
• • • •
i i
And we also have the negative versions of these.

9
Rachel Wu 4 March 9, 2017

4.2 Applications to probability


Recall that  are distributed i.i.d. Bernoulli. Let
N2n,0 1
µ2n = Pr {S2n = 0} = ∼ →0
22n πn
and
1
f2n = Pr {return to 0} = µ2n .
2n − 1

Corollary 4.6
We can derive that f2n = µ2n−2 − µ2n from proposition 4.5.

Corollary 4.7
The probability that a simple random walk returns to 0 is

X
fi → 1
i=0

since we have another telescoping series.

So the expected return time is


∞ ∞
X X 1 √
E [τ0 ] = 2nf2n = √ → ∞ = O( n). (4.4)
n=1 n=1
πn

10
Rachel Wu 5 March 21, 2017

5 March 21, 2017


We had an exam and a snow day, so we continue discussing simple random walks
now.

5.1 Last return


Let τ = maxi {S2i = 0}, which is the last return to 0 for i ∈ 0, 1, . . . , 2n. So

N2k,0 N2n−2k,0
Pr {τ = 2k} = = µ2k µ2n−2k .
22n
Bashing with Stirling’s,
1 1
Pr {τ = 2k} ∼ √ p . (5.1)
πk π(n − k)

Proposition 5.1
N2k,0 · N2n−2k,0 is the number of paths that last return to 0 at 2k.

However, probabilities at individual points tend to 0, so we need to determine


the cumulative distribution. For 0 < α < 1,

αn
X
Pr {τ0 ≤ 2αn} = Pr {τ0 = 2k}
k=0
αn
X 1
≈ p
k=1
π k(n − k)
αn
X 1 1
= q . (5.2)
n π k (1 − k )
k=1 n n

As n tends to ∞, this summation becomes an integral.


Z α √
1 2
p dx = arcsin α
0 π x(1 − x) π

Theorem 5.2
2 √
As n tends to ∞, Pr {τα ≤ 2αn} → π arcsin α.

For example, take α = 0.1. Then approximately 20.4% of length n paths last
return to 0 at 10% of n. It is very likely for last returns to be close to 0 or n,
and the least likely for last returns to occur in the middle.
I’m tired so I’ll fill in the plots over spring break....... Fill in a u shaped plot

11
Rachel Wu 5 March 21, 2017

5.2 Maximum value of simple random walks


We now consider paths with a given maximum value.

Proposition 5.3
The number of paths from (0, 0) → (n, x) of length n, which touch the
q = r line is Nn,2r−x .

This is trivially true by bijection and reflection.

Proposition 5.4
The number of paths from (0, 0) → (n, x) of length n such that maxk Sk = r
is Nn,2r−x − Nn,2r+2−x

This is equivalent to the number of paths that touch y = r minus the number of
paths that touch y = r + 1. All paths going beyond must also touch y = r + 1.

Proposition 5.5
The number of paths of length n such that maxk Sk = r is
1
X
# paths (0, 0) → (n, x) = max(Nn,r ; Nn,r+1 ),
x=r

depending on parity.

This comes from the typical telescoping series.

Example 5.6
Let n = 4, r = 1. How many paths are there?

Solution. There are max(N4,1 , N4,2 ) = 4.


Si
Si Si Si

• • • •
• •
i • • • • • • • • •
• i i i
• •

Theorem 5.7
As n tends to ∞,
n √ o 1
Z α
2
Pr Sk ≤ α k → e−t /2 dt
2π −∞

by the central limit theorem.

12
Rachel Wu 6 Mach 23, 2017

6 Mach 23, 2017


6.1 Maximum value, ctd.
Recall that we are dealing with a simple random walk of length n.

Theorem 6.1
By the central limit theorem, for α ≥ 0, n → ∞,
√ Z +∞
 
1 2
Pr max Sk ≥ α k → √ e−t /2 dt .
k 2π α

This is equivalent to the sum


Nn,n + Nn,n−1 + Nn,n−2 + · · · + Nn,rα√n
.
2n
In particular, note that Nn,n−1 and all other “odd” terms are 0, since parity. In
words, that is

#[paths max n] + #[paths max n − 1] + · · · + #[paths max dα ne]
2n

Proof. From central limit theorem,


Z ∞
2
e−t /2
dt
α

Combinatorially, we can treat this quantity as

Nn,n + Nn,n−1 + · · · + Nn,grosschunkintegral

However, we want to find


#paths max = n + #paths max = n − 1 . . .
2n
max(Nn,n ; Nn,n+1 ) + max(Nn,n ; Nn,n+1 ) . . .
= ,
2n
but by parity, we notice that the Nn,n+1 etc. terms are 0, so this value is twice
the central limit theorem result.

6.2 Time above line


We wonder how much time a path spends above the x axis.

Lemma 6.2
Let F2k be the number of paths which return to 0 for the first time at 2k.
Then

N2k,0 = F2k + F2k−2 N2,0 + F2k−4 N4,0 + · · · + F2 N2k−2,0

13
Rachel Wu 6 Mach 23, 2017

Each step of a path with length 2n is either on the positive or negative half
plane. We study paths of length 2n, (S1 , S2 , . . . , S2n ). Let N+ be the number
of “positive steps,” and N− be “negative.” Observe that both N+ and N− must
be even, as we must return to 0 to change sign. Let

B2k,2n = #[paths length 2n such that N+ = 2k.

Example 6.3
Find B2,4 . We draw out the possibilities. There are 4.
• •
• •
• • •
• • • • • •
• • •
• •
• •

Proposition 6.4
We know that B2n,2n = N2n,0 = # of non-negative paths. Then 12 N2n,0 =
# of positive paths, as we can just shift up-right and fill in the last step.

Proposition 6.5
B2k,2n = N2k,0 · N2n−2k,0

This can be proved by induction on n.

14
Rachel Wu 7 April 4, 2017

7 April 4, 2017
7.1 Infinite-state Markov chains
We are given an unbounded but countable
P state space S = {0, 1, 2, . . .} ∈ Z, with
a probability distribution {α(x)} , x α(x) = 1. A Markov chain X0 , X1 , . . .
has transition probability

Pr {Xn+1 = y|Xn = x} = p(x, y), for x, y ∈ S.4

So by total probability,
X
Pr {Xn+1 = y} = p(Xn = x)p(x, y).
x∈S

An invariant distribution π has the property that


X
π(y) = π(x)p(x, y).
x∈S

Example 7.1
We have an infinite Markov chain where

p(i, i + 1) = 1, i ≥ 0.

As a graph,

0 1 2 3 4
etc.

This has no stationary distribution.

4 In addition, Pr {Xn+k = y|Xn = x} = pk (x, y).

15
Rachel Wu 7 April 4, 2017

Example 7.2
Consider a Markov chain, similar to simple random walks.

p(i, i + 1) = p(i, i − 1) = 1/2


p(0, 0) = 1/2

As a graph,

0 1 2 3 4
etc.

So
π(n) = π(n − 1)/2 + π(n + 1)/2.
This is a bit harder, but we can say that π(0) = π(1), and thus

π(1) = π(2) = · · · = π(n − 1) = π(n).

This isn’t very useful since we have an infinite number of states, so we can
say that there’s no invariant distribution.

These examples show two types of new behavior: there could be no conver-
gence like 7.1, or just no distribution like 7.2.
Definition 7.3. Let X0 , X1 , . . . be a sequence of random variables. Let ran-
dom variable τ be the stopping time, such that event {τ = n} depends on
X0 , X1 , . . . , Xn only.
Definition 7.4. Given some state x ∈ S, then the hitting time the first time
the Markov chain visits x, τn = min {n, Xn = x}.

Proposition 7.5 (Strong Markov property)


Let X0 , X1 , . . . be a Markov chain, and let τ be a stopping time. Then

Pr {Xτ +k = y|Xτ = x} = pk (x, y)

for any x, y ∈ S.

Proof. By total probability,



X
Pr {Xτ +k = y|Xτ = x} = Pr {τ = n} Pr {Xn+k = y|Xn = x}
n=1
X∞
= Pr {τ = n} pk (x, y)
n=1
P∞
We can pull pk (x, y) out of the summation, and realize that n=1 Pr {τ = n} =
1, so all that is left is pk (x, y).

Communications classes are the same as before. An irreducible Markov chain


is one with one communication class.

16
Rachel Wu 7 April 4, 2017

Definition 7.6. A ITHMC5 is recurrent if for some state x, Pr {Xn = x}


for infinitely many times n has probability 1. Otherwise, the Markov chain is
transient.

Proposition 7.7
If an irreducible Markov chain is recurrent, then Pr {Xn = y} for infinitely
many n is equal to 1, ∀y.

Proof. For some x ∈ S, event {Xn = x} happens infinitely often. By the


definition of a communication class, ∃k, pk (x, y) =  > 0. So if Xn = x, then
Pr {Xn+k = y} = , and y also occurs infinitely often.

5 infinite time-homogenous Markov chain

17
Rachel Wu 8 April 6, 2017

8 April 6, 2017
8.1 Infinite-state Markov chains, ctd.
We continue discussion of infinite-state Markov chains.

Proposition 8.1
If for some state x ∈ S, then
X
pn (x, x) < ∞
n

implies that the Markov chain is transient, while


X
pn (x, x) = ∞
n

implies that the Markov chain is recurrent.

Definition 8.2. If for any two states x, y ∈ S, pn (x, y) → 0 as n → ∞, then


this Markov chain is known as null recurrent, and it does not have an invariant
distribution. Otherwise, the Markov chain is positive recurrent, and it has a
invariant distribution.

Proposition 8.3
If a recurrent ITHMC has an invariant distribution, then it is positive
recurrent.

Proof. Take state y ∈ S, with π(y) > 0. By the definition of an invariant


distribution, X
π(y) = π(x)p(x, y).
x∈S

That implies that after n steps, we are still at the same distribution.
X
π(y) = π(x)pn (x, y).
x∈S

There exists a finite set F ⊂ S, |F | < ∞, such that


X
π(x) < π(y).
x∈F

We take an n large enough so that


X X X
π(y) = π(x)pn (x, y) = π(x)pn (x, y) +
x∈S x∈F

type up lol

Claim 8.4. A positive recurrent ITHMC has a unique invariant distribution.


For any y ∈ S, p(x, y) → π(y) (convergence theorem).

18
Rachel Wu 8 April 6, 2017

8.2 Applications of Infinite-state MCs


For the rest of class, we will go through examples.

Example 8.5 (Simple random walk)


Here, S = Z and transition probabilities are 1/2 to each direction. How do
we characterize this Markov chain?

Solution. The probability of returning to 0 is


 
2n
p2n (0, 0) = pn q n .
n

By Stirling’s, 2n
 2n
n ∼ 2 / πn. Then
∞ ∞ ∞
X X 1 X 1
p2n (0, 0) ≈ pn q n 22n √ = (4p(1 − p))n √ .
n=1 n=1
πn n=1
πn
1
This sum converges if p 6= 1/2. Otherwise, it diverges since 4 · 4 = 1. Therefore,
it is transient if p 6= 1/2 and recurrent otherwise.
[TODO fill in from picture]

Example 8.6 (Simple random walk in Z>0 )


Let S be the set of non-negative integers.

0 1 2 3 4
etc.

We have transition probabilities

p(x, x + 1) = p
p(x, x − 1) = q

where 0 bounces back. What is the invariant distribution?

Solution. The invariant distribution can be found as


π(x) = π(x + 1)q + π(x − 1)p.
This is a difference equation, with general solution
 x
p
π(x) = c1 + c2 .
q
The base condition is π(0) = qπ(0) + qπ(1). Plugging in and bashing, we find
that there are 3 possible outcomes.

Case 1 If p > q, then π(x) > 1, which is impossible. This is transient.


Case 2 If p = q, then π(x) is also impossible. This is null recurrent.
Case 3 If p < q, then we have an invariant distribution
 x
q−p p
π(x) = .
q q

19
Rachel Wu 8 April 6, 2017

Example 8.7 (d-dimensional lattice walk)


Let S = Zd , which is a d-dimensional lattice. Transition probabilities are
1
p(vi , vj ) = 2d . What is p2n (0, 0)?

Solution. We have 2n/d steps of simple randomp walk in each dimension. The
probability of returning in each dimension is 1/ πn/d. If the simple
p random
walks are independent, then the probability of returning is (1/ πn/d)d . For
large n, this can be made rigorous, though they are not actually independent.


X X 1
p2n (0, 0) ∼ c .
n=1
nd/2

If d > 2, then this Markov chain is transient. If d = 1, 2, then this is a


harmonic series that diverges, and this Markov chain is recurrent.

20
Rachel Wu 9 April 11, 2017

9 April 11, 2017


9.1 Probability generating functions
Let X be a random variable that takes on values {−k, −k + 1, . . . , 0, 1, . . .}.
Then the probability generating function is

X
fX (s) = Pr {X = n} sn .
n=−k

There are two useful properties of probability generating functions.

Proposition 9.1
If X1 and X2 are independent, then fX1 fX2 (s) = fX1 +X2 (s)

Proof. If X1 and X2 are independent, then

Pr {X1 = x1 } Pr {X2 = x2 } = Pr {X1 = x1 , X2 = x2 } .

We use this in the probability generating functions.



X ∞
X
fX1 (s)fX2 (s) = Pr {X1 = n1 } sn1 Pr {X2 = n2 } sn2
n1 =−k1 n2 =−k2
∞ n+k
!
X X
n
= s Pr {X1 = n1 , X2 = n − n1 }
n=−k1 −k2 n1 =−k1
X∞
= sn Pr {X1 + X2 = n} = fX1 +X2 (s).
n=−k1 −k2

We let n2 = n − n1 .

Proposition 9.2
The expected value can be found as E [X] = (fX (s))0 |s=1 .

Proof. This follows by definition.



X
E [X] = n Pr {X = n}
n=−k
X
=( Pr {X = n} sn )0 |s=1

We look at some examples.

21
Rachel Wu 9 April 11, 2017

Example 9.3 (Bernoulli)


Let X be a Bernoulli random variable ∈ {−1, 1}. Then
1 −1 1 1
fX (s) = s + s .
2 2
The expectation is 0. If X1 . . . Xn are i.i.d., then
1 −1
fX1 +···+Xn = (s + s)n .
2n

Example 9.4 (Poisson)


k
Let X ∼ Poisson(α). That is, Pr {X = k} = e−α αk! . Then

X αk
fX (s) = e−α = e−α eαs
k!
k=0

The expectation is α. If X1 . . . Xn are i.i.d., then

fX1 +···+Xn = e(s−1)(α1 +α2 +···+αn ) .

9.2 Branching processes


At time n, the process produces Xn particles, each of which produces random
offspring and then dies.

Xn • • • • •

Xn+1 • • • • • • • • • • • • • •

In this class, we assume that each particle reproduces independently


P and
with the same distribution ξ, such that Pr {ξ = i)} = pi , pi ≥ 0 pi = 1. A
branching process is a THMC with transition probabilities

P (k, j) = Pr {y1 + · · · + yk = j} ,

where y1 , y2 , . . . , yk are i.i.d. ∼ ξ.



X
Pr {Xn+1 = j} = Pr {Xn = k, Xn+1 = j}
k=0
X∞
= Pr {Xn = k} Pr {Xn+1 = j|Xn = k}
k=0
X∞
= Pr {Xn = k} Pr {y1 + · · · + yk = j} (9.1)
k=0

22
Rachel Wu 9 April 11, 2017

Theorem 9.5
fXn+1 (s) = fXn (fξ (s))

Proof. Use equation 9.1 and propositions.

Example 9.6
Let p0 = 1/2, p2 = 1/2, X0 = 1.



• •
• •

TODO fill out the graph to be more fishy


1
We see that fX0 (s) = s and fX1 (s) = 2 + 21 s2 . Continuing,

fXn (s) = fξ (fs (fξ (. . . fξ (s)))).

Let A = limn→∞ Pr {Xn = 0} be the extinction probability. If E [s] < 1,


then A = 1.
Pr {Xn = 0} = fXn (0) = fξ (fs (. . . fs (0)))

Something something fξ (s) : [0, 1] → R. fξ (1) = 1, fξ (0) = p0 > 0. fξ (s) is


increasing and continuous,
TODO pgfplots this graph

Theorem 9.7
The extinction probability A is the smallest positive root of fξ (s) = s.

(n) (n)
Proof. Note that the limit A = limn→∞ fξ (0) exists, since fξ (0) is increasing
and bounded by ≤ 1. In addition,
(n) (n)
fξ (A) = fξ ( lim fξ (0)) = lim fξ (0) = A.
n→∞ n→∞

Finally, if  is the smallest positive root of fξ (s) = s, then  > 0 and tends to
A.

Example 9.8
Let p0 = 1/2, p2 = 1/2. Then fξ (s) = 1/2 + 1/2s2 , so s = 1 is the root to
fξ (s) = s. This has extinction probability 1.

23
Rachel Wu 9 April 11, 2017

Example 9.9
Let p0 = 1/4, p1 = 1/4, p2 = 1/2. Then fξ (s) = 1/2 + 1/4s + 1/2s2 , so
s = 1, 1/2 are the roots to fξ (s) = s. This has extinction probability 1/2,
the smaller root.

24
Rachel Wu 10 April 13, 2017

10 April 13, 2017


10.1 Branching processes, ctd.
Recall that we have a distribution ξ with Pr {ξ = 0} , Pr {ξ = 1} , . . .

Example 10.1
If Pr {ξ = 0} + Pr {ξ = 1} = 1, then the next step there are either 1 or 0
offspring. This process dies out with probability 1.

Assume that X0 = 1. Recall that extinction probability

A = fξ (A) = Pr {ξ = 0} + Pr {ξ = 1} + · · · + Pr {ξ = k} .

This is because each particle dies out independently, so if there are k particles,
they need to all die out.

Theorem 10.2
If E [ξ] ≤ 1, then a = 1, and if E [ξ] > 1, then fξ (s) = s has a unique
positive root a such that a < 1.

[INSERT PICTURE FROM LAST LECTURE]


Graphically, E [ξ] is the slope at 1, so the left plot is for E [ξ] = 1 and the
right is E [ξ] < 1.
We require that fξ (s) is a convex function, fξ00 (s) > 0.


!
d2 X
fξ00 (s) = 2 Pr {ξ = k} s k
ds
k=0

X
= Pr {ξ = k} sk−2
k=2

We can also prove the theorem analytically. First a useful claim about convexity.

Claim 10.3. If f (x) is convex, then the collection of points {(x, y) : y ≥ f (x)}
is convex. That is, the intersection of this set with any line is a segment.

There are two cases to show.

Case 1 E [ξ] = fξ0 (1) ≤ 1 implies that fξ00 (s) < fξ0 (1) ≤ 1.
Z 1
1 − fξ (s) = fξ0 (s) dt < 1 − s
s

For all s < 1, this implies that fξ (s) > s.


Case 2 E [ξ] = fξ0 (1) > 1 implies that

fξ (1 − ) ≈ 1 − fξ0 (1) ·  ≤ 1 − 

25
Rachel Wu 10 April 13, 2017

for small . Thus, fξ (s) < s when s is close to 1. Then fξ (0) =


Pr {ξ = 0} > 0, fξ (s) > 0 for s = 0, so there should be some inter-
section, so fξ (s) has a root strictly < 1. fξ (s) is convex, so fξ (s) has at
most 2 roots, and 1 is a root, so we have found the only other root.

Note, this proof does not cover the case Pr {ξ = 1} and a = 0, but we are
handwaving.
Moving on,

fXn+1 (s) = fXn (fξ (s)) = fX0 (fξ (fξ (. . . fξ (s))))

10.2 Conditional probability


Our next subject of study involves conditional probability and expectation,
which we introduce here.
Given two random variables X, Y , we know Pr {X = x, Y = y} for each pair
(x, y). The conditional probability

Pr {X = x, Y = y}
Pr {Y = y|X = x} = .
Pr {X = x}

if Pr {X = x} =
6 0. In the continuous case, this is problematic since Pr {X = x}
is always 0. The usual expectation is a real number,
X
E [f (x, y)] = f (x, y) Pr {X = x, Y = y} .
x,y

However, that is not true for conditional expectation.


Definition 10.4. Conditional expectation E [Y |X] is a random variable
X
X→ y Pr {Y = y|X = x} .
y

Example 10.5 (Dice)


Let X1 , X2 be i.i.d. random variables distributed Pr {Xi = i} = 1/6 for
i ∈ {1, 2, . . . , 6}. It’s intuitive to see that E [X1 + X2 |X1 ] = E [X2 ] + X1 .

Definition 10.6. The conditional expectation on several variables is


X
E [Y |X1 , . . . , Xl ] = y Pr {Y = y|X1 . . . Xl }
y

for Pr {X1 . . . Xl } > 0.

There are several useful properties of conditional expectations. Let a, b ∈ R


and X, Y be random variables.
Claim 10.7. E [a|X] = a
Claim 10.8. E [aY1 + bY2 |X] = aE [Y1 |X] + bE [Y2 |X]
Claim 10.9. If X and Y are independent, then E [Y |X] = E [Y ].

26
Rachel Wu 10 April 13, 2017

Claim 10.10. E [Y f (x)|X] = f (x)E [Y |X]

Proof. We prove this by bashing simple math.


X
E [Y f (x)|X] = y Pr {Y f (x) = yf (x)}
y
X
= yf (x) Pr {Y = y}
y
X
= f (x) y Pr {Y = y}
y

Claim 10.11. E [E [Y |X]] = E [Y ]

Proof.
X X
E [E [Y |X]] = Pr {X = x} Pr {Y = y|X = x}
x y
XX
= y Pr {Y = y, X = x}
x y
X
= y Pr {Y = y} = E [Y ]
y

Claim 10.12. E [E [Y |X1 , X2 ] |X1 ] = E [Y |X1 ]

Proof.
X X
E [E [Y |X1 , X2 ]] = Pr {X2 = x2 |X1 = x1 } Pr {Y = y|X1 = x1 , X2 = x2 }
x1 y
X Pr {Y = y, X2 = x2 |X1 = x1 }
= Pr {X2 = x2 |X1 = x1 } y
x1 ,y
Pr {X2 = x2 |X1 = x1 }
X
= y Pr {Y = y|X1 = x1 } = E [Y |X1 ]
y

27
Rachel Wu 11 April 20, 2017

11 April 20, 2017


11.1 Martingales
In this class, we study stochastic processes, where we are given a sequence
of random variables X0 , X1 , . . . . A martingale is another type of stochastic
process which depends on past events to predict a current mean.
Definition 11.1. X0 , X1 , . . . , Xn , Xn+1 is a martingale if for any k < n,

E [Xk+1 |Xk , Xk−1 , . . . , X1 , X0 ] = Xk .

Proposition 11.2
E [Mk+2 |M0 . . . Mk ] = Mk

Proof.

Mk = E [Mk+1 |M0 . . . Mk ] = E [E [Mk+1 ]]

FILL IT IN FROM PHOTO

Proposition 11.3
E [Mk+k0 |M0 . . . Mk ] = Mk

Proposition 11.4
E [Mk ] = E [M0 ]

Proof. E [Mk |M0 ] = M0 , so E [Mk ] = E [E [Mk |M0 ]] = E [M0 ].

Example 11.5
Let X1 , X2 , . . . , Xn be P
independent random variables, such that E [Xi ] = 0
for any i. Then Mn = i Xi is a martingale.

Example 11.6
We play a game to gain 3 or -1 with equal probability each round, but we
must pay 1 to play a round. Let X1 , X2 , . . . , Xn be independent random
variables, such that E [Xi ] = µ for any i. Then Mn = X1 + · · · + Xn − nµ
is a martingale.

with X̃i = Xi − µ, which are independent. By the


Proof. Change variables P
n
previous example, Mn = i=1 X̃i is a martingale.

28
Rachel Wu 11 April 20, 2017

Example 11.7
Let X1 , X2 , . . . , Xn be i.i.d. random variables, where E [Xi ] = 0 and
E Xi2 = V , where V is a constant. Then Mn = (X1 +X2 +· · ·+Xn )2 −nV
is a martingale.

Pn
Proof. Let Sn = i=1 Xi .

E [Mn+1 |Mn ] = E [Sn + Xn+1 − (n + 1)V |Sn ]


= −(n + 1)V + E Sn2 |Sn + 2E [Sn Xn+1 |Sn ]
 

The last term can be expanded.


 2
|Sn = −(n + 1)V + Sn2 + 2Sn E [Xn+1 |Sn ] + E Xn+1
  2 
E Xn+1
= −(n + 1)V + Sn2 + 0 + V
= Sn2 − nV = Mn

Example 11.8
Pn
Let X1 , X2 , . . . , Xn be i.i.d. random variables, and Sn = i=1 Xi . Then
M1 = Snn , M2 = Sn−1 n−1
, and Mk = Sn−k
n−k
, down to S1 .

k−1
We can find that E [Sk−1 |Sk ] = k Sk .

Example 11.9
Consider a branching process, X0 = 1, and ξ is the offspring of one particle,
where E [ξ] = µ. Xn is the number of particles at time n.

E [Xn+1 |Xn ] = Xn µ. Basically every step we expect each particle to produce


µ particles, at each step.

“I don’t know! I need your help. . . ”—abufetov, in a pouty voice

Xn
Then the martingale we are looking for is Mn = µn .

 
Xn+1
E [Mn+1 |Mn ] = E |Xn
µn+1
µXn Xn
= n+1 = n = Mn
µ µ

29
Rachel Wu 13 May 1, 2018

12 April 27, 2017


12.1 Rip Liang is a bad influence

13 May 1, 2018
After a year I am returning to finish my notes!

30

You might also like