Probability-2 Shiryaev
Probability-2 Shiryaev
Albert N. Shiryaev
Probability-2
Third Edition
Graduate Texts in Mathematics 95
Graduate Texts in Mathematics
Series Editors:
Sheldon Axler
San Francisco State University, San Francisco, CA, USA
Kenneth Ribet
University of California, Berkeley, CA, USA
Advisory Board:
Graduate Texts in Mathematics bridge the gap between passive study and creative
understanding, offering graduate-level introductions to advanced topics in mathe-
matics. The volumes are carefully written as teaching aids and highlight character-
istic features of the theory. Although these books are frequently used as textbooks
in graduate courses, they are also suitable for individual study.
Probability-2
Third Edition
123
Albert N. Shiryaev
Department of Probability Theory
and Mathematical Statistics
Steklov Mathematical Institute and
Lomonosov Moscow State University
Moscow, Russia
This Springer imprint is published by the registered company Springer Science+Business Media, LLC
part of Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface to the Third English Edition
The present edition is a translation of the fourth Russian edition of 2007, with the
previous three published in 1980, 1989, and 2004. The English translations of the
first two appeared in 1984 and 1996. The third and fourth Russian editions, extended
compared to the second edition, were published in two volumes titled Probability-1
and Probability-2. Accordingly, the present edition consists of two volumes: this
Vol. 2, titled Probability-2, contains Chaps. 4–8, and Chaps. 1–3 are contained in
Vol. 1, titled Probability-1, which was published in 2016.
This English translation of Probability-2 was prepared by the editor and transla-
tor Prof. D. M. Chibisov, Professor of the Steklov Mathematical Institute. A former
student of N. V. Smirnov, he has a broad view of probability and mathematical statis-
tics, which enabled him not only to translate the parts that had not been translated
before, but also to edit both the previous translation and the Russian text, making in
them quite a number of corrections and amendments.
The author is sincerely grateful to D. M. Chibisov for the translation and scien-
tific editing of this book.
v
vi
In Chaps. 7 and 8, we set out random sequences that form martingales and
Markov chains. These classes of processes enable us to study the behavior of various
stochastic systems in the “future”, depending on their “past” and “present” thanks
to which these processes play a very important role in modern probability theory
and its applications.
The book concludes with a Historical Review of the Development of Mathemat-
ical Theory of Probability.
7 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
1 Definitions of Martingales and Related Concepts . . . . . . . . . . . . . . . 107
2 Preservation of Martingale Property Under a Random
Time Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
vii
viii Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Table of Contents of Probability-1
ix
x Table of Contents of Probability-1
1. Zero–One Laws
The concept of mutual independence of two or more experiments holds, in a certain sense,
a central position in the theory of probability. . . . Historically, the independence of exper-
iments and random variables represents the very mathematical concept that has given the
theory of probability its peculiar stamp.
A. N. Kolmogorov, Foundations of Probability Theory [50]
∞ ∞
1. The series n=1 (1/n) diverges and the series n=1 (−1)n (1/n) converges.
We ask the following∞ question. What can we say about the convergence or diver-
gence of a series n=1 (ξn /n), where ξ1 , ξ2 , . . . is a sequence of independent identi-
cally distributed Bernoulli random variables with P(ξ1 = +1) = P(ξ1 = −1) = 12 ?
In other words, what can be said about the convergence of a series whose general
term is ±1/n, where the signs are chosen in a random manner, according to the
sequence ξ1 , ξ2 , . . .?
Let ∞
ξn
A1 = ω : converges
n=1
n
∞
be the set of sample points for which n=1 (ξn /n) converges (to a finite number),
and consider the probability P(A1 ) of this set. It is far from clear, to begin with,
what values this probability might have. However, it is a remarkable fact that we are
able to say that the probability can have only two values, 0 or 1. This is a corollary
of Kolmogorov’s zero–one law, whose statement and proof form the main content
of this section.
where In ∈ B(R), n ≥ 1;
A4 = lim sup ξn < ∞ ;
n
ξ1 + · · · + ξ n
A5 = lim sup <∞ ;
n n
ξ1 + · · · + ξ n
A6 = lim sup <c ;
n n
Sn
A7 = converges , where Sn = ξ1 + · · · + ξn ;
n
Sn
A8 = lim sup √ =1 .
n 2n log n
PROOF. The idea of the proof is to show that every tail event A is independent of
itself, and therefore P(A ∩ A) = P(A) · P(A), i.e., P(A) = P2 (A), so that P(A) = 0
or 1.
If A ∈ X , then A ∈ F1∞ = σ{ξ1 , ξ2 , . . .} = σ( n F1n ), where F1n =
σ{ξ1 , . . . , ξn }, and we find (Problem 8, Sect. 3, Chap. 2, Vol. 1) sets An ∈ F1n , n ≥ 1,
such that P(A An ) → 0, n → ∞. Hence
for every n ≥ 1. Hence (1) implies that P(A) = P2 (A), and therefore P(A) = 0
or 1.
This completes the proof of the theorem.
Corollary. Let η be a random variable that is measurable with respect to the tail
σ-algebra X , i.e., {η ∈ B} ∈ X , B ∈ B(R). Then η is degenerate, i.e., there is a
constant c such that P(η = c) = 1.
PROOF. We first observe that the event B = (Sn = 0 i.o.) is not a tail event, i.e.,
B
∈ X = Fn∞ , Fn∞ = σ{ξn , ξn+1 , . . .}. Consequently it is, in principle, not
clear that B should have only a value of 0 or 1.
Statement (b) is easily proved by applying (the first part of) the Borel–Cantelli
lemma. In fact, if B2n = {S2n = 0}, then, by Stirling’s formula (see (6), Sect. 2,
Chap. 1, Vol. 1),
4 4 Sequences and Sums of Independent Random Variables
(4pq)n
pq ∼ √
n n n
P(B2n ) = C2n
πn
and therefore P(B2n ) < ∞. Consequently, P(Sn = 0 i.o.) = 0.
To prove (a), it is enough to prove that the event
Sn Sn
A = lim sup √ = ∞, lim inf √ = −∞
n n
Then Ac ↓ A, c → ∞, and all the events A, Ac , Ac are tail events. Let us show that
P(Ac ) = P(Ac ) = 1 for each c > 0. Since Ac ∈ X and Ac ∈ X , it is sufficient to
show only that P(Ac ) > 0, P(Ac ) > 0. But by Problem 5,
Sn Sn Sn
P lim inf √ ≤ −c = P lim sup √ ≥ c ≥ lim sup P √ ≥ c > 0,
n n n n n n
where the last inequality follows from the de Moivre–Laplace theorem (Sect. 6,
Chap. 1, Vol. 1).
Thus, P(Ac ) = 1 for all c > 0, and therefore P(A) = limc→∞ P(Ac ) = 1.
This completes the proof of the theorem.
4. Let us observe again that B = {Sn = 0 i.o.} is not a tail event. Nevertheless, it
follows from Theorem 2 that, for a Bernoulli scheme, the probability of this event,
just as for tail events, takes only the values 0 and 1. This phenomenon is not acci-
dental: it is a corollary of the Hewitt–Savage zero–one law, which for independent
identically distributed random variables extends the result of Theorem 1 to the class
of “symmetric” events (which includes the class of tail events).
Let us give the essential definitions. A one-to-one mapping π = (π1 , π2 , . . .) of
the set (1, 2, . . .) on itself is said to be a finite permutation if πn = n for every n with
a finite number of exceptions.
If ξ = ξ1 , ξ2 , . . . is a sequence of random variables, π(ξ) denotes the sequence
(ξπ1 , ξπ2 , . . .). If A is the event {ξ ∈ B}, B ∈ B(R∞ ), then π(A) denotes the event
{π(ξ) ∈ B}, B ∈ B(R∞ ).
We call an event A = {ξ ∈ B}, B ∈ B(R∞ ) symmetric if π(A) coincides with A
for every finite permutation π.
An example of a symmetric event is A = {Sn = 0 i.o.}, where Sn =
ξ1 + · · · + ξn . Moreover, it can be shown (Problem 4) that every event in the tail
σ-algebra X (S) = Fn∞ (S), where Fn∞ (S) = σ{Sn , Sn+1 , . . .}, generated by
S1 = ξ1 , S2 = ξ1 + ξ2 , . . . is symmetric.
1 Zero–One Laws 5
P(A An ) → 0, n → ∞. (2)
Therefore
whence, by (7),
P(A) = P2 (A)
and therefore P(A) = 0 or 1.
This completes the proof of the theorem.
6 4 Sequences and Sums of Independent Random Variables
5. PROBLEMS
1. Prove the corollary to Theorem 1.
2. Show that if (ξn )n≥1 is a sequence of independent random variables, then the
random variables lim sup ξn and lim inf ξn are degenerate.
3. Let (ξn ) be a sequence of independent random variables, Sn = ξ1 + · · · + ξn ,
and let the constants bn satisfy 0 < bn ↑ ∞. Show that the random variables
lim sup (Sn /bn ) and lim inf (Sn /bn ) are degenerate.
4. Let Sn = ξ1 + · · · + ξn , n ≥ 1, and
X (S) = Fn∞ (S), Fn∞ (S) = σ{Sn , Sn+1 , . . .}.
2. Convergence of Series
since
E Sk (ξk+1 + · · · + ξn )IAk = E Sk IAk · E (ξk+1 + · · · + ξn ) = 0
because of independence and the conditions E ξi = 0, 1 ≤ i ≤ n. Hence
E Sn2 ≥ E Sk2 IAk ≥ ε2 P(Ak ) = ε2 P(A),
and therefore
E Sn2 IA = E Sk2 IAk + E (IAk (Sn − Sk )2 )
k k
n
n
≤ (ε + c)2 P(Ak ) + P(Ak ) E ξj2
k k=1 j=k+1
⎡ ⎤
n
≤ P(A) ⎣(ε + c)2 + E ξj2 ⎦ = P(A)[(ε + c)2 + E Sn2 ]. (5)
j=1
By (2),
P sup |Sn+k − Sn | ≥ ε = lim P max |Sn+k − Sn | ≥ ε
k≥1 N→∞ 1≤k≤N
n+N 2
∞ 2
k=n E ξk k=n E ξk
≤ lim = .
N→∞ ε2 ε2
∞
Therefore (6) is satisfied if k=1 E ξk2 < ∞, and consequently ξk converges with
probability
1.
(b) Let ξk converge. Then, by (6), for sufficiently large n,
P sup |Sn+k − Sn | ≥ ε < 12 . (7)
k≥1
By (3),
(c + ε)2
sup |Sn+k − Sn | ≥ ε ≥ 1 − ∞
P 2.
k≥1 k=n E ξk
∞
Therefore if we suppose that k=1 E ξk2 = ∞, then we obtain
P sup |Sn+k − Sn | ≥ ε = 1,
k≥1
3. The following
theorem provides a necessary and sufficient condition for the con-
vergence of ξn without any boundedness condition on the random variables.
Let c be a constant and
ξ, |ξ| ≤ c,
ξc =
0, |ξ| > c.
converge for every c > 0; a sufficient condition is that these series converge for
some c > 0.
10 4 Sequences and Sums of Independent Random Variables
c
PROOF . Sufficiency. By the two-series theorem, ξn converges with probability 1.
But if P(|ξn | ≥ c) < ∞, then I(|ξn | ≥ c) < ∞ with probability 1 by the
Borel–Cantelli lemma. Consequently, ξn = ξnc for all n with at most finitely many
exceptions. Therefore
ξn also converges (P-a.s.).
Necessity. If ξn converges (P-a.s.), then ξn → 0 (P-a.s.), and therefore, for
every c > 0, at most a finite number of the events {|ξn | ≥ c} can occur (P-a.s.).
I(|ξn | ≥ c) < ∞ (P-a.s.), and, by the second part
Therefore of the Borel–Cantelli
lemma, P(|ξ
c n | > c) < ∞. Moreover, the convergence of ξn implies the
con-c
vergence
of ξ n . Therefore, by the two-series theorem, both of the series E ξn
and Var ξnc converge.
This completes the proof of the theorem.
Since E ξn = 0, we have
|E ξn1 | = |E ξn I(|ξn | ≤ 1)| = |E ξn I(|ξn | > 1)|
≤ E |ξn |I(|ξn | > 1) < ∞.
Therefore both E ξn1 and Var ξn1 converge. Moreover, by Chebyshev’s inequal-
ity,
P{|ξn | > 1} = P{|ξn |I(|ξn | > 1) > 1} ≤ E (|ξn |I(|ξn | > 1).
Therefore P(|ξn | > 1) < ∞. Hence the convergence of ξn follows from the
three-series theorem.
2 Convergence of Series 11
4. PROBLEMS
1. Let ξ1 , ξ2 , . . . be a sequence of independent random variables, Sn = ξ1 +. . .+ξn .
Show, using
the three-series theorem, that:
(a) If ξn2 < ∞ (P-a.s.), then ξn converges with probability 1 if and only
if E ξi I(|ξi | ≤ 1) converges;
(b) If ξn converges (P-a.s.), then ξn2 < ∞ (P-a.s.) if and only if
(E |ξn |I(|ξn | ≤ 1))2 < ∞.
Let ξ1 , ξ2 , . . . be a sequence
6. of (arbitrary) random variables. Show that if
n≥1 E |ξ n | < ∞, then ξ
n≥1 n absolutely converges with probability 1.
7. Let ξ1 , ξ2 , . . . be independent random variables with a symmetric distribution.
Show that 2
E ξn ∧ 1 ≤ E (ξn2 ∧ 1).
n n
8. Let ξ1 , ξ2 ,
. . . be independent random variables with finite
second moments.
Show that ξn converges in L2 if and only if E ξn and Varξn converge.
9. Let ξ1 , ξ2 , . . . be independent random variables and the series ξn converge
a.s. Show that the value of this series is independent of the order of its terms if
and only if | E (ξn ; |ξn | ≤ 1)| < ∞.
10. Let ξ1 , ξ2 , . . . be independent random variables with E ξn = 0, n ≥ 1, and
∞
E [ξn2 I(|ξn | ≤ 1) + |ξn |I(|ξn | > 1)] < ∞.
n=1
∞
Then n=1 ξn converges P-a.s.
12 4 Sequences and Sums of Independent Random Variables
n
n
I(Aj ) P(Aj ) → 1 (P-a.s.) as n → ∞.
j=1 j=1
ξj 1
n n
→c (P-a.s.) as n → ∞.
σ2
j=1 j
σ2
j=1 j
Sn − E Sn P
→ 0, n → ∞. (1)
n
A strong law of large numbers is a proposition in which convergence in proba-
bility is replaced by convergence with probability 1.
One of the earliest results in this direction is the following theorem.
E |ξn − E ξn |4 ≤ C, n ≥ 1,
Sn − E Sn
→ 0 (P -a.s.). (2)
n
3 Strong Law of Large Numbers 13
Let us show that this condition is actually satisfied under our hypotheses.
We have
n 4! 4!
Sn4 = (ξ1 + · · · + ξn )4 = ξi4 + ξi2 ξj2 + ξi2 ξj ξk
i=1 i,j
2!2! 2!1!1!
i=j
i<j i=k
j<k
4!
+ 4! ξi ξj ξk ξl + ξ 3 ξj .
3!1! i
i<j<k<l i=j
n
n n
E Sn4 = E ξi4 + 6 E ξi2 E ξj2 ≤ nC + 6 E ξi4 · E ξj4
i=1 i,j=1 i,j=1
i<j
6n(n − 1)
≤ nC + C = (3n2 − 2n)C < 3n2 C.
2
Consequently,
4 1
Sn
E ≤ 3C < ∞.
n n2
This completes the proof of the theorem.
In particular, if
Var ξn
<∞ (5)
n2
then
Sn − E Sn
→ 0 (P -a.s.). (6)
n
For the proof of this, and of Theorem 3 in what follows, we need two lemmas.
Lemma 1 (Toeplitz). Let {an } be a sequence of nonnegative numbers, bn =
n
i=1 ai , b1 = a1 > 0, and bn ↑ ∞, n → ∞. Let {xn }n≥1 be a sequence of
numbers converging to x. Then
1
n
aj xj → x. (7)
bn j=1
In particular, if an = 1, then
x1 + · · · + xn
→ x. (8)
n
PROOF. Let ε > 0, and let n0 = n0 (ε) be such that |xn − x| ≤ ε/2 for all n ≥ n0 .
Choose n1 > n0 such that
1
n0
|xj − x| < ε/2.
bn1 j=1
1 1
n0 n
= aj |xj − x| + aj |xj − x|
bn j=1 bn j=n +1
0
1
n0
1
n
≤ aj |xj − x| + aj |xj − x|
bn1 j=1
bn j=n +1
0
ε bn − bn0 ε
≤ + ≤ ε.
2 bn 2
This completes the proof of the lemma.
1
n
bj xj → 0, n → ∞.
bn j=1
In particular, if bn = n, xn = yn /n and (yn /n) converges, then
y1 + · · · + yn
→ 0, n → ∞. (9)
n
n
PROOF. Let b0 = 0, S0 = 0, Sn = j=1 xj . Then (by summation by parts)
n
n
n
bj xj = bj (Sj − Sj−1 ) = bn Sn − b0 S0 − Sj−1 (bj − bi−1 )
j=1 j=1 j=1
1 1
n n
bj xj = Sn − Sj−1 aj → 0,
bn j=1 bn j=1
1
n
Sj−1 aj → x.
bn j=1
Sn
→ m (P -a.s.) (11)
n
where m = E ξ1 .
(Or one can use formula (69) with n = 1 of Sect. 6, Chap. 2, Vol. 1.)
and suppose that E ξn = 0, n ≥ 1. Then ξn
= ξ˜n for finitely many n (P-a.s.), and
therefore (ξ1 +· · ·+ξn )/n → 0 (P-a.s.) if and only if (ξ˜1 +· · ·+ ξ˜n )/n → 0 (P-a.s.).
Note that in general E ξ˜n
= 0, but
∞ ∞
1
n
1
= 2
E [ξ 2
1 I(|ξ 1 | < n)] = 2
E [ξ12 I(k − 1 ≤ |ξ1 | < k)]
n=1
n n=1
n
k=1
∞
∞
1
= E [ξ12 I(k − 1 ≤ |ξ1 | < k)] ·
n2
k=1 n=k
∞
1
≤2 E [ξ12 I(k − 1 ≤ |ξ1 | < k)]
k
k=1
∞
≤2 E [|ξ1 |I(k − 1 ≤ |ξ1 | < k)] = 2 E |ξ1 | < ∞.
k=1
ξ1 + · · · + ξ n
→ C,
n
with probability 1, where C is a (finite) constant. Then E |ξ1 | < ∞ and C = E ξ1 .
18 4 Sequences and Sums of Independent Random Variables
and therefore P(|ξn | > n i.o.) = 0. By the Borel–Cantelli lemma (Sect. 10, Chap. 2,
Vol. 1),
P(|ξ1 | > n) < ∞,
and by Lemma 3 we have E |ξ1 | < ∞. Then it follows from the theorem that C =
E ξ1 .
Consequently, for independent identically distributed random variables, the con-
dition E |ξ1 | < ∞ is necessary and sufficient for the convergence (with probabil-
ity 1) of the ratio Sn /n to a finite limit.
Remark 2. If the expectation m = E ξ1 exists but is not necessarily finite, the con-
clusion (9) of the theorem remains valid.
In fact, let, for example, E ξ1− < ∞ and E ξ1+ = ∞. With C > 0, put
n
SnC = ξi I(ξi ≤ C).
i=1
Then (P-a.s.)
Sn SC
lim inf ≥ lim inf n = E ξ1 I(ξ1 ≤ C).
n n n n
But as C → ∞,
E ξ1 I(ξ1 ≤ C) → E ξ1 = ∞,
and therefore Sn /n → +∞ (P-a.s.).
Remark 3. Theorem 3 asserts the convergence Snn → m (P-a.s.). Note that, be-
sides
the convergence
almost surely
(a.s.),
in this case, the convergence in the mean
Sn L
1
Sn
n − → m also holds, i.e., E n − m → 0, n → ∞. This follows from the ergodic
Theorem 3 of Sect. 3, Chap. 5. But in the case under consideration of independent
identically distributed random variables ξ1 , ξ2 , . . . and Sn = ξ1 + ξ2 + · · · + ξn , this
can be proved directly (Problem 7) without invoking the ergodic theorem.
4. Let us give some applications of the strong law of large numbers.
EXAMPLE 2 (Application to number theory). Let Ω = [0, 1), let B be the sigma-
algebra of Borel subsets of Ω, and let P be a Lebesgue measure on [0, 1). Consider
the binary expansions ω = 0. ω1 ω2 . . . of numbers ω ∈ Ω (with infinitely many 0s),
and define random variables ξ1 (ω), ξ2 (ω), . . . by putting ξn (ω) = ωn . Since, for all
n ≥ 1 and all x1 , . . . , xn taking a value 0 or 1,
{ω : ξ1 (ω) = x1 , . . . , ξn (ω) = xn }
x1 x2 xn x1 xn 1
= ω: + 2 + ··· + n ≤ ω < + ··· + n + n ,
2 2 2 2 2 2
3 Strong Law of Large Numbers 19
P(ξ1 = 0) = P(ξ1 = 1) = 12 .
Hence, by the strong law of large numbers, we have the following result of Borel:
almost every number in [0, 1) is normal, in the sense that with probability 1 the
proportion of zeroes and ones in its binary expansion tends to 12 , i.e.,
1
n
1
I(ξk = 1) → (P -a.s.).
n 2
k=1
EXAMPLE 3 (The Monte Carlo method). Let f (x) be a continuous function defined
on [0, 1], with values in [0, 1]. The following idea is the foundation of the statistical
1
method of calculating 0 f (x) dx (the Monte Carlo method). Let ξ1 , η1 , ξ2 , η2 , . . .
be a sequence of independent random variables uniformly distributed on [0, 1]. Put
1 if f (ξi ) > ηi ,
ρi =
0 if f (ξi ) ≤ ηi .
It is clear that 1
E ρ1 = P{f (ξ1 ) > η1 } = f (x) dx.
0
By the strong law of large numbers (Theorem 3),
1
n 1
ρi → f (x) dx (P -a.s.).
n i=1 0
1
Consequently, we can approximate an integral 0 f (x) dx by taking a simulation
n of pairs of random variables (ξi , ηi ), i ≥ 1, and then calculating ρi and
consisting
(1/n) i=1 ρi .
EXAMPLE 4 (The strong law of large numbers for a renewal process). Let N =
(Nt )t≥0be a renewal process introduced in Subsection 4 of Sect. 9, Chap. 2, Vol. 1:
∞
Nt = n=1 I(Tn ≤ t), Tn = σ1 + · · · + σn , where σ1 , σ2 , . . . is a sequence of
independent identically distributed positive random variables. We assume now that
μ = E σ1 < ∞.
Under this condition, the process N satisfies the strong law of large numbers:
Nt 1
→ (P -a.s.), t → ∞. (14)
t μ
For the proof, we observe that the assumption Nt > 0 and the fact that TNt ≤ t <
TNt +1 , t ≥ 0, imply the inequalities
TNt t TNt +1 1
≤ < 1+ . (15)
Nt Nt Nt + 1 Nt
20 4 Sequences and Sums of Independent Random Variables
and hence we see from (15) that there exists (P-a.s.) the limit limt→∞ t/Nt , which
is equal to μ, which proves the strong law of large numbers (14).
5. PROBLEMS
∞
1. Show that E ξ 2 < ∞ if and only if n=1 n P(|ξ| > n) < ∞.
2. Supposing that ξ1 , ξ2 , . . . are independent identically distributed, show that if
E |ξ1 |α < ∞ for some α, 0 < α < 1, then Sn /n1/α → 0 (P-a.s.), and if
E |ξ1 |β < ∞ for some β, 1 ≤ β < 2, then (Sn − n E ξ1 )/n1/β → 0 (P-a.s.).
3. Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-
ables, and let E |ξ1 | = ∞. Show that
Sn
lim sup − an = ∞ (P -a.s.)
n n
1
n
1
I(ξk (ω) = i) (P-a.s.) for any i = 0, 1, . . . , 9.
n 10
k=1
3 Strong Law of Large Numbers 21
11. Prove that Kolmogorov’s law of large numbers (Theorem 3) can be restated in
the following form: Let ξ1 , ξ2 , . . . be independent identically distributed ran-
dom variables; then
Prove that the first statement remains true with independence replaced by pair-
wise independence.
12. Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-
ables. Show that
ξ
n
E sup < ∞ ⇐⇒ E |ξ1 | log+ |ξ1 | < ∞.
n n
(d) If E ξ1 = 0, E ξ12 < ∞, and M(ε) = supn≥0 (Sn − nε), ε > 0, then
Definition. We call a function ϕ∗ = ϕ∗ (n), n ≥ 1, upper (for (Sn )n≥1 ) if, with
probability 1, Sn ≤ ϕ∗ (n) for all n from some n = n0 (ω) on.
We call a function ϕ∗ = ϕ∗ (n), n ≥ 1, lower (for (Sn )n≥1 ) if, with probability 1,
Sn > ϕ∗ (n) for infinitely many n.
Using these
√ definitions, and appealing to (1) and (2),√we can say that every func-
tion ϕ∗ = ε n log n, ε > 0, is upper, whereas ϕ∗ = ε n is lower, ε > 0.
Let ϕ = ϕ(n) be a function and ϕ∗ε = (1 + ε)ϕ, ϕ∗ε = (1 − ε)ϕ, where ε > 0.
Then it is easily seen that
!
Sn Sm
lim sup ≤ 1 = lim sup ≤1
ϕ(n) n m≥n ϕ(m)
4 Law of the Iterated Logarithm 23
Sm
⇔ sup ≤ 1 + ε for any ε > 0 and some n1 (ε, ω)
m≥n1 (ε,ω) ϕ(m)
⇔ {Sm ≤ (1 + ε)ϕ(m) for any ε > 0, from some n1 (ε, ω) on}. (3)
It follows from (3) and (4) that to verify that each function ϕ∗ε = (1+ε)ϕ, ε > 0,
is upper, we must show that
Sn
P lim sup ≤ 1 = 1, (5)
ϕ(n)
and to show that ϕ∗ε = (1 − ε)ϕ, ε > 0, is lower, we must show that
Sn
P lim sup ≥ 1 = 1. (6)
ϕ(n)
n
1
n
1
P(B) ≥ P(Ak ∩ B) ≥ P(Ak ) = P(A),
2 2
k=1 k=1
σ(n)
P(Sn > a(n)) ∼ √ exp{− 12 a2 (n)/σ 2 (n)}. (10)
2πa(n)
PROOF OF THEOREM 1 (for ξi ∼ N (0, 1)). Let us first establish (5). Let ε > 0,
λ = 1 + ε, nk = λk , where k ≥ k0 , and k0 is chosen so that log log k0 is defined.
We also define
and put
A = {Ak i.o.} = {Sn > λψ(n) for infinitely many n}.
Snk−1 ≥ −2ψ(nk−1 )
or
Snk ≥ Yk − 2ψ(nk−1 ), (12)
where Yk = Snk − Snk−1 .
Hence, if we show that for infinitely many k
this and (12) show that (P-a.s.) Snk > λψ(nk ) for infinitely many k. Take some
λ ∈ (λ, 1). Then there is an N > 1 such that for all k
λ [2(N k − N k−1 ) log log N k ]1/2 > λ(2N k log log N k )1/2
+ 2(2N k−1 log log N k−1 )1/2 ≡ λψ(N k ) + 2ψ(N k−1 ).
Remark 1. Applying (7) to the random variables (−Sn )n≥1 , we find that (P-a.s.)
Sn
lim inf = −1. (15)
ϕ(n)
It follows from (7) and (15) that the law of the iterated logarithm can be put in the
form
|Sn |
P lim sup = 1 = 1. (16)
ϕ(n)
Remark 2. The law of the iterated logarithm says that for every ε > 0 each function
ψε∗ = (1 + ε)ψ is upper and ψ∗ε = (1 − ε)ψ is lower.
The conclusion (7) is also equivalent to the statement that, for each ε > 0,
3. PROBLEMS
1. Let ξ1 , ξ2 , . . . be a sequence of independent random variables with ξn ∼
N (0, 1). Show that
ξn
P lim sup √ = 1 = 1.
2 log n
2. Let ξ1 , ξ2 , . . . be a sequence of independent random variables, distributed ac-
cording to Poisson’s law with parameter λ > 0. Show that (regardless of λ)
ξn log log n
P lim sup = 1 = 1.
log n
Show that
Sn 1/(log log n)
P lim sup 1/α =e1/α
= 1.
n
4. Establish the following generalization of (9). Let ξ1 , . . . , ξn be independent ran-
dom variables, and let S0 = 0, Sk = ξ1 + · · · + ξk . Then Lévy’s inequality
P max [Sk + μ(Sn − Sk )] > a ≤ 2 P(Sn > a)
0≤k≤n
5 Probabilities of Large Deviations 27
holds for every real a > 0, where μ(ξ) is the median of ξ, i.e., a constant such
that
P(ξ ≥ μ(ξ)) ≥ 12 , P(ξ ≤ μ(ξ)) ≥ 12 .
5. Let ξ1 , . . . , ξn be independent random variables, and let S0 = 0, Sk = ξ1 +· · ·+ξk .
Prove that:
(a) (In addition to Problem 4)
P max |Sk + μ(Sn − Sk )| ≥ a ≤ 2 P{|Sn | ≥ a},
1≤k≤n
Under the same √ assumptions, show that if (an ) is a sequence of real numbers
such that an / n → ∞, but an = o(n), then for any ε > 0 and sufficiently large
n
a2
P{Sn > an } > exp − n 2 (1 + ε) .
2nσ
8. Let ξ1 , . . . , ξn be independent
n random variables such that E ξi = 0, |ξi | ≤ C
(P-a.s.), i ≤ n. Let Dn = i=1 Var ξi . Show that Sn = ξ1 + · · · + ξn satisfies the
inequality (Yu. V. Prohorov)
a aC
P{Sn ≥ a} ≤ exp − arcsin , a ∈ R.
2C 2Dn
1. Consider the Bernoulli scheme treated in Sect. 6, Chap. 1, Vol. 1. For this scheme,
the de Moivre–Laplace theorem provides an √ approximation for the probabilities of
−np| ≥ ε n, i.e., deviations of Sn from the central
standard (normal) deviations |Sn√
value np by a quantity of order n. In the same Sect. 6, Chap. 1, Vol. 1 we gave a
28 4 Sequences and Sums of Independent Random Variables
bound for probabilities of so-called large deviations |Sn − np| ≥ εn, i.e., deviations
of Sn from np of order n:
Sn
P − p ≥ ε ≤ 2e−2nε
2
(1)
n
(see (42) in Sect. 6, Chap. 1, Vol. 1). From this, of course, there follow the inequali-
ties
Sm Sm 2
P sup
− p ≥ ε ≤ P
− p ≥ ε ≤ 2 e
−2nε2
, (2)
1−e −2ε
m≥n m m
m≥n
the function ψ(λ) is convex (from below) and infinitely differentiable. We also no-
tice that
ψ(0) = 0, ψ (0) = m (= E ξ), ψ (λ) ≥ 0.
We define the function
called the Cramér transform (of the distribution function F = F(x) of the random
variable ξ). The function H(a) is also convex (from below) and its minimum is zero,
attained at a = m.
If a > m, we have
H(a) = sup [aλ − ψ(λ)].
λ>0
Then
P{ξ ≥ a} ≤ inf E eλ(ξ−a) = inf e−[aλ−ψ(λ)] = e−H(a) . (7)
λ>0 λ>0
5 Probabilities of Large Deviations 29
then
Hn (a) = nH(a) = n sup[aλ − ψ(λ)]
λ
and inequalities (7), (8), and (9) assume the following forms:
Sn
P ≥ a ≤ e−nH(a) , a > m, (11)
n
Sn
P ≤ a ≤ e−nH(a) , a < m, (12)
n
Sn
P − m ≥ ε ≤ 2e− min{H(m−ε),H(m+ε)}·n . (13)
n
where a > 0 and b > 0, indicate exponential convergence “adjusted” by the con-
stants a and b. In the theory of large deviations, such results are often presented in
a somewhat different, “cruder,” form,
1 Sn
lim sup log P − m ≥ ε < 0, (15)
n n n
that clearly arises from (14) and refers to the “exponential” rate of convergence, but
without specifying the values of the constants a and b.
Now we turn to the question of upper bounds for the probabilities
Sk Sk Sk
P sup > a , P inf
< a , P sup − m > ε ,
k≥n k k≥n k k≥n k
which can provide definite bounds on the rate of convergence in the strong law of
large numbers.
30 4 Sequences and Sums of Independent Random Variables
Then
⎧ ⎫
⎨ & ⎬
Sk Sk
P sup >a =P >a
k≥n k ⎩ k ⎭
k≥n
Sκ
=P > a, κ < ∞ = P{eλSκ > eλaκ , κ < ∞}
κ
= P{eλSκ −κ log ϕ(λ) > eκ(λa−log ϕ(λ)) , κ < ∞}
≤ P{eλSκ −κ log ϕ(λ) > en(λa−log ϕ(λ)) , κ < ∞}
≤ P sup eλSk −k log ϕ(λ) ≥ en(λa−log ϕ(λ) . (17)
k≥n
To take the final step, we notice that the sequence of random variables
Let a > m. Since the function f (λ) = aλ − log ϕ(λ) has the properties f (0) = 0,
f (0) > 0, there is a λ > 0 for which (16) is satisfied, and consequently we obtain
from (18) that if a > m, then
Sk
P sup > a ≤ e−n supλ>0 [λa−log ϕ(λ)] = e−nH(a) . (19)
k≥n k
5 Probabilities of Large Deviations 31
Remark 2. The fact that the right-hand sides of inequalities (11) and (19) are the
same leads us to suspect that this situation is not accidental. In fact, this expectation
is concealed in the property that the sequences (Sk /k)n≤k≤N form, for every n ≤ N,
reversed martingales (see Problem 5 in Sect. 1, Chap. 7, and Example 4 in Sect. 11,
Chap. 1, Vol. 1).
2. PROBLEMS
1. Carry out the proof of inequalities (8) and (20).
2. Verify that under condition (3), the function ψ(λ) is convex (from below) on the
interior of the set Λ (see (5)) (and strictly convex provided ξ is nondegenerate)
and infinitely differentiable.
3. Assuming that ξ is nondegenerate, prove that the function H(a) is differentiable
on the whole real line and is convex (from below).
4. Prove the following inversion formula for Cramér’s transform:
(for all λ, except, possibly, the endpoints of the set Λ = {λ : ψ(λ) < ∞}).
5. Let Sn = ξ1 + · · · + ξn , where ξ1 , . . . , ξn , n ≥ 1, are independent identically
distributed simple random variables with E ξ1 < 0, P{ξ1 > 0} > 0. Let ϕ(λ) =
E eλξ1 and inf λ ϕ(λ) = ρ (0 < ρ < 1).
Show that the following result (Chernoff’s theorem) holds:
1
lim log P{Sn ≥ 0} = log ρ. (22)
n
6. Using (22), prove that in the Bernoulli scheme (P{ξ1 = 1} = p, P{ξ1 = 0} = q)
1
lim log P{Sn ≥ nx} = −H(x), (23)
n
for p < x < 1, where (cf. notation in Sect. 6, Chap. 1, Vol. 1)
x 1−x
H(x) = x log + (1 − x) log .
p 1−p
32 4 Sequences and Sums of Independent Random Variables
√ xn2
P{Sn ≥ xn n} = e− 2 (1+yn ) ,
where yn → 0, n → ∞.
8. Derive from (23) that in the Bernoulli case (P{ξ1 = 1} = p, P{ξ1 = 0} = q)
we have:
(a) For p < x < 1 and xn = n(x − p),
xn
P{Sn ≥ np + xn } = exp −nH p + (1 + o(1)) ; (24)
n
√ √
(b) For xn = an npq with an → ∞, an / n → 0,
x2
P{Sn ≥ np + xn } = exp − n (1 + o(1)) . (25)
2npq
Compare (24) with (25) and both of them with the corresponding results in Sect. 6
of Chap. 1, Vol. 1.
Chapter 5
Stationary (Strict Sense) Random
Sequences and Ergodic Theory
In the strict sense, the theory [of stationary stochastic processes] can be stated outside the
framework of probability theory as the theory of one-parameter groups of transformations
of a measure space that preserve the measure; this theory is very close to the general theory
of dynamical systems and to ergodic theory.
Encyclopaedia of Mathematics [42, Vol. 8, p. 479].
Definition 1. A random sequence ξ is stationary (in the strict sense) if the probabil-
ity distributions of θk ξ and ξ are the same for every k ≥ 1:
ξ1 + · · · + ξ n
→ m, n → ∞.
n
In 1931, Birkhoff [6] obtained a remarkable generalization of this fact, which was
stated as a theorem of statistical mechanics dealing with the behavior of the “relative
residence time” of dynamical systems described by differential equations admitting
an integral invariant (“conservative systems”). Soon after, in 1932, Khinchin [47]
obtained an extension of Birkhoff’s theorem to a more general case of “stationary
motions of a multidimensional space within itself preserving the measure of a set.”
The following presentation of Birkhoff’s and Khinchin’s results will combine the
ideas of the theory of “dynamical systems” and the theory of “stationary in a strict
sense random sequences.”
In this presentation we will primarily concentrate on the “ergodic” results of
these theories.
2. Let (Ω, F , P) be a (complete) probability space.
T −1 A = {ω : Tω ∈ A} ∈ F .
P(T −1 A) = P(A).
EXAMPLE 2. If Ω = [0, 1), F = B([0, 1)), P is the Lebesgue measure, λ ∈ [0, 1),
then Tx = (x + λ) mod 1 is a measure-preserving transformation.
Let us consider the physical hypotheses that lead to the consideration of measure-
preserving transformations.
Suppose that Ω is the phase space of a system that evolves (in discrete time)
according to a given law of motion. If ω is the state at instant n = 1, then T n ω,
where T is the translation operator induced by the given law of motion, is the state
attained by the system after n steps. Moreover, if A is some set of states ω, then
T −1 A = {ω : Tω ∈ A} is, by definition, the set of states ω that lead to A in one step.
Therefore, if we interpret Ω as an incompressible fluid, the condition P(T −1 A) =
P(A) can be thought of as the rather natural condition of conservation of volume.
(For the classical conservative Hamiltonian systems, Liouville’s theorem asserts that
the corresponding transformation T preserves the Lebesgue measure.)
4. PROBLEMS
1. Let T be a measure-preserving transformation and ξ = ξ(ω) a random variable
whose expectation E ξ(ω) exists. Show that E ξ(ω) = E ξ(Tω).
2. Show that the transformations in Examples 1 and 2 are measure-preserving.
3. Let Ω = [0, 1), F = B([0, 1)), and let P be a measure whose distribution func-
tion is continuous. Show that the transformations Tx = λx, 0 < λ < 1, and
Tx = x2 are not measure-preserving.
4. Let Ω be the set of all sequences ω = (. . . , ω−1 , ω0 , ω1 , . . . ) of real numbers, F
the σ-algebra generated by measurable cylinders {ω : (ωk , . . . , ωk+n−1 ) ∈ Bn },
where n = 1, 2, . . . , k = 0, ±1, ±2, . . . , and Bn ∈ B(Rn ). Let P be a probability
measure on (Ω, F ), and let T be the two-sided transformation defined by
T(. . . , ω−1 , ω0 , ω1 , . . . ) = (. . . , ω0 , ω1 , ω2 , . . . ).
(in other words, if ξk (ω) = ωk , then ξk (Tω) = ωk+1 ). Then (1) becomes
The following lemma establishes a connection between invariant and almost in-
variant sets.
Lemma 1. If A is an almost invariant set, then there is an invariant set B such that
P(A B) = 0.
Hence P(A B) = 0.
Sect. 1), we have (Problem 1, Sect. 1) that, since the random variable ξ is invariant,
cn = E ξ(ω)e−2πinξ(ω) = E ξ(Tω)e−2πinTω = e−2πinλ E ξ(Tω)e−2πinω
= e−2πinλ E ξ(ω)e−2πinω = cn e−2πinλ .
It is clear that this set is invariant; but P(A) = 12 . Consequently, T is not ergodic.
2. Definition 4. A measure-preserving transformation is mixing (or has the mixing
property) if, for all A and B ∈ F ,
P(A ∩ T −n B) = P(A ∩ B)
for all n ≥ 1. Because of (1), P(A ∩ B) = P(A) P(B). Hence we find, when A = B,
that P(B) = P2 (B), and consequently P(B) = 0 or 1. This completes the proof.
3. PROBLEMS
1. Show that a random variable ξ is invariant if and only if it is I -measurable.
2. Show that a set A is almost invariant if and only if P(T −1 A \ A) = 0.
3. Show that a transformation T is mixing if and only if, for all random variables ξ
and η with E ξ 2 < ∞ and E η 2 < ∞,
be satisfied for sets A and B in A . Show that this property will then hold for all
A and B in F = σ(A ) (and therefore the transformation T is mixing).
Show that this statement remains true if A is a π-system such that π(A ) = F .
6. Let A be an almost invariant set. Show that ω ∈ A (P-a.s.) if and only if T n ω ∈ A
for all n = 1, 2, . . . (cf. Theorem 1 in Sect. 1.)
7. Give examples of measure-preserving transformations T on (Ω, F , P) such that
(a) A ∈ F does not imply that TA ∈ F and (b) A ∈ F and TA ∈ F do not
imply that P(A) = P(TA).
8. Let T be a measurable transformation on (Ω, F ), and let P be the set of proba-
bility measures P with respect to which T is measure-preserving. Show that:
(a) The set P is convex;
(b) T is an ergodic transformation with respect to P if and only if P is an
extreme point of P (i.e., P cannot be represented as P = λ1 P1 +λ2 P2
with λ1 > 0, λ2 > 0, λ1 + λ2 = 1, P1
= P2 , and P1 , P2 ∈ P).
3. Ergodic Theorems
1
n−1
lim ξ(T k ω) = E(ξ | I ), (1)
n n
k=0
40 5 Stationary (Strict Sense) Random Sequences and Ergodic Theory
1
n−1
lim ξ(T k ω) = E ξ. (2)
n n
k=0
The proof given below is based on the following proposition, whose simple proof
was given by Garsia [28].
Lemma (Maximal Ergodic Theorem). Let T be a measure-preserving transforma-
tion, let ξ be a random variable with E |ξ| < ∞, and let
Then
E[ξ(ω)I{Mn >0} (ω)] ≥ 0
for every n ≥ 1.
0≤η≤η≤0 (P -a.s.).
Consider the random variable η = η(ω). Since η(ω) = η(Tω), the variable η is
invariant, and consequently, for every ε > 0, the set Aε = {η(ω) > ε} is also
invariant. Let us introduce the new random variable
and set
Sk∗ (ω) = ξ ∗ (ω) + · · · + ξ ∗ (T k−1 ω), Mk∗ (ω) = max(0, S1∗ , . . . , Sk∗ ).
where the last equation follows because supk≥1 (Sk∗ /k) ≥ η and Aε = {ω : η > ε}.
Moreover, E |ξ ∗ | ≤ E |ξ| + ε. Hence, by the dominated convergence theorem,
Thus,
1
n−1
lim IT −k B (ω) = P(B).
n n
k=0
If we now integrate both sides over A ∈ F and use the dominated convergence
theorem, we obtain (3), as required.
2. We now show that, under the hypotheses of Theorem 1, there is not only almost
sure convergence in (1) and (2), but also convergence in the mean. (This result will
be used subsequently in the proof of Theorem 3.)
PROOF. For every ε > 0 there is a bounded random variable η (|η(ω)| ≤ M) such
that E |ξ − η| ≤ ε. Then
1 n−1 1 n−1
E ξ(T k ω) − E(ξ | I ) ≤ E (ξ(T k ω) − η(T k ω))
n n
k=0 k=0
1 n−1
+ E η(T k ω) − E(η | I ) + E | E(ξ | I ) − E(η | I )|. (6)
n
k=0
Since |η| ≤ M, by the dominated convergence theorem and using (1), we find that
the second term on the right-hand side of (6) tends to zero as n → ∞. The first
and third terms are each at most ε. Hence, for sufficiently large n, the left-hand side
of (6) is less than 3ε, so that (4) is proved. Finally, if T is ergodic, then (5) follows
from (4) and the remark that E(ξ | I ) = E ξ (P-a.s.).
This completes the proof of the theorem.
3. We now turn to the question of the validity of the ergodic theorem for station-
ary (in strict sense) random sequences ξ = (ξ1 , ξ2 , . . .) defined on a probabil-
ity space (Ω, F , P). In general, (Ω, F , P) need not carry any measure-preserving
transformations, so that it is not possible to apply Theorem 1 directly. However, as
we observed in Sect. 1, we can construct a coordinate probability space (Ω̃, F˜ , P̃),
random variables ξ˜ = (ξ˜1 , ξ˜2 , . . .), and a measure-preserving transformation T̃ such
that ξ˜n (ω̃) = ξ˜1 (T̃ n−1 ω̃) and the distributions of ξ and ξ˜ are the same. Since such
3 Ergodic Theorems 43
properties as almost sure convergence and convergence in the mean n are defined
only for probability distributions, from the convergence of (1/n) k=1 ξ˜1 (T k−1 ω̃)
n
(P-a.s. and in the mean) to a random variable η̃ it follows that (1/n) k=1 ξk (ω)
d
also converges (P-a.s. and in the mean) to a random variable η such that η = η̃. It
follows from Theorem 1 that if Ẽ|ξ˜1 | < ∞, then η̃ = Ẽ(ξ˜1 | I˜), where I˜ is a
collection of invariant sets (Ẽ is the expectation with respect to the measure P̃). We
now describe the structure of η.
n show that if the random variable η is the limit (P-a.s. and in the
Let us now
mean) of 1n k=1 ξk (ω), n → ∞, then it can be taken equal to E(ξ1 | Iξ ). To this
end, notice that we can set
1
n
η(ω) = lim sup ξk (ω). (7)
n n
k=1
It follows from the definition of lim sup that for the random variable η(ω) so
defined, the sets {ω : η(ω) < y}, y ∈ R, are therefore η is Iξ -
invariant, and
1 n−1
measurable. Now, let A ∈ Iξ . Then, since E n k=1 ξk − η → 0, we have for η
defined by (7)
n
1
ξk d P → η d P . (8)
n A A
k=1
∞
Let B ∈ B(R ) be such that A = {ω : (ξk , ξk+1 , . . .) ∈ B} for all k ≥ 1. Then
since ξ is stationary,
ξk d P = ξk d P = ξ1 d P = ξ1 d P .
A {ω : (ξk ,ξk+1 ,...)∈B} {ω : (ξ1 ,ξ2 ,...)∈B} A
which implies (see (1) in Sect. 7, Chap. 2, Vol. 1) that (η being Iξ -measurable) η =
E(ξ1 | Iξ ). Here E(ξ1 | Iξ ) = E ξ1 if ξ is ergodic.
Therefore we have proved the following theorem.
44 5 Stationary (Strict Sense) Random Sequences and Ergodic Theory
1
n
lim ξk (ω) = E(ξ1 | Iξ ).
n
k=1
1
n
lim ξk (ω) = E ξ1 .
n
k=1
4. PROBLEMS
1. Let ξ = (ξ1 , ξ2 , . . .) be a Gaussian stationary sequence with E ξn = 0 and
covariance function R(n) = E ξk+n ξk . Show that R(n) → 0 is a sufficient con-
dition for the measure-preserving transformation related to ξ to be mixing (and,
hence, ergodic).
2. Show that for every sequence ξ = (ξ1 , ξ2 , . . .) of independent identically dis-
tributed random variables the corresponding measure-preserving transformation
is mixing.
3. Show that a stationary sequence ξ is ergodic if and only if
1
n
IB (ξi , . . . , ξi+k−1 ) → P((ξ1 , . . . , ξk ) ∈ B) (P-a.s.)
n i=1
1
n−1
(n)
IA = IA (T k ω).
n
k=0
Prove that T is ergodic if and only if one of the following conditions holds:
(n) P
(a) IA − → P(A) for any A ∈ A ;
n−1
(b) lim n k=0 P(A ∩ T −k B) = P(A) P(B) for all A, B ∈ A ;
1
(n) P
(c) IA − → P(A) for any A ∈ F .
6. Let T be a measure-preserving transformation on (Ω, F , P). Prove that T is
ergodic (with respect to P) if and only if there is no measure P̄
= P on (Ω, F )
such that P̄ P and T is measure-preserving with respect to P̄.
7. (Bernoullian shifts.) Let S be a finite set (say, S = {1, 2, . . . , N}), and let Ω =
S∞ be the space of sequences ω = (ω0 , ω1 , . . . ) with ωi ∈ S. Set ξk (ω) = ωk ,
3 Ergodic Theorems 45
10. Borel’s normality theorem (Example 3 in Sect. 3, Chap. 4) states that the fraction
of ones and zeros in the binary expansion of a number ω in [0, 1) converges to
1
2 almost everywhere (with respect to the Lebesgue measure). Prove this result
by considering the transformation T : [0, 1) → [0, 1) defined by
The [spectral] decomposition provides grounds for considering any stationary stochastic
process in the wide sense as a superposition of a set of non-correlated harmonic oscillations
with random amplitudes and phases.
Encyclopaedia of Mathematics [42, Vol. 8, p. 480].
E ξn = E ξ1 , (2)
can easily be obtained as special cases of the corresponding results for complex
random variables.
Let H 2 = H 2 (Ω, F , P) be the space of (complex) random variables ξ = α + iβ,
α, β ∈ R, with E |ξ|2 < ∞, where |ξ|2 = α2 + β 2 . If ξ and η ∈ H 2 , then we set
As for real random variables, the space H 2 (more precisely, the space of equiva-
lence classes of random variables; cf. Sects. 10 and 11 of Chap. 2, Vol. 1) is complete
under the scalar product (ξ, η) and norm ξ. In accordance with the terminology of
functional analysis, H 2 is called the complex (or unitary) Hilbert space (of random
variables considered on the probability space (Ω, F , P)).
If ξ, η ∈ H 2 their covariance is
E ξ n = E ξ0 ,
Cov(ξk+n , ξk ) = Cov(ξn , ξ0 ), k ∈ Z. (8)
R(n)
ρ(n) = , n ∈ Z. (10)
R(0)
We call R(n) the covariance function and ρ(n) the correlation function of the se-
quence ξ (assumed stationary in the wide sense).
1 Spectral Representation of the Covariance Function 49
It follows immediately from (9) that R(n) is positive semidefinite, i.e., for all
complex numbers a1 , . . . , am and t1 , . . . , tm ∈ Z, m ≥ 1, we have
m
ai aj R(ti − tj ) ≥ 0, (11)
i,j=1
since the left-hand side of (11) is equal to (αi ξti )2 . It is then easy to deduce
(either from (11) or directly from (9)) the following properties of the covariance
function (see Problem 1):
2. Let us give some examples of stationary sequences ξ = (ξn )n∈Z . (From now on,
the words “in the wide sense” and the statement n ∈ Z will often be omitted.)
g(n) = g(0)eiλn .
ξn = ξ0 · g(0)eiλn
is stationary with
R(n) = |g(0)|2 eiλn .
In particular, the random “constant” ξn ≡ ξ0 is a stationary sequence.
Remark. In connection with this example, notice that, since eiλn = ein(λ+2πk) , k =
±1, ±2, . . ., the (circular) frequency λ is defined up to a multiple of 2π. Following
tradition, we will assume henceforth that λ ∈ [−π, π].
N
ξn = zk eiλk n , (13)
k=1
N
R(n) = σk2 eiλk n . (14)
k=1
50 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
square and
∞
R(n) = σk2 eiλk n . (16)
k=−∞
where λ
1
F(λ) = f (v)dv; f (λ) = , −π ≤ λ < π. (20)
−π 2π
Comparison of the spectral functions (17) and (20) shows that, whereas the spec-
trum in Example 2 is discrete, in the present example it is absolutely continuous
with constant “spectral density” f (λ) ≡ 1/2π. In this sense we can say that the se-
quence ε = (εn ) “consists of harmonics of equal intensities.” It is just this property
that has led to calling such a sequence ε = (εn ) “white noise” by analogy with white
light, which consists of different frequencies with the same intensities.
EXAMPLE 4 (Moving Averages). Starting from the white noise ε = (εn ) introduced
in Example 3, let us form the new sequence
∞
ξn = ak εn−k , (21)
k=−∞
∞
where ak are complex numbers such that k=−∞ |ak |2 < ∞. From (21) we obtain
∞
Cov(ξn+m , ξm ) = Cov(ξn , ξ0 ) = an+k ak ,
k=−∞
so that ξ = (ξk ) is a stationary sequence, which we call the sequence obtained from
ε = (εk ) by a (two-sided) moving average.
In the special case where the ak of negative index are zero, i.e.,
∞
ξn = ak εn−k ,
k=0
1
f (λ) = |P(e−iλ )|2 (23)
2π
with
P(z) = a0 + a1 z + · · · + ap zp .
52 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
Under what conditions on b1 , . . . , bn can we say that (24) has a stationary solu-
tion? To find an answer, let us begin with the case q = 1:
ξn = αξn−1 + εn , (25)
where α = −b1 . If |α| < 1, then it is easy to verify that the stationary sequence
ξ˜ = (ξ˜n ) with
∞
ξ˜n = αj εn−j (26)
j=0
is a solution of (25). (The series on the right-hand side of (26) converges in mean
square.) Let us now show that, in the class of stationary sequences ξ = (ξn ) (with
finite second moments), this is the only solution. In fact, we find from (25), by
successive iteration, that
k−1
ξn = αξn−1 + εn = α[αξn−2 + εn−1 ] + εn = · · · = αk ξn−k + αj εn−j .
j=0
Therefore, when |α| < 1, a stationary solution of (25) exists and is representable as
the one-sided moving average (26).
There is a similar result for every q > 1: if all the zeros of the polynomial
Q(z) = 1 + b1 z + · · · + bq zq (27)
lie outside the unit disk, then the autoregression equation (24) has a unique station-
ary solution, which is representable as a one-sided moving average (Problem 2).
Here the covariance function R(n) can be represented (Problem 3) in the form
π λ
R(n) = eiλn dF(λ), F(λ) = f (v)dv, (28)
−π −π
where
1 1
f (λ) = · . (29)
2π |Q(e−iλ )|2
1 Spectral Representation of the Covariance Function 53
Under the same hypotheses as in Example 5 on the zeros of Q(z) (see (27)) it will
be shown later (Corollary 6, Sect. 3) that (32) has a stationary solution ξ = (ξn ) for
π λ
which the covariance function is R(n) = −π eiλn dF(λ) with F(λ) = −π f (v) dv,
where
54 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
1 P(e−iλ ) 2
f (λ) =
·
2π Q(e−iλ )
with P and Q as in (23) and (27).
3. Theorem (Herglotz). Let R(n) be the covariance function of a stationary (wide
sense) random sequence with zero mean. Then there is, on ([−π, π), B([−π, π))),
a finite measure F = F(B), B ∈ B([−π, π)), such that for every n ∈ Z
π
R(n) = eiλn F(dλ), (33)
−π
where the integral is understood as the Lebesgue–Stieltjes integral over [−π, π).
PROOF. For N ≥ 1 and λ ∈ [−π, π], set
1
N N
fN (λ) = R(k − l) e−ikλ eilλ . (34)
2πN
k=1 l=1
Since R(n) is nonnegative definite, fN (λ) is nonnegative. Since there are N − |m|
pairs (k, l) for which k − l = m, we have
1 |m|
fN (λ) = 1− R(m)e−imλ . (35)
2π N
|m|<N
Let
FN (B) = fN (λ) dλ, B ∈ B([−π, π)).
B
Then
π π |n|
iλn iλn 1− N R(n), |n| < N,
e FN (dλ) = e fN (λ) dλ = (36)
−π −π 0, |n| ≥ N.
The measures FN , N ≥ 1, are supported on the interval [−π, π] and FN ([−π, π]) =
R(0) < ∞ for all N ≥ 1. Consequently, the family of measures {FN }, N ≥ 1,
is tight, and by Prokhorov’s theorem (Theorem 1 of Sect. 2, Chap. 3, Vol. 1) there
w
are a sequence {Nk } ⊆ {N} and a measure F such that FNk → F. (The concepts of
tightness, relative compactness, and weak convergence, together with Prokhorov’s
theorem, can be extended in an obvious way from probability measures to any finite
measures.)
It then follows from (36) that
π π
iλn
e F(dλ) = lim eiλn FNk (dλ) = R(n).
−π Nk →∞ −·π
The measure F so constructed is supported on [−π, π]. Without changing the inte-
π
gral −π eiλn F(dλ), we can redefine F by transferring the “mass” F({π}), which is
1 Spectral Representation of the Covariance Function 55
It follows (cf. proof of Theorem 2 in Sect. 12, Chap. 2, Vol. 1) that F1 (B) = F2 (B)
for all B ∈ B([−π, π)).
Remark 3. If ξ = (ξn ) is a stationary sequence of real random variables ξn , then
R(n) = R(−n), and therefore
π
R(n) + R(−n)
R(n) = = cos λn F(dλ).
2 −π
4. PROBLEMS
1. Derive (12) from (11).
2. Prove that the autoregression Eq. (24) has a unique stationary solution repre-
sentable as a one-sided moving average if all the zeros of the polynomial Q(z)
defined by (27) lie outside the unit disk.
3. Show that the spectral functions of sequences (22) and (24) have densities spec-
ified by (23)
and (29), respectively.
+∞
4. Show that if n=−∞ |R(n)|2 < ∞, then the spectral function F(λ) has a density
f (λ) given by ∞
1 −iλn
f (λ) = e R(n),
2π n=−∞
Remark 2. In analogy with nonstochastic measures, one can show that for finitely
additive stochastic measures the condition (5) of countable additivity (in the mean-
square sense) is equivalent to continuity (in the mean-square sense) at “zero”:
We write
m(Δ) = E |Z(Δ)|2 , Δ ∈ E0 . (9)
For elementary orthogonal stochastic measures, the set function m = m(Δ), Δ ∈
E0 , is, as is easily verified, a finite measure, and, consequently, by Carathéodory’s
theorem (Sect. 3, Chap. 2, Vol. 1), it can be extended to (E, E ). The resulting mea-
sure will again be denoted by m = m(Δ) and called the structure function (of the
elementary orthogonal stochastic measure Z = Z(Δ), Δ ∈ E0 ).
The following question now arises naturally: since the set function m = m(Δ)
defined on (E, E0 ) admits an extension to (E, E ), where E = σ(E0 ), can an elemen-
tary orthogonal stochastic measure Z = Z(Δ), Δ ∈ E0 , be extended to sets Δ in E
in such a way that E |Z(Δ)|2 = m(Δ), Δ ∈ E ?
The answer is affirmative, as follows from the construction given below. This
construction, at the same time, leads to the stochastic integral that we need for the
integral representation of stationary sequences.
with only a finite number of different (complex) values, we define the random vari-
able
I (f ) = fk Z(Δk ).
(ξ, η) = E ξη
and
I (f ) = f =
2 2
|f (λ)|2 m(dλ).
E
Now let f ∈ L2 , and let {fn } be functions of the type (10) such that f − fn → 0,
n → ∞ (Problem 2). Consequently,
We note the following basic properties of the stochastic integral I (f ); these are
direct consequences of its construction. Let g, f , and fn ∈ L2 . Then
(I (f ), I (g)) = f , g; (11)
I (f ) = f ; (12)
I (af + bg) = aI (f ) + bI (g) (P-a.s.) (13)
4. Let us use the preceding definition of the stochastic integral to extend the elemen-
tary stochastic measure Z(Δ), Δ ∈ E0 , to sets in E = σ(E0 ).
Since measure m is assumed to be finite, we have IΔ = IΔ (λ) ∈ L2 for all
Δ ∈ E . Write Z̃(Δ) = I (IΔ ). It is clear that Z̃(Δ) = Z(Δ) for Δ ∈ E0 . It follows
from (13) that if Δ1 ∩ Δ2 = ∅ for Δ1 and Δ2 ∈ E , then
E |Z̃(Δ)|2 = m(Δ), Δ ∈ E.
Let us show that the random set function Z̃(Δ), Δ∈ E , is countably additive in
∞
the mean-square sense. In fact, let Δk ∈ E and Δ = k=1 Δk . Then
n
Z̃(Δ) − Z̃(Δk ) = I (gn ),
k=1
where
n ∞
gn (λ) = IΔ (λ) − IΔk (λ) = IΣn (λ), Σn = Δk .
k=1 k=n+1
But
E |I (gn )|2 = gn 2 = m(Σn ) ↓ 0, n → ∞,
i.e.,
n
E |Z̃(Δ) − Z̃(Δk )|2 → 0, n → ∞.
k=1
E Z̃(Δ1 )Z̃(Δ2 ) = 0
when Δ1 ∩ Δ2 = ∅, Δ1 , Δ2 ∈ E .
Thus, our function Z̃(Δ), defined on Δ ∈ E , is countably additive in the mean-
square sense and coincides with Z(Δ) on the sets Δ ∈ E0 . We shall call Z̃(Δ),
Δ ∈ E , an orthogonal stochastic measure (since it is an extension of the elementary
orthogonal stochastic measure Z(Δ)) with respect to the structure function m(Δ),
Δ ∈ E ; and we call the integral I (f ) = E f (λ) Z̃(dλ), defined earlier, a stochastic
integral with respect to this measure.
5. We now consider the case (E, E ) = (R, B(R)), which is the most important for
our purposes. As we know (Theorem 1, Sect. 3, Chap. 2, Vol. 1), there is a one-to-one
correspondence between finite measures m = m(Δ) on (R, B(R)) and (generalized)
distribution functions G = G(x), with m(a, b] = G(b) − G(a).
It turns out that there is something similar for orthogonal stochastic measures.
We introduce the following definition.
60 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
E |Zλ − Zλn |2 → 0, λn ↓ λ, λn ∈ R;
Condition (3) is the condition of orthogonal increments. Condition (1) means that
Zλ ∈ H 2 . Finally, condition (2) is included for technical reasons; it is a requirement
of continuity on the right (in the mean-square sense) at each λ ∈ R.
Let Z = Z(Δ) be an orthogonal stochastic measure with respect to the struc-
ture function m = m(Δ), which is a finite measure with (generalized) distribution
function G(λ). Let us set
Zλ = Z(−∞, λ].
Then
and (evidently) (3) is also satisfied. Thus, {Zλ } is a process with orthogonal incre-
ments.
On the other hand, let G(λ) be a generalized distribution function, G(−∞) = 0,
G(+∞) < ∞, and let {Zλ } be a process with orthogonal increments such that
E |Zλ |2 = G(λ). Set
Z(Δ) = Zb − Za
n
when Δ = (a, b]. Let E0 be the algebra generated by the sets Δ = k=1 (ak , bk ]
with disjoint (ak , bk ] and
n
Z(Δ) = Z(ak , bk ].
k=1
It is clear that
E |Z(Δ)|2 = m(Δ),
n
where m(Δ) = k=1 [G(bk ) − G(ak )] and
E Z(Δ1 )Z(Δ2 ) = 0
and let L02 (F) be the linear manifold (L02 (F) ⊆ L2 (F)) spanned by the functions
en = en (λ), n ∈ Z, where en (λ) = eiλn .
Observe that since E = [−π, π) and F is finite, the closure of L02 (F) coincides
(Problem 1) with L2 (F):
L02 (F) = L2 (F).
Also, let L02 (ξ) be the linear manifold spanned by the random variables ξn , n ∈ Z,
and let L2 (ξ) be its closure in the mean-square sense (with respect to P).
We establish a one-to-one correspondence between the elements of L02 (F) and
2
L0 (ξ), denoted by “↔,” by setting
en ↔ ξn , n ∈ Z, (4)
and defining it for elements in general (more precisely, for equivalence classes of
elements) by linearity:
αn en ↔ α n ξn (5)
(here we suppose that only finitely many of the complex numbers αn are different
from zero).
in the sense that Σαn en = 0 almost
Observe that (5) is a consistent definition,
everywhere with respect to F if and only if αn ξn = 0 (P-a.s.).
The correspondence “↔” is an isometry, i.e., it preserves scalar products. In fact,
by (3),
π π
en , em = en (λ)em (λ) F(dλ) = eiλ(n−m) F(dλ)
−π −π
= R(n − m) = E ξn ξ m = (ξn , ξm )
and similarly,
* +
αn en , βn en = α n ξn , β n ξn . (6)
Now let η ∈ L2 (ξ). Since L2 (ξ) = L02 (ξ), there is a sequence {ηn } such that
ηn ∈ L02 (ξ) and ηn − η → 0, n → ∞. Consequently, {ηn } is a fundamental
sequence, and therefore so is the sequence {fn }, where fn ∈ L02 (F) and fn ↔ ηn .
The space L2 (F) is complete, and consequently there is an f ∈ L2 (F) such that
fn − f → 0.
There is an evident converse: if f ∈ L2 (F) and f − fn → 0, fn ∈ L02 (F), there
is an element η of L2 (ξ) such that η − ηn → 0, ηn ∈ L02 (ξ), and ηn ↔ fn .
3 Spectral Representation of Stationary (Wide Sense) Sequences 63
Up to now, the isometry “↔” has been defined only as between elements of L02 (ξ)
and L02 (F). We extend it by continuity, taking f ↔ η when f and η are the elements
considered earlier. It is easily verified that the correspondence obtained in this way
is one-to-one (between classes of equivalent random variables and of functions), is
linear, and preserves scalar products.
Consider the function f (λ) = IΔ (λ), where Δ ∈ B([−π, π)), λ ∈ [−π, π),
and let Z(Δ) be the element of L2 (ξ) such that IΔ (λ) ↔ Z(Δ). It is clear that
IΔ (λ)2 = F(Δ), and therefore E |Z(Δ)|2 = F(Δ). Since E ξn = 0, n ∈ Z, we
have for every element of L02 (ξ) (and hence of L2 (ξ)) that it has zero expectation. In
particular, E Z(Δ) = 0. Moreover, if Δ1 ∩ Δ2 = ∅, we have E Z(Δ1 )Z(Δ2 ) = 0
n 2 ∞
and E Z(Δ) − k=1 Z(Δk ) → 0, n → ∞, where Δ = k=1 Δk .
Hence the family of elements Z(Δ), Δ ∈ B([−π, π)), form an orthogonal
stochastic measure, with respect to which (according to Sect. 2) we can define the
stochastic integral
π
I (f ) = f (λ) Z(dλ), f ∈ L2 (F).
−π
Let f ∈ L2 (F) and η ↔ f . Denote the element η by Φ(f ) (more precisely, se-
lect single representatives from the corresponding equivalence classes of random
variables or functions). Let us show that (P-a.s.)
I (f ) = Φ(f ). (7)
In fact, if
f (λ) = αk IΔk (λ) (8)
π
ξn = eiλn Z(dλ), n ∈ Z (P-a.s.)
−π
Since IΔ (λ) ↔ Z(Δ), it follows from (10) that IΔ (−λ) ↔ Z(Δ) (or, equivalently,
I−Δ (λ) ↔ Z(Δ)). On the other hand, I−Δ (λ) ↔ Z(−Δ). Therefore Z(Δ) =
Z(−Δ) (P-a.s.).
Corollary 2. Again let ξ = (ξn ) be a stationary sequence of real random variables
ξn and Z(Δ) = Z1 (Δ) + iZ2 (Δ). Then
E Z1 (Δ1 )Z2 (Δ2 ) = 0 (11)
2. Let ξ = (ξn ) be a stationary sequence with the spectral representation (2), and let
η ∈ L2 (ξ). The following theorem describes the structure of such random variables.
PROOF. If
ηn = α k ξk , (19)
|k|≤n
then, by (2),
π
ηn = αk eiλk Z(dλ), (20)
−π |k|≤n
In the general case, where η ∈ L2 (ξ), there are variables ηn of type (19) such that
η − ηn → 0, n → ∞. But then ϕn − ϕm = ηn − ηm → 0, n, m → ∞.
Consequently, {ϕn } is fundamental in L2 (F), and therefore there is a function ϕ ∈
L2 (F) such that ϕ − ϕn → 0, n → ∞.
By property (14) of Sect. 2, we have I (ϕn ) − I (ϕ) → 0, and since ηn =
I (ϕn ), we also have η = I (ϕ) (P-a.s.).
This completes the proof of the theorem.
Remark. Let H0 (ξ) and H0 (F) be the respective closed linear manifolds spanned
by the variables ξ 0 = (ξn )n≤0 and by the functions e0 = (en )n≤0 . Then, if η ∈
π
H0 (ξ), there is a function ϕ ∈ H0 (F) such that (P-a.s.) η = −π ϕ(λ) Z(dλ).
3. Formula (18) describes the structure of the random variables that are obtained
from ξn , n ∈ Z, by linear transformations, i.e., in the form of finite sums (19) and
their mean-square limits.
66 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
For physically realizable systems, the values of the input at instant n are deter-
mined only by the “past” values of the signal, i.e., the values xm for m ≤ n. It is
therefore natural to call a filter with the impulse response h(s) physically realizable
if h(s) = 0 for all s < 0, in other words if
∞
∞
yn = h(n − m)xm = h(m)xn−m . (23)
m=−∞ m=0
In terms of the spectral measure, (25) is evidently equivalent to saying that ϕ(λ) ∈
L2 (F), i.e., π
|ϕ(λ)|2 F(dλ) < ∞. (27)
−π
of η from (26) and (2). Consequently, the covariance function Rη (n) of η is given
by the formula π
Rη (n) = eiλn |ϕ(λ)|2 F(dλ). (29)
−π
Theorem 3. Let η = (ηn ) be a stationary sequence with spectral density fη (λ). Then
(possibly at the expense of enlarging the original probability space) we can find a
sequence ε = (εn ) representing white noise, and a filter, such that the representation
(30) holds.
Let π
ηn = eiλn Z(dλ), n ∈ Z.
−π
where
a−1 , if a
= 0,
a⊕ =
0, if a = 0.
68 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
The stochastic measure Z = Z(Δ) is a measure with orthogonal values, and for
every Δ = (a, b], we have
1 ⊕ 1 |Δ|
E |Z(Δ)| =
2
|ϕ (λ)| |ϕ(λ)| dλ +
2 2
|1 − ϕ⊕ (λ)ϕ(λ)|2 dλ = ,
2π Δ 2π Δ 2π
is a white noise.
We now observe that
π π
iλn
e ϕ(λ) Z(dλ) = eiλn Z(dλ) = ηn (31)
−π −π
and, on the other hand, by definition of ϕ(λ) and property (14) in Sect. 2, we have
(P-a.s.)
π π
∞
eiλn ϕ(λ) Z(dλ) = eiλn e−iλm h(m) Z(dλ)
−π −π m=−∞
∞
π ∞
= h(m) eiλ(n−m) Z(dλ) = h(m)εn−m ,
m=−∞ −π m=−∞
ηn = a0 εn + a1 εn−1 + · · · + ap εn−p .
Conversely, every stationary sequence ξ = (ξn ) that satisfies this equation with
some white noise ε = (εn ) and some polynomial Q(z) with no zeros on {z : |z| = 1}
has a spectral density (32).
In fact, let ηn = ξn + b1 ξn−1 + · · · + bq ξn−q . Then fη (λ) = (1/2π)|P(e−iλ )|2 ,
and the required representation follows from Corollary 5.
On the other hand, if (33) holds and Fξ (λ) and Fη (λ) are the spectral functions
of ξ and η, then
λ λ
−iv 1
Fη (λ) = |Q(e 2
)| dFξ (v) = |P(e−iv )|2 dv.
−π 2π −π
Since |Q(e−iv )|2 > 0, it follows that Fξ (λ) has a density defined by (32).
1 L2
n−1
ξk → Z({0}) (34)
n
k=0
and
1
n−1
R(k) → F({0}). (35)
n
k=0
PROOF. By (2),
1 1 ikλ
n−1 π n−1 π
ξk = e Z(dλ) = ϕn (λ) Z(dλ),
n −π n −π
k=0 k=0
70 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
where
1 ikλ
n−1
1, λ = 0,
ϕn (λ) = e = 1 einλ −1
(36)
n
k=0 n eiλ −1 , λ
= 0.
It is clear that |ϕn (λ)| ≤ 1.
L2 (F)
Moreover, ϕn (λ) −−−→ I{0} (λ), and therefore, by (14) of Sect. 2,
π π
L2
ϕn (λ) Z(dλ) → I{0} (λ) Z(dλ) = Z({0}),
−π −π
1 1 L2
n−1 n−1
R(k) → 0 ⇒ ξk → 0.
n n
k=0 k=0
Since 2 , - 2
1
n−1 1
n−1 n−1 2
1
2
R(k) = E ξk ξ0 ≤ E |ξ0 | E ξk ,
n n n
k=0 k=0 k=0
1 L2 1
n−1 n−1
ξk → 0 ⇒ R(k) → 0.
n n
k=0 k=0
n−1
Therefore the condition (1/n) k=0 R(k) → 0 is necessary and sufficient for the
n−1
convergence (in the mean-square sense) of the arithmetic means (1/n) k=0 ξk to
zero. It follows that if the original sequence ξ = (ξn ) has expectation m (that is,
E ξ0 = m), then
1 1 L2
n−1 n−1
R(k) → 0 ⇔ ξk → m, (37)
n n
k=0 k=0
ξn = α + η n ,
4 Statistical Estimation of Covariance Function and Spectral Density 71
for some C > 0, α > 0. Use the Borel–Cantelli lemma to show that then
1
N
ξk → 0 (P-a.s.).
N
k=0
Then it follows from the elementary properties of the expectation that this is a
“good” estimator of m in the sense that “in the average over all possible realiza-
tions of data x0 , . . . , xN−1 ” it is unbiased, i.e.,
N−1
1
E mN (ξ) = E ξk = m. (2)
N
k=0
N
In addition, it follows from Theorem 4 of Sect. 3 that when (1/N) k=0 R(k) → 0,
N → ∞, our estimator is consistent (in mean square), i.e.,
Next we take up the problem of estimating the covariance function R(n), the
spectral function F(λ) = F([−π, λ]), and the spectral density f (λ), all under the
assumption that m = 0.
1
N−n−1
R̂N (n; x) = xn+k xk .
N−n
k=0
1
N−1
E[ξn+k ξk − R(n)][ξn ξ0 − R(n)] → 0, N → ∞, (4)
N
k=0
4 Statistical Estimation of Covariance Function and Spectral Density 73
Let us suppose that the original sequence ξ = (ξn ) is Gaussian (with zero mean
and covariance R(n)). Then, proceeding analogously to (51) of Sect. 12, Chap. 2,
Vol. 1, we obtain
1 2
N−1
[R (k) + R(n + k)R(n − k)] → 0, N → ∞. (6)
N
k=0
Since |R(n + k)R(n − k)| ≤ |R(n + k)|2 + |R(n − k)|2 , the condition
1 2
N−1
R (k) → 0, N → ∞, (7)
N
k=0
But as N → ∞,
1, λ = ν,
fN (λ, v) → f (λ, ν) =
0, λ
= ν.
Therefore
π π
1 2
N−1
R (k) → f (λ, ν) F(dλ)F(dν)
N −π −π
k=0
π
= F({λ}) F(dλ) = F 2 ({λ}),
−π λ
where the sum over λ contains at most a countable number of terms since the mea-
sure F is finite.
Hence (7) is equivalent to
F 2 ({λ}) = 0, (8)
λ
which means that the spectral function F(λ) = F([−π, λ]) is continuous.
2. We now turn to the problem of finding estimators for the spectral function F(λ)
and the spectral density f (λ) (under the assumption that they exist).
A method that naturally suggests itself for estimating the spectral density follows
from the proof of Herglotz’s theorem that we gave earlier. Recall that the function
1 |n|
fN (λ) = 1− R(n)e−iλn (9)
2π N
|n|<N
converges on the whole to the spectral function F(λ). Therefore, if F(λ) has a den-
sity f (λ), then we have
λ λ
fN (ν) dν → f (ν) dν (10)
−π −π
The function f̂N (λ; x) is known as a periodogram. It is easily verified that it can
also be represented in the following more convenient form:
N−1 2
1 −iλn
f̂N (λ; x) = xn e . (12)
2πN n=0
If the spectral function F(λ) has density f (λ), then, since fN (λ) can also be written
in the form (34) of Sect. 1, we find that
N−1 N−1
1 π iν(k−l) iλ(l−k)
fN (λ) = e e f (ν) dν
2πN
k=0 l=0 −π
π N−1 2
1 i(ν−λ)k
= e f (ν) dν.
2πN
−π k=0
is the Fejér kernel. It is known, from the properties of this function, that for almost
every λ (with respect to Lebesgue measure)
π
ΦN (λ − ν)f (ν) dν → f (λ). (13)
−π
in other words, the estimator f̂N (λ; x) of f (λ) on the basis of x0 , x1 , . . . , xN−1 is
asymptotically unbiased.
In this sense, the estimator f̂N (λ; x) could be considered “good.” However, at
the individual observed values x0 , . . . , xN−1 the values of the periodogram f̂N (λ; x)
usually turn out to be far from the actual values f (λ). In fact, let ξ = (ξn ) be a sta-
tionary sequence of independent Gaussian random variables, ξn ∼ N (0, 1). Then
f (λ) ≡ 1/2π and
2
1 1 −iλk
N−1
f̂N (λ; ξ) = √ ξk e .
2π N
k=0
Therefore for λ = 0 we have that 2π f̂N (0, ξ) coincides in distribution with the square
of the Gaussian random variable η ∼ N (0, 1). Hence, for every N,
76 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
1
E |f̂N (0; ξ) − f (0)|2 = E |η 2 − 1|2 > 0.
4π 2
Moreover, an easy calculation shows that if f (λ) is the spectral density of a station-
ary sequence ξ = (ξn ) that is constructed as a moving average:
∞
ξn = ak εn−k (15)
k=0
∞ ∞
with k=0 |ak | < ∞, k=0 |ak |2 < ∞, where ε = (εn ) is white noise with E ε40 <
∞, then 2
2f (0), λ = 0, ±π,
lim E |f̂N (λ; ξ) − f (λ)| =
2
(16)
N→∞ f 2 (λ), λ
= 0, ±π.
Hence it is clear that the periodogram cannot be a satisfactory estimator of the
spectral density. To improve the situation, one often uses an estimator for f (λ) of
the form π
W
fN (λ; x) = WN (λ − ν)f̂N (ν; x) dν, (17)
−π
which is obtained from the periodogram f̂N (λ; x) by means of a smoothing function
WN (λ), which we call a spectral window. Natural requirements on WN (λ) are as
follows:
(a) WN (λ) has a sharp maximum at λ = 0;
π
(b) −π WN (λ) dλ = 1;
(c) P |f̂NW (λ; ξ) − f (λ)|2 → 0, N → ∞, λ ∈ [−π, π).
By (14) and (b), the estimators f̂NW (λ; ξ) are asymptotically unbiased. Condition (c)
is the condition of consistency in mean square, which, as we showed above, is vio-
lated for the periodogram. Finally, condition (a) ensures that the required frequency
λ is “picked out” from the periodogram.
Let us give some examples of estimators of the form (17).
Bartlett’s estimator is based on the spectral window
WN (λ) = aN B(aN λ),
where aN ↑ ∞, aN /N → 0, N → ∞, and
2
1 sin(λ/2)
B(λ) = .
2π λ/2
WN (λ) = aN Z(aN λ)
with
− α+1
2α |λ| +
α α+1
2α , |λ| ≤ 1,
Z(λ) =
0, |λ| > 1,
where 0 < α ≤ 2 and the aN are selected in a particular way.
We shall not spend any more time on problems of estimating spectral densities;
we merely note that there is an extensive statistical literature dealing with the con-
struction of spectral windows and the comparison of the corresponding estimators
f̂NW (λ; x). (See, e.g., [36, 37, 38].)
for every n, as N → ∞.
78 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
5. Wold’s Expansion
the projection of η on the subspace Hn (ξ) (Sect. 11, Chap. 2, Vol. 1). We also write
where η − π̂−∞ (η) ⊥ π̂−∞ (η). Therefore H(ξ) is represented as the orthogonal
sum
H(ξ) = S(ξ) ⊕ R(ξ),
where S(ξ) consists of the elements π̂−∞ (η) with η ∈ H(ξ), and R(ξ) consists of
the elements of the form η − π̂−∞ (η).
We shall now assume that E ξn = 0 and Var ξn > 0. Then H(ξ) is automatically
nontrivial (contains elements different from zero).
Definition 1. A stationary sequence ξ = (ξn ) is regular if
H(ξ) = R(ξ)
and singular if
H(ξ) = S(ξ).
5 Wold’s Expansion 79
Remark 1. Singular sequences are also called deterministic and regular sequences
are called purely or completely nondeterministic. If S(ξ) is a proper subspace of
H(ξ), we just say that ξ is nondeterministic.
Theorem 1. Every stationary (wide sense) random sequence ξ has a unique decom-
position,
ξn = ξnr + ξns , (1)
where ξ r = (ξnr ) is regular and ξ s = (ξns ) is singular. Here ξ r and ξ s are orthogonal
(ξnr ⊥ ξms for all n and m).
PROOF. We define
ξns = Ê(ξn | S(ξ)), ξnr = ξn − ξns .
Since ξnr ⊥ S(ξ) for every n, we have S(ξ r ) ⊥ S(ξ). On the other hand, S(ξ r ) ⊆ S(ξ),
and therefore S(ξ r ) is trivial (contains only random sequences that coincide almost
surely with zero). Consequently, ξ r is regular.
Moreover, Hn (ξ) ⊆ Hn (ξ s ) ⊕ Hn (ξ r ) and Hn (ξ s ) ⊆ Hn (ξ), Hn (ξ r ) ⊆ Hn (ξ).
Therefore Hn (ξ) = Hn (ξ s ) ⊕ Hn (ξ r ), and hence
S(ξ) ⊆ Hn (ξ s ) ⊕ Hn (ξ r ) (2)
S(ξ) ⊆ Hn (ξ s ),
and therefore S(ξ) ⊆ S(ξ s ) ⊆ H(ξ s ). But ξns ∈ S(ξ); hence H(ξ s ) ⊆ S(ξ), and
consequently
S(ξ) = S(ξ s ) = H(ξ s ),
which means that ξ s is singular.
The orthogonality of ξ s and ξ r follows in an obvious way from ξns ∈ S(ξ) and
ξn ⊥ S(ξ).
r
Since Hn (ξ) is spanned by elements of Hn−1 (ξ) and elements of the form βξn , where
β is a complex number, the dimension of Bn is either zero or one. But the space
Hn (ξ) is different from Hn−1 (ξ) for any value of n. In fact, if Bn is trivial for some
n, then, by stationarity, Bk is trivial for all k, hence H(ξ) = S(ξ), contradicting the
assumption that ξ is regular. Thus, Bn has the dimension dim Bn = 1.
Let ηn be a nonzero element of Bn . Set
ηn
εn = ,
ηn
k−1
ξn = aj εn−j + π̂n−k (ξn ), (4)
j=0
where aj = E ξn εn−j .
By Bessel’s inequality (6), Sect. 11, Chap. 2, Vol. 1,
∞
|aj |2 ≤ ξn 2 < ∞.
j=0
∞
It follows that j=0 ai εn−j converges in mean square, and then, by (4), Eq. (3) will
L2
be established as soon as we show that π̂n−k (ξn ) → 0, k → ∞.
It is enough to consider the case n = 0. Let π̂i = π̂i (ξ0 ). Since
k
π̂−k = π̂0 + [π̂−i − π̂−i+1 ],
i=0
and the terms that appear in this sum are orthogonal, we have for every k ≥ 0
5 Wold’s Expansion 81
. .2
k
. k .
.
π̂−i − π̂−i+1 = .
2
(π̂−i − π̂−i+1 ).
.
i=0 i=0
= π̂−k − π̂0 2 ≤ 4ξ0 2 < ∞.
where ε̃ = ε̃n is an orthonormal system (see the definition in Example 4 of Sect. 1).
In this sense, the conclusion of Theorem 2 says more, specifically that for a regular
sequence ξ there exist a = (an ) and an orthonormal system ε = (εn ) such that not
only (5) but also (3) is satisfied, with Hn (ξ) = Hn (ε), n ∈ Z.
3. The significance of the concepts introduced here (regular and singular sequences)
becomes particularly clear if we consider the following (linear) extrapolation prob-
lem, for whose solution the Wold expansion (6) is especially useful.
Let H0 (ξ) = L2 (ξ 0 ) be the closed linear manifold spanned by the variables
0
ξ = (. . . , ξ−1 , ξ0 ). Consider the problem of constructing an optimal (least-squares)
linear estimator ξˆn of ξn in terms of the “past” ξ 0 = (. . . , ξ−1 , ξ0 ).
It follows from Sect. 11, Chap. 2, Vol. 1, that
(In the notation of Subsection 1, ξˆn = π̂0 (ξn ).) Since ξ r and ξ s are orthogonal and
H0 (ξ) = H0 (ξ r ) ⊕ H0 (ξ s ), we obtain, by using (6),
k=0
In (6), the sequence ε = (εn ) is an innovation sequence for ξ r = (ξnr ), and therefore
H0 (ξ r ) = H0 (ε). Therefore
∞ ∞
ξˆn = ξns + Ê ak εn−k | H0 (ε) = ξns + ak εn−k (8)
k=0 k=n
n−1
σn2 = E |ξn − ξˆn |2 = |ak |2 . (9)
k=0
Since
∞
|ak |2 = E |ξn |2 ,
k=0
Set
∞
Φ(z) = ak zk . (14)
k=0
∞
This function is analytic in the open domain |z| < 1, and since k=0 |ak |2 < ∞, it
belongs to the Hardy class H 2 , the class of functions g = g(z), analytic in |z| < 1,
satisfying π
1
sup |g(reiθ )|2 dθ < ∞. (15)
0≤r<1 2π −π
In fact,
π ∞
1
|Φ(reiθ )|2 dθ = |ak |2 r2k
2π −π k=0
and
sup |ak |2 r2k ≤ |ak |2 < ∞.
0≤r<1
It is shown in the theory of functions of a complex variable (e.g., [64]) that the
boundary function Φ(eiλ ), −π ≤ λ < π, of Φ ∈ H 2 , not identically zero, has the
property that π
log |Φ(e−iλ )| dλ > −∞. (16)
−π
In our case,
1
f (λ) = |Φ(e−iλ )|2 ,
2π
84 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
where Φ ∈ H 2 . Therefore
On the other hand, let the spectral density f (λ) satisfy (17). It again follows
from thetheory of functions of a complex variable that there is then a function
∞
Φ(z) = k=0 ak zk in the Hardy class H 2 such that (almost everywhere with respect
to Lebesgue measure)
1
f (λ) = |Φ(e−iλ )|2 .
2π
Therefore, if we set ϕ(λ) = Φ(e−iλ ), we obtain
1
f (λ) = |ϕ(λ)|2 ,
2π
where ϕ(λ) is given by (13). Then it follows from Corollary 5, Sect. 3, that ξ ad-
mits a representation as a one-sided moving average (11), where ε = (εn ) is an
orthonormal sequence. From this and from Remark 4 it follows that ξ is regular.
Thus, we have the following theorem.
5. PROBLEMS
1. Show that a stationary sequence with discrete spectrum (piecewise-constant
spectral function F(λ)) is singular.
2. Let σn2 = E |ξn − ξˆn |2 , ξˆn = Ê(ξn | H0 (ξ)). Show that if σn2 = 0 for some n ≥ 1,
the sequence is singular; if σn2 → R(0) as n → ∞, the sequence is regular.
3. Show that the stationary sequence ξ = (ξn ), ξn = einϕ , where ϕ is a uniform
random variable on [0, 2π], is regular. Find the estimator ξˆn and its mean-square
error σn2 , and show that the nonlinear estimator
n
˜ ξ0
ξn =
ξ−1
6 Extrapolation, Interpolation, and Filtering 85
E |ξ˜n − ξn |2 = 0, n ≥ 1.
4. Prove that decomposition (1) into regular and singular components is unique.
and
n−1
σn2 = E |ξn − ξˆn |2 = |ak |2 . (3)
k=0
However, this can be considered only a theoretical solution, for the following rea-
sons.
The sequences that we consider are ordinarily not given to us by means of their
representations (1), but by their covariance functions R(n) or the spectral densities
f (λ) (which exist for regular sequences). Hence a solution (2) can only be regarded
as satisfactory if the coefficients ak are given in terms of R(n) or of f (λ), and the εk
in terms of . . . , ξk−1 , ξk .
Without discussing the problem in general, we consider only the special case (of
interest in applications) when the spectral density has the form
1
f (λ) = |Φ(e−iλ )|2 , (4)
2π
86 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
∞
where Φ(z) = k=0 bk zk has radius of convergence r > 1 and has no zeros in
|z| ≤ 1.
Let π
ξn = eiλn Z(dλ) (5)
−π
Theorem 1. If the spectral density of ξ has the form (4), then the optimal (linear)
estimator ξˆn of ξn in terms of ξ 0 = (. . . , ξ−1 , ξ0 ) is given by
π
ξˆn = ϕ̂n (λ) Z(dλ), (6)
−π
where
Φn (e−iλ )
ϕ̂n (λ) = eiλn (7)
Φ(e−iλ )
and
∞
Φn (z) = bk zk .
k=n
where H0 (F) is the closed linear manifold spanned by the functions en = eiλn for
λ
n ≤ 0 F(λ) = −π f (ν) dν .
Since (Sect. 2)
π 2
E |ξn − ξ˜n |2 = E (eiλn − ϕ̃n (λ)) Z(dλ)
−π
π
= |eiλn − ϕ̃n (λ)|2 f (λ) dλ,
−π
It follows from Hilbert-space theory (Sect. 11, Chap. 2, Vol. 1) that the optimal
function ϕ̂n (λ) (in the sense of (9)) is determined by the two conditions
Since
and, in a similar way, 1/Φ(e−iλ ) ∈ H0 (F), the function ϕ̂n (λ) defined in (7) belongs
to H0 (F). Therefore in proving that ϕ̂n (λ) is optimal, it is sufficient to verify that,
for every m ≥ 0,
eiλn − ϕ̂n (λ) ⊥ eiλm ,
i.e., π
In,m ≡ [eiλn − ϕ̂n (λ)]eiλm f (λ) dλ = 0, m ≥ 0.
−π
The following chain of equations shows that this is actually the case:
π !
1 Φn (e−iλ )
In,m = e iλ(n+m)
1− |Φ(e−iλ )|2 dλ
2π −π Φ(e−iλ )
π
1
= eiλ(n+m) [Φ(e−iλ ) − Φn (e−iλ )]Φ(e−iλ ) dλ
2π −π
π
n−1
∞
1 iλ(n+m) −iλk iλl
= e bk e bl e dλ
2π −π
k=0 l=0
π
n−1
∞
1
= eiλm bk eiλ(n−k) bl eiλl dλ = 0,
2π −π
k=0 l=0
e−iλ
ϕ̂1 (λ) = eiλ , ϕ̂n (λ) = 0 for n ≥ 2. (12)
2 + e−iλ
1 1 − |a|2
f (λ) = ,
2π |1 − ae−iλ |2
i.e.,
1
f (λ) = |Φ(e−iλ )|2 ,
2π
6 Extrapolation, Interpolation, and Filtering 89
where
∞
(1 − |a|2 )1/2
Φ(z) = = (1 − |a|2 )1/2 (az)k ,
1 − az
k=0
n
from which ϕ̂n (λ) = a , and therefore
π
ξˆn = an Z(dλ) = an ξ0 .
−π
Remark 3. It follows from the Wold expansion of a regular sequence ξ = (ξn ) with
∞
ξn = ak ξn−k (13)
k=0
It is evident that the converse also holds, that is, if f (λ) admits the representa-
tion (14) with a function Φ(z) of the form (15), then the Wold expansion of ξn
has the form (13). Therefore the problem of representing the spectral density in the
form (14) and the problem of determining the coefficients ak in the Wold expansion
are equivalent.
The assumptions that Φ(z) in Theorem 1 has no zeros for |z| ≤ 1 and that r > 1
are in fact not essential. In other words, if the spectral density of a regular sequence
is represented in the form (14), then the optimal estimator ξˆn (in the mean-square
sense) for ξn in terms of ξ 0 = (. . . , ξ−1 , ξ0 ) is determined by formulas (6) and (7).
Remark 4. Theorem 1 (with the preceding Remark 3) solves the prediction problem
for regular sequences. Let us show that in fact the same answer remains valid for
arbitrary stationary sequences. More precisely, let
π
ξn = ξns + ξnr , ξn = eiλn Z(dλ), F(Δ) = E |Z(Δ)|2 ,
−π
and let f r (λ) = (1/2π)|Φ(e−iλ )|2 be the spectral density of the regular sequence
ξ r = (ξnr ). Then ξˆn is determined by (6) and (7).
90 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
where Z r (Δ) is the orthogonal stochastic measure in the representation of the regu-
lar sequence ξ r . Then
π
ˆ
E |ξn − ξn | =
2
|eiλn − ϕ̂n (λ)|2 F(dλ)
−π
π π
≥ |e
iλn
− ϕ̂n (λ)| f (λ) dλ ≥
2 r
|eiλn − ϕ̂rn (λ)|2 f r (λ) dλ
−π −π
But ξn − ξˆn = ξˆnr − ξˆr . Hence E |ξn − ξˆn |2 = E |ξnr − ξˆnr |2 , and it follows from (16)
that we may take ϕ̂n (λ) to be ϕ̂rn (λ).
where ϕ belongs to H 0 (F), the closed linear manifold spanned by the functions eiλn ,
n
= 0. The estimator π
ˇ
ξ0 = ϕ̌(λ) Z(dλ) (17)
−π
It follows from the perpendicularity properties of the Hilbert space H 0 (F) that
ϕ̌(λ) is completely determined (compare (10)) by the two conditions
Then α
ϕ̌(λ) = 1 − , (20)
f (λ)
where 2π
α = π dλ
, (21)
−π f (λ)
for every n
= 0. By (22), the function [1 − ϕ̌(λ)] f (λ) belongs to the Hilbert space
√ π], B[−π, π], μ) with Lebesgue measure μ. In this space the functions
L2 ([−π,
{einλ / 2π, n = 0, ±1, . . .} form an orthonormal basis (Problem 10, Sect. 12,
Chap. 2, Vol. 1). Hence it follows from (23) that [1 − ϕ̌(λ)] f (λ) is a constant, which
we denote by α.
Thus, the second condition in (18) leads to the conclusion that
α
ϕ̌(λ) = 1 − . (24)
f (λ)
Corollary. If
ϕ̌(λ) = ck eiλk ,
0<|k|≤N
then
π
ξˇ0 = ck eiλk Z(dλ) = ck ξk .
0<|k|≤N −π 0<|k|≤N
EXAMPLE 3. Let f (λ) be the spectral density in Example 2 above. Then an easy
calculation shows that
π
a a
ˇ
ξ0 = [eiλ + e−iλ ] Z(dλ) = [ξ1 + ξ−1 ],
−π 1 + |a| 2 1 + |a|2
We write
Fθ (Δ) = E |Zθ (Δ)|2 , Fξ (Δ) = E |Zξ (Δ)|2
and
Fθξ (Δ) = E Zθ (Δ)Zξ (Δ).
In addition, we suppose that θ and ξ are connected in a stationary way, i.e., that their
covariance function Cov(θn , ξm ) = E θn ξ m depends only on the difference n − m.
Let Rθξ (n) = E θn ξ 0 ; then
π
Rθξ (n) = eiλn Fθξ (dλ).
−π
As in Subsections 1 and 2, the conditions to impose on the optimal ϕ̂n (λ) are that
(i) ϕ̂n (λ) ∈ H(Fξ ),
(ii) θn − θ̂n ⊥ H(ξ).
From the latter condition we find
π π
eiλ(n−m)
Fθξ (dλ) − e−iλm ϕ̂n (λ) Fξ (dλ) = 0 (26)
−π −π
for every m ∈ Z. Therefore, if we suppose that Fθξ (λ) and Fξ (λ) have densities
fθξ (λ) and fξ (λ), we find from (26) that
π
eiλ(n−m) [ fθξ (λ) − e−iλn ϕ̂n (λ) fξ (λ)] dλ = 0.
−π
As is easily verified, ϕ̂ ∈ H(Fξ ), and consequently the estimator (25), with the
function (27), is optimal.
where
ϕ̂(λ) = fθ (λ)[ f0 (λ) + fη (λ)]⊕ ,
and the filtering error is
π
E |θn − θ̂n | =
2
[ fθ (λ) fη (λ)][ fθ (λ) + fη (λ)]⊕ dλ.
−π
94 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
The solution (25) obtained earlier can now be used to construct an optimal esti-
mator θ̃n+m of θn+m based on observations ξk , k ≤ n, where m is a given number in
Z. Let us suppose that ξ = (ξn ) is regular, with spectral density
1
f (λ) = |Φ(e−iλ )|2 ,
2π
∞
where Φ(z) = k=0 ak zk . By the Wold expansion,
∞
ξn = ak εn−k ,
k=0
Since
and π ∞
θ̂n+m = eiλ(n+m) ϕ̂(λ)Φ(e−iλ ) Zε (dλ) = ân+m−k εk ,
−π k=−∞
where π
1
âk = eiλk ϕ̂(λ)Φ(e−iλ ) dλ, (29)
2π −π
we have / ∞
0
θ̃n+m = Ê ân+m−k εk | Hn (ξ) .
k=−∞
Theorem 3. If the sequence ξ = (ξn ) under observation is regular, then the optimal
(mean-square) linear estimator θ̃n+m of θn+m in terms of ξk , k ≤ n, is given by
7 The Kalman–Bucy Filter and Its Generalizations 95
π
θ̃n+m = eiλn Hm (e−iλ ) Zξ (dλ), (30)
−π
where
∞
Hm (e−iλ ) = âl+m e−iλl Φ⊕ (e−iλ ) (31)
l=0
where π
1
ck = eikλ log f (λ) dλ.
2π −π
Deduce from this formula that the one-step prediction error σ12 = E |ξˆ1 − ξ1 |2 is
given by the Szegő–Kolmogorov formula
π
2 1
σ1 = 2π exp log f (λ) dλ .
2π −π
1. From a computational point of view, the solution presented earlier for the problem
of filtering out an unobservable component θ by means of observations of ξ is not
practical since, because it is expressed in terms of the spectrum, it has to be carried
96 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
out by analog devices. In the method proposed by Kalman and Bucy, the synthesis
of the optimal filter is carried out recursively; this makes it possible to do it with a
digital computer. There are also other reasons for the wide use of the Kalman–Bucy
filter, one being that it still “works” even without the assumption that the sequence
(θ, ξ) is stationary.
We shall present not only the usual Kalman–Bucy method but also its generaliza-
tions in which the recurrent equations for (θ, ξ) have coefficients that may depend
on all the data observed in the past.
Thus, let us suppose that (θ, ξ) = ((θn ), (ξn )) is a partially observed sequence,
and let
θn = (θ1 (n), . . . , θk (n)) and ξn = (ξ1 (n), . . . , ξl (n))
be governed by the recurrent equations
Here
ε1 (n) = (ε11 (n), . . . , ε1k (n)) and ε2 (n) = (ε21 (n), . . . , ε2l (n))
Lemma 1. Under the assumptions made earlier about the coefficients of (1), to-
gether with (2), the sequence (θ, ξ) is conditionally Gaussian, i.e., the conditional
distribution function
P{θ0 ≤ a0 , . . . , θn ≤ an | Fnξ }
is (P-a.s.) the distribution function of an n-dimensional Gaussian vector whose
mean and covariance matrix depend on (ξ0 , . . . , ξn ).
PROOF. We prove only the Gaussian character of P(θn ≤ a | Fnξ ); this is enough to
let us obtain equations for mn and γn .
First we observe that (1) implies that the conditional distribution
E[exp(it∗ ζn+1 ) | Fnξ , θn ] = exp{it∗ (A0 (n, ξ) + A1 (n, ξ)θn ) − 12 t∗ B(n, ξ)t}. (3)
Suppose now that the conclusion of the lemma holds for some n ≥ 0. Then
is Gaussian.
As in the proof of the theorem on normal correlation (Theorem 2 in Sect. 13,
Chap. 2, Vol. 1) we can verify that there is a matrix C such that the vector
This implies that the conditionally Gaussian vectors η and ξn+1 , considered under
the condition Fnξ , are independent, i.e., (P-a.s.)
By (5), the conditional distribution P(η ≤ y | Fnξ ) is Gaussian. With (6), this
ξ
shows that the conditional distribution P(θn+1 ≤ a | Fn+1 ) is also Gaussian.
This completes the proof of the lemma.
Theorem 1. Let (θ, ξ) be a partially observed sequence that satisfies the system (1)
and condition (2). Then (mn , γn ) obey the following recursion relations:
and
Let us write
Then, by (10),
By the theorem on normal correlation (see Theorem 2 and Problem 4 in Sect. 13,
Chap. 2, Vol. 1),
⊕
mn+1 = E(θn+1 | Fnξ , ξn+1 ) = E(θn+1 | Fnξ ) + d12 d22 (ξn+1 − E(ξn+1 | Fnξ ))
and
⊕ ∗
γn+1 = Cov(θn+1 , θn+1 | Fnξ , ξn+1 ) = d11 − d12 d22 d12 .
If we then use the expressions from (9) for E(θn+1 | Fnξ ) and E(ξn+1 | Fnξ ) and
those for d11 , d12 , d22 from (11), we obtain the required recursion formulas (7) and
(8).
This completes the proof of the theorem.
Corollary 7. If the coefficients a0 (n, ξ), . . . , B2 (n, ξ) in (1) are independent of ξ,
the corresponding method is known as the Kalman–Bucy method, and Eqs. (7) and
(8) for mn and γn describe the Kalman–Bucy filter. It is important to observe that in
this case the conditional and unconditional error matrices γn agree, i.e.,
γn ≡ E γn = E[(θn − mn )(θn − mn )∗ ].
Corollary 8. Suppose that a partially observed sequence (θn , ξn ) has the property
that θn satisfies the first equation (1), and that ξn satisfies the equation
Then evidently
ξn+1 = Ã0 (n, ξ) + Ã1 (n, ξ)[a0 (n, ξ) + a1 (n, ξ)θn
+b1 (n, ξ)ε1 (n + 1) + b2 (n, ξ)ε2 (n + 1)
+B̃1 (n, ξ)ε1 (n + 1) + B̃2 (n, ξ)ε2 (n + 1),
we find that the case under consideration also obeys the model (1) and that mn and
γn satisfy (7) and (8).
2. We now consider a linear model (cf. (1))
θn+1 = a0 + a1 θn + a2 ξn + b1 ε1 (n + 1) + b2 ε2 (n + 1),
(13)
ξn+1 = A0 + A1 θn + A2 ξn + B1 ε1 (n + 1) + B2 ε2 (n + 1),
where the coefficients a0 , . . . , B2 may depend on n (but not on ξ), and εij (n) are
independent Gaussian random variables with E εij (n) = 0 and E ε2ij (n) = 1.
Let (13) be solved with initial values (θ0 , ξ0 ) such that the conditional distri-
bution P(θ0 ≤ a | ξ0 ) is Gaussian with parameters m0 = E(θ0 , ξ0 ) and γ =
Cov(θ0 , θ0 | ξ0 ) = E γ0 . Then, by the theorem on normal correlation and (7) and
(8), the optimal estimator mn = E(θn | Fnξ ) is a linear function of ξ0 , ξ1 , . . . , ξn .
This remark makes it possible to prove the following important statement about
the structure of the optimal linear filter without the assumption that the random
variables involved are Gaussian.
Theorem 2. Let (θ, ξ) = (θn , ξn )n≥0 be a partially observed sequence that satisfies
(13), where εij (n) are uncorrelated random variables with E εij (n) = 0, E ε2ij (n) =
1, and the components of the initial vector (θ0 , ξ0 ) have finite second moments. Then
the optimal linear estimator m̂n = Ê(θn | ξ0 , . . . , ξn ) satisfies (7) with a0 (n, ξ) =
a0 (n) + a2 (n)ξn , A0 (n, ξ) = A0 (n) + A2 (n)ξn , and the error matrix
γ̂n = E[(θn − m̂n )(θn − m̂n )∗ ]
For the proof of this theorem, we need the following lemma, which reveals the
role of the Gaussian case in determining optimal linear estimators.
Lemma 2. Let (α, β) be a two-dimensional random vector with E(α2 + β 2 ) <
∞, and (α̃, β̃) a two-dimensional Gaussian vector with the same first and second
moments as (α, β), i.e.,
Then λ(β) is the optimal (in the mean-square sense) linear estimator of α in terms
of β, i.e.,
Ê(α | β) = λ(β).
Here E λ(β) = E α.
PROOF. We first observe that the existence of a linear function λ(b) coinciding with
E(α̃ | β̃ = b) follows from the theorem on normal correlation. Moreover, let λ(b)
be any other linear estimator. Then
E[α̃ − λ(β̃)]2 ≥ E[α̃ − λ(β̃)]2
and since λ(b) and λ(b) are linear and the hypotheses of the lemma are satisfied, we
have
which shows that λ(β) is optimal in the class of linear estimators. Finally,
where ε̃ij (n) are independent Gaussian random variables with E ε̃ij (n) = 0 and
E ε̃2ij (n) = 1. Let (θ̃0 , ξ˜0 ) also be a Gaussian vector that has the same first mo-
ments and covariance as (θ0 , ξ0 ) and is independent of ε̃ij (n). Then, since (15) is
linear, the vector (θ̃0 , . . . , θ̃n , ξ˜0 , . . . , ξ˜n ) is Gaussian, and therefore the conclusion
of the theorem follows from Lemma 2 (more precisely, from its multidimensional
analog) and the theorem on normal correlation.
This completes the proof of the theorem.
3. Let us consider some illustrations of Theorems 1 and 2.
EXAMPLE 1. Let θ = (θn ) and η = (ηn ) be two stationary (wide sense) uncorrelated
random sequences with E θn = E ηn = 0 and spectral densities
1 1 1
fθ (λ) = and fη (λ) = · ,
2π|1 + b1 e−iλ |2 2π |1 + b2 e−iλ |2
Then
ξn+1 = θn+1 + ηn+1 = −b1 θn − b2 ηn + ε1 (n + 1) + ε2 (n + 1)
= −b2 (θn + ηn ) − θn (b1 − b2 ) + ε1 (n + 1) + ε2 (n + 1)
= −b2 ξn − (b1 − b2 )θn + ε1 (n + 1) + ε2 (n + 1).
Let us find the initial conditions under which we should solve this system. Write
d11 = E θn2 , d12 = E θn ξn , d22 = E ξn2 . Then we find from (16) that
from which
1 1 2 − b21 − b22
d11 = , d12 = , d22 = ,
1 − b21 1 − b22 (1 − b21 )(1 − b22 )
d12 1 − b22
m0 = ξ0 = ξ0 ,
d22 2 − b21 − b22
(18)
d2 1 1 − b22 1
γ0 = d11 − 12 = − = .
d22 1 − b12 (1 − b1 )(2 − b1 − b2 )
2 2 2 2 − b21 − b22
7 The Kalman–Bucy Filter and Its Generalizations 103
Thus the optimal (in the least-squares sense) linear estimators mn for the signal
θn in terms of ξ0 , . . . , ξn and the mean-square error are determined by the system of
recurrent equations (17), solved under the initial conditions (18). Observe that the
equation for γn contains no random components, and consequently the numbers γn ,
which are needed for finding mn , can be calculated in advance, before solving the
filtering problem.
EXAMPLE 2. This example is instructive because it shows that the result of Theo-
rem 2 can be applied to find the optimal linear filter in a case where the sequence
(θ, ξ) is described by a (nonlinear) system that is different from (13).
Let ε1 = (ε1 (n)) and ε2 = (ε2 (n)) be two independent Gaussian sequences of
independent random variables with E εi (n) = 0 and E ε2i (n) = 1, n ≥ 1. Consider a
pair of sequences (θ, ξ) = (θn , ξn ), n ≥ 0, with
a1 A1 γ̂n
m̂n+1 = a1 m̂n + [ξn+1 − A1 m̂n ],
1 + A21 γ̂n
(a1 A1 γ̂n )2
γ̂n+1 = (a21 γ̂n + b21 (n)) − ,
1 + A21 γ̂n
"
where b1 (n) = E(1 + θn )2 must be found from the first equation in (19).
where ε1 is as in (1).
Then we have from (7) and (8) that mn = E(θ | Fnξ ) and γn can be found from
mn+1 = mn + γn A∗1 (n, ξ)[(B1 B∗1 )(n, ξ) + A1 (n, ξ)γn A∗1 (n, ξ)]⊕
× [ξn+1 − A0 (n, ξ) − A1 (n, ξ)mn ], (22)
γn+1 = γn − γn A∗1 (n, ξ)[(B1 B∗1 )(n, ξ) + A1 (n, ξ)γn A∗1 (n, ξ)]⊕ A1 (n, ξ)γn .
If the matrices B1 B∗1 are nonsingular for all n and ξ, the solution of (22) is given
by
/ 0−1
n
mn+1 = E + γ A∗1 (i, ξ)(B1 B∗1 )−1 (i, ξ)A∗1 (i, ξ)
i=0
/ 0
n
× m+γ A∗1 (i, ξ)(B1 B∗1 )−1 (i, ξ)(ξi+1 − A0 (i, ξ)) , (23)
i=0
/ 0−1
n
γn+1 = E + γ A∗1 (i, ξ)(B1 B∗1 )−1 (i, ξ)A1 (i, ξ) γ,
i=0
4. PROBLEMS
1. Show that the vectors mn and θn − mn in (1) are uncorrelated:
E[m∗n (θ − mn )] = 0.
2. In (1)–(2), let γ and the coefficients other than a0 (n, ξ) and A0 (n, ξ) be inde-
pendent of “chance” (i.e., of ξ). Show that then the conditional covariance γn is
independent of “chance”: γn = E γn .
3. Show that the solution of (22) is given by (23).
4. Let (θ, ξ) = (θn , ξn ) be a Gaussian sequence satisfying the following special case
of (1):
θn+1 = aθn + bε1 (n + 1), ξn+1 = Aθn + Bε2 (n + 1).
Show that if A
= 0, b
= 0, B
= 0, the limiting error of filtering, γ = limn→∞ γn ,
exists and is determined as the positive root of the equation
!
B2 (1 − a2 ) b2 B2
2
γ + − b2
γ − = 0.
A2 A2
7 The Kalman–Bucy Filter and Its Generalizations 105
of θm is Gaussian.
(a) Show that the conditional distribution
(a) Show that in this case the distribution πa,b (m, n) = P(θn ≤ a, ξn ≤ b | Fmξ ) is
Gaussian (n ≥ m).
(b) Find the extrapolation estimators
Here the “control” un is Fnξ -measurable and satisfies E u2n < ∞ for all 0 ≤ n ≤
N − 1. The variables ε1 (n) and ε2 (n), n = 1, . . . , N, are the same as in (1), (2);
ξ0 = 0, θ0 ∼ N (m, γ).
We say that the “control” u∗ = (u∗0 , . . . , u∗N−1 ) is optimal if V(u∗ ) = supu V(u),
where /N−1 0
2 2 2
V(u) = E (θn + un ) + θN .
n=0
Show that
u∗n = −[1 + Pn+1 ]+ Pn+1 m∗n , n = 0, . . . , N − 1,
where
a−1 , a
= 0,
a+ =
0, a = 0,
106 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
with γ0∗ = γ.
Chapter 7
Martingales
Martingale theory illustrates the history of mathematical probability; the basic definitions
are inspired by crude notions of gambling, but the theory has become a sophisticated tool
of modern abstract mathematics, drawing from and contributing to other fields.
J. L. Doob [19]
1. The study of the dependence between random variables arises in various ways in
probability theory. In the theory of stationary (wide sense) random sequences, the
basic indicator of dependence is the covariance function, and the inferences made in
this theory are determined by the properties of that function. In the theory of Markov
chains (Sect. 12 of Chap. 1, Vol. 1 and Chap. 8) the basic dependence is supplied by
the transition function, which completely determines the development of the random
variables involved in Markov dependence.
In this chapter (see also Sect. 11 of Chap. 1, Vol. 1) we single out a rather wide
class of sequences of random variables (martingales and their generalizations) for
which dependence can be studied by methods based on the properties of conditional
expectations.
2. Let (Ω, F , P) be a given probability space with a filtration (flow), i.e., with a
family (Fn ) of σ-algebras Fn , n ≥ 0, such that F0 ⊆ F1 ⊆ . . . ⊆ F (“filtered
probability space”).
Let X0 , X1 , . . . be a sequence of random variables defined on (Ω, F , P). If, for
each n ≥ 0, the variable Xn is Fn -measurable, the set X = (Xn , Fn )n≥0 , or simply
X = (Xn , Fn ), is called a stochastic sequence.
If a stochastic sequence X = (Xn , Fn ) has the property that, for each n ≥ 1, the
variable Xn is Fn−1 -measurable, we write X = (Xn , Fn−1 ), taking F−1 = F0 , and
call X a predictable sequence. We call such a sequence increasing if X0 = 0 and
Xn ≤ Xn+1 (P-a.s.).
F 0 ⊆ F1 ⊆ · · · ⊆ F .
{ω : E(Xn+1
+
| Fn ) < ∞} ∪ {ω : E(Xn− | Fn ) < ∞} = Ω (P -a.s.)
Definition 3. A random variable τ = τ(ω) with values in the set {0, 1, . . . , +∞}
is a Markov time (with respect to (Fn )) (or a random variable independent of the
future) if, for each n ≥ 0,
{τ = n} ∈ Fn . (4)
When P(τ < ∞) = 1, a Markov time τ is called a stopping time.
for every n ≥ 0.
EXAMPLE 7. Let X = (Xn , Fn ) be a martingale (or submartingale) and τ a Markov
time (with respect to (Fn )). Then the “stopped” sequence X τ = (Xn ∧ τ , Fn ) is also
a martingale (or submartingale).
In fact, the equation
n−1
Xn ∧ τ = Xm I{τ=m} + Xn I{τ≥n}
m=0
implies that the variables Xn∧τ are Fn -measurable, are integrable, and satisfy
X(n+1)∧τ − Xn∧τ = I{τ>n} (Xn+1 − Xn ),
whence
Definition 5. Let Y = (Yn , Fn )n≥0 be a stochastic sequence and V = (Vn , Fn−1 )n≥0
a predictable sequence (F−1 = F0 ). The stochastic sequence V ·Y = ((V ·Y)n , Fn )
with
n
(V · Y)n = V0 Y0 + Vi ΔYi , (5)
i=1
PROOF. (a) ⇒ (b). Let X be a local martingale, and let (τk ) be a localizing se-
quence of Markov times for X. Then, for every m ≥ 0,
and therefore
The random variable I{τk >n} is Fn -measurable. Hence it follows from (7) that
Under this condition, E[Xn+1 | Fn ] is defined, and it remains only to show that
E[Xn+1 | Fn ] = Xn (P-a.s.).
To do this, we need to show that
Xn+1 d P = Xn d P
A A
Letting k → ∞, we have
|Xn | d P ≤ |Xn+1 | d P,
A A
from which there follows the required σ-finiteness of the measure A |Xn | d P, A ∈
Fn .
Let A ∈ Fn have the property A |Xn+1 | d P < ∞. Then, by Lebesgue’s theorem
on dominated convergence, we may take limits in the relation
Xn d P = Xn+1 d P,
A∩{τk >n} A∩{τk >n}
Consider the sequence X τk = ((V · Y)n∧τk I{τk >0} , Fn ). On the set {τk > 0}, the
inequality |Vn∧tk | ≤ k is in effect. Hence it follows that E |(V · Y)n∧τk I{τk >0} | < ∞
for every n ≥ 1. In addition, for n ≥ 1,
n
Xn = Vi ηi = Xn−1 + Vn ηn , X0 = 0.
i=1
It is quite natural to suppose that the amount Vn at the nth turn may depend on the
results of the preceding turns, i.e., on V1 , . . . , Vn−1 and on η1 , . . . , ηn−1 . In other
words, if we put F0 = {∅, Ω} and Fn = σ{η1 , . . . , ηn }, then Vn is an Fn−1 -
measurable random variable, i.e., the sequence V = (Vn , Fn−1 ) that determines the
player’s “strategy” is predictable. Putting Yn = η1 + · · · + ηn , we find that
n
Xn = Vi ΔYi ,
i=1
fair if p = q = 12 ,
favorable if p > q,
unfavorable if p < q.
114 7 Martingales
Since X = (Xn , Fn ) is a
martingale if p = q = 12 ,
submartingale if p > q,
supermartingale if p < q,
we can say that the assumption that the game is fair (or favorable or unfavorable)
corresponds to the assumption that the sequence X is a martingale (or submartingale
or supermartingale).
Let us now consider the special class of strategies V = (Vn , Fn−1 )n≥1 with
V1 = 1 and (for n > 1)
n−1
2 if η1 = −1, . . . , ηn−1 = −1,
Vn = (9)
0 otherwise.
In such a strategy, a player, having started with a stake V1 = 1, doubles the stake
after a loss and drops out of the game immediately after a win.
If η1 = −1, . . . , ηn = −1, the total loss to the player after n turns will be
n
2i−1 = 2n − 1.
i=1
E Xn = E X0 = 0 for every n ≥ 1.
We may therefore expect that this equation will be preserved if the instant n is
replaced by a random instant τ. It will appear later (Theorem 1 in Sect. 2) that
E Xτ = E X0 in “typical” situations. Violations of this equation (as in the game
discussed above) arise in what we may describe as physically unrealizable situa-
tions, when either τ or |Xn | takes values that are much too large. (Note that the game
discussed above would be physically unrealizable since it supposes an unbounded
time for playing and an unbounded initial capital for the player.)
1 Definitions of Martingales and Related Concepts 115
Xn = mn + An (P -a.s.) (11)
n−1
mn = m0 + [Xj+1 − E(Xj+1 | Fj )], (12)
j=0
n−1
An = [E(Xj+1 | Fj ) − Xj ]. (13)
j=0
It is evident that m and A, defined in this way, have the required properties. In addi-
tion, let Xn = mn + An , where m = (mn , Fn ) is a martingale and A = (An , Fn ) is a
predictable increasing sequence. Then
and if we take conditional expectations on both sides, we find that (P-a.s.) An+1 −
An = An+1 − An . But A0 = A0 = 0, and therefore An = An and mn = mn (P-a.s.)
for all n ≥ 0.
This completes the proof of the theorem.
It follows from (11) that the sequence A = (An , Fn−1 ) compensates X = (Xn , Fn )
so that it becomes a martingale. This observation justifies the following definition.
Definition 7. A predictable increasing sequence A = (An , Fn−1 ) appearing in the
Doob decomposition (11) is called a compensator (of the submartingale X).
116 7 Martingales
The Doob decomposition plays a key role in the study of square-integrable mar-
tingales M = (Mn , Fn ), i.e., martingales for which E Mn2 < ∞, n ≥ 0; this depends
on the observation that the stochastic sequence M 2 = (M 2 , Fn ) is a submartingale.
According to Theorem 2, there is a martingale m = (mn , Fn ) and a predictable
increasing sequence M = (Mn , Fn−1 ) such that
The sequence M is called the quadratic characteristic of M and, in many re-
spects, determines its structure and properties.
It follows from (13) that
n
Mn = E[(ΔMj )2 | Fj−1 ] (15)
j=1
It is easily verified that (Xn Yn −X, Yn , Fn ) is a martingale, and therefore, for l ≤ k,
n
X, Yn = Cov(ξi , ηi ).
i=1
The sequence X, Y = (X, Yn , Fn−1 ), defined in (19), is often called the mu-
tual characteristic of the (square-integrable) martingales X and Y. It is easy to show
1 Definitions of Martingales and Related Concepts 117
n
[X]n = (ΔXi )2 ,
i=1
which can be defined for all random sequences X = (Xn )n≥1 and Y = (Yn )n≥1 .
PROOF. (1) Let us show that either of the conditions E Xn− < ∞, n ≥ 0, or E Xn+ <
∞, n ≥ 0, implies that E |Xn | < ∞, n ≥ 0.
Indeed, let, for example, E Xn− < ∞ for all n ≥ 0. Then, by the Fatou lemma,
−
E Xn+ = E lim inf Xn∧τ
+
k
≤ lim inf E Xn∧τ
+
k
= lim inf [E Xn∧τk + E Xn∧τ k
]
k k k
n
−
= E X0 + lim inf E Xn∧τ k
≤ | E X0 | + E Xk− < ∞.
k
k=0
n+1
|X(n+1)∧τk | ≤ |Xi |,
i=0
where
n+1
E |Xi | < ∞.
i=0
118 7 Martingales
for every n ≥ 1.
6. Let ξ1 , ξ2 , . . . be independent random variables,
4
n
1
P(ξi = 0) = P(ξi = 2) = 2 and Xn = ξi .
i=1
Show that there does not exist an integrable random variable ξ and a nondecreas-
ing family (Fn ) of σ-algebras such that Xn = E(ξ | Fn ). This example shows
that not every martingale (Xn )n≥1 can be represented in the form (E(ξ | Fn ))n≥1
(cf. Example 3 in Sect. 11, Chap. 1, Vol. 1).
7. (a) Let ξ1 , ξ2 , . . . be independent random variables with E |ξn | < ∞, E ξn = 0,
n ≥ 1. Show that for any k ≥ 1 the sequence
Xn(k) = ξi1 . . . ξik , n ≥ k,
1≤i1 <···<ik ≤n
is a martingale.
2 Preservation of Martingale Property Under a Random Time Change 119
ξ1 + · · · + ξ n
E(ξn+1 | ξ1 , . . . , ξn ) = (= Xn ).
n
Prove that the sequence X1 , X2 , . . . is a martingale.
8. Give an example of a martingale (Xn , Fn )n≥1 such that the family {Xn , n ≥ 1}
is not uniformly integrable.
9. Let X = (Xn )n≥0 be a Markov chain (Sect. 1, Chap. 8) with a countable state
space E = {i, j, . . . } and transition
probabilities pij . Let ψ = ψ(x), x ∈ E, be a
bounded function such that pij ψ(j) ≤ λψ(i) for λ > 0 and i ∈ E. Show that
j∈E
the sequence (λ−n ψ(Xn ))n≥0 is a supermartingale.
E Xn = E X0 (1)
E Xτ = E X0 . (2)
The following basic theorem describes the “typical” situation, in which, in par-
ticular, E Xτ = E X0 . (We let Xτ = 0 on the set {τ = ∞}.)
Theorem 1 (Doob). (a) Let X = (Xn , Fn )n≥0 be a submartingale, and τ and σ
finite (P-a.s.) stopping times for which E Xτ and E Xσ are defined (e.g., such that
E |Xτ | < ∞ and E |Xσ | < ∞). Assume that
Then
E(Xτ | Fσ ) ≥ Xτ∧σ (P -a.s.) (4)
or, equivalently,
E(Xτ | Fσ ) ≥ Xσ ({τ ≥ σ}; P -a.s.).
(b) Let M = (Mn , Fn )n≥0 be a martingale, and τ and σ finite (P-a.s.) stop-
ping times for which E Mτ and E Mσ are defined (e.g., such that E |Mτ | < ∞ and
E |Mσ | < ∞). Assume that
Then
E(Mτ | Fσ ) = Mτ∧σ (P -a.s.) (6)
or, equivalently,
E(Mτ | Fσ ) = Mσ ({τ ≥ σ}; P -a.s.).
i.e., that
E Xτ I(B, τ ≥ n) ≥ E Xn I(A, τ ≥ n), B = A ∩ {σ = n}.
Using the property B ∪ {τ > n} ∈ Fn and the fact that the process X =
(Xn , Fn )n≥0 is a submartingale, we find by iterating in n that for any m ≥ n
Consequently,
Thus, we have
E Xτ I(B, σ = n, τ ≥ n) ≥ E Xn I(B, σ = n, τ ≥ n)
2 Preservation of Martingale Property Under a Random Time Change 121
or
E Xτ I(A, τ ≥ σ, σ = n) ≥ E Xσ I(A, τ ≥ σ, σ = n).
Hence, using the assumption P{σ < ∞} = 1 and the fact that the expectations E Xτ
and E Xσ are defined, we obtain the desired inequality (7).
(b) Let M = (Mn , Fn )n≥0 be a martingale satisfying (5). This condition implies
that
lim inf E Mm+ I(τ > m) = lim inf E Mm− I(τ > m) = 0.
m→∞ m→∞
with the latter inequality telling us that E[Mτ | Fσ ] ≤ Mτ∧σ . Hence E[Mτ | Fσ ] =
Mτ∧σ (P-a.s.), which is precisely equality (6).
Corollary 1. Let τ and σ be stopping times such that
P{σ ≤ τ ≤ N} = 1
E X0 ≤ E Xσ ≤ E Xτ ≤ E XN ,
E M0 = E Mσ = E Mτ = E MN .
E X0 ≤ E Xσ ≤ E Xτ .
(and similarly for σ) because, due to inequality (16) of Sect. 6, Chap. 2, Vol. 1, the
assumption of uniform integrability of {Xn , n ≥ 0} implies that supN E |XN | < ∞;
122 7 Martingales
hence the required inequality E |Xτ | < ∞ (and, similarly, E |Xσ | < ∞). will follow
from (9).
Corollary 1 applied to the bounded stopping time τN = τ ∧ N implies
E X0 ≤ E XτN .
Therefore
E |XτN | = 2 E Xτ+N − E XτN ≤ 2 E Xτ+N − E X0 . (10)
The sequence X +
= (Xn+ , Fn )n≥0 is a submartingale (see Example 5 in Sect. 1);
hence
N
E Xτ+N = E Xj+ I(τN = j) + E XN+ I(τ > N)
j=0
N
≤ E XN+ I(τN = j) + E XN+ I(τ > N) = E XN+ ≤ E |XN | ≤ sup E |Xm |,
m
j=0
E |Xτ | = E lim |XτN | = E lim inf |XτN | ≤ lim inf E |XτN | ≤ 3 sup E |XN |,
N N N N
Therefore condition (5) fails here. It is of interest to notice that the property (6) fails
here as well since, as was shown in that example, there is a stopping time τ such that
E Xτ = 1 > X0 = 0. In this sense, condition (5) (together with the condition that
E Xσ and E Xτ are defined) is not only sufficient for (6), but also “almost necessary.”
2. The following proposition, which we shall deduce from Theorem 1, is often useful
in applications.
Then
E |Xτ | < ∞
and
E Xτ = E X0 . (11)
(≥)
PROOF. We first verify that the stopping time τ has the properties
E |Xτ | < ∞ and lim inf |Xn | d P = 0,
n→∞
{τ>n}
n
∞ ∞
∞
∞
= Yj d P = Yj d P = Yj d P .
n=0 j=0 {τ=n} j=0 n=j {τ=n} j=0 {τ≥j}
and therefore τ
|Xn | d P ≤ Yi d P .
{τ>n} {τ>n} j=0
τ
Hence, since (by (12)) E j=0 Yj < ∞ and {τ > n} ↓ ∅, n → ∞, the dominated
convergence theorem yields
124 7 Martingales
τ
lim inf |Xn | d P ≤ lim inf Yj d P = 0.
n→∞ {τ>n} n→∞ {τ>n} j=0
Hence the hypotheses of Theorem 1 are satisfied, and (11) follows, as required.
This completes the proof of the theorem.
E(ξ1 + · · · + ξτ ) = E ξ1 · E τ. (13)
E[|Xn+1 − Xn | | X1 , . . . , Xn ] = E[|ξn+1 − E ξ1 | | ξ1 , . . . , ξn ]
= E |ξn+1 − E ξ1 | ≤ 2 E |ξ1 | < ∞.
E Sτ2 = E η12 · E τ.
n
the sequence (Sn2 − ηi2 , Fnξ )n≥1 is a martingale with zero expectation.
i=1
By Corollary 1 we have
τ(n)
2
E Sτ(n) =E ηi2
i=1
2
so that E Sτ(n) = E η12 · E τ(n).
2 Preservation of Martingale Property Under a Random Time Change 125
It remains to identify the random variable S. Let us observe that with probability 1
there is a subsequence {n } ⊆ {n} such that both Sτ(n ) → S and τ(n ) → τ. But
then it is clear that Sτ(n ) → Sτ with probability 1. Therefore S and Sτ are the same
almost surely; hence E Sτ2 = E η12 · E τ, which was to be proved.
The second proof. By Fatou’s lemma (Theorem 2 (a), Sect. 6, Chap. 2, Vol. 1), we
2
obtain from the equality E Sτ(n) = E η12 · E τ(n) established above that
for any n ≥ 1.
Notice, using Wald’s first identity (13), that
so
E |Sn |I(τ > n) = E |η1 + · · · + ηn | I(τ > n) ≤ E(|η1 | + · · · + |ηn |) I(τ > n)
≤ E(|η1 | + · · · + |ητ |) I(τ > n) → 0 as n → ∞.
Applying Theorem 1 to the submartingale (|Sn |, Fnξ )n≥1 ), we find that, on the
set {τ ≥ n},
E(|Sτ | | Fnξ ) ≥ |Sn | (P -a.s.).
Hence, by Jensen’s inequality for conditional expectations (Problem 5, Sect. 7,
Chap. 2, Vol. 1), we obtain that on the set {τ ≥ n}
And on the complementary set {τ < n} we have E(Sτ2 | Fnξ ) = Sτ2 = Sτ(n)
2
. Thus
(P-a.s.)
E(Sτ2 | Fnξ ) ≥ Sτ(n)
2
because the required convergence will then follow by Lebesgue’s dominated con-
vergence theorem (Theorem 3, Sect. 6, Chap. 2, Vol. 1).
For the proof of this inequality we will use the “maximal inequality” (13) to be
given in the next Sect. 3. This inequality applied to the martingale (Sτ(k) , Fkξ )k≥1
yields !
E 2
sup Sτ(k) ≤ 4 E Sτ(n)
2
≤ 4 sup E Sτ(n)
2
.
1≤k≤n n
But
2
E Sτ(n) = E η12 · E τ(n) ≤ E η12 · E τ < ∞.
Therefore
2
E sup Sτ(n) ≤ 4 E η12 · E τ < ∞,
n
as was to be shown.
PROOF. Take
Yn = et0 Sn (ϕ(t0 ))−n .
Then Y = (Yn , Fnξ )n≥1 is a martingale with E Yn = 1 and, on the set {τ ≥ n},
t0 ξn+1
e
E{|Yn+1 − Yn | | Y1 , . . . , Yn } = Yn E
− 1 ξ1 , . . . , ξn
ϕ(t0 )
= Yn · E |et0 ξ1 (ϕ(t0 ))−1 − 1| ≤ C < ∞ (P -a.s.),
E Sτ αA + βB
Eτ = = ,
p−q p−q
where α and β are defined by (17).
EXAMPLE 2. In the example considered above, let p = q = 12 . Let us show that for
τ defined in (16) and every λ in 0 < λ < π/(B + |A|)
−τ cos λ B+A
E(cos λ) = 2 . (18)
cos λ B+|A|
2
For this purpose we consider the martingale X = (Xn , Fnξ )n≥0 with
B+A
Xn = (cos λ)−n cos λ Sn − (19)
2
Let us show that the family {Xn∧τ } is uniformly integrable. For this purpose we
observe that, by Corollary 1 to Theorem 1 for 0 < λ < π/(B + |A|),
−(n∧τ) B+A
E X0 = E Xn∧τ = E(cos λ) cos λ Sn∧τ −
2
B−A
≥ E(cos λ)−(n∧τ) cos λ .
2
Therefore, by (20),
−(n∧τ) cos λ B+A
E(cos λ) ≤ 2 ,
cos λ B+|A|
2
Consequently, by (19),
|Xn∧τ | ≤ (cos λ)−τ .
With (21), this establishes the uniform integrability of the family {Xn∧τ }. Then, by
Corollary 2 to Theorem 1,
2 Preservation of Martingale Property Under a Random Time Change 129
B+A B−A
cos λ = E X0 = E Xτ = E(cos λ)−τ cos λ ,
2 2
m(t) 1
→ , t → ∞. (22)
t μ
(Recall that the process Nt = (Nt )t≥0 itself obeys the strong law of large numbers:
Nt 1
→ (P -a.s.), t → ∞;
t μ
see Example 4 in Sect. 3, Chap. 4.)
To prove (22), we will show that
m(t) 1 m(t) 1
lim inf ≥ and lim sup ≤ . (23)
t→∞ t μ t→∞ t μ
Hence we see from the right inequality in (24) that t < μ[m(t) + 1], i.e.,
m(t) 1 1
> − , (26)
t μ t
whence, letting t → ∞, we obtain the first inequality in (23).
Next, the left inequality in (24) implies that t ≥ E TNt . Since TNt +1 = TNt +σNt +1 ,
we have
If we assume that the variables σi are bounded from above (σi ≤ c), then (27)
implies that t ≥ μ[m(t) + 1] − c, and hence
m(t) 1 1 c−μ
≤ + · . (28)
t μ t μ
Then the second inequality in (23) would follow.
To discard the restriction σi ≤ c, i ≥ 1, we introduce, for some c > 0, the
variables
σic = σi I(σi < c) + cI(σi ≥ c)
∞
and define the related renewal process N c = (Ntc )t≥0 with Ntc = n=1 I(Tnc ≤ t),
Tnc = σ1c + · · · + σnc . Since σic ≤ σi , i ≥ 1, we have Ntc ≥ Nt ; hence
m(t) mc (t) 1 1 c − μc
≤ ≤ c+ · ,
t t μ t μc
where μc = E σ1c .
Therefore
m(t) 1
lim sup ≤ c.
t→∞ t μ
Letting now c → ∞ and using that μc → μ, we obtain the required second inequal-
ity in (23).
Thus (22) is established.
Remark. For more general results of renewal theory see, for example, [10, Chap.
9], [25, Chap. 13].
5. PROBLEMS
1. Show that
E |Xτ | ≤ lim E |XN |
N→∞
for any martingale or nonnegative submartingale X = (Xn , Fn )n≥0 and any finite
(P -a.s.) stopping time τ. (Compare with inequality E |Xτ | ≤ 3 supN E |XN | in
Corollary 2 to Theorem 1.)
2. Let X = (Xn , Fn )n≥0 be a square-integrable martingale, E X0 = 0, τ a stopping
time, and
lim inf Xn2 d P = 0.
n→∞ {τ>n}
Show that , τ
-
E Xτ2 = EXτ =E (ΔXj ) 2
,
j=0
Xσ ≥ E(Xτ | Fσ ) (P -a.s.).
n
n
Xn = a I(ξk = +1) − b I(ξk = −1)
k=1 k=1
and
τ = min{n ≥ 1 : Xn ≤ −r}, r > 0.
Show that E eλτ < ∞ for λ ≤ α0 and E eλτ = ∞ for λ > α0 , where
b 2b a 2a
α0 = log + log .
a+b a+b a+b a+b
5. Let ξ1 , ξ2 , . . . be a sequence of independent random variables with E ξi = 0,
Var ξi = σi2 , Sn = ξ1 + · · · + ξn , Fnξ = σ{ξ1 , . . . , ξn
}. Prove the following
τ
generalizations of Wald’s identities (13) and (14): If E j=1 E |ξj | < ∞, then
τ
E Sτ = 0; if E j=1 E ξj2 < ∞, then
τ
τ
E Sτ2 = E ξj2 = E σj2 . (29)
j=1 j=1
Show that if
lim inf E(Xn2 I(τ > n)) < ∞ or lim inf E(|Xn | I(τ > n)) = 0,
n→∞ n→∞
τ
then E Xτ2 = E n=1 (ΔXn )2 .
7. Let X = (Xn , Fn )n≥1 be a submartingale and τ1 ≤ τ2 ≤ . . . stopping times such
that E Xτm are defined and
Prove that the sequence (Xτm , Fτm )m≥1 is a submartingale. (As usual, Fτm =
{A ∈ F : A ∩ {τm = j} ∈ Fj , j ≥ 1}.)
132 7 Martingales
3. Fundamental Inequalities
III. Let Y = (Yn , Fn )n≥0 be a nonnegative supermartingale. Then for all λ > 0
λ P max Yk ≥ λ ≤ E Y0 , (7)
k≤n
λ P sup Yk ≥ λ ≤ E Yn . (8)
k≥n
and if p > 1,
p
Xn p ≤ Xn∗ p ≤ Xn p . (12)
p−1
In particular, if p = 2,
E |X |2
n
P max |Xk | ≥ λ ≤ , (13)
k≤n λ2
E max Xk2 ≤ 4 E Xn2 . (14)
k≤n
Hence
λP min Yk ≤ −λ ≤ − E Yn ; min Yk ≤ −λ ≤ E Yn− ,
k≤n k≤n
from which, as N → ∞,
E Yn ≥ λ P{γ < ∞} = λ P sup Yk ≥ λ .
k≥n
PROOF OF THEOREM 2. The first inequalities in (9) and (10) are evident.
To prove the second inequality in (9), we first suppose that
and use the fact that, for every nonnegative random variable ξ and for r > 0,
∞
E ξr = r tr−1 P(ξ ≥ t) dt (16)
0
(see (69) in Sect. 6, Chap. 2, Vol. 1). Then we obtain, by (1) and Fubini’s theorem,
that for p > 1
, -
∞ ∞
E(Xn∗ )p = p tp−1 P{Xn∗ ≥ t} dt ≤ p tp−2 Xn d P dt
0 0 {Xn∗ ≥t}
∞ !
=p tp−2 Xn I{Xn∗ ≥ t} d P dt
0 Ω
/ 0
Xn∗
p
=p Xn t p−2
dt d P = E Xn (Xn∗ )p−1 . (17)
Ω 0 p−1
and therefore
E(Xn∗ )p = lim E(Xn∗ ∧ L)p ≤ qp Xn pp .
L→∞
3 Fundamental Inequalities 135
We now prove the second inequality in (10). Again applying (1), we obtain
∞
∗ ∗
E Xn − 1 ≤ E(Xn − 1) = +
P{Xn∗ − 1 ≥ t} dt
∞ / 0
0 Xn∗ −1
1 dt
≤ Xn d P dt = E Xn = E Xn log Xn∗ .
0 1 + t ∗
{Xn ≥1+t} 0 1 + t
we have
E Xn∗ − 1 ≤ E Xn log Xn∗ ≤ E Xn log+ Xn + e−1 E Xn∗ .
If E Xn∗ < ∞, we immediately obtain the second inequality (10).
However, if E X ∗ = ∞, we proceed, as above, by replacing Xn∗ with Xn∗ ∧ L.
This proves the theorem.
PROOF OF THEOREM 3. The proof follows from the remark that |X|p , p ≥ 1, is
a nonnegative submartingale (if E |Xn |p < ∞, n ≥ 0), and from inequalities (1)
and (9).
Corollary of Theorem 3. Let Xn = ξ0 +· · ·+ξn , n ≥ 0, where (ξk )k≥0 is a sequence
of independent random variables with E ξk = 0 and E ξk2 < ∞. Then inequality (13)
becomes Kolmogorov’s inequality (Sect. 2, Chap. 4).
2. Let X = (Xn , Fn ) be a nonnegative submartingale and
Xn = Mn + An ,
E An
P{Xn∗ ≥ ε} ≤ .
ε
Theorem 4, below, shows that this inequality is valid, not only for submartingales,
but also for the wider class of sequences that have the property of domination in the
following sense.
Definition. Let X = (Xn , Fn ) be a nonnegative stochastic sequence and A =
(An , Fn−1 ) an increasing predictable sequence. We shall say that X is dominated
by sequence A if
E Xτ ≤ E Aτ (20)
for every stopping time τ.
Theorem 4. If X = (Xn , Fn ) is a nonnegative stochastic sequence dominated by an
increasing predictable sequence A = (An , Fn−1 ), then for λ > 0, a > 0, and any
stopping time τ,
136 7 Martingales
E Aτ
P{Xτ∗ ≥ λ} ≤ , (21)
λ
1
P{Xτ∗ ≥ λ} ≤ E(Aτ ∧ a) + P(Aτ ≥ a), (22)
λ
1/p
∗ 2 − p
Xτ p ≤ Aτ p , 0 < p < 1. (23)
1−p
PROOF. We set
σn = min{j ≤ τ ∧ n : Xj ≥ λ},
taking σn = τ ∧ n, if {·} = ∅. Then
∗
E Aτ ≥ E Aσn ≥ E Xσn ≥ Xσn d P ≥ λ P{Xτ∧n > λ},
∗ >λ}
{Xτ∧n
from which
∗ 1
P{Xτ∧n > λ} ≤ E Aτ ,
λ
and we obtain (21) by Fatou’s lemma.
For the proof of (22), we introduce the time
where ΔAk = Ak − Ak−1 . Then the following inequality is satisfied (cf. (22)):
1
P{Xτ∗ ≥ λ} ≤ E[Aτ ∧ (a + c)] + P{Aτ ≥ a}. (24)
λ
The proof is analogous to that of (22). We have only to replace the time γ =
min{j : Aj+1 ≥ a} with γ = min{j : Aj ≥ a} and notice that Aγ ≤ a + c.
Corollary. Let the sequences X k = (Xnk , Fnk ) and Ak = (Akn , Fnk ), n ≥ 0, k ≥ 1,
satisfy the hypotheses of Theorem 4 or the remark. Also, let (τk )k≥1 be a sequence
P P
of stopping times (with respect to F k = (Fnk )) and Akτk → 0. Then (X k )∗τk → 0.
3. In this subsection we present (without proofs, but with applications) a num-
ber of significant inequalities for martingales. These generalize the inequalities of
Khinchin and of Marcinkiewicz and Zygmund for sums of independent random
variables stated below.
Khinchin’s Inequalities. Let ξ1 , ξ2 , . . . be independent identically distributed
Bernoulli random variables with P(ξi = 1) = P(ξi = −1) = 12 , and let (cn )n≥1 be
a sequence of numbers.
Then for every p, 0 < p < ∞, there are universal constants Ap and Bp (indepen-
dent of (cn )) such that
1/2 . n .
1/2
n
. . n
Ap c2j ≤.
. c ξ .
j j. ≤ Bp c2
j (25)
j=1 j=1 p j=1
for every n ≥ 1.
Marcinkiewicz and Zygmund’s Inequalities. If ξ1 , ξ2 , . . . is a sequence of inde-
pendent integrable random variables with E ξi = 0, then for p ≥ 1 there are univer-
sal constants Ap and Bp (independent of (ξn )) such that
.
n . . n . .
n .
. 2 1/2 . . . . 2 1/2 .
Ap .
. ξ j
.
. ≤ .
. ξ .
j. ≤ B .
p. ξ j
.
. (26)
i=1 p j=1 p j=1 p
for every n ≥ 1.
n n
The sequences X = (Xn ) with Xn = j=1 cj ξj and Xn = j=1 ξj in (25) and
(26) are martingales involving independent ξj . It is natural to ask whether these
inequalities can be extended to arbitrary martingales.
The first result in this direction was obtained by Burkholder.
138 7 Martingales
n
[X]n = (ΔXj )2 , X0 = 0. (28)
j=1
where
Ap = [18p3/2 /(p − 1)]−1 , B∗p = 18p5/2 /(p − 1)3/2 .
Burkholder’s inequalities (27) hold for p > 1, whereas the Marcinkiewicz–
Zygmund inequalities (26) also hold when p = 1. What can we say about the
validity of (27) for p = 1? It turns out that a direct generalization to p = 1 is
impossible, as the following example shows.
where
n
τ = min n ≥ 1 : ξj = 1 .
i=1
But
" " τ∧n 1/2
√
[X]n 1 = E [X]n = E 1 = E τ ∧ n → ∞.
j=1
i.e., 5 5
6 ! 6
6 n 6 n
AE 7 (ΔXj ) ≤ E max |Xj | ≤ B E 7 (ΔXj )2 .
2
1≤j≤n
j= 1 j= 1
E Sτ = 0 (31)
for every stopping time τ (with respect to (Fnξ )) for which E τ < ∞.
If we assume additionally that E |ξ1 |r < ∞, where 1 < r ≤ 2, then the condition
E τ1/r < ∞ is sufficient for (31).
For the proof, we set τn = τ ∧ n, Y = supn |Sτn | and let m = [tr ] (integral part of
r
t ) for t > 0. By Corollary 1 to Theorem 1 (Sect. 2), we have E Sτn = 0. Therefore
a sufficient condition for E Sτ = 0 is (by the dominated convergence theorem) that
E supn |Sτn | < ∞.
Using (1) and (27), we obtain
and therefore
∞ ∞ !
EY = P(Y ≥ t) dt ≤ (1 + Brr μr ) E τ1/r + Brr μr t−r τ d P dt
0 0 {τ<tr }
∞ !
−r
= (1 + Brr μr ) E τ1/r
+ τ Brr μr
t dt d P
Ω τ1/r
Br μr
= 1 + Brr μr + r E τ1/r < ∞.
r−1
Corollary 2. Let M = (Mn ) be a martingale with E |Mn |2r < ∞ for some r ≥ 1
and such that (with M0 = 0)
∞
E |ΔMn |2r
< ∞. (32)
n=1
n1+r
Then (cf. Theorem 2 in Sect. 3, Chap. 4) we have the strong law of large numbers:
Mn
→0 (P -a.s.), n → ∞. (33)
n
When r = 1, the proof follows the same lines as the proof of Theorem 2 in
Sect. 3, Chap. 4. In fact, let
n
ΔMk
mn = .
k
k=1
Then n
1
n
Mn k=1 ΔMk
= = kΔmk
n n n
k=1
and, by Kronecker’s lemma (Sect. 3, Chap. 4), a sufficient condition for the limit
relation (P-a.s.)
1
n
kΔmk → 0, n → ∞,
n
k=1
is that the limit limn mn exists and is finite (P-a.s.), which in turn (Theorems 1 and
4 in Sect. 10, Chap. 2, Vol. 1) is true if and only if
P sup |mn+k − mn | ≥ ε → 0, n → ∞. (34)
k≥1
3 Fundamental Inequalities 141
By (1),
∞
E(ΔMk )2
P sup |mn+k − mn | ≥ ε ≤ ε−2 .
k≥1 k2
k=n
1 1
≤ 2r E |Mn |2r + E(|Mj |2r − |Mj−1 |2r ).
n j2r
j≥n+1
We have
N
1
IN = 2r
E |Mj |2r − E |Mj−1 |2r
j=2
j
N !
1 1 E |MN |2r
≤ − E |Mj−1 | 2r
+ .
j=2
(j − 1)2r j2r N 2r
Hence
N−1 ! j
2r E |MN |
2r
1 1
IN ≤ B2r
2r − j r−1
E |ΔMi |
j=2
j2r (j + 1)2r i=1
N 2r
142 7 Martingales
N−1
1
j
E |MN |2r
≤ C1 E |ΔMi |2r
j=2
jr+2 i=1
N 2r
N
E |ΔMj |2r
≤ C2 + C3
j=2
jr+1
4. The sequence of random variables {Xn }n≥1 has a limit lim Xn (finite or infinite)
with probability 1 if and only if the number of “oscillations between two arbitrary
rational numbers a and b, a < b” is finite with probability 1. In what follows, Theo-
rem 5 provides an upper bound for the number of “oscillations” for submartingales.
In the next section, this will be applied to prove the fundamental result on their
convergence.
Let us choose two numbers a and b, a < b, and define the following times in
terms of the stochastic sequence X = (Xn , Fn ):
τ0 = 0,
τ1 = min{n > 0 : Xn ≤ a},
τ2 = min{n > τ1 : Xn ≥ b},
··· ··· ························
τ2m−1 = min{n > τ2m−2 : Xn ≤ a},
τ2m = min{n > τ2m−1 : Xn ≥ b},
··· ··· ························
and &
{ϕi = 1 } = [{τm < i}\{τm+1 < i}] ∈ Fi−1 .
odd m
Therefore
n n
b E βn (0, b) ≤ E ϕi [Xi − Xi−1 ] = (Xi − Xi−1 ) d P
i=1 i=1 {ϕi =1}
n
= E(Xi − Xi−1 | Fi−1 ) d P
i=1 {ϕi =1}
n
= [E(Xi | Fi−1 ) − Xi−1 ] d P
i=1 {ϕi =1}
n
≤ [E(Xi | Fi−1 ) − Xi−1 ] d P = E Xn ,
i=1 Ω
5. In this subsection we discuss some of the simplest inequalities for the probabilities
of large deviations for square-integrable martingales.
Let M = (Mn , Fn )n≥0 be a square-integrable martingale with quadratic char-
acteristic M = (Mn , Fn−1 ), setting M0 = 0. If we apply inequality (22) to
Xn = Mn2 , An = Mn , we find that for a > 0 and b > 0
P max |Mk | ≥ an} = P max Mk ≥ (an) 2 2
k≤n k≤n
1
≤ E[Mn ∧ (bn)] + P{Mn ≥ an}. (39)
(an)2
In fact, at least in the case where |ΔMn | ≤ C for all n and ω ∈ Ω, this inequality
can be substantially improved using the ideas explained in Sect. 5 of Chap. 4 for
estimating the probabilities of large deviations for sums of independent identically
distributed random variables.
Let us recall that in Sect. 5, Chap. 4, when we introduced the corresponding in-
equalities, the essential point was to use the property that the sequence
4
n
En (λ) = E(eλΔMj | Fj−1 ) (41)
j=1
is called the stochastic exponential (see also Subsection 13, Sect. 6, Chap. 2, Vol. 1).
This expression is rather complicated. At the same time, in using (8) it is not
necessary for the sequence to be a martingale. It is enough for it to be a nonnega-
tive supermartingale. Here we can arrange this by forming a sequence (Zn (λ), Fn )
((43), below), which sufficiently simply depends on Mn and Mn , and to which we
can apply the method used in Sect. 5, Chap. 4.
Lemma 1. Let M = (Mn , Fn )n≥0 be a square-integrable martingale, M0 = 0,
ΔM0 = 0, and |ΔMn (ω)| ≤ c for all n and ω. Let λ > 0,
⎧ λc
⎨ (e − 1 − λc )/c2 , c > 0,
ψc (λ) = (42)
⎩ 1 2
2 λ , c = 0,
and
Zn (λ) = eλMn −ψc (λ)Mn . (43)
Then for every c ≥ 0 the sequence Z(λ) = (Zn (λ), Fn )n≥0 is a nonnegative
supermartingale.
PROOF. For |x| ≤ c,
(λx)m−2 (λc)m−2
eλx − 1 − λx = (λx)2 ≤ (λx)2 ≤ x2 ψc (λ).
m! m!
m≥2 m≥2
we find that
E(ΔZn | Fn−1 )
= Zn−1 [E(eλΔMn − 1 | Fn−1 ) e−ΔMn ψc (λ) + (e−ΔMn ψc (λ) − 1)]
= Zn−1 [E(eλΔMn − 1 − λΔMn | Fn−1 ) e−ΔMn ψc (λ) + (e−ΔMn ψc (λ) − 1)]
≤ Zn−1 [ψc (λ) E((ΔMn )2 | Fn−1 ) e−ΔMn ψc (λ) + (e−ΔMn ψc (λ) − 1)]
= Zn−1 [ψc (λ)ΔMn e−ΔMn ψc (λ) + (e−ΔMn ψc (λ) − 1)] ≤ 0, (44)
where we have also used the fact that xe−x + (e−x − 1) ≤ 0 for x ≥ 0.
3 Fundamental Inequalities 145
Passing from M to −M, we find that the right-hand side of (46) also provides an
upper bound for the probability P{mink≤n Mk ≤ −an}. Consequently,
P max |Mk | ≥ an ≤ 2 P{Mn > bn} + 2e−nHc (a,b) . (47)
k≤n
which characterize, in particular, the rate of convergence in the strong law of large
numbers for martingales (also see Theorem 4 in Sect. 5).
Proceeding as in Sect. 5, Chap. 4, we find that for every a > 0 there is a λ > 0
for which aλ − ψc (λ) > 0. Then, for every b > 0,
Mk
P sup > a ≤ P sup eλMk −ψc (λ)Mk > e[αλ−ψc (λ)]Mn
k≥n Mk k≥n
λMk −ψc (λ)Mk
≤ P sup e >e [aλ−ψc (λ)]bn
+ P{Mn < bn}
k≥n
−bn[aλ−ψc (λ)]
≤e + P{Mn < bn}, (49)
from which
Mk
P sup > a ≤ P{Mn < bn} + e−nHc (ab,b) , (50)
k≥n M k
Mk
−nHc (ab,b)
P sup
k≥n Mk
> a ≤ 2 P{Mn < bn} + 2e . (51)
7. PROBLEMS
1. Let X = (Xn , Fn ) be a nonnegative submartingale, and let V = (Vn , Fn−1 ) be
a predictable sequence such that 0 ≤ Vn+1 ≤ Vn ≤ C (P-a.s.), where C is a
constant. Establish the following generalization of (1):
n
ε P max Vj Xj ≥ ε + Vn Xn d P ≤ E Vj ΔXj . (52)
i≤j≤n {max1≤j≤n Vj Xj <ε} j=1
E Sn∗ ≤ 8 E |Sn |.
In particular,
P max Xk ≥ x ≤ e−tx E etXn .
1≤k≤n
11. Let ξ = (ξn )n≥1 be a martingale difference and 1 < p ≤ 2. Show that
n p
∞
E sup ξj ≤ Cp E |ξj |p
n≥1
j=1 j=1
for a constant Cp .
12. Let X = (Xn )n≥1 be a martingale with E Xn = 0 and E Xn2 < ∞. As a general-
ization of Problem 5 of Sect. 2, Chap. 4, show that for any n ≥ 1 and ε > 0
E Xn2
P max Xk > ε ≤ .
1≤k≤n ε2 + E Xn2
1. The following result, which is fundamental for all problems about the conver-
gence of submartingales, can be thought of as an analog of the fact that in real
analysis a bounded monotonic sequence of numbers has a (finite) limit.
Then with probability 1 the limit lim Xn = X∞ exists and E |X∞ | < ∞.
Then, since
&
{lim sup Xn > lim inf Xn } = {lim sup Xn > b > a > lim inf Xn }
a<b
(here a and b are rational numbers), there are values a and b such that
and (P-a.s.)
Xn → X∞ (P -a.s.), (4)
1
L
Xn → X∞ . (5)
Moreover, the sequence X = (Xn , Fn ), 1 ≤ n ≤ ∞, with F∞ = σ( Fn ) is also a
submartingale.
PROOF. Statement (4) follows from Theorem 1, and (5) follows from (4) and The-
orem 4 (Sect. 6, Chap. 2, Vol. 1).
Moreover, if A ∈ Fn and m ≥ n, then
E IA |Xm − X∞ | → 0, m → ∞,
and therefore
lim Xm d P = X∞ d P .
m→∞ A A
The sequence A
Xm d P m≥n
is nondecreasing, and therefore
Xn d P ≤ Xm d P ≤ X∞ d P,
A A A
then there is an integrable random variable X∞ for which (4) and (5) are satisfied.
For the proof, it is enough to observe that, by Lemma 3 of Sect. 6, Chap. 2, Vol. 1,
condition (6) guarantees the uniform integrability of the family {Xn }.
Theorem 3 (P. Lévy). Let (Ω, F , P) be a probability space, and let (Fn )n≥1 be a
nondecreasing family of σ-algebras, F1 ⊆ F2 ⊆ · · · ⊆ F . Let ξ be a random
variable with E |ξ| < ∞ and F∞ = σ( n Fn ). Then, both P-a.s. and in the L1
sense,
E(ξ | Fn ) → E(ξ | F∞ ), n → ∞. (7)
4 General Theorems on Convergence of Submartingales and Martingales 151
X∞ = E(ξ | F∞ ) (P -a.s.).
Since the family {Xn } is uniformly integrable and since, by Theorem 5, Sect. 6,
Chap. 2, Vol. 1, we have E IA |Xm − X∞ | → 0 as m → ∞, it follows that
X∞ d P = ξ d P . (8)
A A
∞
This equation is satisfied for all A ∈ Fn and, therefore, for all A ∈ n=1 Fn .
Since E |X∞ | < ∞ and E |ξ| < ∞, the left-hand and right-hand sides of (8) are
σ-additive measures, possibly taking negative as well as positive values, but finite
∞
and agreeing on the algebra n=1 Fn . Because of the uniqueness of the extension
of a σ-additive measure from an algebra to the smallest σ-algebra containing it
(Carathéodory’s
theorem, Sect. 3, Chap. 2, Vol. 1), Eq. (8) remains valid for sets A ∈
F∞ = σ( Fn ). Thus
X∞ d P = ξ d P = E(ξ | F∞ ) d P, A ∈ F∞ . (9)
A A A
The next two examples illustrate possible applications of the preceding results to
convergence theorems in analysis.
(In this sense, g(x) is a “derivative” of f (x).) Let us show how this result can be
deduced from Theorem 1.
Let Ω = [0, 1), F = B([0, 1)), and let P denote Lebesgue measure. Put
2n
k−1 k−1 k
ξn (x) = I ≤ x < ,
2n 2n 2n
k=1
Xn = E[g | Fn ]. (11)
and since n and k are arbitrary, we obtain the required equation (10).
EXAMPLE 3. Let Ω = [0, 1), F = B([0, 1)), and let P denote Lebesgue measure.
Consider the Haar system {Hn (x)}n≥1 , as defined in Example
3 of Sect. 11, Chap. 2,
Vol. 1. Put Fn = σ{H1 , . . . , Hn }, and observe that σ( Fn ) = F . From the prop-
erties of conditional expectations and the structure of the Haar functions, it is easy
to deduce that
n
E[ f (x) | Fn ] = ak Hk (x) (P -a.s.) (12)
k=1
n
(f , Hk )Hk (x) → f (x) (P -a.s.)
k=1
and n
1
(f , Hk )Hk (x) − f (x) dx → 0.
0
k=1
t0 so that |t0 | < δ. Then there is an n0 = n0 (t0 ) such that | E eit0 Sn | ≥ c > 0 for all
n ≥ n0 , where c is a constant.
For n ≥ n0 , we form the sequence X = (Xn , Fn ) with
eit0 Sn
Xn = , Fn = σ{ξ1 , . . . , ξn }.
E eit0 Sn
Since ξ1 , ξ2 , . . . were assumed to be independent, the sequence X = (Xn , Fn ) is a
martingale with
sup E |Xn | ≤ c−1 < ∞.
n≥n0
Then it follows from Theorem 1 that with probability 1 the limit limn Xn exists and is
finite. Therefore the limit limn→∞ eit0 Sn also exists with probability 1. Consequently,
we can assert that there is a δ > 0 such that for each t in the set T = {t : |t| < δ}
the limit limn eitSn exists with probability 1.
Let T × Ω = {(t, ω) : t ∈ T, ω ∈ Ω}, let B(T) be the σ-algebra of Lebesgue sets
on T, and let λ be Lebesgue measure on (T, B(T)). Also, let
C = (t, ω) ∈ T × Ω : lim eitSn (ω) exists .
n
of Theorem 3: As n → ∞,
if the sequence (Xn )n≥1 is uniformly integrable, then the limit X∞ = limn Xn
exists (P-a.s.) and the “closed” sequence X = (Xn , Fn )1≤n≤∞ is amartingale.
∞
12. Assume that X = (Xn , Fn )n≥1 is a submartingale, and let F∞ = σ n=1 Fn .
Prove that if (Xn+ )n≥1 is uniformly integrable, then the limit X∞ = limn Xn ex-
ists (P-a.s.) and the “closed” sequence X = (Xn , Fn )1≤n≤∞ is a submartingale.
PROOF. The inclusion {Xn →} ⊆ {sup Xn < ∞} is evident. To establish the oppo-
site inclusion, we consider the stopped submartingale X τa = (Xτa ∧n , Fn ). Then, by
(1),
Therefore (P-a.s.)
The following propositions extend this result to more general martingales and
submartingales.
Theorem 2. Let X = (Xn , Fn ) be a submartingale and
Xn = mn + An
PROOF. (a) The second inclusion in (7) is obvious. To establish the first inclusion,
we introduce the times
Therefore (P-a.s.),
&
{A∞ < ∞} = {A∞ ≤ a} ⊆ {Xn →}.
a>0
(b) The first equation follows from Theorem 1. To prove the second, we notice
that, in accordance with (5),
and therefore
E Aτa = E lim Aτa ∧n < ∞.
n
Hence {τa = ∞} ⊆ {A∞ < ∞}, and we obtain the required conclusion since
a>0 {τa = ∞} = {sup Xn < ∞}.
5 Sets of Convergence of Submartingales and Martingales 159
where
∞
M∞ = E((ΔMn )2 | Fn−1 ) (15)
n=1
n
n
An = E(Δ(Mk + 1)2 | Fk−1 ) = E(ΔMk2 | Fk−1 ) = An
k=1 k=1
160 7 Martingales
because the linear term in E(Δ(Mk + 1)2 | Fk−1 ) vanishes. Hence (7) implies that
(P-a.s.)
Because of (9), Eq. (14) will be established if we show that the condition
E sup |ΔMn |2 < ∞ guarantees that M 2 belongs to C+ .
Let τa = min{n ≥ 1 : Mn2 > a}, a > 0. Then, on the set {τa < ∞},
whence
"
E |ΔMτ2a | I{τa < ∞} ≤ E(ΔMτa )2 I{τa < ∞} + 2a1/2 E(ΔMτa )2 I{τa < ∞}
"
≤ E sup |ΔMn |2 + 2a1/2 E sup |ΔMn |2 < ∞.
n
ΔMi
mn = .
i=1
Ai
Then
n
E[(ΔMi )2 | Fi−1 ]
mn = . (19)
i=1
A2i
5 Sets of Convergence of Submartingales and Martingales 161
Since n
Mn k=1 Ak Δmk
= ,
An An
we have, by Kronecker’s lemma (Sect. 3, Chap. 4), Mn /An → 0 (P-a.s.) if the limit
limn mn exists (finite) with probability 1. By (13),
Therefore it follows from (19) that (16) is a sufficient condition for (17).
If now An = Mn , then (16) is automatically satisfied (Problem 6).
This completes the proof of the theorem.
EXAMPLE. Consider a sequence ξ1 , ξ2 , . . . of independent random variables with
E ξi = 0, Var ξi = Vi > 0, and let the sequence X = {Xn }n≥0 be defined recursively
by
Xn+1 = θXn + ξn+1 , (21)
where X0 is independent of ξ1 , ξ2 , . . . and θ is an unknown parameter, −∞ < θ <
∞.
We interpret Xn as the result of an observation made at time n and ask for an esti-
mator of the unknown parameter θ. As an estimator of θ in terms of X0 , X1 , . . . , Xn ,
we take n−1
(Xk Xk+1 )/Vk+1
θ̂n = k=0 n−1 2 , (22)
k=0 Xk /Vk+1
P(θ̂n → θ) = 1 (23)
are sufficient for (24), and therefore sufficient for (23). We have
∞
2
∞ ∞
ξ ξn2 (Xn − θXn−1 )2
∧1 ≤
n
=
n=1
Vn V
n=1 n n=1
Vn
/∞ ∞
0 !
X2 X 2
n−1 Vn+1
≤2 n
+θ 2
≤ 2 sup + θ M∞ ,
2
n=1 n
V n=1
Vn Vn
n−1 Xk2
where Mn = k=0 Vk+1 by definition.
Therefore
∞
ξn2
∧ 1 = ∞ ⊆ {M∞ = ∞}.
n=1
Vn
By
∞the three-series theorem (Theorem 3, Sect. 2, Chap. 4)the divergence of
∞
n=1 E((ξ 2
n /Vn ) ∧ 1) guarantees the divergence (P-a.s.) of n=1 ((ξ 2 /Vn ) ∧ 1).
Therefore P{M∞ = ∞} = 1, hence (24) follows directly from Theorem 4.
Estimators θ̂n , n ≥ 1, with property (23) are said to be strongly consistent; com-
pare the notion of consistency in Sect. 4, Chap. 1, Vol. 1.
In Subsection 5 of the next section we continue the discussion of this example
for Gaussian variables ξ1 , ξ2 , . . . .
Theorem 5. Let X = (Xn , Fn ) be a submartingale, and let
Xn = mn + An
or, equivalently,
∞
E[ΔXn + (ΔXn ) | Fn−1 ] < ∞
2
= {Xn →}. (27)
n=1
PROOF. Since
n
An = E(ΔXk | Fk−1 ) (28)
k=1
and
n
mn = [ΔXk − E(ΔXk | Fk−1 )], (29)
k=1
5 Sets of Convergence of Submartingales and Martingales 163
it follows from the assumption that |ΔXk | ≤ C that the martingale m = (mn , Fn )
is square-integrable with |Δmn | ≤ 2C. Then, by (13),
n
A ∩ {Xn →} = A ∩ ξk I(|ξk | ≤ c) →
k=1
n
=A ∩ [ξk I(|ξk | ≤ c) − E(ξk I(|ξk | ≤ c) | Fk−1 )] → . (31)
k=1
164 7 Martingales
n
Let ηk = ξkc − E(ξkc | Fk−1 ), and let Yn = k=1 ηk . Then Y = (Yn , Fn ) is a square-
integrable martingale with |ηk | ≤ 2c. By Theorem 5 we have
A⊆ Var(ξnc | Fn−1 ) < ∞ = {Y∞ < ∞} = {Yn →}. (32)
A ∩ {Xn →} = A,
4. Show that the corollary of Theorem 1 remains valid for generalized martingales.
5. Show that every generalized submartingale
n of class C+ is a local submartingale.
6. Let an > 0, n ≥ 1, and let bn = k=1 ak . Show that
∞
an
2
< ∞.
b
n=1 n
1. Let (Ω, F ) be a measurable space on which there is defined a family (Fn )n≥1 of
σ-algebras such that F1 ⊆ F2 ⊆ · · · ⊆ F and
&
∞
F =σ Fn . (1)
n=1
Let us suppose that two probability measures P and P̃ are given on (Ω, F ). Let us
write
Pn = P | Fn , P̃n = P̃ | Fn
for the restrictions of these measures to Fn , i.e., let Pn and P̃n be measures on
(Ω, Fn ), and for B ∈ Fn let
The fundamental question that we shall consider in this section is the determina-
loc
tion of conditions under which local absolute continuity P̃ P implies one of the
properties P̃ P, P̃ ∼ P, P̃ ⊥ P. It will become clear that martingale theory is the
mathematical apparatus that lets us give definitive answers to these questions.
Recall that the problems of absolute continuity and singularity were considered
in Sect. 9, Chap. 3, Vol. 1, for arbitrary probability measures. It was shown that the
corresponding tests could be stated in terms of the Hellinger integrals (Theorems 2
and 3 therein). The results about absolute continuity and singularity for locally ab-
solutely continuous measures to be stated below could be obtained using those tests.
This approach is revealed in the monographs [34, 43]. Here we prefer another pre-
sentation, which enables us to better illustrate the possibilities of using the results on
the sets of convergence of submartingales obtained in Sect. 5. (Note that throughout
this section we assume the property of local absolute continuity. This is done only
to simplify the presentation. The reader is referred to [34, 43] for the general case.)
loc
Let us then suppose that P̃ P. Denote by
dP̃n
zn =
d Pn
loc
Theorem 1. Let P̃ P.
(a) Then with 12 (P̃ + P)-probability 1 there exists the limit limn zn , to be denoted
by z∞ , such that
P(z∞ = ∞) = 0.
(b) The Lebesgue decomposition
P̃(A) = z∞ d P +P̃(A ∩ {z∞ = ∞}), A ∈ F, (3)
A
holds, and the measures P̃(A ∩ {z∞ = ∞}) and P(A), A ∈ F , are singular.
PROOF. Let us notice first that, according to the classical Lebesgue decomposition
(see (29) in Sect. 9, Chap. 3, Vol. 1) of an arbitrary probability measure P with re-
spect to a probability measure P̃, the following representation holds:
8 z̃ 8 ∩ {z = 0}), A ∈ F ,
P(A) = d P +P(A (4)
z
A
where
dP 8
dP
z= , z̃ =
dQ dQ
and the measure Q can be taken, for example, to be Q = 12 (P +P). 8 Conclusion (3)
can be thought of as a specialization of decomposition (4) under the assumption that
loc
P̃ P, i.e., P̃n Pn .
Let
d Pn dP8n 1
zn = , z̃n = , Qn = 8 n ).
(Pn +P
d Qn d Qn 2
The sequences (zn , Fn ) and ( z̃n , Fn ) are martingales with respect to Q such that
0 ≤ zn ≤ 2, 0 ≤ z̃n ≤ 2. Therefore, by Theorem 2, Sect. 4, there exist the limits
Then we obtain
by Carathéodory’s theorem (Sect. 3, Chap. 2, Vol. 1) that for any
A ∈ F = σ( n Fn )
8
z̃∞ d Q = P(A),
A
6 Absolute Continuity and Singularity of Probability Distributions. . . 167
i.e., d P/d Q = z∞ .
Thus, we have established the result
that was to be expected: If the measures
P and Q are defined on F = σ( Fn ) and Pn , Qn are the restrictions of these
measures to Fn , then
d Pn dP
lim =
n d Qn dQ
(Q-a.s. and in L1 (Ω, F , Q)). Similarly,
dP8n 8
dP
lim = .
n d Qn dQ
8 n Pn , n ≥ 1, it is not hard to
In the special case under consideration, where P
show that (Q-a.s.)
z̃n
zn = , (6)
zn
and Q{zn = 0, z̃n = 0} ≤ 12 [P{zn = 0} + P{z̃ 8 n = 0}] = 0, so that (6) Q-a.s. does
0
not involve an indeterminacy of the form 0 .
The expression of the form 20 , as usual, is set at +∞. It is useful to note that, since
(zn , Fn ) is a nonnegative martingale, relation (5) of Sect. 2 implies that if zτ = 0,
then zn = 0 for all n ≥ τ (Q-a.s.). Of course, the same holds also for (z̃n , Fn ).
Therefore the points 0 and +∞ are “absorbing states” for the sequence (zn )n≥1 .
It follows from (5) and (6) that the limit
limn z̃n z̃∞
z∞ ≡ lim zn = = (7)
n limn zn z∞
exists Q-a.s.
Since P{z∞ = 0} = {z∞ =0} z∞ d Q = 0, we have P{z∞ = ∞} = 0, which
proves conclusion (a).
For the proof of (3) we use the general decomposition (4). In our setup, by what
P
has been proved, we have z = dd Q = z∞ , z̃ = ddPQ = z̃∞ (Q-a.s.), hence (4) yields
8 z̃∞ 8 ∩ {z∞ = 0}).
P(A) = d P +P(A
A z∞
are singular.
168 7 Martingales
The Lebesgue decomposition (3) implies the following useful tests for absolute
continuity or singularity of locally absolutely continuous probability measures.
loc
Theorem 2. Let P̃ P, i.e., P̃n Pn , n ≥ 1. Then
2. It is clear from Theorem 2 that the tests for absolute continuity or singularity can
be expressed in terms of either P (verify the equation Ez∞ = 1 or Ez∞ = 0) or P̃
(verify that P̃(z∞ < ∞) = 1 or that P̃(z∞ = ∞) = 1).
By Theorem 5 in Sect. 6, Chap. 2, Vol. 1, the condition Ez∞ = 1 is equivalent to
the uniform integrability (with respect to P) of the family {zn }n≥1 . This allows us
to give simple sufficient conditions for the absolute continuity P̃ P. For example,
if
sup E[zn log+ zn ] < ∞ (12)
n
or, if
sup Ez1+ε
n < ∞, ε > 0, (13)
n
then, by Lemma 3 in Sect. 6, Chap. 2, Vol. 1, the family of random variables {zn }n≥1
is uniformly integrable, and therefore P̃ P.
In many cases, it is preferable to verify the property of absolute continuity or
of singularity using a test in terms of P̃, since then the question is reduced to the
investigation of the probability of the “tail” event {z∞ < ∞}, where one can use
propositions like the zero–one law.
Let us show, by way of illustration, that the Kakutani dichotomy can be deduced
from Theorem 2.
Let ξ = (ξ1 , ξ2 , . . .) and ξ˜ = (ξ˜1 , ξ˜2 , . . .) be sequences of independent random
variables defined on a probability space (Ω, F , P).
6 Absolute Continuity and Singularity of Probability Distributions. . . 169
Also, let
Pn = P | Bn , P̃n = P̃ | Bn
be the restrictions of P and P̃ to Bn , and let
Then either P̃ P or P̃ ⊥ P.
loc
PROOF. Condition (14) is evidently equivalent to P̃n Pn , n ≥ 1, i.e., P̃ P. It is
clear that
dP̃n
zn = = q1 (x1 ) · · · qn (xn ),
dPn
where
dPξ̃i
qi (xi ) = (xi ). (15)
dPξi
Consequently,
∞
{x : z∞ < ∞} = {x : log z∞ < ∞} = x : log qi (xi ) < ∞ .
i=1
∞
The event {x : i=1 log qi (xi ) < ∞} is a tail event. Therefore, by the Kolmogorov
zero–one law (Theorem 1, Sect. 1, Chap. 4) the probability P̃{x : z∞ < ∞} has only
two values (0 or 1), and therefore, by Theorem 2, either P̃ ⊥ P or P̃ P.
This completes the proof of the theorem.
3. The following theorem provides, in “predictable” terms, a test for absolute conti-
nuity or singularity.
loc
Theorem 4. Let P̃ P, and let
α n = zn z⊕
n−1 , n ≥ 1,
∞
√
P̃ ⊥ P ⇔ P̃ [1 − E( αn | Fn−1 )] = ∞ = 1. (17)
n=1
PROOF. Since
P̃n {zn = 0} = zn d P = 0,
{zn =0}
we have (P-a.s.)
4
n
n
zn = αk = exp log αk . (18)
k=1 k=1
Then
n
n
− ∞ < lim log αk < ∞ = − ∞ < lim u(log αk ) < ∞ . (20)
k=1 k=1
Let Ẽ denote averaging with respect to P̃, and let η be an Fn -measurable inte-
grable random variable. It follows from the properties of conditional expectations
(Problem 4) that
Ẽ(η | Fn−1 ) = z⊕
n−1 E(ηzn | Fn−1 ) (P̃-a.s.). (22)
Recalling that αn = z⊕
n−1 zn ,
we obtain the following useful formula for “recalcula-
tion of conditional expectations” (see (44) in Sect. 7, Chap.2, Vol. 1):
By (23),
n
Xn = u(log αk ),
k=1
Hence we find, by combining (19), (20), (22), and (25), that (P-a.s.)
∞
{z∞ < ∞} = Ẽ[u(log αk ) + u2 (log αk ) | Fk−1 ] < ∞
k=1
∞
= E[αk u(log αk ) + αk u2 (log αk ) | Fk−1 ] < ∞
k=1
and for x ≥ 0 there are constants A and B (0 < A < B < ∞) such that
√ √
A(1 − x)2 ≤ xu(log x) + xu2 (log x) + 1 − x ≤ B(1 − x)2 . (28)
172 7 Martingales
Hence (16) and (17) follow from (26), (27) and (24), (28).
This completes the proof of the theorem.
Corollary 1. If, for all n ≥ 1, the σ-algebras σ(αn ) and Fn−1 are independent
loc
with respect to P (or P̃), and P̃ P, then we have the following dichotomy: either
P̃ P or P̃ ⊥ P. Correspondingly,
∞
√
P̃ P ⇔ [1 − E αn ] < ∞,
n=1
∞
√
P̃ ⊥ P ⇔ [1 − E αn ] = ∞.
n=1
loc
Corollary 2. Let P̃ P. Then
∞
P̃ E(αn log αn | Fn−1 ) < ∞ = 1 ⇒ P̃ P .
n=1
Corollary 4. Let there exist constants A and B such that 0 ≤ A < 1, B ≥ 0, and
P{1 − A ≤ αn ≤ 1 + B} = 1, n ≥ 1.
6 Absolute Continuity and Singularity of Probability Distributions. . . 173
loc
Then, if P̃ P, we have
∞
P̃ P ⇔ P̃ E[(1 − αn )2 | Fn−1 ] < ∞ = 1,
n=1
∞
P̃ ⊥ P ⇔ P̃ E[(1 − αn )2 | Fn−1 ] = ∞ = 1.
n=1
From (34),
2
1 2dn−1 2
dn−1 an−1 (x) − ãn−1 (x)
log E(αn1/2 | Bn−1 ) = log 2 − 2 .
2 1 + dn−1 1 + dn−1 bn−1
2
Since log [2dn−1 /(1 + dn−1 )] ≤ 0, statement (30) can be written in the form
∞ 2
1 1 + dn−1
P̃ P ⇔ P̃ log
n=1
2 2dn−1
2 !
2
dn−1 an−1 (x) − ãn−1 (x)
+ 2 <∞ = 1. (35)
1 + dn−1 bn−1
The series
∞
2 ∞
1 + dn−1
log and 2
(dn−1 − 1)
n=1
2dn−1 n=1
Then it is clear that if P̃ and P are not singular measures, we have P̃ P. But
by hypothesis, P̃n ∼ Pn , n ≥ 1; hence by symmetry, we have P P̃. Therefore we
have the following theorem.
Theorem 5 (Hájek–Feldman Dichotomy). Let ξ = (ξ1 , ξ2 , . . .) and ξ˜ =
(ξ˜1 , ξ˜2 , . . .) be Gaussian sequences whose finite-dimensional distributions are
equivalent: P̃n ∼ Pn , n ≥ 1. Then either P̃ ∼ P or P̃ ⊥ P. Moreover,
∞
2
2 2 !
Δn (x) b̃n
P̃ ∼ P ⇔ Ẽ + −1 < ∞,
n=0
bn b2n
∞
2
2 2 ! (38)
Δn (x) b̃n
P̃ ⊥ P ⇔ Ẽ + −1 = ∞.
n=0
bn b2n
Lemma. Let β = (βn )n≥1 be a Gaussian sequence defined on (Ω, F , P). Then
∞
∞ ∞
P βn2 <∞ >0⇔P βn2 <∞ =1⇔ Eβn2 < ∞. (39)
n=1 n=1 n=1
PROOF. The implications (⇐) are obvious. To establish the implications (⇒), we
first suppose that Eβn = 0, n ≥ 1. Here it is enough to show that
∞
∞
!−2
E βn2 ≤ E exp − βn2 , (40)
n=1 n=1
since then the condition P{ βn2 < ∞} = 1 will imply that1the ∞right-hand side
2 of
∞
(40) is finite. Therefore then n=1 Eβn2 < ∞, and hence P n=1 nβ 2
< ∞ =1
by the implication (⇐).
Select an n ≥ 1. Then it follows from Sects. 11 and 13, Chap. 2, Vol. 1, that there
are independent Gaussian random variables βk,n , k = 1, . . ., r ≤ n, with Eβk,n = 0,
such that
n
r
βk2 = 2
βk,n .
k=1 k=1
2
If we write Eβk,n = λk,n , we easily see that
r
r
2
E βk,n = λk,n (41)
k=1 k=1
and
r 4
r
E exp − 2
βk,n = (1 + 2λk,n )−1/2 . (42)
k=1 k=1
Comparing the right-hand sides of (41) and (42), we obtain
n r
r !−2
n !−2
E 2
βk = E βk,n ≤ E exp −
2 2
βk,n = E exp − 2
βk ,
k=1 k=1 k=1 k=1
Since
(Eβn )2 ≤ 2βn2 + 2(βn − Eβn )2 ,
∞
we have n=1 (Eβn )
2
< ∞, and therefore
∞
∞
∞
Eβn2 = (Eβn )2 + E(βn − E βn )2 < ∞.
n=1 n=1 n=1
where
n−1
Xk ξk+1
n−1
Xk2
Mn = , Mn = .
Vk+1 Vk+1
k=0 k=0
6 Absolute Continuity and Singularity of Probability Distributions. . . 177
Pθ (M∞ = ∞) = 1 ⇔ Eθ M∞ = ∞,
But
k
Eθ Xk2 = θ2i Vk−i
i=0
and
∞
∞
k
Eθ X 2 k 1 2i
= θ Vk−i
Vk+1 Vk+1 i=0
k=0 k=0
∞ ∞
∞ ∞
∞
Vi−k Vi Vi−k
= θ2k = + θ2k . (45)
Vi+1 V
i=0 i+1
Vi+1
k=0 i=k k=1 i=k
Hence (44) follows from (43), and therefore, by Theorem 4, the estimator θ̂n , n ≥ 1,
is strongly consistent for every θ.
Necessity. For all θ ∈ R, let Pθ (θ̂n → θ) = 1. Let us show that if θ1
= θ2 ,
the measures Pθ1 and Pθ2 are singular (Pθ1 ⊥ Pθ2 ). In fact, since the sequence
(X0 , X1 , . . .) is Gaussian, by Theorem 5, the measures Pθ1 and Pθ2 are either
singular or equivalent. But they cannot be equivalent, since, if Pθ1 ∼ Pθ2 but
Pθ1 (θ̂n → θ1 ) = 1, then also Pθ2 (θ̂n → θ1 ) = 1. However, by hypothesis,
Pθ2 (θ̂n → θ2 ) = 1 and θ2
= θ1 . Therefore Pθ1 ⊥ Pθ2 for θ1
= θ2 .
According to (38),
∞
!
Xk2
Pθ1 ⊥ Pθ2 ⇔ (θ1 − θ2 )2 Eθ1 =∞
Vk+1
k=0
∞
Vi
P0 ⊥ Pθ2 ⇔ = ∞,
V
i=0 i+1
6. PROBLEMS
1. Prove (6).
2. Let P̃n ∼ Pn , n ≥ 1. Show that
3. Let P̃n Pn , n ≥ 1, let τ be a stopping time (with respect to (Fn )), and let
P̃τ = P̃ | Fτ and Pτ = P | Fτ be the restrictions of P̃ and P to the σ-algebra
Fτ . Show that P̃τ Pτ if and only if {τ = ∞} = {z∞ < ∞} (P̃-a.s.). (In
particular, if P̃{τ < ∞} = 1, then P̃τ Pτ .)
4. Prove the “recalculation formulas” (21) and (22).
5. Verify (28), (29), and (32).
6. Prove (34).
7. In Subsection 2, let the sequences ξ = (ξ1 , ξ2 , . . .) and ξ˜ = (ξ˜1 , ξ˜2 , . . .) consist
of independent identically distributed random variables. Show that if Pξ̃1 Pξ1 ,
then P̃ P if and only if the measures Pξ̃1 and Pξ1 coincide. If, however, Pξ̃1
Pξ1 and Pξ̃1
= Pξ1 , then P̃ ⊥ P.
be the first time at which the random walk (Sn ) is found below the boundary g =
g(n). (As usual, τ = ∞ if {·} = ∅.)
It is difficult to discover the exact form of the distribution of the time τ. In the
present section we find the asymptotic form of the probability P(τ > n) as n → ∞,
for a wide class of boundaries g = g(n) and assuming that the ξi are normally
distributed. The method of proof is based on the idea of an absolutely continuous
change of measure together with a number of the properties of martingales and
Markov times that were presented earlier.
Then
1
n
P(τ > n) = exp − 2
[Δg(k)] (1 + o(1)) , n → ∞. (3)
2
k=2
Before starting the proof, let us observe that (1) and (2) are satisfied if, for exam-
ple,
g(n) = anν + b, 12 < ν ≤ 1, a + b < 0, a > 0,
or (for sufficiently large n)
g(n) = nν L(n), 1
2 ≤ ν ≤ 1,
where L(n) is a slowly varying function (e.g., L(n) = C(log n)β , C > 0, with
arbitrary β for 12 < ν < 1 or with β > 0 for ν = 12 ).
2. We shall need the following two auxiliary propositions for the proof of Theo-
rem 1.
Let us suppose that ξ1 , ξ2 , . . . is a sequence of independent identically distributed
random variables, ξi ∼ N (0, 1). Let F0 = {∅, Ω}, Fn = σ{ξ1 , . . . , ξn }, and let
α = (αn , Fn−1 ) be a predictable sequence with P(|αn | ≤ C) = 1, n ≥ 1, where C
is a constant. Form the sequence z = (zn , Fn ) with
1 2
n n
zn = exp α k ξk − αk , n ≥ 1. (4)
2
k=1 k=1
Lemma 1 (Discrete version of Girsanov’s theorem). With respect to P̃n , the random
variables ξ˜k = ξk − αk , 1 ≤ k ≤ n, are independent and normally distributed,
ξ˜k ∼ N (0, 1).
PROOF. Let Ẽn denote the expectation with respect to P̃n . Then for λk ∈ R, 1 ≤
k ≤ n,
n n
Ẽn exp i λk ξ˜k = E exp i λk ξ˜k zn
k=1 k=1
!
αn2
n−1
= E exp i ˜
λk ξk zn−1 · E exp iλn (ξn − αn ) + αn ξn − Fn−1
2
k=1
180 7 Martingales
!
1 2
n−1 n
= E exp i λk ξk zn−1 exp{− 2 λn } = · · · = exp −
1 2
λk .
2
k=1 k=1
Now the desired conclusion follows from Theorem 4, Sect. 12, Chap. 2, Vol. 1.
Lemma 2. Let X = (Xn , Fn )n≥1 be a square-integrable martingale with mean zero
and
σ = min{n ≥ 1 : Xn ≤ −b},
where b is a constant, b > 0. Suppose that
On the set {σ ≤ n}
−Xσ ≥ b > 0.
Therefore, for n ≥ 1,
which, with (7) and (8), leads to the required inequality with
and
n
lim sup log P(τ > n) [Δg(k)]2 ≤ − 12 . (11)
n→∞
k=2
7 Asymptotics of the Probability of the Outcome 181
For this purpose we consider the (nonrandom) sequence (αn )n≥1 with
α1 = 0, αn = Δg(n), n ≥ 2,
and the probability measures (P̃n )n≥1 defined by (5). Then, by Hölder’s inequality,
P̃n (τ > n) = E I(τ > n)zn ≤ (P(τ > n))1/q (E zpn )1/p , (12)
Now let us estimate the probability P̃n (τ > n) that appears on the left-hand side
of (12). We have
C
P̃(τ > n) ≥ , (14)
n
where C is a constant.
Then it follows from (12)–(14) that, for every p > 1,
p
n
p
P(τ > n) ≥ Cp exp − [Δg(k)]2 − log n , (15)
2 p−1
k=2
where Cp is a constant. Then (15) implies the lower bound (10) by the hypotheses
of the theorem, since p > 1 is arbitrary.
To obtain the upper bound (11), we first observe that since zn > 0 (P- and P̃-a.s.),
we have, by (5),
P(τ > n) = Ẽn I(τ > n)z−1 n , (16)
where Ẽn denotes an average with respect to P̃n .
In the case under consideration α1 = 0, αn = Δg(n), n ≥ 2, and therefore for
n≥2 n
1
n
−1
zn = exp − Δg(k) · ξk + 2
[Δg(k)] .
2
k=2 k=2
By the formula for summation by parts (see the proof of Lemma 2 in Sect. 3,
Chap. 4)
182 7 Martingales
n
n
Δg(k) · ξk = Δg(n) · Sn − Sk−1 Δ(Δg(k)).
k=2 k=2
n
n
Δg(k) · ξk ≥ Δg(n) · g(n) − g(k − 1)Δ(Δg(k)) − ξ1 Δg(2)
k=2 k=3
n
= [Δg(k)]2 + g(1)Δg(2) − ξ1 Δg(2).
k=2
Thus, by (16),
1
n
P(τ > n) ≤ exp − [Δg(k)]2 − g(1)Δg(2) Ẽn I(τ > n)e−ξ1 Δg(2)
2
k=2
1
n
= exp{−g(1)Δg(2)} exp − [Δg(k)] Ẽn I(τ > n)e−ξ1 Δg(2) ,
2
2
k=2
where
Ẽn I(τ > n)e−ξ1 Δg(2) ≤ E zn e−ξ1 Δg(2) = E e−ξ1 Δg(2) < ∞.
Therefore
1
n
P(τ > n) ≤ C exp − [Δg(k)]2 ,
2
k=2
f (n) → ∞, n → ∞,
and
n n
[Δf (k)]2 = o f −2 (k) , n → ∞.
k=2 k=1
Then for
σ = min{n ≥ 1 : |Sn | ≥ f (n)}
8 Central Limit Theorem for Sums of Dependent Random Variables 183
we have
π 2 −2
n
P(σ > n) = exp − f (k)(1 + o(1)) , n → ∞. (17)
8
k=1
4. PROBLEMS
1. Show that the sequence defined in (4) is a martingale. Is it still true without the
condition |αn | ≤ c (P-a.s.), n ≥ 1?
2. Establish (13).
3. Prove (17).
1. In Sect. 4, Chap. 3, Vol. 1, the central limit theorem for sums Sn = ξn1 + · · · + ξnn ,
n ≥ 1, of random variables ξn1 , . . . , ξnn was established under the assumptions
of their independence, finiteness of second moments, and asymptotic negligibility
of their terms. In this section, we give up both the assumption of independence
and even that of the finiteness of the absolute first-order moments. However, the
asymptotic negligibility of the terms will be retained.
Thus, we suppose that on the probability space (Ω, F , P) there are given stochas-
tic sequences
ξ n = (ξnk , Fkn ), 0 ≤ k ≤ n, n ≥ 1,
with ξn0 = 0, F0n = {∅, Ω}, Fkn ⊆ Fk+1
n
⊆ F (k + 1 ≤ n). We set
[nt]
Xtn = ξnk , 0 ≤ t ≤ 1.
k=0
Theorem 1. For a given t, 0 < t ≤ 1, let the following conditions be satisfied: for
each ε ∈(0, 1 ), as n → ∞,
[nt] P
k=1 P(|ξnk | > ε | Fk−1 ) → 0,
n
(A)
[nt] P
k=1 E[ξnk I(|ξnk | ≤ ε) | Fk−1 ] → 0,
n
(B)
[nt] P
k=1 Var[ξnk I(|ξnk | ≤ ε) | Fk−1 ] → σt , where σt2 ≥ 0.
n 2
(C)
Then
d
Xtn → N (0, σt2 ).
Remark 1. Hypotheses (A) and (B) guarantee that Xtn can be represented in the form
P [nt]
Xtn = Ytn +Ztn with Ztn → 0 and Ytn = k=0 ηnk , where the sequence η n = (ηnk , Fkn )
is a martingale difference, and E(ηnk | Fk−1
n
) = 0, with |ηnk | ≤ c, uniformly for
1 ≤ k ≤ n and n ≥ 1. Consequently, in the cases under consideration, the proof
reduces to proving the central limit theorem for martingale differences.
184 7 Martingales
In the case where the variables ξn1 , . . . , ξnn are independent, conditions (A),
(B), and (C), with t = 1 and σ 2 = σ12 , become
n
(a) P(|ξnk | > ε) → 0,
k=1
n
(b) E[ξnk I(|ξnk | ≤ ε)] → 0,
k=1
n
(c) Var[ξnk I(|ξnk | ≤ ε)] → σ 2 .
k=1
These are well known; see the book by Gnedenko and Kolmogorov [33]. Hence
we have the following corollary to Theorem 1.
Corollary. If ξn1 , . . . , ξnn are independent random variables, n ≥ 1, then
n
d
(a), (b), (c) ⇒ X1n = ξnk → N (0, σ 2 ).
k=1
Remark 2. In hypothesis (C), the case σt2 = 0 is not excluded. Hence, in particular,
d
Theorem 1 yields a convergence condition to the degenerate distribution (Xtn → 0).
Remark 3. The method used to prove Theorem 1 lets us state and prove the follow-
ing more general proposition.
Let 0 = t0 < t1 < t2 < · · · < tj ≤ 1, 0 = σt20 ≤ σt21 ≤ σt22 ≤ · · · ≤ σt2j ,
σ02 = 0, and let ε1 , . . . , εj be independent Gaussian random variables with zero
means and E ε2k = σt2k − σt2k−1 . Form the Gaussian vector (Wt1 , . . . , Wtj ) with Wtk =
ε1 + · · · + ε k .
Let conditions (A), (B), and (C) be satisfied for t = t1 , . . . , tj . Then the joint
distribution (Pnt1 ,...,tj ) of the random variables (Xtn1 , . . . , Xtnj ) converges weakly to
the Gaussian distribution Pt1 ,...,tj of the variables (Wt1 , . . . , Wtj ):
w
Pnt1 ,...,tj → Pt1 ,...,tj .
of the process W. (For details, see [4, 55, 43].) This result is usually called the
functional central limit theorem or the invariance principle (when ξn1 , . . . , ξnn are
independent, the latter is referred to as the Donsker–Prohorov invariance principle).
2. Theorem 2. 1. Condition (A) is equivalent to the uniform asymptotic negligibility
condition
P
(A∗ ) max1≤k≤[nt] |ξnk | → 0.
2. Assuming (A) or (A∗ ), condition (C) is equivalent to
[nt] 2 P
(C∗ ) k=0 [ξnk − E(ξnk I(|ξnk | ≤ 1) | Fk−1 )] → σt .
n 2
(The value of t in (A∗ ) and (C∗ ) is the same as in (A) and (C).)
ξ n = (ξnk , Fkn ), 1 ≤ k ≤ n,
[nt]
X t =
n 2
E(ξnk | Fk−1
n
), (2)
k=0
[nt]
P d
2
E(ξnk | Fk−1
n
) → σt2 ⇒ Xtn → N (0, σt2 ), (5)
k=0
186 7 Martingales
[nt]
P d
2 → 2
ξnk σt ⇒ Xtn → N (0, σt2 ). (6)
k=0
3.
[nt]
[nt]
Xtn = ξnk I(|ξnk | ≤ 1) + ξnk I(|ξnk | > 1)
k=0 k=0
[nt]
[nt]
= E[ξnk I(|ξnk | ≤ 1) | Fk−1
n
]+ ξnk I(|ξnk | > 1)
k=0 k=0
[nt]
+ {ξnk I(|ξnk | ≤ 1) − E[ξnk I(|ξnk | ≤ 1) | Fk−1
n
]}. (7)
k=0
We define
[nt]
Bnt = E[ξnk I(|ξnk | ≤ 1) | Fk−1
n
],
k=0
μnk (Γ) = I(ξnk ∈ Γ), (8)
νkn (Γ) = P(ξnk ∈ Γ | Fk−1
n
),
where Γ is a set from the smallest σ-algebra B0 = σ(A0 ) generated by the system
of sets A0 in R0 = R \ {0}, which consists of finite sums of disjoint intervals
(a, b] not containing the point {0}, and P(ξnk ∈ Γ | Fk−1
n
) is a regular conditional
distribution of ξnk given the σ-algebra Fk−1 .
n
which is known as the canonical decomposition of (Xtn , Ftn ). (The integrals are to
be understood as Lebesgue–Stieltjes integrals, defined for every sample point.)
P
According to (B), we have Bnt → 0. Let us show that (A) implies
[nt]
P
|x |dμnk → 0. (10)
k=1 {|x|>1}
We have
[nt]
[nt]
|x| dμnk = |ξnk | I(|ξnk | > 1). (11)
k=1 {|x|>1} k=1
[nt]
[nt]
|ξnk | I(|ξnk | > 1) > δ = I(|ξnk | > 1) > δ , (12)
k=1 k=1
since each sum is greater than δ if |ξnk | > 1 for at least one k. It is clear that
[nt]
[nt]
I(|ξnk | > 1) = dμnk n
(≡ U[nt] ).
k=1 k=1 {|x|>1}
By (A),
[nt]
P
n
V[nt] ≡ dνkn → 0, (13)
k=1 {|x|>1}
n → Pn → P
V[nt] 0 ⇒ U[nt] 0. (14)
n
Note that by the same corollary and the inequality ΔU[nt] ≤ 1, we also have the
converse implication:
n → P n → P
U[nt] 0 ⇒ V[nt] 0, (15)
which will be needed in the proof of Theorem 2.
The required proposition (10) now follows from (11)–(14).
Thus
Xtn = Ytn + Ztn , (16)
where
[nt]
Ytn = x d(μnk − νkn ), (17)
k=1 {|x|≤1}
and
[nt]
P
Ztn = Bnt + x dμnk → 0. (18)
k=1 {|x|>1}
Ytn = γ[nt]
n
(ε) + Δn[nt] (ε), ε ∈ (0, 1],
where
[nt]
n
γ[nt] (ε) = x d(μnk − νkn ), (20)
k=1 {ε<|x|≤1}
[nt]
Δn[nt] (ε) = x d(μnk − νkn ). (21)
k=1 {|x|≤ε}
k
2 !
Δn (ε)k = x2 dνin − x dνin
i=1 {|x|≤ε} {|x|≤ε}
k
= Var[ξni I(|ξni | ≤ ε) | Fi−1
n
].
i=1
Because of (C),
P
Δn (ε)[nt] → σt2 .
Hence, for every ε ∈ (0, 1],
P
n
max{γ[nt] (ε), |Δn (ε)[nt] − σt2 |} → 0.
n → d
M[nt] N (0, σt2 ), (22)
where
k
Mkn = Δnk (εn ) = x d(μni − νin ). (23)
i=1 {|x|≤εn }
For Γ ∈ B0 , let
in the form
k k
n n
Mk = ΔMi = x dμ̃nk .
i=1 i=1 {|x|≤2εn }
and
4
k
Ekn (Gn ) = (1 + ΔGnj ).
j=1
Observe that
1+ ΔGnk =1+ (eiλx − 1) dν̃kn = E[exp(iλΔMkn ) | Fk−1
n
],
{|x|≤2εn }
and consequently,
4
k
Ekn (Gn ) = E[exp(iλΔMjn ) | Fj−1
n
].
j=1
By the lemma to be proved in Subsection 4, (24) will follow if, for every real λ,
[nt]
4
|E[nt]
n
(Gn )| = E[exp(iλΔMj ) | Fj−1 ] ≥ c(λ) > 0
n n
(25)
j=1
and
P
E[nt]
n
(Gn ) → exp(− 12 λ2 σt2 ). (26)
To see this, we represent Ekn (Gn ) in the form
4
k
Ekn (Gn ) = exp(Gnk ) · (1 + ΔGnj ) exp(−ΔGnj ).
j=1
(Compare the function Et (A) defined by (76) of Sect. 6, Chap. 2, Vol. 1.)
190 7 Martingales
Since
x dν̃jn = E(ΔMjn | Fj−1
n
) = 0,
{|x|≤2εn }
we have
k
Gnk = (eiλx − 1 − iλx) dν̃jn . (27)
j=1 {|x|≤2εn }
Therefore
1
|ΔGnk | ≤ |e iλx
−1− iλx| dν̃kn ≤ λ2 x2 dν̃kn
{|x|≤2εn } 2 {|x|≤2εn }
1 2
≤ λ (2εn )2 → 0 (28)
2
and
1
k k
1 2 n
|ΔGnj | ≤ λ2 x2 dṽnj = λ M k . (29)
j=1
2 j=1 {|x|≤2εn } 2
By (C),
P
M n [nt] → σt2 . (30)
Suppose first that M [nt] ≤ a (P-a.s.). Then, by (28), (29), and Problem 3,
n
4
[nt]
P
(1 + ΔGnk ) exp(−ΔGnk ) → 1, n → ∞,
k=1
But
|eiλx − 1 − iλx + 12 λ2 x2 | ≤ 16 |λx|3 ,
and therefore
[nt] [nt]
|eiλx − 1 − iλx + 12 λ2 x2 | dν̃kn ≤ 16 |λ|3 (2εn ) x2 dν̃kn
k=1 {|x|≤2εn } k=1 {|x|≤2εn }
But
2 λ ΔM j
1 2 n
log(1 − 12 λ2 ΔM n j ) ≥ −
1 − 12 λ2 ΔM n j
and ΔM n j ≤ (2εn )2 ↓ 0, n → ∞. Therefore there is an n0 = n0 (λ) such that for
all n ≥ n0 (λ),
|Ekn (Gn )| ≥ exp{−λ2 M n k },
and therefore
(Gn )| ≥ exp{−λ2 M n [nt] } ≥ e−λ a .
2
|E[nt]
n
Hence the theorem is proved under the assumption that M n [nt] ≤ a (P-a.s.). To
remove this assumption, we proceed as follows.
Let
τn = min{k ≤ [nt] : M n k ≥ σt2 + 1},
taking τn = ∞ if M n [nt] ≤ σt2 + 1.
Then, for M = Mk∧τn , we have
n
M [nt] = M n [nt]∧τn ≤ 1 + σt2 + 2ε2n ≤ 1 + σt2 + 2ε21 (= a),
But
n
lim | E{exp(iλM[nt]
n
) − exp(iλM [nt] )}| ≤ 2 lim P(τn < ∞) = 0.
n n
Consequently,
n
n
lim E exp(iλM[nt] n
) = lim E{exp(iλM[nt] ) − exp(iλM [nt] )}
n n
n
+ lim E exp(iλM [nt] ) = exp(− 12 λ2 σt2 ).
n
The proof of this is similar to that of (24), replacing (Mkn , Fkn ) by the square-
integrable martingales (M̂kn , Fkn ),
k
M̂kn = νi ΔMin ,
i=1
4. In this subsection we prove a simple lemma that lets us reduce the verification of
(24) to the verification of (25) and (26).
Let η n = (ηnk , Fkn ), 1 ≤ k ≤ n, n ≥ 1, be stochastic sequences, let
n
Yn = ηnk ,
k=1
let
4
n
E n (λ) = E [exp(iληnk ) | Fk−1
n
], λ ∈ R,
k=1
E (λ) = E eiλY , λ ∈ R.
PROOF. Let n
eiλY
mn (λ) = .
E n (λ)
Then |mn (λ)| ≤ c−1 (λ) < ∞, and it is easily verified that
E mn (λ) = 1.
8 Central Limit Theorem for Sums of Dependent Random Variables 193
and
n n
|ξnk |I(|ξnk | > ε) > δ = I(|ξnk | > ε) > δ ,
k=1 k=1
we have
n
P max |ξnk | > ε + δ ≤P I(|ξnk | > ε) > δ
1≤k≤n
k=1
n
=P dμnk >δ .
k=1 {|x|>ε}
n
n∧σ n
n∧σ
P
I(|ξnk | ≥ ε/2) = dμnk → 0.
k=1 k=1 {|x|≥ε/2}
Therefore, by (15),
n
n∧σ n
n∧σ
P
dνkn ≤ dνkn → 0,
k=1 {|x|≥ε} k=1 {|x|≥ε/2}
which, together with the property limn P(σn < ∞) = 0, proves that (A∗ ) ⇒ (A).
2. Again suppose that t = 1. Choose an ε ∈ (0, 1] and consider the square-
integrable martingales (see (21))
with δ ∈ (0, ε]. For the given ε ∈ (0, 1], we have, according to (C),
P
Δn (ε)n → σ12 .
Let us show that from (C∗ ) and (A) or, equivalently, from (C∗ ) and (A∗ ), it fol-
lows that, for every δ ∈ (0, ε],
P
[Δn (δ)]n → σ12 , (36)
where !2
n
n
[Δ (δ)]n = ξnk I(|ξnk | ≤ δ) − x dνkn .
k=1 {|x|≤δ}
But
n !2 !2
n
ξ − x dνkn − ξnk I(|ξnk | ≤ 1) − x dνkn
nk
k=1 {|x|≤1} k=1 {|x|≤1}
!
n
≤ I(|ξnk | > 1) ξnk + 2|ξnk |
2 n
x d(μk − νk )
n
k=1 {|x|≤1}
8 Central Limit Theorem for Sums of Dependent Random Variables 195
n
≤5 I(|ξnk | > 1)ξnk
2
k=1
n
≤ 5 max ξnk
2
· dμnk → 0. (38)
1≤k≤n {|x|>1}
k=1
Let
mnk (δ) = [Δn (δ)]k − Δn (δ)k , 1 ≤ k ≤ n.
n
The sequence m (δ) = (mnk (δ), Fkn )
is a square-integrable martingale, and (mn (δ))2
is dominated (in the sense of the definition from Sect. 3) by the sequences [mn (δ)]
and mn (δ).
It is clear that
n
[mn (δ)]n = (Δmnk (δ))2 ≤ max |Δmnk (δ)| · {[Δn (δ)]n + Δn (δ)n }
1≤k≤n
k=1
≤ 3δ 2 {[Δn (δ)]n + Δn (δ)n }. (40)
Since [Δn (δ)] and Δn (δ) dominate each other, it follows from (40) that (mn (δ))2
is dominated by the sequences 6δ 2 [Δn (δ)] and 6δ 2 Δn (δ).
Hence, if (C) is satisfied, then for sufficiently small δ (e.g., for δ <
min ε, 16 b(σ12 + 1))
Since |Δ[Δn (δ)]k | ≤ (2δ)2 , the validity of (39) follows from (41) and another ap-
peal to the corollary to Theorem 4 (Sect. 3).
This completes the proof of Theorem 8.
6. PROOF OF THEOREM 3. On account of the Lindeberg condition (L), the equiv-
alence of (C) and (1), and of (C∗ ) and (3), can be established by direct calculation
(Problem 6).
196 7 Martingales
P
Therefore Bnt → 0 by the Lindeberg condition (L).
8. The fundamental theorem of the present section, namely, Theorem 1, was proved
under the hypothesis that the terms that are summed are uniformly asymptotically
infinitesimal. It is natural to ask for conditions of the central limit theorem without
such a hypothesis. For independent random variables, an example of such a theorem
was Theorem 1 in Sect. 5, Chap. 3, Vol. 1 (assuming finite second moments).
We quote (without proof) an analog of this theorem, restricting ourselves to se-
quences ξ n = (ξnk , Fkn ), 1 ≤ k ≤ n, that are square-integrable martingale differ-
2
ences (E ξnk < ∞, E(ξnk | Fk−1 n
) = 0).
Let Fnk (x) = P(ξnk ≤ x | Fk−1 n
) be a regular distribution function of ξnk given
Fk−1 , and let Δnk = E(ξnk | Fk−1 ).
n 2 n
[nt]
P
Δnk → σt2 , 0 ≤ σt2 < ∞, 0 ≤ t ≤ 1,
k=0
then
d
Xtn → N (0, σt2 ).
9. PROBLEMS
d d d
1. Let ξn = ηn + ζn , n ≥ 1, where ηn → η and ζn → 0. Prove that ξn → η.
P
2. Let (ξn (ε)), n ≥ 1, ε > 0, be a family of random variables such that ξn (ε) → 0
for each ε > 0 as n → ∞. Using, for example, Problem 11 from Sect. 10,
P
Chap. 2, Vol. 1, prove that there is a sequence εn ↓ 0 such that ξn (εn ) → 0.
3. Let (αk ), 1 ≤ k ≤ n, n ≥ 1, be complex-valued random variables such that
n
(P-a.s.)
n
|αkn | ≤ C, |αkn | ≤ an ↓ 0.
k=1
9 Discrete Version of Itô’s Formula 197
1. In the stochastic analysis of Brownian motion and other related processes (e.g.,
martingales, local martingales, semimartingales) Itô’s change-of-variables formula
plays a key role. In this section, we present a discrete (in time) version of this for-
mula and show briefly how Itô’s formula for Brownian motion could be derived
from it using a limiting procedure.
2. Let X = (Xn )0≤n≤N and Y = (Yn )0≤n≤N be two sequences of random variables
on the probability space (Ω, F , P), X0 = Y0 = 0, and
where
n
[X, Y]n = ΔXi ΔYi (1)
i=1
Given the function f = f (x) as in (2), consider the quadratic covariation [X, f (X)]
of the sequences X and f (X) = (f (Xn ))0≤n≤N . By (1),
n
[X, f (X)]n = Δf (Xk )ΔXk
k=1
n
= (f (Xk ) − f (Xk−1 ))(Xk − Xk−1 ). (4)
k=1
n
In (X, f (X)) = f (Xk−1 )ΔXk , 1 ≤ n ≤ N, (5)
k=1
n
Ĩn (X, f (X)) = f (Xk )ΔXk , 1 ≤ n ≤ N. (6)
k=1
Then
[X, f (X)]n = Ĩn (X, f (X)) − In (X, f (X)). (7)
(For n = 0, we set I0 = Ĩ0 = 0.)
For a fixed N, we introduce a new (reversed) sequence X̃ = (X̃n )0≤n≤N with
Then, clearly,
ĨN (X, f (X)) = −IN (X̃, f (X̃))
and, analogously,
[X, f (X)]n = −{IN (X̃, f (X̃)) − IN−n (X̃, f (X̃))} − In (X, f (X))
N n
=− f (X̃k−1 )ΔX̃k + f (Xk−1 )ΔXk . (9)
k=N−n+1 k=1
Remark 1. We note that the structures of the right-hand sides of (7) and (9) are dif-
ferent. Equation (7) contains two different forms of “discrete integral.” The integral
In (X, f (X)) is a “forward integral” in the sense that the value f (Xk−1 ) of f at the left
end of the interval [k − 1, k] is multiplied by the increment ΔXk = Xk − Xk−1 on this
interval, whereas in Ĩn (X, f (X)) the increment ΔXk is multiplied by the value f (Xk )
at the right end of [k − 1, k].
9 Discrete Version of Itô’s Formula 199
Thus, (7) contains both the “forward integral” In (X, f (X)) and the “backward
integral” Ĩn (X, f (X)), while in (9), both integrals are “forward integrals,” over two
different sequences X and X̃.
it is clear that
n
F(Xn ) = F(X0 ) + g(Xk−1 )ΔXk + 12 [X, g(X)]n
k=1
n
g(Xk−1 ) + g(Xk )
+ (F(Xk ) − F(Xk−1 )) − ΔXk . (10)
2
k=1
where !
n
Xk
f (Xk−1 ) + f (Xk )
Rn (X, f (X)) = f (x) − dx. (12)
Xk−1 2
k=1
From analysis, it is well known that if the function f (x) is continuous, then the
following formula (“trapezoidal rule”) holds:
b ! b
f (a) + f (b) f (ξ(x))
f (x) − dx = (x − a)(x − b) dx
a 2 a 2!
(b − a)3 1
= x(x − 1)f (ξ(a + x(b − a))) dx
2 0
1
(b − a)3
= f (ξ(a + x(b − a))) x(x − 1) dx
2 0
(b − a)3
=− f (η),
12
where ξ(x), x, and η are “intermediate” points in the interval [a, b].
Thus, in (12),
1
n
Rn (X, f (X)) = − f (ηk )(ΔXk )3 ,
12
k=1
1
n
|Rn (X, f (X))| ≤ sup f (η) |ΔXk |3 , (13)
12
k=1
200 7 Martingales
min(X0 , X1 , . . . , Xn ) ≤ η ≤ max(X0 , X1 , . . . , Xn ).
We shall refer to formula (11) as the discrete analog of Itô’s formula. We note that
the right-hand side of this formula contains the following three “natural” ingredi-
ents: “the discrete integral” In (X, f (X)), the quadratic covariation [X, f (X)]n , and
the “remainder” term Rn (X, f (X)), which is so termed because it goes to zero in the
limit transition to the continuous time (see Subsection 5 for details).
4.
EXAMPLE 1. If f (x) = a + bx, then Rn (X, f (X)) = 0, and formula (11) takes the
following form:
EXAMPLE 2. Let ⎧
⎨ 1, x > 0,
f (x) = sign x = 0, x = 0,
⎩
−1, x < 0,
and let F(x) = |x|.
Let Xk = Sk , where
Sk = ξ1 + ξ2 + · · · + ξk
with ξ1 , ξ2 , . . . independent Bernoulli random variables taking values ±1 with prob-
ability 1/2.
If we also set S0 = 0, we obtain from (11) that
n
|Sn | = (sign Sk−1 )ΔSk + Nn , (15)
k=1
where
Nn = #{0 ≤ k < n, Sk = 0}
is the number of zeroes in the sequence S0 , S1 , . . . , S
n−1 .
n
We note that the sequence of discrete integrals ( k=1 (sign Sk−1 )ΔSk )n≥1 in-
volved in (14) forms a martingale, and therefore
E |Sn | = E Nn . (16)
Since (Problem 2) 9
2
E |Sn | ∼ n, n → ∞, (17)
π
(16) yields 9
2
E Nn ∼ n, n → ∞. (18)
π
9 Discrete Version of Itô’s Formula 201
n
1
F(B1 ) = F(B0 ) + f (B(k−1)/n )ΔBk/n + [f (B·/n ), B·/n ]n + Rn (B·/n , f (B·/n )).
2
k=1
(19)
It is known from the stochastic calculus of Brownian motion (e.g., [75, 32]) that
n
P
|Bk/n − B(k−1)/n |3 → 0, n → ∞. (20)
k=1
[B, f (B)]1 .
We have here 1
[B, f (B)]1 = f (Bs ) ds, (22)
0
and therefore
202 7 Martingales
1 1
1
F(B1 ) = F(0) + f (Bs ) dBs + f (Bs ) ds, (23)
0 2
0
6. PROBLEMS
1. Prove formula (15).
2. Establish that property (17) is true.
3. Prove formula (22).
4. Try to prove that (24) holds for any F ∈ C2 .
1. The material studied in this section is a good illustration of the fact that the theory
of martingales provides a simple way of estimating the risk faced by an insurance
company.
We shall assume that the evolution of the capital of a certain insurance company
is described by a random process X = (Xt )t≥0 . The initial capital is X0 = u > 0.
Insurance payments arrive continuously at a constant rate c > 0 (in time Δt the
amount arriving is cΔt) and claims are received at random times T1 , T2 , . . . (0 <
T1 < T2 < · · · ), where the amounts to be paid out at these times are described by
nonnegative random variables ξ1 , ξ2 , . . . .
Thus, taking into account receipts and claims, the capital Xt at time t > 0 is
determined by the formula
Xt = u + ct − St , (1)
where
St = ξi I(Ti ≤ t). (2)
i≥1
We denote by
T = inf{t ≥ 0 : Xt ≤ 0}
the first time at which the insurance company’s capital becomes less than or equal
to zero (“time of ruin”). Of course, if Xt > 0 for all t ≥ 0, then the time T is set
equal to +∞.
10 Application of Martingale Methods to Calculation of Probability of Ruin in Insurance 203
A. The times T1 , T2 , . . . at which claims are received are such that the variables
(T0 ≡ 0)
σi = Ti − Ti−1 , i ≥ 1,
are independent identically distributed random variables having an exponential dis-
tribution with density λe−λt , t ≥ 0 (see Table 2.3 in Sect. 3, Chap. 2, Vol. 1).
C. The sequences (T1 , T2 , . . .) and (ξ1 , ξ2 , . . .) are independent sequences (in the
sense of Definition 6 of Sect. 5, Chap. 2, Vol. 1).
Denote by
Nt = I(Ti ≤ t) (3)
i≥1
k−1
(λt)i
P(Nt < k) = P(σ1 + · · · + σk > t) = e−λt ,
i=0
i!
whence
(λt)k
P(Nt = k) = e−λt , k = 0, 1, . . . , (4)
k!
i.e., the random variable Nt has the Poisson distribution (see Table 2.2 in Sect. 3,
Chap. 2, Vol. 1) with parameter λt. Here, E Nt = λt.
The Poisson process N = (Nt )t≥0 constructed in this way is a special case of
a renewal process (Subsection 4, Sect. 9, Chap. 2, Vol. 1). The trajectories of this
process are discontinuous (specifically, piecewise-constant, continuous on the right,
and with unit jumps). Like Brownian motion (Sect. 13, Chap. 2, Vol. 1) having con-
tinuous trajectories, this process plays a fundamental role in the theory of random
processes. From these two processes can be built random processes of rather com-
plicated probabilistic structure. (We mention processes with independent increments
as a typical example of these; see, e.g., [31, 75, 68].)
204 7 Martingales
Thus, we see that, in the case under consideration, a natural requirement for an
insurance company to operate with a clear profit (i.e., E(Xt − X0 ) > 0, t > 0) is
that
c > λμ. (5)
In the following analysis, an important role is played by the function
∞
h(z) = (ezx − 1) dF(x), z ≥ 0, (6)
0
∞
Nt
= e−rct E er i=0 ξi
P(Nt = n)
n=0
∞
e−λt (λt)n
= e−rct (1 + h(r))n
n=0
n!
Let FtX = σ(Xs , s ≤ t). Since the process X = (Xt )t≥0 is a process with indepen-
dent increments (Problem 2), we have (P-a.s.)
hence (P-a.s.)
E(e−rXt −tg(r) | FsX ) = e−rXs −sg(r) . (8)
Using the notation
Zt = e−rXt −tg(r) , t ≥ 0, (9)
we see that property (8) can be rewritten in the form
{τ(ω) ≤ t} ∈ FtX .
It turns out that for martingales with continuous time, which are considered now,
Theorem 1 from Sect. 2 remains valid (with self-evident changes to the notation). In
particular,
E Zt∧τ = E Z0 (11)
for any Markov time τ.
Let τ = T. Then, by virtue of (9), we find from (11) that for any t > 0
Therefore
e−ru
P(T ≤ t) ≤ = e−ru max esg(r) . (12)
min0≤s≤t e−sg(r) 0≤s≤t
g(r) = λh(r) − cr
in more detail. Clearly, g(0) = 0, g (0) = λμ−c < 0 (by virtue of (5)) and g (r) =
λh (r) ≥ 0. Thus, there exists a unique positive value r = R with g(R) = 0.
206 7 Martingales
From this and λH(R) − cR = 0 we conclude that R is the (unique) root of the
equation
λ ∞ rx
e (1 − F(x)) dx = 1. (13)
c 0
Let us set r = R in (12). Then we obtain, for any t > 0,
whence
P(T < ∞) ≤ e−Ru . (15)
Hence we have proved the following theorem.
4. In the foregoing proof, we used relation (11), which, as we said, follows from
a continuous-time analog of Theorem 1, Sect. 2 (on preservation of the martingale
property under random time change). The proof of this continuous-time result can be
found, for example, in [54, Sect. 3.2]. However, if we assumed that σi , i = 1, 2, . . . ,
had a (discrete) geometric (rather than an exponential) distribution (P{σi = k} =
qk−1 p, k ≥ 1), then Theorem 1 of Sect. 2 would suffice.
The derivations in this section, which appeal to the theory of random processes
with continuous time, demonstrate, in particular, how mathematical models with
continuous time arise in applied problems.
5. PROBLEMS
1. Prove that the process N = (Nt )t≥0 (under assumption A) is a process with
independent increments.
2. Prove that X = (Xt )t≥0 is also a process with independent increments.
3. Consider the Cramér–Lundberg model and obtain an analog of the foregoing
theorem assuming that the variables σi , i = 1, 2, . . ., have a geometric (rather
than exponential) distribution (P(σi = k) = qk−1 p, k = 1, 2, . . .).
11 Fundamental Theorems 207
1. In the previous section we applied the martingale theory to the proof of the
Cramér–Lundberg theorem, which is a basic result of the mathematical theory of
insurance. In this section the martingale theory will be applied to the problem of ab-
sence of arbitrage in a financial market in the situation of stochastic indeterminacy.
In what follows, Theorems 1 and 2, which are called the fundamental theorems of
arbitrage theory in stochastic financial mathematics, are of particular interest be-
cause they state conditions for the absence of arbitrage in martingale terms (in a
sense to be explained later) in the markets under consideration as well as conditions
that guarantee the possibility of meeting financial obligations. (For a more detailed
exposition of the financial mathematics, see [71].)
2. Let us give some definitions. It will be assumed throughout that we are given a
filtered probability space (Ω, F , (Fn )n≥0 , P), which describes the stochastic inde-
terminacy of the evolution of prices, financial indexes, and other financial indicators.
The totality of events in Fn will be interpreted as the information available at time n
(inclusive). For example, Fn may comprise information about the particular values
of some financial assets or financial indexes, for example.
The main object of the fundamental theorems will be the concept of a (B, S)-
market, defined as follows.
Let B = (Bn )n≥0 and S = (Sn )n≥0 be positive random sequences. It is assumed
that Bn for every n ≥ 0 is Fn−1 -measurable, whereas Sn is Fn -measurable. For
simplicity, we assume that the initial σ-algebra F0 is trivial, i.e., F0 = {∅, Ω}
(Sect. 2, Chap. 2, Vol. 1). Therefore B0 and S0 are constants. In the terminology of
Sect. 1, B = (Bn )n≥0 and S = (Sn )n≥0 are stochastic sequences, and moreover, the
sequence B = (Bn )n≥0 is predictable (since Bn are Fn−1 -measurable).
The financial meaning of B = (Bn )n≥0 is that it describes the evolution of a bank
account with initial value B0 . The fact that Bn is Fn−1 -measurable means that the
state of the bank account at time n (say, “today”) becomes already known at time
n − 1 (“yesterday”).
If we let
ΔBn
rn = , n ≥ 1, (1)
Bn−1
with ΔBn = Bn − Bn−1 , then we obviously get
Bn = (1 + rn )Bn−1 , n ≥ 1, (2)
where rn are Fn−1 -measurable and satisfy rn > −1 (since Bn > 0 by assumption).
In the financial literature rn are called the (bank) interest rates.
208 7 Martingales
The sequence S = (Sn )n≥0 differs from B = (Bn )n≥0 in that Sn is Fn -measurable,
in contrast to the Fn−1 -measurability of Bn . This reflects the situation with stock
prices,whose actual value at time n becomes known only when it is announced (i.e.,
“today” rather than “yesterday” as for a bank account).
Similarly to the bank interest rate, we can define the market interest rate
ΔSn
ρn = , n ≥ 1, (3)
Sn−1
By definition, the pair of processes B = (Bn )n≥0 and S = (Sn )n≥0 introduced
in the foregoing form a financial (B, S)-market consisting of two assets, the bank
account B and the stock S.
Remark. It is clear that this (B, S)-market is merely a simple model of real financial
markets, which usually consist of many assets of a diverse nature (e.g., [71]). Nev-
ertheless, even this simple example demonstrates that the methods of martingale
theory are very efficient in the treatment of many issues of a financial and eco-
nomic nature (including, for example, the question about the absence of arbitrage in
a (B, S)-market, which will be solved by the first fundamental theorem.)
3. Now we provide a definition of an investment portfolio and its value and define
the important notion of a self-financing portfolio.
Let (Ω, F , (Fn )n≥0 , P) be a basic filtered probability space with F0 = {∅, Ω},
and let π = (β, γ) be a pair of predictable sequences β = (βn )n≥0 , γ = (γn )n≥0 . We
impose no other restrictions on βn and γn , n ≥ 0, except that they are predictable,
i.e., Fn−1 -measurable (F−1 = F0 ). In particular, they can take fractional and
negative values.
The meaning of βn is the amount of “units” in a bank account, and that of γn is
the amount of shares in an investor’s possession at time n.
We will call π = (β, γ) the investment portfolio in the (B, S)-market under con-
sideration.
We associate with each portfolio π = (β, γ) the corresponding value X π =
π
(Xn )n≥0 by setting
Xnπ = βn Bn + γn Sn (7)
11 Fundamental Theorems 209
and interpreting βn Bn as the amount of money in the bank account and γn Sn as the
total price of the stock at time n. The intuitive meaning of the predictability of β and
γ is also clear: the investment portfolio “for tomorrow” must be composed “today.”
The following important notion of a self-financing portfolio expresses the idea of
considering the (B, S)-markets that admit neither outflow nor inflow of capital. The
formal definition is as follows.
Using the formula of discrete differentiation (Δ(an bn ) = an Δbn + bn−1 Δan ), we
find that the increment ΔXnπ (= Xnπ − Xn−1 π
) of the value is representable as
The real change of the value may be caused only by market-based changes in
the bank account and the stock price, related to the quantity βn ΔBn + γn ΔSn . The
second expression on the right-hand side of (8), i.e., Bn−1 Δβn + Sn−1 Δγn , is Fn−1 -
π
measurable and cannot affect Xn−1 at time n. Therefore it must be equal to zero.
In general, the value can vary not only because of market-based changes in in-
terest rates (rn and ρn , n ≥ 1) but also due to, say, inflow of capital from outside
or outflow of capital for operating expenditures, and so on. We will not take into
account such possibilities; in addition, we will consider (in accordance with the
foregoing discussion) only portfolios π = (β, γ) satisfying the condition
for all n ≥ 1.
In stochastic financial mathematics such portfolios are called self-financing.
4. It follows from (9) that a self-financing portfolio π = (β, γ) satisfies
n
Xnπ = X0π + (βk ΔBk + γk ΔSk ), (10)
k=1
and since Xπ S
n
Δ n = γn Δ , (11)
Bn Bn
we have
Xnπ Xπ
n S
k
= 0 + γk Δ . (12)
Bn B0 Bk
k=1
Let us fix an N ≥ 1 and consider the evolution of the (B, S)-market at times
n = 0, 1, . . . , N.
Definition 2. We say that there is no arbitrage on the (B, S)-market (at time N), or
that this market is arbitrage-free if, for any portfolio π = (β, γ) with X0π = 0 and
210 7 Martingales
P{XNπ ≥ 0} = 1, it holds that P{XNπ = 0} = 1, i.e., the event XNπ > 0 may occur
only with zero P-probability.
The financial meaning of these definitions is that it is impossible to obtain any
risk-free income in an arbitrage-free market.
Clearly, the property of a (B, S)-market to be arbitrage-free, and hence to be in
a certain sense “fair” or “rational,” depends on the probabilistic properties of the
sequences B = (Bn )n≤N and S = (Sn )n≤N , as well as on the assumptions regarding
the structure of the filtered probability space (Ω, F , (Fn )n≤N , P).
Remarkably, the theory of martingales enables us to effectively state conditions
that guarantee the absence of arbitrage opportunities.
Theorem 1 (First Fundamental Theorem). Assume that stochastic indeterminacy is
described by a filtered probability space (Ω, F , (Fn )n≤N , P) with F0 = {∅, Ω},
FN = F .
A (B, S)-market defined on (Ω, F , (Fn )n≤N , P) is arbitrage-free if and only if
8 on (Ω, F ) equivalent to P (P
there exists a measure P 8 ∼ P) such that the dis-
S Sn
counted sequence B = Bn is a martingale with respect to this measure, i.e.,
n≤N
8 Sn < ∞,
E n ≤ N,
Bn
and
8 Sn | Fn−1 = Sn−1 ,
E n ≤ N,
Bn Bn−1
8 is the expectation with respect to P.
where E 8
Remark 1. The statement of the theorem remains valid also for vector processes
S = (S1 , . . . , Sd ) with d < ∞ [71, Chap. V, Sect. 2b].
Remark 2. For obvious reasons, the measure P 8 involved in the theorem is called
the martingale measure.
Denote by M(P) = P 8
8 ∼ P : S is a P-martingale the class of measures P,8
B
which are equivalent to P and such that the sequence BS = BSnn is a martingale
n≤N
8
with respect to P.
We will write N A for the absence of arbitrage (no arbitrage). Using this notation
the conclusion of Theorem 1 can be written
N A ⇐⇒ M(P)
= ∅. (13)
Xnπ
n S
k
= γk Δ , 1 ≤ n ≤ N. (14)
Bn Bk
k=1
11 Fundamental Theorems 211
The sequence S
= Sk 8
is a P-martingale; therefore the sequence G =
B Bk
k≤N
n
(Gπn )0≤n≤N with Gπ0 = 0 and Gπn
= γk Δ BSkk , 1 ≤ n ≤ N, is a martingale
π k=1
X
transform. Hence the sequence Bnn is also a martingale transform.
0≤n≤N
When testing for arbitrage or its absence, we must consider portfolios π such that
8 ∼ P and BN > 0 (P- and P-a.s.),
XNπ ≥ 0 (P-a.s.). Since P
not only X0π = 0,
but also 8
8 N ≥ 0 = 1.
π
X
we obtain that P BN
π
X
Then, applying Theorem 3 in Sect. 1 to the martingale transform Bnn ,
0≤n≤N
8 8 XN = E
8 X0 = 0, and
π π
we obtain that this sequence is in fact a P-martingale. Thus, E
π
π BN B0
8 XN ≥ 0 = 1, we have P
since P 8 XN = 0 = 1.
BN BN
Hence we see that XNπ 8 and P-a.s.), and therefore X π = 0 (P-a.s.) for any
= 0 (P- N
self-financing portfolio π with X0π = 0 and XNπ ≥ 0 (P-a.s.), which by definition
means the absence of arbitrage opportunities.
Necessity. We will give the proof only for the one-step model of a (B, S)-market,
i.e., for N = 1. But even this simple case will enable us to demonstrate the idea
of the proof, which consists in an explicit construction of a martingale measure
using the absence of arbitrage. We will construct this measure using the Esscher
transform (see subsequent discussion). (For the proof in the general case N ≥ 1 see
[71, Chapter V, Sect. 2d].)
Without loss of generality we can assume that B0 = B1 = 1. In the current setup,
the absence of arbitrage opportunities reduces (Problem 1) to the condition
Lemma 1. Let (Ω, F ) = (R, B(R)), and let X = X(ω) be the coordinate random
variable (X(ω) = ω). Let P be a probability measure on (Ω, F ) such that
8 aX < ∞.
Ee (17)
8
In particular, E|X| < ∞ and, moreover,
8 = 0.
EX (18)
212 7 Martingales
PROOF. Define the measure Q = Q(dx) with Q(dx) = ce−x P(dx) and normaliz-
2
8 a ∼ Q ∼ P.
is a probability measure for any real a. Clearly, P
ax
Remark 3. The transformation x ϕ(a) e
is known as the Esscher transform. As we
8 8
will see later, the measure P = Pa∗ for a certain value a∗ possesses the martingale
property (18). This measure is referred to as the Esscher measure or the martingale
Esscher measure.
Now we return to the proof of Theorem 1. The function ϕ = ϕ(a) defined for
all real a is strictly convex, since ϕ (a) > 0. Let ϕ∗ = inf{ϕ(a) : a ∈ R}. The
following two cases are possible: (i) there exists a∗ such that ϕ(a∗ ) = ϕ∗ , and (ii)
there is no such (finite) a∗ .
In the first case, ϕ (a∗ ) = 0. Therefore
Xea∗ X ϕ (a∗ )
EPa X = EQ = = 0,
∗ ϕ(a∗ ) ϕ(a∗ )
EXAMPLE 1. Suppose that the (B, S)-market is described by (5) and (6) with 1 ≤
k ≤ N, where rk = r (a constant) for all 1 ≤ k ≤ N and ρ = (ρ1 , ρ2 , . . . , ρN ) is a
sequence of independent identically distributed Bernoulli random variables taking
11 Fundamental Theorems 213
This model of a (B, S)-market is known as the CRR model, after the names of its
authors J. C. Cox, R. A. Ross, and M. Rubinstein; for more details see [71].
Since in this model 1 + ρ S
Sn n n−1
= ,
Bn 1 + r Bn−1
8 must satisfy
it is clear that the martingale measure P
8 1 + ρn = 1,
E
1+r
8 n = r.
i.e., Eρ
8 n = b}, q̃ = P{ρ
Use the notation p̃ = P{ρ 8 n = a}; then for any n ≥ 1
p̃ + q̃ = 1, bp̃ + aq̃ = r.
Hence
r−a b−r
p̃ = , q̃ = . (23)
b−a b−a
In this case the whole “randomness” is determined by the Bernoulli sequence
ρ = (ρ1 , ρ2 , . . . , ρN ). We let Ω = {a, b}N , i.e., we assume that the space of elemen-
tary outcomes consists of sequences (x1 , . . . , xN ) with xi = a or b. (Assuming this
specific “coordinate” structure of Ω does not restrict generality; in this connection,
see the end of the proof of sufficiency in Theorem 2, Subsection 6.)
As an exercise (Problem 2) we suggest showing that the measure P 8 defined by
EXAMPLE 2. Let the (B, S)-market have the structure Bn = 1 for all n = 0, 1, . . . , N
and
n
Sn = S0 exp ρ̂k , 1 ≤ n ≤ N. (25)
k=1
Let ρ̂k = μk + σk εk , where μk and σk > 0 are Fk−1 -measurable and (ε1 , . . . , εN )
are independent standard Gaussian random variables, εk ∼ N (0, 1).
214 7 Martingales
We will construct the required Esscher measure P 8 (on (Ω, FN )) by means of the
conditional 8
Esscher transform. That is, let P(dω) = ZN (ω) P(dω), where ZN (ω) =
3
1≤k≤N kz (ω) with
eak ρ̂k
zk (ω) = (26)
E(e k k | Fk−1 )
a ρ̂
(F0 = {∅, Ω}) and where the Fk−1 -measurable random variables ak = ak (ω) are
8
to be chosen so that the sequence (Sn )0≤n≤N is a P-martingale.
8 is a martingale measure if and only if
In view of (25), P
σn2
μn + = −an σn2 ,
2
i.e.,
μn 1
an = − − .
σn2 2
With this choice of an , 1 ≤ n ≤ N, the density ZN (ω) is given by the formula
N
μn σn 1 μn σ n 2
ZN (ω) = exp − + εn + + . (28)
n=1
σn 2 2 σn 2
and the (B, S)-market defined on this space is arbitrage-free (M(P)
= ∅). Then this
market is complete if and only if there exists only a unique equivalent martingale
measure (| M(P)| = 1).
PROOF. Necessity. Let the market at hand be complete. This means that for any
FN -measurable contingent claim fN there is a self-financing portfolio π = (β, γ)
such that XNπ = fN (P-a.s.). Without loss of generality we may assume that Bn = 1,
0 ≤ n ≤ N. Hence we see from (10) that
N
fN = XNπ = X0π + γk ΔSk . (29)
k=1
N
IA (ω) = XNπ = X0π + γk ΔSk (P -a.s.),
k=1
n
we conclude from Theorem 3 in Sect. 1 that the sequence k=1 γk ΔSk 1≤n≤N
is a
1 2
martingale with respect to either of the measures P and P . Therefore
where EPi is the expectation with respect to Pi and x = X0π , which is a constant
since F0 = {∅, Ω}. Now (30) implies that P1 (A) = P2 (A) for any set A ∈ FN .
Hence the uniqueness of the martingale measure is established.
The proof of sufficiency is more complicated and will be carried out in several
steps. We consider an arbitrage-free (B, S)-market (M(P)
= ∅) such that the mar-
tingale measure is unique (| M(P)| = 1).
It is worth noting that both assumptions of the uniqueness of the martingale mea-
sure and the completeness of the market are strong restrictions. What is more, it
turns out that these assumptions imply that the trajectories S = (Sn )0≤n≤N are
“conditionally two-pointed,” which will be explained subsequently. (This may be
exemplified by the CRR model ΔSn = ρn Sn−1 , where ρn takes only two values, so
that the conditional probabilities P(ΔSn ∈ · | Fn−1 ) “sit” on two points, aSn−1 and
bSn−1 .)
The uniqueness of the martingale measure (| M(P)| = 1) also imposes restric-
tions on the structure of the filtration (Fn )n≤N . Under this condition the σ-algebras
Fn must be generated by the prices S0 , S1 , . . . , Sn (assuming that Bk ≡ 1, k ≤ n).
In this regard, see the diagram on p. 610 of [71] and Chap. V, Sect. 4e, therein.
216 7 Martingales
n S
γk∗ Δ
k
mn = m0 + (31)
Bk
k=1
∗
n
Xnπ = x + γk∗ ΔSk , (32)
k=1
∗
with x = X0π .
∗ ∗ ∗
Since XNπ = fN ≤ c, the sequence X π = (Xnπ , Fn , P) 8 0≤n≤N is a martingale
∗
(Theorem 3, Sect. 1). Thus, we have two martingales, m and X π , with the same ter-
∗
minal value fN (XNπ = mN = fN ). But by the definition of the martingale property,
∗ ∗
mn = E(mN | Fn ) and Xnπ = E(XNπ | Fn ), 0 ≤ n ≤ N. Therefore the Lévy martin-
∗
gales m and X π are the same, and by (32) the martingale m = (mn , Fn , P)8 0≤n≤N
admits the “S-representation”
n
mn = x + γk∗ ΔSk , 1 ≤ n ≤ N, (33)
k=1
with x = m0 .
Let us now prove the reverse statememt (S-representation ⇒ completeness).
By assumption, there exists a measure P 8 ∈ M(P) such that any bounded P- 8
martingale admits an S-representation.
Take for such a martingale X = (Xn , Fn , P) 8 0≤n≤N a martingale with Xn =
8 N | Fn ), where E
E(f 8 is the expectation with respect to P
8 and fN is the contingent
claim involved in Definition 3, for which we must find a self-financing portfolio π
8 and P-a.s.).
such that XNπ = fN (P-
11 Fundamental Theorems 217
n
Xn = X0 + γk ΔSk (34)
k=1
as required in Definition 3.
Using representation (34), set γ̃n = γn and define
β̃n = Xn − γn Sn . (36)
(Implication {1} was established in the proof of necessity and implication {2} in
the foregoing lemma.)
To make the proof of {3} more transparent, we will consider the particular case
of a (B, S)-market described by the CRR model.
As was pointed out earlier (Example 1), in this model the martingale measure
8 is unique (| M(P)| = 1). So we need to understand why in this case the S-
P
representation (with respect to the martingale measure P) 8 holds. We have already
indicated that the key reason for that is the fact that the ρn in (4) take only two
values, a and b, and therefore the conditional distributions P(ΔSn ∈ · | Fn−1 ) are
two-pointed.
Thus we will consider the CRR model introduced in Example 1 and assume
additionally that Fn = σ(ρ1 , . . . , ρn ) for 1 ≤ n ≤ N and F0 = {∅, Ω}. Let P 8
denote the martingale measure on (Ω, FN ) defined by (24).
218 7 Martingales
8
Since E(ΔXn | Fn−1 ) = 0, we have
i.e.,
Wn (ω, x) = gn (ρ1 (ω), . . . , ρn−1 (ω), x) − gn−1 (ρ1 (ω), . . . , ρn−1 (ω)),
Wn (ω, x)
Wn∗ (ω, x) = .
x−r
Using this notation we obtain
ΔXn (ω) = Wn (ω, ρn (ω)) = Wn (ω, x) μn (dx; ω) = (x − r)Wn∗ (ω, x) μn (dx; ω).
By (38) the functions Wn∗ (ω, x) do not depend on x. Therefore denoting the expres-
sion in the left-hand side (or, equivalently, in the right-hand side) of (38) by γn∗ (ω)
we find that
ΔXn (ω) = γn∗ (ω) (ρn (ω) − r). (39)
Therefore
n
Xn (ω) = X0 (ω) + γk∗ (ω) (ρk (ω) − r). (40)
k=1
Hence
Bn−1 Sn
ρn − r = (1 + r) Δ ,
Sn−1 Bn
and consequently we see from (40) that
n S (ω)
k
Xn (ω) = X0 (ω) + γk (ω)Δ , (41)
Bk
k=1
where
Bk−1
γk (ω) = γk∗ (ω) (1 + r) .
Sk−1
The sequence S
= Sn 8 Thus, (41) is
is a martingale with respect to P.
B Bn
0≤n≤N
8
simply the “ BS -representation” for X with respect to the (basic) P-martingale S
B.
The key argument in the proof of {3} for the CRR model (where | M(P)| = 1)
was the fact that the ρn take on only two values. However, it turns out that the
uniqueness assumption of the martingale measure P 8 is so strong that in the general
ΔSn
case it also implies that the variables ρn = Sn−1 are “two-pointed,” i.e., there exist
predictable an = an (ω) and bn = bn (ω) such that
8 n = an | Fn−1 ) + P(ρ
P(ρ 8 n = bn | Fn−1 ) = 1 (P -a.s.). (42)
Taking for granted this property, the foregoing proof of the BS -representation in
the CRR model will “work” also in the general case. Thus, all that remains is to
establish (42). We leave obtaining this result to the reader (Problem 5). Nevertheless,
we give some heuristic arguments showing how the uniqueness of the martingale
measure leads to two-pointed conditional distributions.
Let Q = Q(dx) be a probability distribution on (R, B(R)) and ξ = ξ(x) the
coordinate random variable (ξ(x) = x). Let EQ |ξ| < ∞, EQ ξ = 0 (“martingale
property”), and the measure Q has the property that for any other measure Q 8 such
that EQ |ξ| < ∞ and EQ ξ = 0, it holds that Q 8 = Q (“uniqueness of the martingale
measure”).
We assert that in this case Q is supported on at most two points (a ≤ 0 and b ≥ 0)
that may stick together as one zero point (a = b = 0).
The aforementioned heuristic arguments, which make this assertion very likely,
are as follows.
Suppose that Q is supported on three points, x− ≤ x0 ≤ x+ , with masses
q− , q0 , q+ , respectively. The condition EQ ξ = 0 means that
q− x− + q0 x0 + q+ x+ = 0.
If x0 = 0, then q− x− + q+ x+ = 0.
Let
q− 1 q0 q+
q̃− = , q̃0 = + , q̃+ = , (43)
2 2 2 2
i.e., we move some parts of masses q− and q+ from the points x− and x+ to x0 .
220 7 Martingales
1. Hedging is one of the basic methods of the dynamic control of investment port-
folios. We will set out some basic notions and results related to this method consid-
ering as an example the pricing of so-called option contracts (or simply options).
At the same time, the amount of this payment should not give the seller arbitrage
possibilities of a “free lunch” type, i.e., the chance to earn a risk-free profit.
Before defining what the fair price of an option should mean, we give the com-
monly accepted classification of options.
2. We will consider a (B, S)-market, B = (Bn )0≤n≤N , S = (Sn )0≤n≤N , operat-
ing at time instants n = 0, 1, . . . , N and defined on a filtered probability space
(Ω, F , (Fn )0≤n≤N , P) with F0 = {∅, Ω} and FN = F .
We will consider options written on stock with prices described by the random
sequence S = (Sn )0≤n≤N .
With regard to the time of their exercise, options are of two types: European and
American.
If an option can be exercised only at the time instant N fixed in the contract, then
N is called the time of its exercise, and this option is said to be of the European type.
fN = (K − SN )+ . (2)
3. When defining the fair price in an arbitrage-free (B, S)-market we must distin-
guish between two cases, complete and incomplete markets.
222 7 Martingales
Definition 1. Let a (B, S)-market be arbitrage-free and complete. The fair price of
an option of the European type with FN -measurable bounded (nonnegative) contin-
gent claim fN is the price of perfect hedging,
Note that this definition is correct, i.e., for any bounded function fN there always
exists a portfolio π with some initial capital x such that XNπ ≥ fN (P-a.s.).
4. Now we give a formula for the price C(fN ; P). We will prove it for complete
markets and refer the reader to specialized literature for incomplete markets (e.g.,
[71, Chap. VI, Sect. 1c]).
Theorem 1. (i) For a complete arbitrage-free (B, S)-market, the fair price of a
European-type option with a contingent claim fN is
fN
C(fN ; P) = B0 EP , (5)
BN
where EP is the expectation with respect to the (unique) martingale measure P.8
(ii) For a general arbitrage-free (B, S)-market, the fair price of a European-type
option with a contingent claim fN is
fN
C(fN ; P) = sup B0 EP , (6)
P∈M(P)
BN
where the sup is taken over the set of all martingale measures M(P).
PROOF. (i) Let π be a perfect hedge with X0π = x and XNπ = fN (P-a.s.). Then (see
(15) in Sect. 11)
fN Xπ x N S
k
= N = + γk Δ . (7)
BN BN B0 Bk
k=1
12 Hedging in Arbitrage-Free Models 223
Note that the left-hand side of (8) does not depend on the structure of a particular
hedge π with initial value X0π = x. If we take another hedge π with initial value
X0π , then, according to (8), this initial value is again equal to B0 EP BfNN . Hence it is
clear that the initial value x is the same for all perfect hedges, which proves (5).
(ii) Here we only prove the inequality
fN
sup B0 EP ≤ C(fN ; P). (10)
P∈M(P)
BN
(The proof of the reverse inequality relies on the so-called optional decomposi-
tion, which goes beyond the scope of this book; see, e.g., [71, Chap. VI, Sects. 1c
and 2d].)
Suppose that the hedge π is such that X0π = x and XNπ ≥ fN (P-a.s.).
Then (7) implies that
x
N S fN
k
+ γk Δ ≥ ≥ 0.
B0 Bk BN
k=1
8 ∈ M(P),
Therefore, for any measure P
fN
B0 EP ≤x
BN
(cf. (8) and (9)). Hence, taking the supremum on the left-hand side over all measures
8 ∈ M(P), we arrive at the required inequality (10).
P
5. Now we consider some definitions and results related to options of the American
type. For these options we must assume that we are given not a single contingent
claim fN related to time N, but a collection of claims f0 , f1 , . . . , fN whose meaning is
that once the buyer exercises the option at time n, the payoff (by the option seller to
the buyer) is determined by the (Fn -mesurable) function fn = fn (ω).
If the buyer of an option decides to exercise the option at time τ = τ(ω), which
is a Markov time with values in {0, 1, . . . , N}, then the payoff is fτ(ω) (ω). Therefore
the seller of the option when composing his portfolio π must envisage that for any τ
the following hedging condition must hold: Xτπ ≥ fτ (P-a.s.).
224 7 Martingales
Theorem 2. (i) For a complete arbitrage-free (B, S)-market, the fair price of an
American-type option with a system of payoff functions f = (fn )0≤n≤N is given by
fτ
C(f ; P) = sup B0 EP , (12)
τ∈MN0 Bτ
where MN0 = {τ : τ ≤ N} is the class of stopping times (with respect to (Fn )0≤n≤N )
and P 8 is the unique martingale measure.
(ii) In the general case of an (incomplete) arbitrage-free (B, S)-market, the fair
price of an American-type option with a system of payoff functions f = (fn )0≤n≤N
is given by
fτ
C(f ; P) = sup B0 EP , (13)
τ∈MN , P∈M(P)
B τ
0
8
where M(P) is the set of martingale measures P.
The proof follows directly from the proof of the implication “complete-
ness” ⇒ “ BS -represenatation” in Lemma
2, Sect. 11, applied to the martingale
m = (mn )0≤n≤N with mn = EP fN
BN | Fn .
7. As an example of actual option pricing consider a (B, S)-market described by the
CRR model,
Bn = Bn−1 (1 + r),
(17)
Sn = Sn−1 (1 + ρn ),
where ρ1 , . . . , ρN are independent identically distributed random variables taking
two values, a and b, −1 < a < r < b.
This market is arbitrage-free and complete (Problem 3, Sect. 11) with martingale
8 such that P{ρ
measure P 8 n = a} = q̃, where
8 n = b} = p̃, P{ρ
r−a b−r
p̃ = , q̃ = . (18)
b−a b−a
(See Example 1 in Sect. 11, Subsection 5.)
By formula (5) of Theorem 1, the fair price for this (B, S)-market is
fN
C(fN ; P) = EP . (19)
(1 + r)N
(with Fn = σ(ρ1 , . . . , ρn ), 1 ≤ n ≤ N, and F0 = {∅, Ω}) and then to find γn∗ and
βn∗ by (15) and (16).
∗
Since X0π = C(fN ; P), the problem amounts to finding the conditional expecta-
tions on the right-hand side of (20) for n = 0, 1, . . . , N.
We will assume that the FN -measurable function fN has a “Markov” structure,
i.e., fN = f (SN ), where f = f (x) is a nonnegative function of x ≥ 0.
Use the notation
n
Fn (x; p) = f x(1 + b)k (1 + a)n−k Cnk pk (1 − p)n−k . (21)
k=0
226 7 Martingales
with p̃ = (r − a)/(b −3
a).
Using that SN = Sn n<k≤N (1 + ρk ), (21) and (20) imply that
∗
fN
Xnπ = EP | F n = (1 + r)−N FN−n (Sn ; p̃). (23)
(1 + r)N
In particular, ∗
C(fN ; P) = X0π = (1 + r)−N FN (S0 ; p̃). (24)
Finally, taking into account (23), we obtain from (15) that
X π∗ S
γn∗
n
=Δ n Δ
Bn Bn
is given by
To find βn∗ , note that Bn−1 Δβn∗ + Sn−1 Δγn∗ = 0 by the self-financing condition.
Therefore
π∗
Xn−1 = βn∗ Bn−1 + γn∗ Sn−1 , (26)
and consequently,
∗
π
Xn−1 − γn∗ Sn−1
βn∗ = . (27)
Bn−1
Using this formula along with (23) and (25) we obtain
1
1+r
βn∗ = FN−n+1 (Sn−1 ; p̃) − [FN−n (Sn−1 (1 + b); p̃)
BN 1+b
− FN−n (Sn−1 (1 + a); p̃)] . (28)
Let us see, finally, what the fair price C(fN ; P) is in the case of a standard buyer’s
(call) option when fN = (SN − K)+ .
12 Hedging in Arbitrage-Free Models 227
N
B(K0 , N; p) = CNk pk (1 − p)N−k , (31)
k=K0
it is not hard to derive from (24) the following formula (Cox–Ross–Rubinstein) for
the fair price (denoted presently by CN ) of the standard call option:
If K0 > N, then CN = 0.
Remark. Since
(K − SN )+ = (SN − K)+ − SN + K,
the fair price of a standard seller’s (put) option denoted by PN (= C(fN ; P) with
fN = (K − SN )+ ) is given by
8 + r)−N (K − SN )+ = CN − E(1
PN = E(1 8 + r)−N SN + K(1 + r)−N .
8. Problems
1. Find the price C(fN ; P) of a standard call option with fN = (SN − K)+ for the
model of the (B, S)-market considered in Example 2, Subsection 5, Sect. 11.
2. Try to prove the reverse inequality in (10).
3. Prove (12), and try to prove (13).
4. Give a detailed derivation of (23).
5. Prove (25) and (28).
6. Give a detailed derivation of (32).
228 7 Martingales
∞ ∞
where M0 = {τ : τ ≤ ∞} is the class of all Markov times. Obviously, V 0 =
supτ ∈M∞ E fτ when letting f∞ = 0 (cf. Sect. 1, Subsection 3).
0
In what follows we will only treat problem (1). (Regarding the case N = ∞, see
Sect. 9, Chap. 8.) If the probabilistic structure of the sequence f = (f0 , f1 , . . . , fN )
is not specified, the most efficient method of solving problems (1) and (3) is the
“martingale” method described in what follows. (We will always assume without
mention that E |fn | < ∞ for all n ≤ N.)
2. Thus, let N < ∞. This case may be treated by what is known as backward
induction, which is carried out here as follows.
13 Optimal Stopping Problems 229
Using this notation, the following theorem completely describes the solution of
the optimal stopping problems (1) and (5).
Theorem 1. Let f = (f0 , f1 , . . . , fN ) be such that every fn is Fn -measurable.
(i) For any n, 0 ≤ n ≤ N, the stopping time
(ii) The stopping times τnN , 0 ≤ n ≤ N, are optimal also in the following “condi-
tional” sense:
E(fτnN | Fn ) = ess sup E(fτ | Fn ) (P -a.s). (10)
τ ∈MNn
The “stochastic prices” ess supτ ∈MNn E(fτ | Fn ) are equal to vNn :
and
VnN = E vNn . (12)
If n = 0, then
V0N = vN0 . (13)
For n = N,
VNN = E fN . (14)
3. Before we proceed to the proof, let us recall the definition of the essential supre-
mum ess supα∈A ξα (ω) of a family of F -measurable random variables {ξα (ω), α ∈
A} involved in (10).
230 7 Martingales
We need this concept because in the case of an uncountable set A the use of the
ordinary supα∈A ξα (ω) may, in general, give rise to functions (of ω ∈ Ω) that are
not F -measurable.
Indeed, for any c ∈ R
ω : sup ξα (ω) ≤ c = {ω : ξα (ω) ≤ c}.
α∈A
α∈A
PROOF. First, assume that all ξα (ω), α ∈ A, are uniformly bounded (|ξα (ω)| ≤ c,
ω ∈ Ω, α ∈ A).
Let A be a finite set of indices α ∈ A. Set S(A) = E maxα∈A ξα (ω) . Let,
further, S = sup S(A), where the supremum is taken over all finite subsets A ⊆ A.
Denote by An , n ≥ 1, a finite set such that
1
E max ξα (ω) ≥ S − .
α∈An n
Let A0 = n≥1 An . This set is countable, hence
is F -measurable, i.e., this is a random variable. (Note that |ξ(ω)| ≤ c, hence ξ(ω)
is an ordinary rather than extended random variable.)
13 Optimal Stopping Problems 231
This construction of ξ(ω) implies (Problem 1) that this random variable has prop-
erties (a) and (b) of the foregoing definition.
Therefore we have established the existence of the essential supremum for a uni-
formly bounded family {ξα (ω), α ∈ A}.
In the general case we first go from ξα (ω) to the bounded random variables
ξ˜α (ω) = arctan ξα (ω), for which |ξ˜α (ω)| ≤ π/2, α ∈ A, ω ∈ Ω, and then we let
˜
ξ(ω) = ess supα∈A ξ˜α (ω).
Then the random variable ξ(ω) = tan ξ(ω) ˜ will satisfy requirements (a) and (b)
of the definition of the essential supremum (Problem 3).
In view of the Fk−1 -measurability of the set A, this implies that for any τ ∈
Mk−1
E( fτ | Fk−1 ) ≤ vk−1 (P -a.s.). (16)
We will show now that for the Markov time τk−1
with P-probability 1. (If this equality is established, we obtain by (16) that (10) and
(11) hold also for n = k − 1.)
For that purpose it suffices to show that (15) holds for τ = τk−1 with equality
rather than inequality signs throughout.
Beginning as in (15) and using then that on the set {τk−1 ≥ k} we have τ = τk
by definition (5) and that (by the induction assumption) E( fτk | Fk ) = vk (P-a.s.),
we obtain
E[IA fτk−1 ] = E[IA∩{τk−1 =k−1} fk−1 ] + E[IA∩{τk−1 ≥k} E( fτk−1 | Fk−1 )]
= E[IA∩{τk−1 =k−1} fk−1 ] + E[IA∩{τk−1 ≥k} E( fτk | Fk−1 )]
= E[IA∩{τk−1 =k−1} fk−1 ] + E[IA∩{τk−1 ≥k} E(vk | Fk−1 )] = E[IA vk−1 ],
232 7 Martingales
where the last equality holds because vk−1 = max( fk−1 , E(vk | Fk−1 )) by defini-
tion, hence vk−1 = fk−1 on the set {τk−1 = k − 1} and vk−1 > fk−1 on the set
{τk−1 > k − 1} = {τk−1 ≥ k} (so that on this set vk−1 = E(vk | Fk−1 )).
Thus (17) is established. As was pointed out earlier, this property, together with
(16), implies that (10) and (11) hold.
It follows from these relations that (P-a.s.)
vn = E( fτn | Fn ) ≥ E( fτ | Fn ) (18)
for any τ ∈ Mn (= MNn ). Therefore, taking into account the convention vNn = vn ,
we find that
E vNn = E fτn ≥ sup E fτ = VnN , (19)
τ ∈MNn
vNn ≥ fn , (20)
vNn ≥ E(vNn+1 | Fn ). (21)
The first inequality here means that the sequence vN majorizes the sequence
f = (f0 , f1 , . . . , fN ). The second inequality shows that vN is a supermartingale
with “terminal” value vNN = fN . Thus, we can say that vN = (vN0 , vN1 , . . . , vNN ) with
vNn ’s defined by (6) or by (11), is a supermartingale majorant for the sequence
f = (f0 , f1 , . . . , fN ).
In other words, this means that the sequence vN belongs to the class of sequences
γ = (γ0N , γ1N , . . . , γNN ) with γNN ≥ fN satisfying (P-a.s.) the “variational inequali-
N
ties”
γnN ≥ max(fn , E(γn+1
N
| Fn )) (22)
for all n = 0, 1, . . . , N − 1.
But the sequence vN possesses additionally the property that (22) holds for vN
not only with nonstrict inequality “≥” but with equality “=” (see (6)). This property
singles out the sequence vN among sequences γ N (with γNN ≥ fN ) as follows.
Theorem 2. The sequence vN is the least supermartingale majorant for the se-
quence f = (f0 , f1 , . . . , fN ).
PROOF. Since vNN = fN and γNN ≥ fN , we have γNN ≥ vNN . Together with (22) and (6)
this implies that (P-a.s.)
N
γN−1 ≥ max( fN−1 , E(γNN | FN−1 )) ≥ max( fN−1 , E(vNN | FN−1 )) = vNN−1 .
13 Optimal Stopping Problems 233
In a similar way we find that γnN ≥ vNn (P-a.s.) also for all n < N − 1.
Remark. The result of this theorem can be restated as follows: The solution vN =
(vN0 , vN1 , . . . , vNN ) of the recurrence system
with vNN = fN , is the smallest among solutions γ N = (γ0N , γ1N , . . . , γNN ) of the recur-
rence system of inequalities
with γNN ≥ fN .
6. Theorems 1 and 2 not only describe the method of finding the price V0N =
sup E fτ , where sup is taken over the class of Markov times MN0 , but also enable
us to determine the optimal time τ0N , i.e., the time for which E fτ0N = V0N .
According to (8),
When solving specific optimal stopping problems, the following equivalent de-
scription of this stopping time τ0N is usable.
Let
DNn = {ω : vNn (ω) = fn (ω)} (25)
and
CnN = Ω \ DNn = {ω : vNn (ω) = E(vNn+1 | Fn )(ω)}.
Clearly, DNN = Ω, CNN = ∅, and
It follows from (24) and (25) that the stopping time τ0N can also be defined as
It is natural to call DNk the “stopping sets” and CkN the “continuation of observa-
tion sets.” This terminology can be justified as follows.
Consider the time instant n = 0, and divide the set Ω into sets DN0 and C0N (Ω =
D0 ∪ C0N , DN0 ∩ C0N = ∅). If ω ∈ DN0 , then τ0N (ω) = 0. In other words, “stopping”
N
is done then at time n = 0. But if ω ∈ C0N , then τ0N (ω) ≥ 1. In the case where
ω ∈ DN1 ∩ C0N , we have τ0N (ω) = 1. The subsequent steps are considered in a similar
manner. At time N the observations are certainly terminated.
234 7 Martingales
Thus the optimal stopping problem for martingale sequences is solved actually
in a trivial manner: the optimal stopping time is τ0N (ω) = 0, ω ∈ Ω (as well as, by
the way, any other stopping time τnN (ω) = n, ω ∈ Ω, 1 ≤ n ≤ N).
The preceding examples are fairly simple, and the problem of finding optimal
stopping times is solved in them actually without invoking the theory given by The-
orems 1 and 2. Their solution relies on the results on preservation of martingale,
submartingale, and supermartingale properties under time change by a Markov time
(Sect. 2). But in general finding the price V0N and the optimal stopping time τ0N may
be a very difficult problem.
Of great interest are the cases where the functions fn have the form
where X = (Xn )n≥0 is a Markov chain. As will be shown in Sect. 9 of Chap. 8, the
solution of optimal stopping problems reduces in fact to the solution of variational
inequalities and Wald–Bellman equatons of dynamic programming.
We also provide therein (nontrivial) examples of complete solutions to some op-
timal stopping problems for Markov sequences.
8. Problems
1. Show that the random variable ξ(ω) = supα∈A0 ξα (ω) constructed in the proof
of the lemma (Subsection 3) satisfies requirements (a) and (b) in the definition of
essential supremum. (Hint: In the case α ∈ / A0 , consider E max(ξ(ω), ξα (ω)).)
2. Show that ξ(ω) = tan ξ(ω) ˜ (see the end of the proof of the lemma in Subsec-
tion 3) also satisfies requirements (a) and (b).
13 Optimal Stopping Problems 235
M∞
n = {τ : n ≤ τ < ∞},
Vn∞ = sup E fτ ,
τ ∈M∞
n
v∞
n = ess sup E( fτ | Fn ),
τ ∈M∞
n
τn∞ = min{k ≥ n : v∞
n = fn }.
Assuming that E sup fn− < ∞, show that the limiting random variables
ṽn = E( fτn∞ | Fn ),
ṽn = v∞
n (= ess sup E( fτ | Fn )).
τ ∈M∞
n
5. Let τn∞ ∈ M∞ ∞
n . Deduce from (a) and (b) of the previous problem that τn is the
optimal stopping time in the sense that
and
sup E fτ = E fτn∞ ,
τ ∈M∞
n
The modern theory of Markov processes has its origins in the studies of A. A. Markov
(1906–1907) on sequences of experiments “connected in a chain” and in attempts to de-
scribe mathematically the physical phenomenon known as Brownian motion (L. Bachelier
1900, A. Einstein 1905).
E. B. Dynkin “Markov processes,” [21, Vol. 1]
1. In Sect. 12 of Chap. 1 we set out, for the case of finite probability spaces, the
fundamental ideas and principles behind the concept of Markov dependence (see
property (7) therein) of random variables, which is designed to describe the evolu-
tion of memoryless systems. In this chapter we extend this treatment to more general
probability spaces.
One of the main problems of the theory of Markov processes is the study of the
asymptotic behavior (as time goes to infinity) of memoryless systems. Remarkably,
under very broad assumptions, such a system evolves as if it “forgot” the initial state,
its behavior “stabilizes,” and the system reaches a “steady-state” regime. We will an-
alyze in detail the asymptotic behavior of systems described as Markov chains with
countable many states. To this end we will provide a classification of the states of
Markov chains according to the algebraic and asymptotic properties of their transi-
tion probabilities.
2. Let (Ω, F , (Fn )n≥0 , P) be a filtered probability space, i.e., a probability space
(Ω, F , P) with a specified filtration (flow) (Fn )n≥0 of σ-algebras Fn , n ≥ 0, such
that F0 ⊆ F1 ⊆ . . . ⊆ F . Intuitively, Fn describes the “information” available by
the time n (inclusive).
Let, further, (E, E ) be a measurable space representing the “state space,” where
systems under consideration take their values. For “technical reasons” (e.g., so that,
for any random element X0 (ω) and x ∈ E, the set {ω : X0 (ω) = x} ∈ F ) it will be
assumed that the σ-algebra E contains all singletons in E, i.e., sets consisting of one
point. (Regarding this assumption, see Subsection 6 below.)
The measurable spaces (E, E ) subject to this assumption are called the phase
spaces or the state spaces of the systems under consideration.
Definition 1 (Markov chain in wide sense). Let (Ω, F , (Fn )n≥0 , P) be a filtered
probability space and (E, E ) a phase space. A sequence X = (Xn )n≥0 of E-
valued Fn /E -measurable random elements Xn = Xn (ω), n ≥ 0, defined on
(Ω, F , (Fn )n≥0 , P), is a sequence of random variables with Markov dependence(or
a Markov chain) in the wide sense if for any n ≥ 0 and B ∈ E the following wide-
sense Markov property holds:
For clarity (cf. Sect. 12 in Chap. 1, Vol. 1), this property is often written
The strict-sense Markov property (2) deduced from (1) suggests the definition of
the Markov dependence in the case where the flow (Fn )n≥0 is not specified a priori.
Remark. The introduction from the outset of a filtered probability space on which
a Markov chain in the wide sense is defined is useful in many problems where the
behavior of systems depends on a “flow of information” (Fn )n≥0 . For example, it
may happen that the first component X = (Xn )n≥0 of a “two-dimensional” process
(X, Y) = (Xn , Yn )n≥0 is not a Markov chain in the sense of (2), but nevertheless it is
a Markov chain in the sense of (1) with Fn = FnX,Y , n ≥ 0.
However, in the elementary exposition of the theory of Markov chains to be set
out in this chapter, the flow (Fn )n≥0 is not usually introduced and the presentation
is based on Definition 2.
P(F | F[0,n]
X
)(ω) = P(F | Xn (ω)) (P -a.s.) (6)
To this end, consider first a particular case of such a set, namely, a set PN, where
P ∈ F[0,n−1]
X
and N ∈ σ(Xn ), and show that in this case (6 ) follows from (7).
Indeed,
i.e., the property (6 ) holds for sets C of the form PN, where P ∈ F[0,n−1] X
and
N ∈ σ(Xn ). By means of monotone classes arguments (Sect. 2, Chap. 2, Vol. 1) we
deduce that property (6 ) is valid for any sets C in F[0,n]
X
. Since the function P(F | Xn )
is F[0,n] -measurable, (6 ) implies that P(F | Xn ) is a version of the conditional prob-
X
Therefore it is natural to start with the proof of (6) for sets F in the σ-algebras
F[n+1,n+k]
X
.
We will prove this by induction. If k = 1, then F[n+1,n+1]
X
= σ(Xn+1 ), and (6) is
the same as (2), which is assumed to hold.
Now let (6) hold for some k ≥ 1. Let us prove its validity for k + 1.
To this end, let us take a set F ∈ F[n+1,n+k+1]
X
of the form F = F1 ∩ F2 , where
F ∈ F[n+1,n+k] and F ∈ σ(Xn+k+1 ). Then, using the induction assumption, we
1 X 2
P( F | F[0,n]
X
) = E( IF | F[0,n]
X
) = E[ IF1 ∩F2 | F[0,n]
X
]
= E[ IF1 E( IF2 | F[0,n+k]
X
) | F[0,n]
X
]
= E[ IF1 E( IF2 | Xn+k ) | F[0,n]
X
] = E[ IF1 E( IF2 | Xn+k ) | Xn ]
= E[ IF1 E( IF2 | F[n,n+k] ) | Xn ] = E[ E( IF1 IF2 | F[n,n+k] ) | Xn ]
= E[ IF1 IF2 | Xn ] = P( F1 ∩ F2 | Xn ) = P( F | Xn ). (9)
The fact that property (9) holds, as we proved, for the sets F ∈ F[n+1,n+k+1]
X
(Problem 1a) that this property holds for any sets F ∈ F[n+1,n+k+1]
X
. Hence we
∞
conclude (Problem 1b) that (9) is valid also for F in the algebra k=1 F[n+1,n+k]
X
,
which implies
in turn (Problem
1c) that this property is satisfied also for the σ-
∞
k=1 F[n+1,n+k] = F(n,∞)
X X
algebra σ .
Remark. The reasoning in this proof is based on the principle of appropriate sets
(starting the proof with sets of a “simple” structure) by applying subsequently the
results on monotone classes (Sect. 2, Chap. 2, Vol. 1). In what follows this method
will be repeatedly used (e.g., proofs of Theorems 2 and 3, which, in particular,
enable one to recover the parts of the foregoing proof of Theorem 1 that were stated
as Problems 1a, 1b, and 1c).
where
Pn+1 (A) = P{ξn+1 ∈ A} (13)
and
B − Xn (ω) = {y : y + Xn (ω) ∈ B}, B ∈ B(R).
PROOF. We will prove (11) and (12) simultaneously.
For discrete probability spaces, similar results were proved in Sect. 12, Chap. 1,
Vol. 1, and it may appear that the proof here should be rather simple, too. But, as
will be seen from the subsequent proof, the present situation is more complicated.
Let A be a set such that A ∈ {X0 ∈ B0 , ξ1 ∈ B1 , . . . , ξn ∈ Bn }, where Bi ∈ B(R),
i = 0, 1, . . . , n. By the definition of conditional probability P(Xn+1 ∈ B | Fn ) (ω)
(Sect. 7, Chap. 2, Vol. 1),
P(Xn+1 ∈ B | Fn )(ω) P(dω) = I{Xn+1 ∈B} (ω) P(dω)
A A
= P{X0 ∈ B0 , ξ1 ∈ B1 , . . . , ξn ∈ Bn , Xn+1 ∈ B}
= Pn+1 (B − (x0 + x1 + · · · + xn )) P0 (dx0 ) . . . Pn (dxn )
B0 ×···×Bn
= Pn+1 (B − Xn (ω)) P(dω). (14)
A
where the function Pn+1 (x; B), B ∈ E , x ∈ E, has the following properties (Defini-
tion 7, Sect. 7, Chap. 2, Vol. 1):
(a) For any x the set function Pn+1 (x, · ) is a measure on (E, E );
(b) For any B ∈ E the function Pn+1 ( · ; B) is E -measurable.
The functions Pn = Pn (x; B), n ≥ 1, are called transition functions (or Markov
kernels).
The case of special interest to us will be the one where all these transition func-
tions are the same, P1 = P2 = . . ., or, more precisely, when the conditional proba-
bilities P(Xn+1 ∈ B | Xn (ω)), n ≥ 0, have a common version of regular conditional
distribution P(x; B) such that (P-a.s.)
If such a version P = P(x; B) exists (in which case we can set all Pn = P,
n ≥ 0), then the Markov chain is called homogeneous (in time) with transition
function P = P(x; B), x ∈ E, B ∈ E .
The intuitive meaning of the homogeneity property of Markov chains is clear:
the corresponding system evolves homogeneously in the sense that the probabilistic
mechanisms governing the transitions of the system remain the same for all time
instants n ≥ 0. (In the theory of dynamical systems, systems with this property are
said to be conservative.)
Besides the transition probabilities P1 , P2 , . . . , or the transition probability P for
homogeneous chains, the important characteristic of Markov chains is the initial
distribution π = π(B), B ∈ E , i.e., the probability distribution π(B) = P{X0 ∈ B},
B ∈ E.
The set of objects (π, P1 , P2 , . . . ) completely determines the probabilistic prop-
erties of the sequence X = (Xn )n≥0 , since all the finite-dimensional distributions of
this sequence are given by the formulas
P{X0 ∈ B} = π(B), B ∈ E,
and
P{(X0 , X1 , . . . , Xn ) ∈ B}
= IB (x0 , x1 , . . . , xn ) π(dx0 )P1 (x0 ; dx1 ) · · · Pn (xn−1 ; dxn ) (19)
E×···×E
P{X0 ∈ B0 , X1 ∈ B1 , . . . , Xn ∈ Bn }
244 8 Markov Chains
= I{X0 ∈B0 ,...,Xn−1 ∈Bn−1 } (ω) P(Xn ∈ Bn | X0 (ω), . . . , Xn−1 (ω)) P(dω)
Ω
= I{X0 ∈B0 ,...,Xn−1 ∈Bn−1 } (ω) P(Xn ∈ Bn | Xn−1 (ω)) P(dω)
Ω
= I{X0 ∈B0 ,...,Xn−1 ∈Bn−1 } (ω) Pn (Bn ; Xn−1 (ω)) P(dω)
Ω
= IB0 ×B1 ×···×Bn−1 (x0 , x1 , . . . , xn−1 )
E×···×E
× Pn (Bn ; xn−1 ) P{X0 ∈ dx1 , . . . , Xn−1 ∈ dxn }
= IB0 ×B1 ×···×Bn−1 ×Bn (x0 , x1 , . . . , xn−1 , xn )
E×···×E
× Pn (dxn ; xn−1 ) Pn−1 (dxn−1 ; xn−2 ) . . . P1 (dx1 ; x0 ) π(dx0 ),
Following the proofs of these theorems, define first of all the measurable space
(Ω, F ) by setting (Ω, F ) = (E∞ , B(E∞ )), where E∞ = E × E × · · · , B(E∞ ) =
E ⊗ E ⊗ · · · ; in other words, we take the elementary events to be the “points”
ω = (x0 , x1 , . . . ), where xi ∈ E.
Define the flow (Fn )n≥0 by setting Fn = σ(x0 , x1 , . . . , xn ). The random vari-
ables Xn = Xn (ω) will be defined “canonically” by setting Xn (ω) = xn if ω =
(x0 , x1 , . . . ).
Ionescu Tulcea’s theorem states that for arbitrary measurable spaces (E, E ) (and
in particular for phase spaces under consideration) there exists a probability measure
Pπ on (Ω, F ) such that
Pπ {(X0 , X1 , . . . , Xn ) ∈ B}
= π(dx0 ) P1 (x0 ; dx1 ) . . . IB (x0 , . . . , xn ) Pn (xn−1 ; dxn ). (22)
E E E
We will prove this by using the principle of appropriate sets and the results on
monotone classes (Sect. 2, Chap. 2, Vol. 1).
As before, we take for appropriate sets the sets of a “simple” structure A ∈ Fn
of the form
A = {ω : X0 (ω) ∈ B0 , . . . , Xn (ω) ∈ Bn },
where Bi ∈ E , i = 0, 1, . . . , n, and let B ∈ E .
Then the construction of the measure Pπ (see (22)) implies
I{Xn+1 ∈B} (ω) Pπ (dω) = Pπ {X0 ∈ B0 , . . . , Xn ∈ Bn , Xn+1 ∈ B}
A
= π(dx0 ) P1 (x0 ; dx1 ) . . . Pn (xn−1 ; dxn ) Pn+1 (xn ; dxn+1 )
B0 B1 Bn B
= Pn+1 (Xn (ω); B) Pπ (dω). (25)
A
246 8 Markov Chains
By the same arguments as in the proof of Theorem 2 (see the proof of (15) for
sets A ∈ Fn ) we find that (25) is also fulfilled for A ∈ Fn , i.e., for sets of the form
A = {ω : (X0 (ω), . . . , Xn (ω)) ∈ C}, where C ∈ B(En+1 ).
Since, by the definition of conditional probabilities (Sect. 7, Chap. 2, Vol. 2),
I{Xn+1 ∈B} (ω) Pπ (dω) = Pπ (Xn+1 ∈ B | Fn ) (ω) Pπ (dω), (26)
A A
and the functions Pn+1 (Xn (ω); B) are Fn -measurable, we obtain the required prop-
erties (23) and (24) using (25) and the “telescopic” property of conditional expecta-
tions (see H* in Subsection 4, Sect. 7, Chap. 2, Vol. 1).
7. Thus, with any given collection of distributions (π, P1 , P2 , . . . ), we can associate
a Markov chain (to be denoted by X π = (Xn , Pπ )n≥0 ) with initial distribution π and
transition probabilities P1 , P2 , . . . (i.e., a chain with properties (21), (23), and (24)).
This chain proceeds as follows.
At the time instant n = 0, the initial state is randomly chosen according to the
distribution π. If the initial value X0 is equal to x, then at the first step the system
moves from this state to a state x1 with distribution P1 ( · ; x), and so on.
Therefore the initial distribution π acts only at time n = 0, while the subsequent
evolution of the system is determined by the transition probabilities P1 , P2 , . . . .
Consequently, if the random choice of two initial distributions π1 and π2 results in
the same state x, then the behavior of the system will be the same (in probabilistic
terms), being determined only by the transition probabilities P1 , P2 , . . . . This can
also be expressed as follows.
Let Px denote the distribution Pπ corresponding to the case where π is supported
at a single point x: π(dy) = δx (dy), i.e., π({x}) = 1, where {x} is a singleton, which
belongs to E by the assumption about the phase space (E, E ) (Subsection 2).
Then (22) implies (Problem 4) that for any A ∈ B(E∞ ) and x ∈ E the probability
Px (A) is (for each π) a version of the conditional probability Pπ (A | X0 = x), i.e.,
These arguments gave rise to an approach according to which the main object
in the “general theory of Markov processes” (see [21]) (with discrete time in the
present case) is not a particular Markov chain X π = (Xn , Pπ )n≥0 , but rather a family
1 Definitions and Basic Properties 247
Pπ {(X0 , X1 , . . . , Xn ) ∈ B}
= π(dx0 ) P(x0 ; dx1 ) . . . IB (x0 , x1 , . . . , xn )P(xn−1 ; dxn ). (29)
E E E
Hence, by the Radon–Nikodym theorem (Sect. 6, Chap. 2, Vol. 1) and the defini-
tion of the conditional probabilities, we find that (π-a.s.)
Pπ (X2 ∈ B2 | X0 = x) = P(x; dx1 )P(x1 ; B2 ). (31)
E
where B ∈ E .
248 8 Markov Chains
P(Xn+1 ∈ B | X0 ∈ B0 , X1 ∈ B1 , . . . , Xn ∈ Bn ) = P(Xn+1 ∈ B | Xn ∈ Bn ),
{ω : τ (ω) = n} ∈ FnX .
According to the terminology used in Sect. 1, Chap. 7 (Definition 3), such a random
variable is called a (finite) Markov or stopping time.
We will associate with the flow (FnX )n≥0 and the stopping time τ the σ-algebra
Theorem 2. Suppose that the conditions of Theorem 1 are fulfilled, and let τ =
τ (ω) be a finite Markov time. Then the following strong Markov property holds:
Before we proceed to the proof, let us comment on how EXτ H and H ◦ θτ must
be understood.
Let ψ(x) = Ex H. (We pointed out in Subsection 1 that ψ(x) is a E -measurable
function of x.) By EXτ H we mean ψ(Xτ ) = ψ(Xτ (ω) (ω)). As concerns (H ◦ θτ )(ω),
this is the random variable (H ◦ θτ (ω) )(ω) = H(θτ (ω) (ω)).
PROOF. Take a set A ∈ Fτ . As in Theorem 1, for the proof of (7) we must show
that
Eπ (H ◦ θτ ; A) = Eπ (EXτ H; A). (8)
Consider the left-hand side. We have
∞
Eπ (H ◦ θτ ; A) = Eπ (H ◦ θτ ; A ∩ {τ = n})
n=0
∞
= Eπ (H ◦ θn ; A ∩ {τ = n}). (9)
n=0
252 8 Markov Chains
Pπ ((Xτ , Xτ +1 , . . . ) ∈ B | X0 , X1 , . . . , Xτ )
= PXτ {(X0 , X1 , . . . ) ∈ B} (Pπ -a.s.). (11)
Remark 1. If we analyze the proof of the strong Markov property (7), we can see
that in fact the following property also holds.
Let for any n ≥ 0 the real-valued functions Hn = Hn (ω) defined on Ω = E∞
be F -measurable (F = E ∞ ) and uniformly bounded (i.e., |Hn (ω)| ≤ c, n ≥ 0,
ω ∈ Ω). Then for any finite Markov time τ = τ (ω) (τ (ω) < ∞, ω ∈ Ω) the
following form of the strong Markov property holds (Problem 2):
In other words, in this case, (12) holds Pπ -a.s. on the set {τ < ∞}.
3. Example (Related to the strong Markov property). When dealing with the law
of iterated logarithm we used an inequality (Lemma 1 in Sect. 4, Chap. 4, see also
(14) in what follows)
1 whose counterpart
2 for the Brownian motion B = (Bt )t≤T is
the equality P max0≤t≤T Bt > a = 2 P{|BT | > a} ([12, Chap. 3]).
Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-
ables with symmetric (about zero) distribution. Let X0 = x ∈ R, Xm = X0 + (ξ1 +
· · · + ξm ), m ≥ 1. As before, we denote by Px the probability distribution of the
sequence X = (Xm )m≥0 with X0 = x. (The space Ω is assumed to be specified
coordinate-wise, ω = (x0 , x1 , . . . ) and Xm (ω) = xm .)
According to (slightly modified) inequality (9) of Sect. 4, Chap. 4,
P0 max Xm > a ≤ 2 P0 {Xn > a} (14)
0≤m≤n
(As usual, we set inf ∅ = ∞.) Let us demonstrate an “easy proof” of (14) using this
Markov time, which would be valid if such a (random) time could be treated in the
same manner as if it were nonrandom. We have (cf. proof of Lemma 1 in Sect. 4,
Chap. 4)
where we have used the seemingly “almost obvious” property that Xn − Xτ ∧n and
Xτ ∧n are independent, which is true for a deterministic time τ but is, in general,
false for a random τ (Problem 4). (This means that our “easy proof” is incorrect.)
Now we give a correct proof of (14) based on the strong Markov property (13).
Since {Xn > a} ⊆ {τ ≤ n}, we have
Hence
1
E0 (Hτ ◦ θτ | Fτ ) ≥
(P -a.s.) (21)
2
on the set {τ ≤ n}. Together with (19) and (20), this implies the required inequality
(14).
4. If we compare the Kolmogorov–Chapman Eq. (13) with Eq. (38), both in Sect. 12,
Chap. 1, Vol. 1, we can observe that they are very similar. Therefore it is of interest
to analyze the common points and the differences in their statements and proofs.
(We restrict ourselves to homogeneous Markov chains with discrete state space E.)
Using (1) and (2), we obtain for n ≥ 1, 1 ≤ k ≤ n, and i, j ∈ E, that
Pi {Xn = j} = Pi {Xn = j, Xk = α} = Ei I(Xn = j)I(Xk = α)
α∈E α∈E
= Ei [Ei (I(Xn = j)I(Xk = α) | Fk )]
α∈E
= Ei [I(Xk = α) Ei (I(Xn = j) | Fk )]
α∈E
(1)
= Ei [I(Xk = α) Ei (I(Xn−k = j) ◦ θk | Fk )]
α∈E
(2)
= Ei [I(Xk = α) EXk I(Xn−k = j)]
α∈E
= Ei [I(Xk = α) Eα I(Xn−k = j)]
α∈E
= Ei I(Xk = α) Eα I(Xn−k = j) = Pi {Xk = α} Pα {Xn−k = j}, (22)
α∈E α∈E
which is exactly the Kolmogorov–Chapman Eq. (13) as in Sect. 12, Chap. 1, Vol. 1,
written there as (k) (n−k)
(n)
pij = piα pαj .
α∈E
Both in (22) and (23) the summation is done over the phase variable α ∈ E, whereas
in (38) of Sect. 12, Chap. 1, Vol. 1, the summation is over the time variable.
Having noticed this, assume that τ is a Markov time with values in {1, 2, . . . }.
Starting as in the derivation of (38) given earlier, we find that
2 Generalized Markov and Strong Markov Properties 255
n
Pi {Xn = j} = Pi {Xn = j, τ = k} + Pi {Xn = j, τ ≥ n + 1}
k=1
n
= Ei I(Xn = j)I(τ = k) + Pi {Xn = j, τ ≥ n + 1}
k=1
n
= Ei [Ei (I(Xn = j)I(τ = k) | Fk )] + Pi {Xn = j, τ ≥ n + 1}
k=1
n
= Ei [I(τ = k) Ei (I(Xn = j) | Fk )] + Pi {Xn = j, τ ≥ n + 1}
k=1
n
= Ei [I(τ = k) Ei (I(Xn−k = j) ◦ θk | Fk )] + Pi {Xn = j, τ ≥ n + 1}
k=1
n
= Ei [I(τ = k) EXk I(Xn−k = j)] + Pi {Xn = j, τ ≥ n + 1}. (24)
k=1
τj = min{1 ≤ k ≤ n : Xk = j}
with the condition that τj = n + 1 if the set { · } = ∅. In this case, (24) simplifies to
n
Pi {Xn = j} = Ei (I(τj = k) EXτj I(Xn−k = j))
k=1
n
n
= Ei (I(τj = k) Ej I(Xn−k = j)) = Ei I(τj = k) Ej I(Xn−k = j)
k=1 k=1
n
= Pi {τj = k} Pj {Xn−k = j}
k=1
(n)
n
(k) (n−k)
pij = fij pjj . (25)
k=1
Equation (24) makes it possible to also derive other useful formulas involv-
ing summation with respect to the time variable (in contrast to the Kolmogorov–
Chapman equation). For example, consider the Markov time
where the (deterministic) function α = α(k), 1 ≤ k ≤ n, and the Markov chain are
such that Pi {τ (α) ≤ n} = 1 (for fixed i and n). Then (24) implies that
256 8 Markov Chains
n
Pi {Xn = j} = Ei [I(τ (α) = k) EXτ (α) I(Xn−k = j)]
k=1
n
= Ei I(τ (α) = k) Eα(k) I(Xn−k = j),
k=1
i.e.,
n
Pi {Xn = j} = Pi {τ (α) = k} Pα(k) {Xn−k = j}
k=1
(cf. (23)).
5. Problems
1. Prove that the function ψ(x) = Ex H introduced in the remark in Subsection 1
is E -measurable.
2. Prove (12).
3. Prove (13).
4. Is the independence property of Xn − Xτ ∧n and Xτ ∧n in the example in Subsec-
tion 3 true?
5. Prove (23).
Let us emphasize that definition (1) does not, in general, presume that q = q(A)
is a probability measure (q(E) = 1).
If this is a probability measure, it is said to be astationary or invariant dis-
tribution. The meaning of this terminology is clear: If we take q for the initial
distribution π, i.e., assume that Pq {X0 ∈ A} = q(A), then (1) will imply that
Pq {Xn ∈ A} = q(A) for any n ≥ 1, i.e., this distribution remains invariant in
time.
It is easy to come up with an example where there is no stationary distribution
q = q(A), but there are stationary measures.
EXAMPLE. Let X = (Xn )n≥0 be the Markov chain generated by Bernoulli trials,
i.e., Xn+1 = Xn + ξn+1 , where ξ1 , ξ2 , . . . are a sequence of independent identically
distributed random variables with P{ξn = +1} = p, P{ξn = −1} = q. Let X0 = x,
x ∈ {0, ±1, . . . }. It is clear that the transition function here is
It is not hard to verify that one of the solutions to (1) is the measure q(A) such that
q({x}) = 1 for any x ∈ {0, ±1, . . . }. If p
= q > 0, then q(A) with q({x}) = (p/q)x
is another invariant measure. It is obvious that neither of them is a probability mea-
sure, and there is no invariant probability measure here.
This simple example shows that the existence of a stationary (invariant) distribu-
tion requires certain assumptions about Markov chains.
The main interest in the problem of convergence of transition probabilities
P(n) (x; A) as n → ∞ lies in the existence of a limit that is independent of the initial
state x. We must bear in mind that there may exist no limiting distribution at all, for
example, it may happen that lim P(n) (x; A) = 0 for any A ∈ E and any initial state
x ∈ E. For example, take p = 1 in the preceding example, i.e., consider the deter-
ministic motion to the right. (See also Examples 4 and 5 in Sect. 8; cf. Problem 6 in
Sect. 5.)
Establishing the conditions for the existence of stationary (invariant) distributions
and the convergence of transition probabilities (and obtaining their properties) for
arbitrary phase spaces (E, E ) is a very difficult problem (e.g., [9]). However, in the
case of a countable state space (for “countable Markov chains”), interesting results
in this area admit fairly transparent formulations. They will be presented in Sects. 6
and 7. But before that we will give a detailed classification of the states of countable
Markov chains according to the algebraic and asymptotic properties of transition
probabilities.
Let us point out that the questions concerning stationary distributions and
the existence of the limits limn P(n) (x; A) are closely related. Indeed, if the limit
limn P(n) (x; A) (= ν(A)) exists, does not depend on x, and is a measure (in A ∈ E ),
then we find from the Kolmogorov–Chapman equation
(n+1)
P (x; A) = P(n) (x; dy) P(y; A)
258 8 Markov Chains
for all j ∈ E.
Remark. The term “ergodicity” used here appeared already in Chap. 5 (ergodicity
as a metric transitivity property, the Birkhoff–Khinchin ergodic theorem). Formally,
these terms are related to different objects, but their common feature is that they
reflect the asymptotic behavior of various probabilistic characteristics as the time
parameter goes to infinity.
3. Problems.
(n)
1. Give examples of Markov chains for which the limits πj = limb pij exist and
(a) are independent of the initial state j and (b) depend on j.
2. Give examples of ergodic and nonergodic Markov chains.
3. Give examples where the stationary distribution is not ergodic.
4 Classification of States of Markov Chains 259
1. We will assume that the Markov chain under consideration has a countable set of
states E = {1, 2, . . . } and transition probabilities pij , i, j ∈ E. The matrix of these
probabilities will be denoted by P = pij or, in expanded form,
. .
.p11 p12 p13 . . ..
. .
.p21 p22 p23 . . ..
. .
P=. .
.. . . . . . . . . . . . . . . .
. pi1 pi2 pi3 . . ..
. .
.. . . . . . . . . . . . . . .
State 1 in this example is such that the particle can enter into it (from state 0),
but cannot leave it.
Consider the graph in Fig. 36, from which one can easily recover the transition
matrix P. It is clear from this graph that there are three states (the left-hand part of
the figure) such that leaving any of them, there is no way to return back.
With regard of the future behavior of the “particle” wandering in accordance with
this graph, these three states are inessential because the particle can leave them but
cannot return anymore.
These “inessential” states are of little interest, and we can discard them to focus
our attention on the classification of the remaining “essential” states. (This descrip-
tive definition of “inessential” and “essential” states can be formulated precisely in
(n)
terms of the transition probabilities pij , i, j ∈ E, n ≥ 1, Problem 1.)
2. To classify essential states or groups of such states, we need some definitions.
PROOF. By the definition of the equivalence relation (in this case “↔”), we must
verify that it is reflexive (i ↔ i), symmetric (if i ↔ j, then j ↔ i), and transitive (if
i ↔ j and j ↔ k, then i ↔ k).
The first two properties follow directly from the definition of communicating
(n)
states. Transitivity follows from the Kolmogorov–Chapman equation: If pij > 0
(m)
and pjk > 0, then
(n+m)
(n) (m) (n) (m)
pik = pil plk ≥ pij pjk > 0,
l∈E
It is clear that this chain has two indecomposable classes, E1 = {1, 2}, E2 =
{3, 4, 5}, and the investigation of its properties reduces to the investigation of the
two separate chains with state spaces E1 and E2 and transition matrices P1 and P2 .
Now let us consider any indecomposable class E, for example, the one sketched
in Fig. 37.
Observe that in this case a return to each state is possible only after an even
number of steps, and a transition to an adjacent state after an odd number. The
transition matrix has a block structure,
⎛ ⎞
0 0 1/2 1/2
⎜ 0 0 1/2 1/2⎟
P=⎜ ⎝1/2 1/2 0 0 ⎠ .
⎟
1/2 1/2 0 0
Therefore it is clear that the class E = {1, 2, 3, 4} separates into two subclasses
C0 = {1, 2} and C1 = {3, 4} with the following cyclic property: After one step
from C0 the particle necessarily enters C1 , and from C1 it returns to C0 .
3. This example suggests that, in general, it is possible to give a classification of
indecomposable classes into cyclic subclasses.
To this end we will need some definitions and a fact from number theory.
262 8 Markov Chains
where GCD(Mϕ ) is the greatest common divisor of the set Mϕ of indices n ≥ 1 for
which ϕn > 0; if ϕn = 0, n ≥ 1, then Mϕ = ∅, and GCD(Mϕ ) is set to be zero.
The following elementary result of number theory will be useful in the sequel for
the classification of states in terms of the cyclicity property.
Now we state a theorem showing that the periods of the states of an indecompos-
able class are of the same “type.”
(k) (l)
PROOF. Let i, j ∈ E∗ . Then there are k and l such that pij > 0 and pji > 0. But,
by the Kolmogorov–Chapman equation, we have then
(k+l)
(k) (l) (k) (l)
pii = pia pai ≥ pij pji > 0,
a∈E
Suppose now that the wandering particle whose evolution is driven by matrix P
starts from a state in the subclass C0 . Then at each time n = p + kd this particle will
be (by the definition of subclasses C0 , C1 , . . . , Cd−1 ) in the set Cp .
Therefore with each set Cp of states we can associate a new Markov chain with
(d)
transition matrix pij , where i, j ∈ Cp . This new chain will be indecomposable
and aperiodic.
Thus, taking into account the foregoing classification (into inessential and essen-
tial states, indecomposable classes and cyclic subclasses; see Fig. 39), we can draw
the following conclusion:
(n)
To investigate the limiting behavior of transition probabilites pij ,
n ≥ 1, i, j ∈ E, which determine the motion of the “Markov parti-
cle,” we can restrict our attention to the case where the phase space
E itself is a unique indecomposable and aperiodic class of states.
In this case, the Markov chain X = (Xn )n≥0 itself with such a phase space and
the matrix of transition probabilities P is called indecomposable and aperiodic.
5 Classification of States of Markov Chains 265
5. Problems
(n)
1. Give an accurate formulation in terms of transition probabilities pij , i, j ∈ E,
n ≥ 1, to the descriptive definition of inessential and essential states stated at
the end of Subsection 1.
2. Let P be the matrix of transition probabilities of an indecomposable Markov
chain with finitely many states. Let P2 = P. Explore the structure of P.
3. Let P be the matrix of transition probabilities of a finite Markov chain X =
(Xn )n≥0 . Let σ1 , σ2 , . . . be a sequence of independent identically distributed
nonnegative integer-valued random variables independent of X, and let τ0 = 0,
τn = σ1 + · · · + σn , n ≥ 1. Show that the sequence X 8 = (X8n )n≥0 with X
8n = Xτn
8
is a Markov chain. Find the matrix P of transition probabilities for this chain.
Show that if states i and j communicate for the chain X, they do so for X. 8
4. Consider a Markov chain with two states, E = {0, 1}, and the matrix of transi-
tion probabilities
α 1−α
P= , 0 < α < 1, 0 < β < 1.
1−β β
1. Let X = (Xn )n≥0 be a homogeneous Markov chain with countable state space
E = {1, 2, . . . } and transition probabilities pij = Pi {X1 = j}, i, j ∈ E.
Let
(n)
fii = Pi {Xn = i, Xk
= i, 1 ≤ k ≤ n − 1} (1)
266 8 Markov Chains
and (for i
= j)
(n)
fij = Pi {Xn = j, Xk
= j, 1 ≤ k ≤ n − 1}. (2)
(n) (n)
It is clear that fii is the probability of first return to state i at time n, while fij is
the probability of first arrival at state j at time n, provided that X0 = i.
If we set
σi (ω) = min{ n ≥ 1 : Xn (ω) = i } (3)
(n) (n)
with σi (ω) = ∞ when the set in (3) is empty, then the probabilities fii and fij can
be represented as
(n) (n)
fii = Pi {σi = n}, fij = Pi {σj = n}. (4)
(b) The state i ∈ E is transient if and only if either of the following two conditions
is satisfied: (n)
Pi {Xn = i i. o.} = 0 or pii < ∞.
n
Therefore, according to this theorem,
(n)
fii = 1 ⇐⇒ Pi {Xn = i i. o.} = 1 ⇐⇒ pii = ∞, (7)
n
(n)
fii < 1 ⇐⇒ Pi {Xn = i i. o.} = 0 ⇐⇒ pii < ∞. (8)
n
5 Classification of States of Markov Chains 267
Remark. Recall that, according to Table 2.1 in Sect. 1, Chap. 2, Vol. 1, the event
{Xn = i i. o.} is the set of outcomes ω for which Xn (ω) = i for infinitely many
∞ n. If we use the notation An = {ω : Xn (ω) = i}, then {Xn = i i. o.} =
indices
n=1 k=n Ak ; see the table mentioned earlier.
(n)
follows from the Borel–Cantelli lemma since pii = Pi {Xn = i} (see statememt (a)
of this lemma, Sect. 10, Chap. 2, Vol. 1).
Let us show that (n)
fii = 1 ⇐⇒ pii = ∞. (10)
n
The Markov property and homogeneity imply that for any collections (i1 , . . . , ik )
and (j1 , . . . , jn )
1 2
Pi (X1 , . . . , Xk ) = (i1 , . . . , ik ), (Xk+1 , . . . , Xk+n ) = (j1 , . . . , jn )
1 2 1 2
= Pi (X1 , . . . , Xk ) = (i1 , . . . , ik ) Pik (X1 , . . . , Xn ) = (j1 , . . . , jn ) .
This implies at once that (cf. derivation of (38) in Sect. 12, Chap. 1, Vol. 1 and (25)
in Sect. 2 of this chapter)
(n)
n−1
pij = Pi {Xn = j} = Pi {X1
= j, . . . , Xn−k−1
= j, Xn−k = j, Xn = j}
k=0
n−1
= Pi {X1
= j, . . . , Xn−k−1
= j, Xn−k = j} Pj {Xk = j}
k=0
n−1
(n−k) (k)
n
(k) (n−k)
= fij pjj = fij pjj .
k=0 k=1
Thus
(n)
n
(k) (n−k)
pij = fij pjj . (11)
k=1
(0)
Letting j = i we find that (with pii = 1)
∞
∞
n ∞
∞
∞
(n) (k) (n−k) (k) (n−k) (n)
pii = fii pii = fii pii = fii pii
n=1 n=1 k=1 k=1 n=k n=0
, ∞
-
(n)
= fii 1 + pii . (12)
n=1
268 8 Markov Chains
∞ (n)
Let now n=1 pii = ∞. Then
N
(n)
N
n
(k) (n−k)
N
(k)
N
(n−k)
N
(k)
N
(l)
pii = fii pii = fii pii ≤ fii pii ,
n=1 n=1 k=1 k=1 n=k k=1 l=0
hence
N
(n)
∞
N pii
(k) (k) n=1
fii = fii ≥ fii ≥ → 1, N → ∞.
N
(l)
k=1 k=1 pii
l=0
Thus,
∞
(n)
pii = ∞ =⇒ fii = 1. (14)
n=1
These properties are easily comprehensible from an intuitive point of view. For
example, if fii = 1, then Pi {σi < ∞} = 1, i.e., the “particle” sooner or later
will return to the same state i from where it started its motion. But then, by the
strong Markov property, the “life of the particle” starts anew from this (random)
time. Continuing this reasoning, we obtain that the events {Xn = i} will occur for
infinitely many indices n, i.e., Pi {Xn = i i.o.} = 1.
Let us give a formal proof of (17) and (18). For a given state i ∈ E, consider the
probability that the number of returns to i is greater than or equal to m. We claim
that this probability is equal to (fii )m .
Indeed, for m = 1 this follows from the definition of fii . Suppose that our claim
is true for m − 1. We will show that the probability of interest is then equal to (fii )m .
5 Classification of States of Markov Chains 269
By the strong Markov property (see (8) in Sect. 2) and since {σi = k} ∈ Fσi , we
find
Pi (the number of returns to i is at least m)
∞
= Pi (σi = k and there are at least m − 1 returns to i after time k)
k=1
∞
= Pi {σi = k} Pi (at least m − 1 of Xσi +1 , Xσi +2 , . . . are equal to i | σi = k)
k=1
∞
= Pi {σi = k} Pi (at least m − 1 of X1 , X2 , . . . are equal to i)
k=1
∞
(k)
= fii (fii )m−1 = fii (fii )m−1 = (fii )m .
k=1
Using the notation A = {An i.o.} (= lim sup An ), where An = {Xn = i}, we see
from (19) that Pi (A) obeys the “0 or 1 law,” i.e., Pi (A) can take only two values 0
or 1. (Note that this property does not follow directly from statements (a) and (b) of
the Borel–Cantelli lemma (Sect. 10, Chap. 2, Vol. 1) since the events An , n ≥ 1, are,
in general, dependent.)
Equation (19) and the property that Pi (A) can take only the values 0 and 1 imply
the required implications in (17) and (18).
2. The theorem just proved implies the following simple, but important, property of
transient states.
(n)
is finite or infinite. (Recall that, according to (1), fii is the probability of the first
return for exactly n steps.)
and null if
∞
(n)
μi = nfi i = ∞. (24)
n=1
Hence, according to this definition, the first return to a null (recurrent) state re-
quires (in average) infinite time. Alternatively, the average time of first return to a
positive (recurrent) state is finite.
4. The following figure illustrates the classification of states of a Markov chain in
terms of recurrence and transience, and positive and null recurrence (Fig. 40).
Then
un → μ−1
∞
as n → ∞, where μ = n=1 nϕn .
(n) 1
pjj → , n → ∞. (28)
μj
(k) (k)
uk = pjj , ϕk = fjj , (30)
(29) becomes
un = ϕ1 un−1 + ϕ2 un−2 + · · · + ϕn u0 ,
which is exactly the recurrence formula (27).
The required result (28) will follow directly from Lemma 1 if we show that the
(1) (2)
period df (j) of the sequence (fjj , fjj , . . . ) is equal to 1, provided that the period
(1) (2)
of (pjj , pjj , . . . ) is 1.
PROOF. Let
(n) (n)
M = { n : pjj > 0 } and Mf = { n : fjj > 0 }.
Since Mf ⊆ M, we have
GCD(M) ≤ GCD(Mf ),
i.e., d(j) ≤ df (j).
(n)
The reverse inequality follows from the following probabilistic meaning of pjj
(n)
and fjj , n ≥ 1.
(n)
If the “particle” leaving state j arrives in this state in n steps again (pjj > 0), this
(k1 )
means that it returned to j for the first time in k1 steps (fjj > 0), then in k2 steps
(k ) (k )
(fjj 2 > 0), . . . , and finally in kl steps > 0). Therefore n = k1 + k2 + · · · + kl .
(fjj l
The number df (j) is divisible by k1 , k2 , . . . , kl ; hence it is a divisor of n. But d(j) is
(n)
the largest among the divisors of n for which pjj > 0. Hence d(j) ≥ df (j).
Thus d(j) = df (j). This, by the way, means that instead of defining the period
(n)
d(j) of a state j by the formula d(j) = GCD(n ≥ 1 : pjj > 0), we could also define
(n)
it by d(j) = GCD(n ≥ 1 : fjj > 0).
The proof of Lemma 2 is completed.
Now we proceed to the proof of (25) for i
= j. Rewrite (11) in the form
∞
(n) (k) (n−k)
pij = fij pjj , (32)
k=1
(l)
where we set pjj = 0, l < 0.
(n) ∞ (k)
Since pjj → μ1j here and k=1 fij ≤ 1, we have, by the dominated convergence
theorem (Theorem 3 in Sect. 6, Chap. 2, Vol. 1),
∞
∞
∞
(k) (n−k) (k) (n−k) 1 (k) 1
lim fij pjj = fij lim pjj = fij = fij . (33)
n n μj μj
k=1 k=1 k=1
(n) fij
lim pij = , (34)
n μj
Finally, we will show that under the additional assumption i ↔ j (i.e. i, j belong
to the same indecomposable class of communicating states), we have fij = 1. Then
(34) will imply property (26).
State j is recursive by assumption. Therefore Pj {Xn = j i. o.} = 1 by statement
(a) of Theorem 1. Hence for any m
(m)
pji = Pj ({Xm = i} ∩ {Xn = j i. o.})
≤ Pj {Xm = i, Xm+1
= j, . . . , Xn−1
= j, Xn = j}
n>m
(m) (n−m) (m)
= pji fij = pji fij , (35)
n>m
where the next-to-last equality follows from the generalized Markov property (see
(2) in Sect. 2).
(m)
Since E is a class of communicating states, there is m such that pji > 0. There-
fore (35) implies that fij = 1.
The proof of Theorem 5 is completed.
6. It is natural to state an analog of Theorem 5 for an arbitrary period d of state j of
interest (d = d(j) ≥ 1).
Theorem 4. Let the state j ∈ E of a Markov chain be recurrent with period d =
d(j) ≥ 1, and let i be a state in E (possibly coinciding with j).
(a) Suppose that i and j are in the same indecomposable class C ⊆ E with
(cyclic) subclasses C0 , C1 , . . . , Cd−1 numbered so that j ∈ C0 , i ∈ Ca , where
a ∈ {0, 1, . . . , d − 1}, and the motion over them goes in cyclic order, C0 → C1 →
· · · → Ca → · · · → Cd−1 → C0 . Then
(nd+a) d
pij → as n → ∞. (36)
μj
(b) In the general case, when i and j may belong to different indecomposable
classes, / n 0
(nd+a) d (kd+a)
pij → fjj as n → ∞ (37)
μj
k=0
for any a = 0, 1, . . . , d − 1.
PROOF. (a) At first, let a = 0, i.e., i and j belong to the same indecomposable
class C and, moreover, to the same cyclic subclass C0 .
(d)
Consider the transition probabilities pij , i, j ∈ C, and arrange from them a new
Markov chain (according to the constructions from Sect. 1).
For this new chain state j will be recurrent and aperiodic, and states i and j remain
communicating (i ↔ j). Therefore, by property (26) of Theorem 5,
(nd) 1 d d
pij → ∞ (kd)
= ∞ (kd)
= ,
μj
k=1 k fjj k=1 (kd) fjj
274 8 Markov Chains
(l)
where the last equality holds because fjj = 0 for all l not divisible by d and μj =
∞ (l)
l=1 lfjj by definition.
Assume now that (36) has been proved for a = 0, 1, . . . , r (≤ d − 2). By the
dominated convergence theorem (Theorem 3 in Sect. 4, Chap. 2, Vol. 1),
∞
∞
(nd+r+1) (nd+r) d d
pij = pik pkj → pik = .
μj μj
k=1 k=1
(nd+a)
nd+a
(k) (nd+a−k)
pij = fij pjj , a = 0, 1, . . . , d − 1.
k=1
(nd+a−k)
By assumption, the period of j is d. Hence pjj = 0, unless k − a has the form
rd. Therefore
(nd+a)
n
(rd+a) ((n−r)d)
pij = fij pjj .
r=0
Using this equality and (36) and applying again the dominated convergence the-
orem, we arrive at the required relation (37).
7. As was pointed out at the end of Sect. 4, in the problem of classifying Markov
chains in terms of asymptotic properties of transition probabilities, we can restrict
ourselves to aperiodic indecomposable chains.
The results of Theorems 1–5 actually contain all that we need for the complete
classification of such chains.
The following lemma is one of the results saying that for an indecomposable
chain all states are of the same (recurrent or transient) type. (Compare with the
property that the states are “of the same type” in Theorem 2, Sect. 4.)
PROOF. Let the chain have at least one transient state, say, state i. By Theorem 1,
(n)
n pii < ∞.
Now let j be another state. Since E is an indecomposable class of communicating
(k) (l)
states (i ↔ j), there are k and l such that pij > 0 and pji > 0. The obvious
inequality
(n+k+l) (k) (n) (l)
pii ≥ pij pjj pji
implies now that
(n+k+l) (k) (l) (n)
pii ≥ pij pji pjj .
n n
5 Classification of States of Markov Chains 275
(n) (k) (l) (n)
By assumption, n pii < ∞ and k, l satisfy pij pji > 0. Hence n pjj < ∞.
By statement (b) of Theorem 1, this implies that j is also a transient state. In other
words, if at least one state of an indecomposable chain is transient, then so are all
other states.
Now let i be a recurrent state. We will show that all the other states are then re-
current. Suppose that (along with the recurrent state i) there is at least one transient
state. Then, by what has been proved, all other states must be transient, which con-
tradicts the assumption that i is a recurrent state. Thus the presence of at least one
recurrent state implies that all other states (of an indecomposable chain) are also
recurrent.
This lemma justifies the commonly used terminology of saying about an inde-
composable chain (rather than about a single state) that it is recurrent or transient.
for any i, j ∈ E with convergence to zero rather “fast” in the sense that
(n)
pij < ∞.
n
(n) 1
lim pij = >0
n μj
PROOF. Statement (i) has been proved in Theorems 1 (b) and 2. Statements (ii) and
(iii) follow directly from Theorems 1 (a) and 3.
Consider the case of finite Markov chains, i.e., the case where the state set E
consists of finitely many elements.
276 8 Markov Chains
It turns out that in this case only the third of the three options (i), (ii), (iii) in
Theorem 5 is possible.
Theorem 6. Let a finite Markov chain be indecomposable and aperiodic. Then this
(n)
chain is recurrent and positive, and limn pij = μ1j > 0.
PROOF. Suppose that the chain is transient. If the state space consists of r states
(E = {1, 2, . . . , r}), then
r
(n)
r
(n)
lim pij = lim pij . (38)
n n
j=1 j=1
Obviously, the left-hand side is equal to 1. But the assumption that the chain is
transient implies (by Theorem 1 (i)) that the right-hand side is zero.
Suppose now that the states of the chain are recurrent.
Since by Theorem 5 there remain only two options, (ii) and (iii), we must exclude
(n)
(ii). But since limn pij = 0 for all i, j ∈ E in this case, we arrive at a contradiction
using (38) in the same way as in the case of transient states.
Thus only (iii) is possible.
9. Problems
1. Consider an indecomposable chain with state space0, 1, 2, . . . . This chain is
transient if and only if the system of equations uj = i ui pi j , j = 0, 1, . . ., has
a bounded solution such that ui
≡ c, i = 0, 1, . . . .
2. A sufficient condition for an indecomposable chain with states 0, 1, . . . to be
recurrent is that there is a sequence (u0 , u1 , . . .) with ui → ∞, i → ∞, such
that uj ≥ i ui pi j for all j
= 0.
3. A necessary and sufficient condition for an indecomposable chain with states
0,
1, . . . to be recurrent and positive is that the system of equations uj =
i ui pi j , j = 0, 1, . . ., has a solution, not identically zero, such that i |ui | <
∞.
4. Consider a Markov chain with states 0, 1, 2, . . . and transition probabilities
1. We begin with a general result clarifying the relationship between the limits Π =
(n)
(π1 , π2 , . . . ), where πj = limn pij , j = 1, 2, . . . , and stationary distributions Q =
(q1 , q2 , . . . ).
Theorem 1. Consider a Markov chain with a countable state space E = {1, 2, . . . }
and transition probabilities pij , i, j ∈ E, such that the limits
(n)
πj = lim pij , j ∈ E,
n
exist and
∞ ∞ of the initial states i ∈ E. Then
are independent
(a) j=1 πj ≤ 1, i=1 πi pij = πj , j ∈ E;
∞ ∞
(b) Either j=1 πj = 0 (hence all πj = 0, j ∈ E) or j=1 πj = 1;
∞
(c) If j=1 πj = 0, then the Markov chain has no stationary distributions, and
∞
if j=1 πj = 1, then the vector of limiting values Π = (π1 , π2 , . . . ) is a stationary
distribution for this chain, and the chain has no other stationary distribution.
PROOF. We have
∞
∞
∞
(n) (n)
πj = lim pij ≤ lim inf pij = 1, (1)
n n
j=1 j=1 j=1
278 8 Markov Chains
Remark. Note that the inequalities and lower limits appear here, of course, due to
Fatou’s lemma, which is applied to Lebesgue’s integral over a σ-finite (nonnegative)
measure rather than a probability measure as in Sect. 6, Chap. 2, Vol. 1.
Then
∞
∞
∞ ∞
∞
∞
πj > πi pij = πi pij = πi .
j=1 j=1 i=1 i=1 j=1 i=1
∞
Thecontradiction thus obtained shows that i=1 πi pij = πj . Together with inequal-
∞
ity j=1 πj ≤ 1 this proves conclusion (a).
∞
For the proof of (b), we iterate the equality i=1 πi pij = πj to obtain
∞
(n)
πi pij = πj
i=1
for any n ≥ 1 and any j ∈ E. Hence, by the dominated convergence theorem (Theo-
rem 3, Sect. 6, Chap. 2, Vol. 1),
∞
∞
∞
(n) (n)
πj = lim πi pij = πi lim pij = πi π j ,
n n
i=1 i=1 i=1
i.e.,
∞
πj 1 − πi = 0, j ∈ E,
i=1
∞ ∞ ∞
so that j=1 πj 1 − i=1 πi = 0. Thus a(1 − a) = 0 with a = i=1 πi ,
implying that either a = 1 or a = 0, which proves conclusion (b).
For the proof of (c), assume that Q = (q1 , q2 , . . . ) is a stationary distribution.
∞ (n) ∞
Then i=1 qi pij = qj and we obtain i=1 qi πj = qj , j ∈ E, by the dominated
convergence theorem.
6 Stationary and Ergodic Distributons 279
∞
Therefore, if Q is a stationary distribution, then i=1 qi = 1, and hence this
stationary distribution must satisfy qj =
π j for all j ∈ E. Thus in the case where
∞ ∞
j=1 π j = 0, it is impossible to have i=1 qi = 1, so that there is no stationary
distribution in this case. ∞
According to (b), there remains the possibility that j=1 πj = 1. In this case,
by (a), Π = (π1 , π2 , . . . ) is itself a stationary distribution, and the foregoing proof
implies that if Q is also a stationary distribution, then it must coincide with Π, which
∞
proves the uniqueness of the stationary distribution when j=1 πj = 1.
2. Theorem 1 provides a sufficient condition for the existence of a unique stationary
distribution. This condition requires that for all j ∈ E there exist the limiting values
(n)
πj = limn pij independent of i ∈ E such that πj > 0 for at least one j ∈ E.
(n)
At the same time, the more general problem of the existence of the limits limn pij
was thoroughly explored in Sect. 5 in terms of the “intrinsic” properties of the chains
such as indecomposability, periodicity, recurrence and transience, and positive and
null recurrence. Therefore it would be natural to formulate the conditions for the
existence of the stationary distribution in terms of these intrinsic properties deter-
mined by the structure of the matrix of transition probabilities pij , i, j ∈ E. It is seen
also that if conditions stated in these terms imply that all the limiting values are
positive, πj > 0, j ∈ E, then by definition (see property C in Sect. 3) the vector
Π = (π1 , π2 , . . . ) will be an ergodic limit distribution.
The answers to these questions are given in the following two theorems.
3. PROOF OF THEOREM 2. Necessity. Let the chain at hand have a unique stationary
distribution, to be denoted by Q. 8 We will show that in this case there is a unique
positive recurrent subclass in state space E.
Let N denote the conceivable number of such subclasses (0 ≤ N ≤ ∞).
Suppose N = 0, and let j be a state in E. Since there are no positive recurrent
classes, state j may be either transitive or null recurrent.
(n)
In the former case, the limits limn pij exist and are equal to zero for all i ∈ E by
Theorem 2 in Sect. 5.
280 8 Markov Chains
In the latter case these limits also exist and are equal to zero, which follows from
(37) in Sect. 5 and the fact that μj = ∞, since state j is null recurrent.
(n)
Thus, if N = 0, then the limits πj = limn pij exist and are equal to zero for all
i, j ∈ E. Therefore, by Theorem 1 (c), in this case there is no stationary distribution,
so the case N = 0 is excluded by the assumption of the existence of a stationary
distribution Q. 8
Suppose now that N = 1. Denote the only positive recurrent class by C. If the
period of this class d(C) = 1, then by (26) of Theorem 5, Sect. 5,
pij → μ−1
(n)
j , n → ∞,
for all i, j ∈ C. If j ∈
/ C, then this state is transient and by property (21) of Theorem 2,
Sect. 5,
(n)
pij → 0, n → ∞,
for all i ∈ E.
Let
μ−1
j (> 0), if j ∈ C,
qj = (5)
0, if j ∈
/ C.
Then, since C
= ∅, the collection Q = (q1 , q2 , . . . ) is (by Theorem 1 (a)) a unique
stationary distribution, therefore Q = Q. 8
Suppose now that the period d(C) > 1. Let C0 , C1 , . . . , Cd−1 be the cyclic sub-
classes of the (positive recurrent) class C.
Every Ck , k = 0, 1, . . . , d − 1, is a recurrent and aperiodic subclass of the matrix
(d)
of transition probabilities pij , i, j ∈ C. Hence, for i, j ∈ Ck ,
(nd) d
pij → >0
μj
by (36) from Sect. 5. Therefore, for each set Ck , the collection {d/μj , j ∈ Ck } is
(d)
(by Theorem 1 (b)) a unique stationary distribution (with regard to the matrix pij ,
i, j ∈ C). This implies, in particular, that j∈Ck μdj = 1, i.e., j∈Ck μ1j = 1d .
Let us set
μ−1
j , j ∈ C = C0 + · · · + Cd−1 ,
qj = (6)
0, j∈/ C,
and show that the collection Q = (q1 , q2 , . . . ) is a unique stationary distribution.
Indeed, if i ∈ C, then (nd−1)
(nd)
pii = pij pji .
j∈C
hence 1
1
≥ pji . (7)
μi μ
j∈C j
But
1 d−1
d−1
1 1
= = = 1. (8)
i∈C
μ i i∈C
μ i d
k=0 k k=0
As in the proof of Theorem 1 (see (3) and (4)), we obtain from (7) and (8) that (7)
holds in fact with an equality sign:
1 1
= pji . (9)
μi μ
j∈C j
Since qi = μ−1 i > 0, we see from (9) that the collection Q = (q1 , q2 , . . . ) is a
stationary distribution, which is unique by Theorem 1. Therefore Q = Q. 8
Finally, let 2 ≤ N < ∞ or N = ∞. Denote the positive recurrent subclasses by
C1 , . . . , CN if N < ∞, and by C1 , C2 , . . . if N = ∞.
Let Qk = (qk1 , qk2 , . . . ) be a stationary distribution for a class Ck , given by the
formula (compare with (5), (6))
k μ−1
j > 0, j ∈ Ck ,
qj =
0, j∈/ Ck .
∞
Then for any nonnegative numbers a1 , a2 , . . . with k=1 ak = 1 (aN+1 = · · · = 0
if N < ∞), the collection a1 Q 1 +· · ·+aN Q N +. . . is, obviously, a stationary distri-
bution. Hence the assumption 2 ≤ N ≤ ∞ leads us to the existence of a continuum
of stationary distributions, which contradicts the assumption of its uniqueness.
Thus, the foregoing proof shows that only the case N = 1 is possible. In other
words, the existence of a unique stationary distribution implies that the chain has
only one indecomposable class, which consists of positive recurrent states.
Sufficiency. If the chain has an indecomposable subclass of positive recurrent
states, i.e., the case N = 1 takes place, then the preceding arguments imply (by
Theorem 1 (c)) the existence and uniquence of the stationary distribution.
This completes the proof of Theorem 2.
4. PROOF OF THEOREM 3. Actually, all we need is contained in Theorem 2 and its
proof.
Sufficiency. Using the notation in the proof of Theorem 2, we have by the con-
ditions of the present theorem that N = 1, C = E, and d(E) = 1 (aperiodic-
ity). Then the reasoning in the case N = 1 of the proof of Theorem 2 implies that
Q = (q1 , q2 , . . . ) with qj = μ−1
j , j ∈ E, is a stationary and ergodic distribution,
since all μ−1
j < ∞, j ∈ E.
Thus, the existence of an ergodic distribution Π = (π1 , π2 , . . . ) is established
(Π = Q).
282 8 Markov Chains
with components
∞ these solutions being nonnegative (xj ≥ 0, j ∈ E) and normal-
of
ized x
j=1 j = 1 .
Under the conditions of Theorem 5 there exists a stationary solution that at the
same time is ergodic. Hence, by Theorem 1 (c), there is a unique solution to sys-
∞(12) within the class of sequences x = (x1 , x2 , . . . ) with xj ≥ 0, j ∈ E, and
tem
j=1 xj = 1.
But in fact we can make a stronger assertion. Since we assume that the conditions
of Theorem 5 are fulfilled, there exists an ergodic distribution Π = (π1 , π2 , . . . ).
Consider under this assumption the problem of existence of a solution ∞ to (12) in
a wider class of sequences x = (x1 , x2 , . . . ) such that xj ∈ R, j ∈ E, j=1 |xj | < ∞
∞
and j=1 xj = 1. We will show that there is a unique solution in this class given by
the ergodic distribution Π. ∞
Indeed, if x = (x1 , x2 , . . . ) is a solution, then, using that j=1 |xj | < ∞, we
obtain the following chain of inequalities:
7 Stationary and Ergodic Distributions 283
∞
∞
∞
xj = xi pij = xk pki pij
i=1 i=1 k=1
∞
∞ ∞ ∞
(2) (n)
= xk pki pij = xk pkj = · · · = xk pkj
k=1 i=1 k=1 k=1
for any n ≥ 1. Taking the limit as n → ∞, we obtain (by the dominated conver-
∞ (n)
gence theorem) that xj = xk πj , where πj = limn pkj for any k ∈ E. By
∞ k=1
assumption, k=1 xk = 1. Hence xj = πj , j ∈ E, which was to be proved.
6. Problems
1. Investigate the problem of stationary, limiting, and ergodic distributions for a
Markov chain with the transition probabilities matrix
⎛ ⎞
1/2 0 1/2 0
⎜ 0 0 0 1⎟
P=⎜ ⎝1/4 1/2 1/4 0⎠ .
⎟
0 1/2 1/2 0
m
2. Let P = pij be a finite doubly stochastic matrix (i.e., j=1 pij = 1 for i =
m
1, . . . , m and i=1 pij = 1 for j = 1, . . . , m). Show that Q = (1/m, . . . , 1/m)
is a stationary distribution of the corresponding Markov chain.
3. Let X be a Markov chain with two states, E = {0, 1}, and the transition proba-
bilities matrix
α 1−α
P= , 0 < α < 1, 0 < β < 1.
1−β β
Explore the limiting, ergodic, and stationary distributions for this chain.
(c) Moreover, these limits πj are equal to μ−1 j > 0 for all j ∈ E, where μj =
∞ (n)
n=1 nfjj is the mean time of return to state j (i.e., μj = Ej τ (j) with τ (j) =
min{ n ≥ 1 : Xn = j}), so that Π = (π1 , π2 , . . . , πr ) is an ergodic distribution.
(d) The stationary distribution Q = (q1 , q2 , . . . , qr ) exists, is unique, and is equal
to Π = (π1 , π2 , . . . , πr ).
2. In addition to Theorem 1, we state the following result clarifying the role of the
properties of a chain being indecomposable and aperiodic.
Theorem 2. Consider a Markov chain with a finite state space E = {1, 2, . . . , r}.
The following statements are equivalent:
(a) The chain is indecomposable and aperiodic (d = 1).
(b) The chain is indecomposable, aperiodic (d = 1), and positive recurrent.
(c) The chain is ergodic.
(d) There is an n0 such that for all n ≥ n0
(n)
min pij > 0.
i,j∈E
PROOF. The implication (d) ⇒ (c) was proved in Theorem 1 of Sect. 12, Chap. 1
(Vol. 1). The converse implication (c) ⇒ (d) is obvious. The implication (a) ⇒ (b)
follows from Theorem 6, Sect. 5, while (b) ⇒ (a) is obvious. Finally, the equivalence
of (b) and (c) is contained in Theorem 5 from Sect. 6.
where p + q = 1.
The following graph demonstrates the possible transitions of this chain.
8 Simple Random Walk as a Markov Chain 285
If p = 0 or 1, the motion is deterministic, and the particle moves to the left or the
right, respectively.
These deterministic cases are of little interest; all the states here are inessential.
Hence we will assume that 0 < p < 1.
Under this assumption the states form a single class of essential communicating
states. In other words, when 0 < p < 1, the chain is indecomposable (Sect. 4).
By the formula for the binomial distribution (Sect. 2, Chap. 1, Vol. 1),
(2n) n (2n)!
pjj = C2n (pq)n = (pq)n (1)
(n!)2
for any j ∈ E. By Stirling’s formula (see (6) in Sect. 2, Chap. 1, Vol. 1); see also
Problem 1), √
n! ∼ 2πn nn e−n .
Therefore we find from (1) that
(2n) (4pq)n
pjj ∼ √ , (2)
πn
hence
∞
(2n)
pjj = ∞, if p = q, (3)
n=1
∞
(2n)
pjj < ∞, if p
= q. (4)
n=1
These formulas, together with Theorem 1 in Sect. 5, yield the following result.
The simple one-dimensional random walk over the set E = Z = {0, ±1, ±2, . . . }
is recurrent in the symmetric case when p = q = 12 and transient when p
= q.
If p = q = 1/2, then, as was shown in Sect. 10, Chap. 1, Vol. 1, for large n
(2n) 1
fjj ∼ √ 3/2 . (5)
2 πn
Therefore ∞
(2n)
μj = (2n)fjj = ∞, j ∈ E. (6)
n=1
Therefore all the states in this case are null recurrent. Hence, by Theorem 5, Sect. 5,
(n)
we obtain that for any i and j, pij → 0 as n → ∞ for all 0 < p < 1. This implies
(Theorem 1, Sect. 6) that there are no limit distributions and no stationary or ergodic
distributions.
EXAMPLE 2. Let d = 2. Consider the symmetric motion in the plain (corresponding
to the case p = q = 1/2 in the previous example), when the particle can move
one step in either direction (to the left or right, up or down) with probability 1/4
(Fig. 41).
286 8 Markov Chains
Assuming for definiteness that the particle was at the origin 0 = (0, 0) at the
initial time instant, we will investigate the problem of its return or nonreturn to this
zero state.
To this end, consider the paths in which a particle makes i steps to the right
and i steps to the left and j steps up and j steps down. If 2i + 2j = 2n, this means
that the particle starting from the origin returns to this state in 2n steps. It is also
clear that the particle cannot return to the origin after an odd number of steps.
This implies that the probabilities of transition from state 0 to the same state 0
are given by the following formulas:
(2n+1)
p00 = 0, n = 0, 1, 2, . . . ,
(2n)
(2n)! 1 2n
p00 = , n = 1, 2, . . . (7)
(i!)2 (j!)2 4
(i,j) : i+j=n
where we have used the formula (Problem 4 in Sect. 2, Chap. 1, Vol. 1).
n
Cni Cnn−i = C2n
n
.
i=0
8 Simple Random Walk as a Markov Chain 287
(2n)
By Stirling’s formula we obtain from (8) that p00 ∼ 1
πn , hence
∞
(2n)
p00 = ∞. (9)
n=0
Of course, a similar assertion, by symmetry, holds not only for (0, 0) but also for
any state (i, j).
As in the case d = 1, we obtain from (9) and Theorem 1 of Sect. 5 the following
statement.
The simple two-dimensional symmetric random walk over the set E = Z2 =
{0, ±1, ±2, . . . }2 is recurrent.
EXAMPLE 3. It turns out that in the case d ≥ 3, the behavior of the symmetric
random walk over the states E = Zd = {0, ±1, ±2, . . . }d is quite different from the
cases d = 1 and d = 2 considered above.
That is:
The simple d-dimensional symmetric random walk over the set E = Zd =
{0, ±1, ±2, . . . }d for every d ≥ 3 is transitive.
(2n)
The proof relies on the fact that the asymptotic behavior of the probabilities pjj
as n → ∞ is
(2n) c(d)
pjj ∼ d/2 (10)
n
with a positive constant c(d) depending only on dimension d.
We will give the proof for d = 3 leaving the case d > 3 as a problem.
The symmetry of the random walk means that at every step the particle moves by
one unit in one of the six coordinate directions with probabilities 1/6.
288 8 Markov Chains
Let the particle start from the state 0 = (0, 0, 0). Then, as for d = 2, we find
from the formulas for the multinomial distribution (Sect. 2, Chap. 1, Vol. 1) that
(2n)! 1 2n
(2n)
p00 =
(i!)2 (j!)2 ((n − i − j)!)2 6
(i,j): 0≤i+j≤n
n! 2 1 2n
= 2−2n C2n n
i! j! (n − i − j)! 3
(i,j): 0≤i+j≤n
n! 1 2n
≤ Cn 2−2n C2n n
3−n
i! j! (n − i − j)! 3
(i,j): 0≤i+j≤n
−2n n −n
= Cn 2 C2n 3 , (11)
where n!
Cn = max (12)
(i,j): 0≤i+j≤n i! j! (n − i − j)!
and where we have used the obvious fact that
n! 1 2n
= 1.
i! j! (n − i − j)! 3
(i,j): 0≤i+j≤n
Taking four points (i0 −1, j0 ), (i0 +1, j0 ), (i0 , j0 −1), (i0 , j0 +1) and using that the
corresponding values mn (i0 − 1, j0 ), mn (i0 + 1, j0 ), mn (i0 , j0 − 1), and mn (i0 , j0 + 1)
are less than or equal to mn (i0 , j0 ), we obtain the inequalities
8 Simple Random Walk as a Markov Chain 289
n − i0 − 1 ≤ 2j0 ≤ n − i0 + 1,
n − j0 − 1 ≤ 2i0 ≤ n − j0 + 1.
Summarizing these cases, we can state the following theorem due to G. Pólya.
2. The previous examples dealt with a simple random walk in the entire space Zd . In
this subsection we will consider examples of simple random walks with state space
E strictly less than Zd . We will restrict ourselves to the case d = 1.
State 0 is here the only positive recurrent state that forms a unique indecom-
posable subclass. (All the other states are transient.) By Theorem 2 in Sect. 6 there
exists a unique stationary distribution Q = (q0 , q1 , . . . ) with q0 = 1 and qi = 0,
i = 1, 2, . . . .
(n)
This walk provides an example, where (for some i and j) the limits limn pij exist
but depend on the initial state, which means, in particular, that this random walk
possesses no ergodic distribution.
(n) (n)
It is clear that p00 = 1 and p0j = 0 for j = 1, 2, . . . , and an easy calculation
(n)
shows that pij → 0 for all i, j = 1, 2, . . . .
(n)
Let us show that the limits α(i) = limn pi0 exist for all i = 1, 2, . . . and are
given by the formula
(q/p)i , p > q,
α(i) = (16)
1, p ≤ q.
This formula demonstrates that when p > q (trend to the right), the limiting
(n)
probability limn pi0 of transition from state i (i = 1, 2, . . . ) to state 0 depends on i
decreasing geometrically as i grows.
290 8 Markov Chains
(n) (k)
For the proof of (16) notice that pi0 = k≤n fi0 since the null state is absorbing,
(n)
hence the limit limn pi0 (= α(i)) exists and equals fi0 , i.e., the probability of interest
is the probability that the particle leaving state i eventually reaches the null state. By
the same method as in Sect. 12, Chap. 1, Vol. 1 (see also Sect. 2, Chap. 7), we obtain
recursive relations for these probabilities:
α(i) → 0, i → ∞, (19)
so that a = 0 and
α(i) = (q/p)i . (20)
Instead of establishing (19) first, we will prove this equality in another way.
Along with the absorbing barrier at point 0, consider one more absorbing barrier
at point N. Denote by αN (i) the probability that a particle leaving point i reaches the
null state before getting to state N. The probabilities αN (i) satisfy equations (17)
with boundary conditions
αN (0) = 1, αN (N) = 0,
(q/p)i − (q/p)N
αN (i) = , 0 ≤ i ≤ N. (21)
1 − (q/p)N
Therefore limN αN (i) = (q/p)i , so that for the proof of (20) we must show that
This is easily seen intuitively. The proof can be carried out as follows.
Assume that the particle starts from a given state i. Then
where A is the event that there is an N such that the particle leaving state i reaches
the null state before state N. If
But αN (i) = Pi (AN ), so that (22) follows directly from (23) and (24).
(n)
Thus, when p > q, the limits limn pi0 depend on i. If p ≤ q, then limn pi0 = 1
(n)
for any i and limn pij = 0, j ≥ 1. Hence in this case there exists a limit distribution
(n)
Π = (π0 , π1 , . . . ) with πj = limn pij independent of i, and Π = (1, 0, 0, . . . ).
In this case there are two indecomposable positive recurrent classes {0} and {N}.
All the other states 1, 2, . . . , N − 1 are transient. We can see from the proof of
Theorem 2, Sect. 6, that there exists a continuum of stationary distributions Q =
(q0 , q1 , . . . , qN ), all of which have the form q1 = · · · = qN−1 = 0 and q0 = a,
qN = b with a ≥ 0, b ≥ 0, and a + b = 1.
According to the results of Subsection 2 in Sect. 9, Chap. 1, Vol. 1,
⎧
⎨ (q/p) − (q/p) , p
= q,
i N
(n)
lim pi0 = 1 − (q/p) N (25)
n ⎩
1 − (i/N), p = q = 1/2,
EXAMPLE 6. Consider a simple random walk with state space E = {0, 1, . . .} and a
reflecting barrier at 0:
If p > q, then a wandering particle has a trend to the right and the reflecting
barrier enhances this trend, unlike the chain in Example 4, where a particle may
(n)
become “stuck” in the zero state. All the states are transitive: pij → 0, n → ∞, for
all i, j ∈ E; there are no stationary or ergodic distributions.
If p < q, there is a leftward trend, and the chain is recurrent, and so is the chain
for p = q.
Let us write down the system of equations (cf. (12) in Sect. 6) for the stationary
distribution Q = (q0 , q1 , . . . ):
q0 = q1 q,
q1 = q0 + q2 q,
q2 = q1 p + q3 q,
...............
Hence
q1 = q(q1 + q2 ),
q2 = q(q2 + q3 ),
...............
Therefore p
qj = qj−1 , j = 2, 3, . . . .
q
If p = q, then q1 = q2 = . . . , and
∞hence there is no nonnegative solution to
this system satisfying the conditions j=0 qj = 1 and q0 = q1 q. Therefore, when
p = q = 1/2, there is no stationary distribution. All the states in this case are
recurrent. ∞
Finally, let p < q. The condition j=0 qj = 1 yields
p p 2
q1 q + 1 + + + . . . = 1.
q q
Hence
q−p q−p
q1 = , q0 = q1 q = ,
2q 2
and
q − p p j−1
qj = , j ≥ 2.
2q q
EXAMPLE 7. The state space of the simple random walk in this example is E =
{0, 1, . . . , N}, with reflecting barriers 0 and N:
1 p p p p
1 2
0 N 0< p <1
q q q q N–1 1
The states of this chain constitute one indecomposable class. They are positive re-
current with period d = 2. By Theorem 2 in Sect. 6, there is a unique stationary dis-
8 Simple Random Walk as a Markov Chain 293
N
tribution Q = (q0 , q1 , . . . , qN ). On solving the system of equations qj = qi pij
N i=0
subject to the conditions i=0 qi = 1, qj ≥ 0, j ∈ E, we find
(p/q)j−1
qj = N−1 , 1 ≤ j ≤ N − 1, (26)
1 + i=1 (p/q)i−1
and q0 = q1 q, qN = qN−1 q.
There is no ergodic distribution, which follows from Theorem 3, Sect. 6, and the
fact that this chain has period d = 2. The lack of an ergodic distribution can also be
seen directly. For example, let N = 2:
1 p
1
0 2 0< p <1
q 1
In 1907, Paul and Tatiana Ehrenfest [23] proposed this Markov chain as the
model of statistical mechanics describing the motion of gas molecules from one
container (A or B) to the other (B or A) through the membrane between them.
It is assumed that the total number of molecules in the two containers is N, and
at each step one of them is randomly chosen (with probability 1/N) and placed into
the other container. The choice of the molecule at each step is made independently
of its prehistory.
294 8 Markov Chains
and
P(Xn+1 = j | Xn = i) = pij , (29)
with pij defined by (27).
For this model there exists a stationary distribution Q = (q0 , q1 , . . . , qN ) given
by the following binomial formula (Problem 3):
1 N
qj = CNj , j = 0, 1, . . . , N. (30)
2
All the states of this Markov chain are recurrent (Problem 4).
It is of interest to note that the maximum of qj , j = 0, 1, . . . , N, is attained for,
say, even N, at the central value j = N/2, which corresponds to the most probable
“equilibrium” state, when the number of molecules in both containers is the same.
Clearly, this equilibrium, which is established in the course of time, is of a prob-
abilistic nature (specified by the stationary distribution Q).
The possibility of “stabilization” of the number of molecules in the containers
is quite understandable intuitively: the farther state i is from the central value, the
larger the probability (by (27)) that the molecule will move toward this value.
B. D. Bernoulli–Laplace Model. This model, which is akin to the Ehrenfest
model, was proposed by Daniel Bernoulli in 1769 and analyzed by Laplace in 1812
in the context of describing the exchange of particles between two ideal liquids.
Specifically, there are two containers, A and B, containing 2N particles, of which
N particles are white and N particles are black.
The system is said to be in state i, where i ∈ E = {0, 1, . . . , N}, if there are
i white particles and N − i black particles in container A. The assumption of ideal
liquids means that in state i there are N − i white particles and i black particles in
container B, i.e., the number of particles in each container remains equal to N.
At each step n one particle in each container is randomly chosen (with proba-
bility 1/N), and these particles interchange their containers. The two choices are
independent, and each choice is independent of the choices in the previous steps.
Let Xn be the number of white particles in container A. Then the aforementioned
mechanism of particle interchange obeys the Markov property (28) with transition
probabilities pij in (29) given by the formula (Problem 5)
⎧
⎪
⎨(i/N) ,
2
j = i − 1,
pij = (1 − (i/N))2 , j = i + 1, (31)
⎪
⎩
2 (i/N)(1 − (i/N)), j = i,
As in the Ehrenfest model, all the states are recurrent. There exists a unique
stationary distribution Q = (q0 , q1 , . . . , qN ) determined by (Problem 5)
(CNj )2
qj = N 2, j = 0, 1, . . . , N. (32)
(C2N )
4. At the beginning of this chapter we wrote that the issue of major interest here
is the asymptotic behavior (as n → ∞) of memoryless systems. In the previous
sections we considered Markov chains with countable state space E = {i, j, . . . } as
a specific class of these systems, and we studied in them the behavior of transition
(n)
probabilities pij as n → ∞. In particular, we investigated the asymptotic behavior
of a simple random walk in which transitions are possible only to adjacent states.
Of great interest is the study of similar problems for Markov chains with more
complicated state spaces. In this regard, see, for example, [14, 65].
5. The two models considered above (Ehrenfest and Bernoulli–Laplace) are said to
be discrete diffusion models.
We will give an explanation to this expression in terms of asymptotic behavior of
a simple random walk in R. Let Sn = ξ1 + · · · + ξn , n ≥ 1, S0 = 0, where ξ1 , ξ2 , . . .
is a sequence of independent identically distributed random variables with E ξi = 0,
Var ξi = 1. Let X0n = 0 and
, -
1
[nt]
S[nt]
Xtn = √ =√ ξk , 0 < t ≤ 1.
n n
k=1
N N
Clearly, the sequence (0, X1/n , X2/n , . . . , X1N ) may be regarded as a simple random
√
walk in times Δ, 2Δ, √. . . , 1 with Δ = 1/n and jumps of order Δ (ΔXkΔ n
≡
XkΔ − X(k−1)Δ = ξk Δ).
n n
1. The subject of this section is closely related to Sect. 13, Chap. 7, which dealt
with the martingale approach to optimal stopping problems for arbitrary stochas-
tic sequences. In this section we focus on the case where stochastic sequences are
generated by functions of states of Markov chains, which enables us to present and
interpret the general results of Sect. 13, Chap. 7, in a simple and conceivable way.
2. Let X = (Xn , Fn , Px ) be a homogeneous Markov chain with discrete time and
phase space (E, E ).
We will assume that the space (Ω, F ) on which the variables Xn = Xn (ω), n ≥ 0,
are defined is a coordinate space (as in Subsection 6, Sect. 1) and that the Xn (ω) are
specified coordinate-wise,
i.e., Xn (ω) = xn if ω = (x0 , x1 , . . . ) ∈ Ω. The σ-algebra
F is defined as σ( Fn ), where Fn = σ(x0 , . . . , xn ), n ≥ 0.
Remark. In the “general theory of optimal stopping rules” there is no need to re-
quire that Ω be a coordinate space. Nevertheless in the “general theory” one also
must assume that the space is sufficiently “rich.” (For details, see [69].)
In the present exposition, the assumption of coordinate space simplifies the pre-
sentation, in particular, regarding the generalized Markov property (Theorem 1 in
Sect. 2), which was defined in this very framework.
As before, P(x; B) will denote the transition function of our chain (P(x; B) =
Px {X1 ∈ B}), x ∈ E, B ∈ E ).
Let T be the one-step transition operator that acts on E -measurable functions
f = f (x) satisfying Ex |f (X1 )| < ∞, x ∈ E, in the following way:
(Tf )(x) = Ex f (X1 ) = f (y) P(x; dy) . (1)
E
(For notational simplicity, we will write Tf (x) instead of (Tf )(x). A similar conven-
tion will also be used in other cases.)
9 Optimal Stopping Problems for Markov Chains 297
3. To state the optimal stopping problem for the Markov chain X, let g = g(x) be a
given E -measurable real-valued function such that Ex |g(Xn )| < ∞, x ∈ E, for all
n ≥ 0 (or for 0 ≤ n ≤ N if we are to take the “optimal decision” before a time N
specified a priori).
Let Mn0 be the class of Markov times τ = τ (ω) (with respect to the filtration
(Fk )0≤k≤N ) taking values in the set {0, 1, . . . , n}.
The following theorem is a “Markov” version of Theorems 1 and 2 in Sect. 13,
Chap. 7.
for all x ∈ E.
(2) The functions sn (x) are determined by the formula
PROOF. Let us apply Theorems 1 and 2 of Sect. 13, Chap. 7, to the functions fn =
g(Xn ), 0 ≤ n ≤ N. To this end, fix an initial state x ∈ E and consider the functions
VnN and vNn introduced therein. To highlight the dependence on the initial state, we
will write VnN = VnN (x). Thus,
where MNn is the class of Markov times (with respect to the filtration (Fk )k≤N )
taking values in the set {n, n + 1, . . . , N}.
298 8 Markov Chains
In accordance with (6) of Sect. 13, Chap. 7, the functions vNn are defined recur-
sively:
vNN = g(XN ), vNn = max(g(Xn ), Ex (vNn+1 | Fn )). (9)
By the generalized Markov property (Theorem 1 in Sect. 2),
where EXN−1 g(X1 ) is to be understood as follows (Sect. 2): for the function ψ(x) =
Ex g(X1 ), i.e., ψ(x) = (Tg)(x), we define EXN−1 g(X1 ) ≡ ψ(XN−1 ) = (Tg)(XN−1 ).
Hence vNN = g(XN ) and
By (13) of Sect. 13, Chap. 7, we have vN0 = V0N . Since V0N = V0N (x) = sN (x), we
have sN (x) = (QN g)(x), which proves (6) for n = N (and similarly for any n < N).
The recurrence formulas (7) follow from (6) and the definition of Q.
We show now that the stopping time defined by (3) is optimal (for n = N) in the
class MN0 (and similarly in the classes Mn0 for n < N).
By Theorem 1 of Sect. 13, Chap. 7, the optimal stopping time is
Now (12) and the fact established above that sn (x) = (Qn g)(x) for any n ≥ 0 imply
that
vNk = (QN−k g)(Xk ) = sN−k (Xk ). (13)
Therefore
τ0N = min{0 ≤ k ≤ N : sN−k (Xk ) = g(Xk )}, (14)
which proves the optimality of this stopping time in the class MN0 .
4. Use the notation
and, by analogy with the sets DNk and CkN (in Ω) introduced in Subsection 6, Sect. 13,
Chap. 7, the sets
can be called the stopping sets and continuation of observation sets (in E), respec-
tively.
Let us point out the specific features of the stopping problems in the case of
Markov chains. Unlike the general case, the answer to the question of whether ob-
servations are to be stopped or continued is given in the Markov case in terms of
the states of the Markov chain itself (τ0N = min{0 ≤ k ≤ N : Xk ∈ DNk }), in
other words, depending on the position of the wandering particle. And the com-
plete solution of the optimal stopping problems (i.e., the description of the “price”
sN (x) and the optimal stopping time τ0N ) is obtained from the recurrence “dynamic
programming equations” (7) by finding successively the functions s0 (x) = g(x),
s1 (x), . . . , sN (x).
5. Consider now the optimal stopping problem assuming that τ ∈ M∞ 0 , where M0
∞
is the class of all finite Markov times. (The assumption τ ∈ M0 means that τ ≤ N,
N
while the assumption τ ∈ M∞ 0 means only that τ = τ (ω) < ∞ for all ω ∈ Ω.)
Thus we consider the price
To avoid any questions about the existence of expectations Ex g(Xτ ), we can assume,
for example, that
Ex sup g− (Xn ) < ∞, x ∈ E. (21)
n
for all x ∈ E. Of course, it is natural to expect that limN→∞ sN (x) equals s(x). If this
is the case, then, passing to the limit in (7), we find that s(x) satisfies the equation
This equation implies that s(x), x ∈ E, fulfills the following “variational inequal-
ities”:
Inequality (24) says that s(x) is a majorant of g(x). Inequality (25), according to
the terminology of the general theory of Markov processes, means that s(x) is an
excessive or a superharmonic function.
Therefore, if we could establish that s(x) satisfies (23), then the price s(x) would
be an excessive majorant for g(x).
Note now that if a function v(x) is an excessive majorant for g(x), then, obviously,
the following variational inequalities hold:
It turns out, however, that if we assume additionally that v(x) is the least excessive
majorant then (26) becomes an equality, i.e., v(x) satisfies the equation
Lemma 1. Any least excessive majorant v(x) of g(x) satisfies equation (27).
PROOF. The proof is fairly simple. Clearly, v(x) satisfies inequality (26). Let
v1 (x) = max(g(x), Tv(x)). Since v1 (x) ≥ g(x) and v1 (x) ≤ v(x), x ∈ E, we have
Therefore v1 (x) is an excessive majorant for g(x). But v(x) is the least excessive
majorant. Hence v(x) ≤ v1 (x), i.e., v(x) ≤ max(g(x), Tv(x)). Together with (26)
this implies (27).
The preliminary discussions presented earlier, which were based on the assump-
tion s(x) = limN→∞ sN (x) and led to (23), as well as the statement of Lemma 1
suggest a characterization of the the price s(x), namely, that it is likely to be the
least excessive majorant of g(x).
Indeed, the following theorem is true.
Theorem 2. Suppose a function g = g(x) satisfies Ex supn g− (Xn ) < ∞, x ∈ E.
Then the following statements are valid.
(a) The price s = s(x) is the least excessive majorant of g = g(x).
(b) The price s(x) is equal to limN→∞ sN (x) = limN→∞ QN g(x) and satisfies the
Wald–Bellman dynamic programming equation
s(x) − ε ≤ Ex g(Xτε∗ ), x ∈ E.
9 Optimal Stopping Problems for Markov Chains 301
is well defined, so that we can consider the optimal stopping problem in the class
M∞ 0 also.
PROOF OF THEOREM 2. We will give the proof only for the case of a finite set E.
In this case the proof is rather simple and clarifies quite well the appearance of
excessive functions in optimal stopping problems. For the proof in the general case,
see [69, 22].
Proof of (a). Let us show that s(x) is excessive, i.e., s(x) ≥ Ts(x), x ∈ E.
It is obvious that for any state y ∈ E and any ε > 0 there is a finite (Py -a.s.)
Markov time τy ∈ M∞ 0 (depending in general on ε > 0) such that
Using these times τy , y ∈ E, we will construct one more time τ̂ , which will deter-
mine the following strategy of the choice of the stopping time.
Let the particle be in the state x ∈ E at the initial time instant. The observation
process is surely not stopped at this time, and one observation is produced. Let at
n = 1 the particle occur in the state y ∈ E. Then the strategy determined by τ̂
consists, informally speaking, in treating the “life” of the particle as if it started
anew at this time subject further to the stopping rule governed by τy .
The formal definition of τ̂ is as follows.
Let y ∈ E. Consider the event {ω : τy (ω) = n}, n ≥ 0. Since τy is a Markov time,
this event belongs to Fn . We assume that Ω is a coordinate space generated by
sequences ω = (x0 , x1 , . . . ) with xi ∈ E, and Fn = σ(x0 , . . . , xn ). This implies that
the set {ω : τy (ω) = n} can be written {ω : (X0 (ω), . . . , Xn (ω)) ∈ By (n)}, where
By (n) is a set in E n+1 = E ⊗ · · · ⊗ E (n + 1 times). (See also Theorem 4 in Sect. 2,
Chap. 2, Vol. 1).
By definition, the Markov time τ̂ = τ̂ (ω) equals n + 1 with n ≥ 0 on the set
Ân = {ω : X1 (ω) = y, (X1 (ω), . . . , Xn+1 (ω)) ∈ By (n)}.
y∈E
302 8 Markov Chains
Thus,
s(x) = Ex g(Xτ̂ ) ≥ Ts(x) − ε, x ∈ E,
and, since ε > 0 is arbitrary,
s(x) ≥ Ts(x), x ∈ E,
(Note that the conditions of Theorem 1, Sect. 2, Chap. 7, mentioned earlier are ful-
filled in this case since space E is finite.)
Now we deduce from (32) the following corollary.
9 Optimal Stopping Problems for Markov Chains 303
Corollary 2. Let the function g = g(x), x ∈ E, in the optimal stopping problem (20)
be excessive (superharmonic). Then τ0∗ ≡ 0 is an optimal stopping time.
This implies that s̄(x) is an excessive majorant for g(x). But s(x) is the least exces-
sive majorant. Hence s(x) ≤ s̄(x). On the other hand, since sN (x) ≤ s(x) for any
N ≥ 0, we have s̄(x) ≤ s(x).
Therefore s̄(x) = s(x), which proves statement (b).
Proof of (c, d). Finally, we will show that the stopping time
This is true indeed under our assumption that the state space E is finite. (For an
infinite E this is, in general, not the case; see Problem 1).
For the proof of this, note that the event {τ0∗ = ∞} is the same as A =
∗
n≥0 {Xn ∈ / D }. Thus, we are to show that Px (A) = 0 for all x ∈ E.
Obviously, this is the case if D∗ = E.
Let D∗
= E. Since E is finite, there is α > 0 such that g(y) ≤ s(y) − α for all
y ∈ E \ D∗ . Then, for any τ ∈ M∞ 0 ,
304 8 Markov Chains
∞
Ex g(Xτ ) = Px {τ = n, Xn = y} g(y)
n=0 y∈E
∞
∞
= Px {τ = n,Xn = y}g(y) + Px {τ = n,Xn = y}g(y)
n=0 y∈D∗ n=0 y∈E\D∗
∞
∞
≤ Px {τ = n,Xn = y}s(y) + Px {τ = n,Xn = y}(s(y) − α)
n=0 y∈D∗ n=0 y∈E\D∗
where the last inequality follows because s(x) is excessive (superharmonic) and
satisfies Eq. (32).
Taking the supremum over all τ ∈ M∞ 0 on the left-hand side of (38), we obtain
But |s(x)| < ∞ and α > 0. Therefore Px (A) = 0, x ∈ E, which proves the finiteness
of the stopping time τ0∗ .
We will show now that this stopping time is optimal in the class M∞ 0 . By the
definition of τ0∗ ,
s(Xτ0∗ ) = g(Xτ0∗ ). (39)
With this property in mind, consider the function γ(x) = Ex g(Xτ0∗ ) = Ex s(Xτ0∗ ).
In what follows, we show that γ(x) has the properties:
(i) γ(x) is excessive;
(ii) γ(x) majorizes g(x), i.e., γ(x) ≥ g(x), x ∈ E;
(iii) Obviously, γ(x) ≤ s(x).
Properties (i) and (ii) imply that γ(x) is an excessive majorant for s(x), which, in
turn, is the least excessive majorant for g(x). Hence, by (iii), γ(x) = s(x), x ∈ E,
which yields
s(x) = Ex g(Xτ0∗ ), x ∈ E,
thereby proving the required optimality of τ0∗ within M∞ 0 .
Let us prove (i). Use the notation τ̄ = min{n ≥ 1 : Xn ∈ D∗ }. This is a Markov
time, τ0∗ ≤ τ̄ , τ̄ ∈ M∞
1 , and since s(x) is excessive, we have, by (33),
Next, using the generalized Markov property (see (2) in Theorem 1, Sect. 2) we
obtain
∞
Ex s(Xτ̄ ) = / D∗ , . . . , Xn−1 ∈
Px {X1 ∈ / D∗ , Xn = y} s(y)
n=1 y∈D∗
9 Optimal Stopping Problems for Markov Chains 305
∞
= / D∗ , . . . , Xn−2 ∈
pxz Pz {X0 ∈ / D∗ , Xn−1 = y} s(y)
n=1 y∈D∗ z∈E
= pxz Ez s(Xτ0∗ ). (41)
z∈E
i.e.,
γ(x) ≥ pxz γ(z), x ∈ E,
z∈E
Clearly, this function is excessive (being the sum of an excessive function and a
constant) and
for all x ∈ E. Thus, γ̃(x) is an excessive majorant for g(x), and hence γ̃(x) ≥ s(x),
since s(x) is the least excessive majorant for g(x).
This implies that
γ̃(x0∗ ) ≥ s(x0∗ ).
But γ̃(x0∗ ) = g(x0∗ ) by (42); therefore g(x0∗ ) ≥ s(x0∗ ). Since s(x) ≥ g(x) for
all x ∈ E, we obtain g(x0∗ ) = s(x0∗ ), i.e., x0∗ is in D∗ , while x∗ ∈ E \ D∗ by
assumption.
This contradiction shows that E \ D∗ = ∅, so γ(x) ≥ g(x) for all x ∈ E.
6. Let us give some examples.
EXAMPLE 1. Consider the simple random walk with two absorbing barriers de-
scribed in Example 5, Sect. 8, assuming that p = q = 1/2 (symmetric random
walk). If a function γ(x), x ∈ E = {0, 1, . . . , N}, is excessive for this random walk,
then
306 8 Markov Chains
1 1
γ(x) ≥ γ(x − 1) + γ(x + 1) (43)
2 2
for all x = 1, . . . , N − 1.
Suppose we are given a function g = g(x), x ∈ {0, 1, . . . , N}. Since states 0
and N are absorbing, the function s(x) must be sought among the functions γ(x)
satisfying condition (43) and boundary conditons γ(0) = g(0), γ(N) = g(N).
Condition (43) means that γ(x) is convex on the set {1, 2, . . . , N − 1}. Hence we
can conclude that the price s(x) in the problem s(x) = supτ ∈M∞ 0
Ex g(Xτ ) is the
least convex function subject to the boundary conditions s(0) = g(0), s(N) = g(N).
A visual description of the rule for determinimg s(x) is as follows. Let us cover the
values of g(x) by a stretched thread. In Fig. 42 this thread passes through the points
(0, a), (1, b), (4, c), (6, d), where the points 0, 1, 4, 6 form the set D∗ of stopping
states. In these points we have s(x) = g(x). The values of s(x) at other points x =
2, 3, 5 are determined by linear interpolation. In the general case, the convex hull
s(x) for all x ∈ E is constructed in a similar manner.
b s (x)
c
d
g (x)
a
0 1 2 3 4 5 6
Fig. 42 The function g(x) (dashed line) and its convex hull s(x), x = 0, 1, . . . , 6
where τ runs over a class of stopping times MN1 determined by information acces-
sible to the fiancée.
To describe the class MN1 more precisely, let us construct the rank sequence X =
(X1 , X2 , . . . ) depending on ω = (a1 , a2 , . . . , aN ), which will determine the action
of the fiancée.
That is, let X1 = 1, and let X2 be the order number of the candidate that
dominates all the preceding ones. If, for example, X2 = 3, this means that in
ω = (a1 , a2 , . . . , aN ) we have a1 > a2 , but a3 > a1 (> a2 ). If, say, X3 = 5,
then a3 > a4 , but a5 > a3 (> a4 ).
There might be at most N dominants (when (a1 , a2 , . . . , aN ) = (1, 2, . . . , N)). If
ω = (a1 , a2 , . . . , aN ) contains m dominants, we set Xm+1 = Xm+2 = · · · = N + 1.
The class MN1 of admissible stopping times will consist of those τ = τ (ω) for
which
{ω : τ (ω) = n} ∈ FnX ,
where FnX = σ(X1 , . . . , Xn ), 1 ≤ n ≤ N.
Consider the structure of the rank sequence X = (X1 , X2 , . . . ) in more detail. It
is not hard to see (Problem 3) that this sequence is a homogeneous Markov chain
(with phase space E = {1, 2, . . . , N + 1}). The transition probabilities of this chain
are given by the following formulas:
i
pij = , 1 ≤ i < j ≤ N, (45)
j(j − 1)
i
pi,N+1 = , 1 ≤ i ≤ N, (46)
N
pN+1,N+1 = 1. (47)
It is seen from these formulas that the state N + 1 is absorbing and all the transitions
on set E are upward, i.e., the only possible transitions are i → j with j > i.
308 8 Markov Chains
Remark. Formula (45) follows from the following simple arguments taking into
account that the probability of every sequence ω = (a1 , . . . , aN ) is 1/N!.
For 1 ≤ i < j ≤ N the transition probability is equal to
P{Xn = i, Xn+1 = j}
pij = P(Xn+1 = j | Xn = i) = . (48)
P{Xn = i}
Suppose now that the fiancée adopted a stopping time τ (with respect to the
system of σ-algebras (FnX )) and Xτ = i. Then the conditional probability that this
stopping time is successful (i.e., aτ = N) is, according to (46), equal to XNτ = Ni .
Therefore
Xτ
P{aτ = N} = E ,
N
and hence seeking the optimal stopping time τ ∗ (i.e., the stopping time for which
P{aτ ∗ = N} = supτ P{aτ = N}) reduces to the optimal stopping problem
Xτ
V ∗ = sup E , (49)
τ N
N
i
v(i) ≥ Tv(i) = v(j), (50)
j=i+1
j(j − 1)
v(i) ≥ g(i), (51)
and moreover it is the least excessive majorant. The same Theorem 2 implies that
v(i), 1 ≤ i ≤ N + 1, satisfies the equation
9 Optimal Stopping Problems for Markov Chains 309
D∗ = {i ∈ E : v(i) = g(i)}.
Therefore, if i ∈ D∗ , then
N
i 1 N
i 1
g(i) = v(i) ≥ Tv(i) = · v(j) ≥ · g(j)
j=i+1
j j−1 j=i+1
j j−1
N
i 1 j N
1
= · · = g(i) .
j=i+1
j j−1 N j=i+1
j−1
N
1
≤ 1.
j=i+1
j − 1
D∗ = {i∗ , i∗ + 1, . . . , N, N + 1},
i∗ − 2 N
i∗ − 2
∗ ∗ ∗
v(i − 2) = Tv(i − 2) = ∗ v(i − 1) + g(j)
(i − 1)(i∗ − 2) j=i∗
j(j − 1)
1 1 1 i∗ − 2 1
N
= + ··· + +
N i∗ − 1 N−1 N j=i∗ j − 1
i∗ − 1 1 1
= + · · · + .
N i∗ − 1 N−1
By induction we establish that
i∗ − 1 1 1
v(i) = v∗ (N) = + · · · + (55)
N i∗ − 1 N−1
for 1 ≤ i < i∗ . Therefore
v∗ (N), 1 ≤ i < i∗ (N),
v(i) = (56)
g(i) = Ni , i∗ (N) ≤ i ≤ N,
i∗ (N) − 1 1
lim v∗ (N) = lim = ≈ 0.368. (58)
N→∞ N→∞ N e
This result may appear somewhat surprising since it implies that for a large num-
ber N of candidates the fiancée can choose the best one with a fairly high probability,
V ∗ = supτ P{aτ = N} = v∗ (N) ≈ 0.368. The optimal stopping time for that is
τ ∗ = min{n : Xn ∈ D∗ },
and
f (x) = f (y) P(x; dy), x ∈ R. (59)
R
(If the equality sign in (59) is replaced by the greater than or equal to symbol,
then f is superharmonic.) Prove that if f is superharmonic, then for any x ∈ R
the sequence (f (Xn ))n≥0 with X0 = x is a supermartingale (with respect to Px ).
5. Show that the stopping time τ̄ involved in (38) belongs to the class M∞ 1 .
6. Similarly to Example 1 in Subsection 6, consider the optimal stopping problems
and
s(x) = sup Ex g(Xτ )
τ ∈M∞
0
In the history of probability theory, we can distinguish the following periods of its
development (cf. [34, 46]∗ ):
1. Prehistory,
2. First period (the seventeenth century–early eighteenth century),
3. Second period (the eighteenth century–early nineteenth century),
4. Third period (second half of the nineteenth century),
5. Fourth period (the twentieth century).
Prehistory. Intuitive notions of randomness and the beginning of reflection about
chances (in ritual practice, deciding controversies, fortune telling, and so on) ap-
peared far back in the past. In the prescientific ages, these phenomena were regarded
as incomprehensible for human intelligence and rational investigation. It was only
several centuries ago that their understanding and logically formalized study began.
Archeological findings tell us about ancient “randomization instruments”, which
were used for gambling games. They were made from the ankle bone (latin: astra-
galus) of hoofed animals and had four faces on which they could fall. Such dice were
definitely used for gambling during the First Dynasty in Egypt (around 3500 BC),
and then in ancient Greece and ancient Rome. It is known ([14]) that the Roman
Emperors August (63 BC–14 AC) and Claudius (10 BC–54 AC) were passionate
dicers.
In addition to gambling, which even at that time raised issues of favorable and
unfavorable outcomes, similar questions appeared in insurance and commerce. The
oldest forms of insurance were contracts for maritime transportation, which were
found in Babylonian records of the 4th to 3rd Millennium BC. Afterwards the prac-
tice of similar contracts was taken over by the Phoenicians and then came to Greece,
Rome, and India. Its traces can be found in early Roman legal codes and in legis-
lation of the Byzantine Empire. In connection with life insurance, the Roman jurist
Ulpian compiled (220 BC) the first mortality tables.
∗ The citations here refer to the list of References following this section.
In the time of flourishing Italian city-states (Rome, Venice, Genoa, Pisa, Flo-
rence), the practice of insurance caused the necessity of statistics and actuarial cal-
culations. It is known that the first dated life insurance contract was concluded in
Genoa in 1347.
The city-states gave rise to the Renaissance (fourteenth to early seventeenth cen-
turies), the period of social and cultural upheaval in Western Europe. In the Italian
Renaissance, there appeared the first discussions, mostly of a philosophical nature,
regarding the “probabilistic” arguments, attributed to Luca Pacioli (1447–1517),
Celio Calcagnini (1479–1541), and Niccòlo Fontana Tartaglia (1500–1557) (see
[46, 14]).
Apparently, one of the first people to mathematically analyse gambling chances
was Gerolamo Cardano (1501–1576), who was widely known for inventing the Car-
dan gear and solving the cubic equation (although this was apparently solved by
Tartaglia, whose solution Cardano published). His manuscript (written around 1525
but not published until 1663) “Liber de ludo aleae” (“Book on Games of Chance”)
was more than a kind of practical manual for gamblers. Cardano was first to state the
idea of combinations by which one could describe the set of all possible outcomes
(in throwing dice of various kinds and numbers). He observed also that for true
dice “the ratio of the number of favorable outcomes to the total number of possible
outcomes is in good agreement with gambling practice” ([14]).
1. First period (the seventeenth century–early eighteenth century). Many math-
ematicians and historians, such as Laplace [44] (see also [64]), related the beginning
of the “calculus of probabilities” with correspondence between Blaise Pascal (1623–
1662) and Pierre de Fermat (1601–1665). This correspondence arose from certain
questions that Antoine Gombaud (alias Chevalier de Méré, a writer and moralist,
1607–1684) asked Fermat.
One of the questions was how to divide the stake in an interrupted game. Namely,
suppose two gamblers, A and B, agreed to play a certain number of games, say a
best-of-five series, but were interrupted by an external cause when A has won 4
games and B has won 3 games. A seemingly natural answer is to divide it in the
proportion 2 : 1. Indeed, the game certainly finishes in two steps, of which A may
win 1 time, while B has to win both. This apparently implies the proportion 2 : 1.
However, A has won 4 games against 3 won by B, so that the proportion 4 : 3 also
looks natural. In fact, the correct answer found by Pascal and Fermat was 3 : 1.
Another question was: what is more likely, to have at least one 6 in 4 throwings
of a dice or to have at least one pair (6, 6) in 24 simultaneous throwings of two
dice? In this problem, Pascal and Fermat also gave a correct answer: the former
combination is slightly more probable than the latter (1 − (5/6)2 = 0.516 against
1 − (35/36)2 = 0.491).
In solving these problems, Pascal and Fermat (as well as Cardano) applied com-
binatorial arguments that became one of the basic tools in “calculus of probabil-
ities” for the calculation of various chances. Among these tools, Pascal’s triangle
also found its place (although it was known before).
In 1657, the book by Christianus Huygens (1629–1695) “De Ratiociniis in Ludo
Aleæ” (“On Reasoning in Games of Chance”) appeared, which is regarded as the
Development of Mathematical Theory of Probability: Historical Review 315
the difference between the notions of the probability of an event and the frequency
of its appearance in a finite number of trials, as well as the possibility of determina-
tion of this probability (with certain accuracy) from its frequency in large numbers
of trials.
2. Second period (the eighteenth century–early nineteenth century). This period
is associated, essentially, with the names of Pierre-Rémond de Montmort (1678–
1719), Abraham de Moivre (1667–1754), Thomas Bayes (1702–1761), Pierre Si-
mon de Laplace (1749–1827), Carl Friedrich Gauss (1777–1855), and Siméon De-
nis Poisson (1781–1840).
While the first period was largely of a philosophical nature, in the second one the
analytic methods were developed and perfected, computations became necessary in
various applications, and probabilistic and statistical approaches were introduced in
the theory of observation errors and shooting theory.
Both Montmort and de Moivre were greatly influenced by Bernoulli’s work in the
calculus of probability. In his book “Essai d’Analyse sur les Jeux de Hasard” (“Es-
say on the analysis of gambling”), Montmort pays major attention to the methods of
computations in diverse gambling games.
In the books “Doctrine of Chances” (1718) and “Miscellanea Analytica” (“An-
alytical Miscellany”, 1730), de Moivre carefully defines such concepts as indepen-
dence, expectation, and conditional probability.
De Moivre’s name is best known in connection with the normal approximation
for the binomial distribution. While J. Bernoulli’s law of large numbers showed that
the relative frequencies obey a certain regularity, namely, they converge to the corre-
sponding probabilities, the normal approximation discovered by de Moivre revealed
another universal regularity in the behavior of deviations from the mean value. This
de Moivre’s result and its subsequent generalizations played such a significant role
in the probability theory that the corresponding “integral limit theorem” became
known as the Central Limit Theorem. (This term was introduced by G. Pólya (1887–
1985) in 1920, see [60].)
The main figure of this period was, of course, Laplace (Pierre-Simon de Laplace,
1749–1827). His treatise “Théorie analytique des probabilités” (“Analytic Theory
of Probability”) published in 1812 was the main manual on the probability theory in
the nineteenth century. He also wrote several memoirs on foundations, philosophical
issues, and particular problems of the probability theory, in addition to his works on
astronomy and calculus. He made a significant contribution to the theory of errors.
The idea that the measurement errors are normally distributed as a result of the
summation of many independent elementary random errors is due to Laplace and
Gauss. Laplace not only restated Moivre’s integral limit theorem in a more general
form (the “de Moivre–Laplace theorem”), but also gave it a new analytic proof.
Following Bernoulli, Laplace maintained the equiprobability principle implying
the classical definition of probability (in the case of finitely many possible out-
comes).
However, already at that time there appeared “nonclassical” probability distri-
butions that did not conform to the classical concepts. So were, for example, the
Development of Mathematical Theory of Probability: Historical Review 317
normal and Poisson laws, which for long time were considered merely as certain
approximations rather than probability distributions per se (in the modern sense of
the term).
Other problems, where “nonclassical” probabilities arose, were the ones related
to “geometric probabilities” (treated, e.g., by Newton 1665, see [55, p. 60]). An
example of such a problem is the “Buffon needle”. Moreover, unequal probabilities
arose from the Bayes formula (presented in “An Essay towards Solving a Problem in
the Doctrine of Chances” which was read to the Royal Society in 1763 after Bayes’
death). This formula gives the rule for recalculation of prior probabilities (assumed
equal by Bayes) into posterior ones given the occurrence of a certain event. This
formula gave rise to the statistical approach called nowadays “the Bayes approach”.
It can be seen from all that has been said that the framework of the “classical”
(finite) probability theory limited the possibilities of its development and applica-
tion, and the interpretation of the normal, Poisson, and other distributions merely
as limiting objects was giving rise to the feeling of incompleteness. During this pe-
riod, there were no abstract mathematical concepts in the probability theory and it
was regarded as nothing but a branch of applied mathematics. Moreover, its meth-
ods were confined to the needs of specific applications (such as gambling, theory of
observation errors, theory of shooting, insurance, demography, and so on).
3. Third period (second half of the nineteenth century). During the third pe-
riod, the general problems of probability theory developed primarily in St. Peters-
burg. The Russian mathematicians P. L. Chebyshev (1821–1894), A. A. Markov
(1856–1922), and A. M. Lyapunov (1857–1918) made an essential contribution to
the broadening and in-depth study of probability theory. It was due to them that
the limitation to “classical” probability was abandoned. Chebyshev clearly realized
the role of the notions of a random variable and expectation and demonstrated their
usability, which have now become a matter of course.
Bernoulli’s law of large numbers and the de Moivre–Laplace theorem dealt with
random variables taking only two values. Chebyshev extended the scope of these
theorems to much more general random variables. Already his first result estab-
lished the law of large numbers for sums of arbitrary independent random variables
bounded by a constant. (The next step was done by Markov who used in the proof
the “Chebyshev–Markov inequality”.)
After the law of large numbers, Chebyshev turned to establishing the de Moivre–
Laplace theorem for sums of independent random variables, for which he worked
out a new tool, the method of moments, which was later elaborated by Markov.
The next unexpected step in finding general conditions for the validity of the de
Moivre–Laplace theorem was done by Lyapunov, who used the method of charac-
teristic functions taking its origin from Laplace. He proved this theorem assuming
only that the random variables involved in the sum have moments of order 2 + δ,
δ > 0 (rather then the moments of all orders required by the method of moments)
and satisfy Lyapunov’s condition.
Moreover, Markov introduced a principally new concept, namely, that of a se-
quence of dependent random variables possessing the memoryless property known
nowadays as a Markov chain, for which he rigorously proved the first “ergodic the-
orem”.
318 Development of Mathematical Theory of Probability: Historical Review
Thus, we can definitely state that the works of Chebyshev, Markov, and Lyapunov
(“Petersbourg school”) laid the foundation for all subsequent development of the
probability theory.
In Western Europe, the interest to probability theory in the late nineteenth century
was rapidly increasing due to its deep connections discovered at that time with pure
mathematics, statistical physics, and flourished mathematical statistics.
It became clear that the development of probability theory was restrained by its
classical framework (finitely many equiprobable outcomes) and its extension had to
be sought in the models of pure mathematics. (Recall that at that time the set theory
only began to be developed and measure theory was on the threshold of its creation.)
At the same time, pure mathematics, particularly number theory, which is an area
apparently very remote from probability theory, began to use concepts and obtain
results of a probabilistic nature with the help of probabilistic intuition.
For example, Jules Henri Poincaré (1854–1912), in his paper [58] of 1890 deal-
ing with the three-body problem, stated a result on return of the motion of a dynam-
ical system described by a transformation T preserving the “volume”. This result
asserted that if A is the set of initial states ω, then for “typical” states ω ∈ A the
trajectories T n ω would return into the set A infinitely often (in the modern language,
the system returns for almost all [rather then for all] initial states of the system).
In considerations of that time, expressions like “random choice”, “typical case”,
“special case” are often used. In the handbook Calcul des Probabilités [59], 1896,
H. Poincaré asks the question about the probability that a randomly chosen point of
[0, 1] happens to be a rational number.
In 1888, the astronomer Johan August Hugo Guldén (1841–1896) published a
paper [24] that dealt (like Poincaré’s [58]) with planetary stability and which nowa-
days would fall within the domain of number theory.
Let ω ∈ [0, 1] be a number chosen “at random” and let ω = (a1 , a2 , . . . )
be its continued fraction representation, where an = an (ω) are integers. (For a
rational number ω there are only finitely many nonvanishing an ’s; the numbers
ω k = (a1 , a2 , . . . , ak , 0, 0, . . . ) formed from the representation ω = (a1 , a2 , . . . )
are in a sense “best possible” rational approximations of ω.) The question is how
the numbers an (ω) behave for large n in “typical” cases.
Guldén established (though nonrigorously) that the “probability” to have an = k
in the representation ω = (a1 , a2 , . . . ) is “more or less” inversely proportional to k2
for large h. Somewhat later, T. Brodén [9] and A. Wiman [69] showed by dealing
with geometric probabilities that if the “random” choice of ω ∈ [0, 1] is determined
by the uniform distribution of ω on [0, 1], then the probability that an (ω) = k tends
to 1 A 1
(log 2)−1 log 1 + 1+
k k+1
as n → ∞. This expression is inversely proportional to k2 for large k, which Guldén
essentially meant.
In the second half of the nineteenth century, the probabilistic concepts and ar-
guments found their way to the classical physics and statistical mechanics. Let us
mention, for example, the Maxwell distribution (James Clerk Maxwell, 1831–1879)
Development of Mathematical Theory of Probability: Historical Review 319
for molecular velocities, see [51], and Boltzmann’s temporal averages and ergodic
hypothesis (Ludwig Boltzmann, 1844–1906), see [6, 7].
With their names, the concept of a statistical ensemble is connected, which was
further elaborated by Josiah Willard Gibbs (1839–1903), see [23].
An important role in development of probability theory and understanding its
concepts and approaches was played by the discovery in 1827 by Robert Brown
(1773–1858) of the phenomenon now known as Brownian motion, which he de-
scribed in the paper “A Brief Account of Microscopical Observations . . .” published
in 1828 (see [11]). Another phenomenon of this kind was the radioactive decay dis-
covered in 1896 by Antoine Henri Becquerel (1852–1908), who studied the proper-
ties of uranium. In 1900, Louis Bachelier (1870–1946) used the Brownian motion
for mathematical description of stock value, see [2].
A qualitative explanation and quantitative description of the Brownian motion
was given later by Albert Einstein (1879–1955) [15] and Marian Smoluchowski
(1872–1917) [63]. The phenomenon of radioactivity was explained in the frame-
work of quantum mechanics, which was created in 1920s.
From all that has been said, it becomes apparent that the appearance of new
probabilistic models and the use of probabilistic methodology were far beyond the
scope of the “classical probability” and required new concepts that would enable one
to give a precise mathematical meaning to expressions such as “randomly chosen
point from the interval [0, 1]”, let alone the probabilistic description of Brownian
motion. From this perspective, very well-timed were measure theory and the notion
of the “Borel measure” introduced by Émile Borel (1871–1956) in 1898 [8], and
the theory of integration by Henri Lebesgue (1875–1941) exposed in his book [45]
of 1904. (Borel introduced the measure on the Euclidean space as a generalization
of the notion of length. The modern presentation of measure theory on abstract
measurable spaces follows Maurice Fréchet (1878–1973), see [22] of 1915. The
history of measure theory and integration can be found, e.g., in [25].)
It was immediately recognized that Borel’s measure theory along with
Lebesgue’s theory of integration form the conceptual basis that may justify many
probabilistic considerations and give a precise meaning to intuitive formulations
like the “random choice of a point from [0, 1]”. Soon afterwards (1905), Borel
himself produced an application of the measure-theoretic approach to the proba-
bility theory by proving the first limit theorem, viz. strong law of large numbers,
regarding certain properties of real numbers that hold “with probability one”.
This theorem, giving a certain idea of “how many” real numbers with exceptional
(in the sense to be specified) properties are there, consists of the following.
Let ω = 0. α1 , α2 , . . . be the binary representation with αn = 0 or 1 of a
real number ω ∈ [0, 1] (compare with continued fraction representation ω =
(a1 , a2 , . . . ) considered above). Let νn (ω) be the (relative) frequency of ones among
the first n digits α1 , . . . , αn , then the set of those numbers ω for which νn (ω) → 1/2
as n → ∞ (“normal” numbers according to Borel) has Borel measure 1, while the
(“exceptional”) numbers for which this convergence fails form a set of zero mea-
sure.
320 Development of Mathematical Theory of Probability: Historical Review
† Bernoulli’s LLN dealt with a problem quite different from the one solved by the SLLN, namely,
obtaining an approximation for the distribution of the sum of n independent variables. Of course,
it was only the first step in this direction; the next one was the de Moivre–Laplace theorem. This
problem does not concern infinite sequences of random variables and is of current interest for
probability theory and mathematical statistics. For the modern form of the LLN, see, e.g., the
degenerate convergence criterion in Loéve’s “Probability Theory” (Translator).
Development of Mathematical Theory of Probability: Historical Review 321
probability required (in the simplest cases) the concepts of “relative measures”, “rel-
ative frequences” and (in the general cases) some artificial limiting procedures.
Among the authors of subsequent work on logical justification of probability
theory, we mention first of all S. N. Bernstein (1880–1968) and Richard von Mises
(1883–1953).
Bernstein’s system of axioms ([4], 1917) was based on the notion of qualita-
tive comparison of events according to their greater or smaller likelihood. But the
numerical value of probability was defined as a subordinate notion.
Afterwards, a very similar approach based on subjective qualitative statements
(“system of knowledge of the subject”) was extensively developed by Bruno de
Finetti (1906–1985) in the late 1920s–early 1930s (see, e.g., [16–21]).
The ideas of de Finetti found support from some statisticians following the Bayes
approach, e.g., Leonard Savage (1917–1971), see [61], and were adopted in game
and decision theory, where subjectivity plays a significant role.
In 1919, Mises proposed ([52, 53]) the frequentist (or in other terms, statistical
or empirical) approach to foundation of probability theory. His basic idea was that
the probabilistic concepts are applicable only to so called “collectives”, i.e., indi-
vidual infinite ordered sequences possessing a certain property of their “random”
formation. The general Mises’ scheme may be outlined as follows.
We have a space of outcomes of an “experiment” and assume that we can produce
infinitely many trials resulting in a sequence x = (x1 , x2 , . . .) of outcomes. Let
A be a subset in the set of outcomes and νn (A; x) = n−1 i=1 IA (xi ) the relative
n
(ii) (The existence of the limiting frequencies for subsequences). For all subse-
quences x = (x1 , x2 , . . .) obtained from the sequence x = (x1 , x2 , . . .) by means of
a certain preconditioned system of (“admissible”) rules (termed by Mises as “place-
selection functions”), the limits of frequencies limn νn (A; x ) must be the same as
for the initial sequence x = (x1 , x2 , . . .), i.e. must be equal to limn νn (A; x).
According to Mises, one can speak of the “probability of A” only in connection
with a certain “collective”, and this probability P(A; x) is defined (by (i)) as the limit
limn νn (A; x). It should be emphasized that if this limit does not exist (so that x by
definition is not a “collective”), this probability is not defined. The second postulate
was intended to set forth the concept of “randomness” in the formation of the “col-
lective” x = (x1 , x2 , . . .) (which is the cornerstone of the probabilistic reasoning
and must be in accordance with intuition). It had to express the idea of “irregu-
larity” of this sequence and “unpredictability” of its “future values” (xn , xn+1 , . . .)
from the “past” (x1 , x2 , . . . , xn−1 ) for any n ≥ 1. (In probability theory based on
322 Development of Mathematical Theory of Probability: Historical Review
Kolmogorov’s axioms exposed in Sect. 1, Chap. 2, Vol. 1, such sequences are “typi-
cal” sequences of independent identically distributed random variables, see Subsec-
tion 4 of Sect. 5, Chap. 1).
The postulates used by Mises in construction of “a mathematical theory of repet-
itive events” (as he wrote in [54]) caused much discussion and criticism, especially
in the 1930s. The objections concerned mainly the fact that in practice we deal only
with finite rather than infinite sequences. Therefore, in reality, it is impossible to de-
termine whether the limit limn νn (A; x) does exist and how sensitive this limit is to
taking it along a subsequence x instead of the sequence x. The issues, which were
also severely criticised, were the manner of defining by Mises “admissible” rules
of selecting subsequences as well as vagueness in defining the set of those (“test”)
rules that can be considered in the alternative condition (ii).
If we consider a sequence x = (x1 , x2 , . . .) of zeroes and ones such that the
limit limn νn ({1}; x) lies in (0, 1), then this sequence must contain infinitely many
both zeroes and ones. Therefore, if any rules of forming subsequences are admitted,
then we always can take a subsequence of x consisting, e.g., only of ones for which
limn νn ({1}; x ) = 1. Hence nontrivial collectives invariant with respect to all rules
of taking subsequences do not exist.
The first step towards the proof that the class of collectives is not empty was taken
in 1937 by Abraham Wald (1902–1950), see [68]. In his construction, the rules of se-
lecting subsequences x = (x1 , x2 , . . .) from x = (x1 , x2 , . . .) were determined by a
countable collection of functions fi = fi (x1 , . . . , xi ), i = 1, 2, . . ., taking two values 0
and 1 so that xi+1 is included into x if fi (x1 , . . . , xi ) = 1 and not included otherwise.
In 1940, Alonzo Church (1903–1995) proposed [12] another approach to forming
subsequences based on the idea that every rule must be “effectively computable” in
practice. This idea led Church to the concept of algorithmically computable func-
tions (i.e., computable by means of, say, Turing machine). (Let, for example, xj take
two values, ω1 = 0 and ω2 = 1. Let us make correspond to (x1 , . . . , xn ) the integer
n
λ= ik 2k−1 ,
k=1
which is naturally derived from the interpretation of Pt (x) as the probability that the
increment of the process for the time t is no greater than x.
However, from the formally-logical point of view the existence of an object,
which could be called a “process” with transition probabilities P(s, x, t, A) or with
increments distributed according to Pt (x), remained an open question.
This was the question solved by the basic theorem stating that for any system of
consistent finite-dimensional probability distributions
one can construct a probability space (Ω, F, P) and a system of random variables
X = (Xt )t≥0 , Xt = Xt (ω), such that
ization, the algorithms in Φ must reject atypical chains of simple structure, selecting
as random those sufficiently complicated.
This consideration led Kolmogorov to the “second” approach to the concept of
randomness. The emphasis in this approach is made on the “complexity” of the
chains rather then on “simplicity” of the related algorithms. Kolmogorov introduces
a certain numerical characteristic of complexity, which is designed to show the de-
gree of “irregularity” in formation of these chains.
This characteristic is known as “algorithmic” (or “Kolmogorov’s”) complexity
KA (x) of an individual chain x with respect to the algorithm A, which can be heuris-
tically described as the shortest length of a binary chain at the input of the algorithm
A for which this algorithm can recover this chain at the output.
The formal definitions are as follows.
Let Σ be a collection of all finite binary chains x = (x1 , x2 , . . . , xn ), let |x| (=
n) denote the length of a chain, and let Φ be a certain class of algorithms. The
complexity of a chain x ∈ Σ with respect to an algorithm A ∈ Φ is the number
i.e., the minimal length |p| of a binary chain p at the input of the algorithm A from
which x can be recovered at the output of A (A(p) = x).
In [36], Kolmogorov establishes that (for some important classes of algorithms
Φ) the following statement holds: there exists a universal algorithm U ∈ Φ such
that for any A ∈ Φ there is a constant C(A) satisfying
where C does not depend on x ∈ Σ. (Kolmogorov points out in [36] that a similar
result was simultaneously obtained by R. Solomonov.)
Taking into account the fact that KU (x) grows to infinity with |x| for “typical”
chains x, this result justifies the following definition: the complexity of a chain x ∈ Σ
with respect to a class Φ of algorithms is K(x) = KU (x), where U is a universal
algorithm in Φ.
The quantity K(x) is customarily referred to as algorithmic or Kolmogorov’s
complexity of an “object” x. Kolmogorov regarded this quantity as measuring the
amount of algorithmic information contained in the “finite object” x. He believed
that this concept is even more fundamental than the probabilistic notion of infor-
mation, which requires knowledge of a probability distribution on objects x for its
definition.
The quantity K(x) may be considered also as a measure of compression of a
“text” x. If the class Φ includes algorithms like simple enumeration of elements,
then (up to a constant factor) the complexity K(x) is no greater than the length
|x|. On the other hand, it is easy to show that the number of (binary) chains x of
Development of Mathematical Theory of Probability: Historical Review 327
References
[8] É. Borel. Leçons sur la théorie des fonctions. Gauthier-Villars, Paris, 1898;
Éd. 2. Gauthier-Villars, Paris, 1914.
[9] T. Brodén. Wahrscheinlichkeitsbestimmungen bei der gewöhnlichen Ketten-
bruchentwicklung reeller Zahlen. Akad. Förh. Stockholm 57 (1900), 239–266.
[10] U. Broggi. Die Axiome der Wahrscheinlichkeitsrechnung. Dissertation.
Göttingen, 1907.
[11] R. Brown. A brief account of microscopical observations made in the months
of June, July, and August, 1827, on the particles contained in the pollen of
plants; and on the general existence of active molecules in organic and inor-
ganic bodies. Philosophical Magazine N.S. 4 (1828), 161–173.
[12] A. Church. On the concept of a random sequence. Bull. Amer. Math. Soc. 46,
2 (1940), 130–135.
[13] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algo-
rithms. 3rd ed. MIT Press, 2009.
[14] F. N. David. Games, Gods and Gambling. The Origin and History of Probabil-
ity and Statistical Ideas from the Earliest Times to the Newtonian Era. Griffin,
London, 1962.
[15] A. Einstein. Über die von der molekularkinetischen Theorie der Wärme
geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen.
Annalen der Physik, 17 (1905), 549–560.
[16] B. de Finetti. Sulle probabilità numerabili e geometriche. Istituto Lombardo.
Accademia di Scienze e Lettere. Rendiconti (2), 61 (1928), 817–824.
[17] B. de Finetti. Sulle funzioni a incremento aleatorio. Accademia Nazionale dei
Lincei. Rendiconti (6), 10 (1929), 163–168.
[18] B. de Finetti. Integrazione delle funzioni a incremento aleatorio. Accademia
Nazionale dei Lincei. Rendiconti (6), 10 (1929), 548–553.
[19] B. de Finetti. Probabilismo: saggio critico sulla teoria delle probabilità e sul
valore della scienza. Perrella, Napoli, 1931; Logos. 14 (1931), 163–219. En-
glish transl.: Erkenntnis. The International Journal of Analytic Philosophy 31
(1989), 169–223.
[20] B. de Finetti. Probability, Induction and Statistics. The Art of Guessing. Wiley,
New York etc., 1972.
[21] B. de Finetti. Teoria delle probabilità: sintesi introduttiva con appendice crit-
ica. Vol. 1, 2. Einaudi, Torino, 1970. English transl.: Theory of Probability: A
Critical Introductory Treatment. Vol. 1, 2. Wiley, New York etc., 1974, 1975.
[22] M. Fréchet. Sur l’intégrale d’une fonctionnelle étendue à un ensemble abstrait.
Bulletin de la Société Mathématique de France. 43 (1915), 248–265.
[23] J. W. Gibbs. Elementary Principles in Statistical Mechanics. Developed with
especial reference to the rational foundation of thermodynamics. Yale Univ.
Press, New Haven, 1902; Dover, New York, 1960.
[24] H. Gyldén. Quelques remarques relativement à la représentation de nombres
irrationnels au moyen des fractions continues. Comptes Rendus, Paris, 107
(1888), 1584–1587.
[25] T. Hawkins. Lebesgue’s Theory of Integration. Its Origin and Development.
Univ. Wisconsin Press, Madison, Wis. – London, 1970.
Development of Mathematical Theory of Probability: Historical Review 329
Chapter 4
Section 1. Kolmogorov’s zero–one law appears in his book [50]. For the Hewitt–
Savage zero–one law see also Borovkov [10], Breiman [11], and Ash [2].
Sections 2–4. Here the fundamental results were obtained by Kolmogorov and
Khinchin (see [50] and references therein). See also Petrov [62], Stout [74], and
Durret [20]. For probabilistic methods in number theory see Kubilius [51].
It is appropriate to recall here the historical background of the strong law of large
numbers and the law of the iterated logarithm for the Bernoulli scheme.
The first paper where the strong law of large numbers appeared was Borel’s paper
[7] on normal numbers in [0, 1). Using the notation of Example 3 in Sect. 3, let
n
1
Sn = I(ξk = 1) − .
2
k=1
Then Borel’s result stated that for almost all (Lebesgue) ω ∈ [0, 1) there exists
N = N(ω) such that
S (ω) log(n/2)
n
≤ √
n 2n
for all n ≥ N(ω). This implies, in particular, that Sn = o(n) almost surely.
The next step was done by Hausdorff [41], who established that Sn = o(n1/2+ε )
almost surely for any ε > 0. In 1914 Hardy and Littlewood [39] showed that
Sn = O((n log n)1/2 ) almost surely. In 1922 Steinhaus [73] improved their result
by showing that
Sn
lim sup √ ≤1
n 2n log n
almost surely.
almost surely. (Note that in this case σ 2 = E[I(ξk = 1) − 1/2]2 = 1/4, which
explains the appearance of the factor n/2 rather than the usual 2n; cf. Theorem 4 in
Sect. 4.)
As was mentioned in Sect. 4, the next step in establishing the law of the iterated
logarithm for a broad class of independent random variables was taken in 1929 by
Kolmogorov [48].
Section 5. Regarding these questions, see Petrov [62], Borovkov [8–10], and
Dacunha-Castelle and Duflo [16].
Chapter 5
Sections 1–3. Our exposition of the theory of (strict sense) stationary random pro-
cesses is based on Breiman [11], Sinai [72], and Lamperti [52]. The simple proof of
the maximal ergodic theorem was given by Garsia [28].
Chapter 6
Section 1. The books by Rozanov [67] and Gihman and Skorohod [30, 31] are de-
voted to the theory of (wide sense) stationary random processes. Example 6 was
frequently presented in Kolmogorov’s lectures.
Section 2. For orthogonal stochastic measures and stochastic integrals see also Doob
[18], Gihman and Skorohod [31], Rozanov [67], and Ash and Gardner [3].
Section 3. The spectral representation (2) was obtained by Cramér and Loève (e.g.,
[56]). Also see Doob [18], Rozanov [67], and Ash and Gardner [3].
Section 4. There is a detailed exposition of problems of statistical estimation of the
covariance function and spectral density in Hannan [37, 38].
Sections 5–6. See also Rozanov [67], Lamperti [52], and Gihman and Skorohod
[30, 31].
Section 7. The presentation follows Liptser and Shiryaev [54].
Historical and Bibliographical Notes (Chaps. 4–8) 335
Chapter 7
Section 1. Most of the fundamental results of the theory of martingales were ob-
tained by Doob [18]. Theorem 1 is taken from Meyer [57]. Also see Meyer [58],
Liptser and Shiryaev [54], Gihman and Skorohod [31], and Jacod and Shiryaev [43].
Theorem 1 is often called the theorem “on transformation under a system of
optional stopping” (Doob [18]). For identities (13) and (14) and Wald’s fundamental
identity see Wald [76].
Section 3. The right inequality in (25) was established by Khinchin [45] (1923) in
the course of proving the law of the iterated logarithm. To explain what led Khinchin
to obtain this inequality, let us recall the line of the proof of the strong law of
large numbers by Borel and Hausdorff (see also the earlier comment to Sects. 2–
4, Chap. 4).
Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-
ables with P{ξ1 = 1} = P{ξ1 = −1} = 1/2 (Bernoulli scheme), and let
Sn = ξ1 + · · · + ξn .
Borel’s proof that Sn = o(n) almost surely was essentially as follows. Since
S E S4 3n2
n 3
P ≥ δ ≤ 4 n4 ≤ 4 4 = 2 4 for any δ > 0,
n n δ n δ n δ
we have
S
S 3 1
k k
P sup ≥ δ ≤ P ≥δ ≤ 4 →0
k≥n k k δ k2
k≥n k≥n
Borel’s proof and t(n) = n1/2+ε in Hausdorff’s (while Hardy and Littlewood needed
t(n) = (n log n)1/2 ).
Analogously, Khinchin needed inequalities (25) (in fact, the right one) to obtain
a “good” bound for the probabilities P{|Sn | ≥ t(n)}.
Regarding the derivation of Khinchin’s inequalities (both right and left) for any
p > 0 and optimality of the constants Ap and Bp in (25) see the survey paper by
Peškir and Shiryaev [61].
Khinchin derives from the right inequality in (25) for p = 2m that for any t > 0
(2m)! −2m 2m
P{|Xn | > t} ≤ t−2m E |Xn |2m ≤ t [X]n .
2m m!
By Stirling’s formula
(2m)! 2 m
≤ D mm ,
2m m! e
√ 2
t
where D = 2. Therefore, setting m = 2[X] 2 , we obtain
n
2m[X]2 m
P{|Xn | > t} ≤ D n
≤ D e−m
et2
t2
t2
≤ D exp 1 − = D e exp −
2[X]2n 2[X]2n
t2
= c exp −
2[X]2n
√
with c = De = 2e.
This inequality implies the bound
t2
P{|Sn | > t} ≤ e− 2n2 ,
√
which was used by Khinchin for the proof that Sn = O( n log log n) almost surely.
Chow and Teicher [13] contains an illuminating study of the inequalities pre-
sented here. Theorem 2 is due to Lenglart [53].
Section 4. See Doob [18].
Section 5. Here we follow Kabanov, Liptser, and Shiryaev [44], Engelbert and
Shiryaev [24], and Neveu [59]. Theorem 4 and the examples were given by Liptser.
Section 6. This approach to problems of absolute continuity and singularity, and the
results given here, can be found in Kabanov, Liptser, and Shiryaev [44].
Section 7. Theorems 1 and 2 were given by Novikov [60]. Lemma 1 is a discrete
analog of Girsanov’s lemma (see [54]).
Section 8. See also Liptser and Shiryaev [55] and Jacod and Shiryaev [43], which
discuss limit theorems for random processes of a rather general nature (e.g., martin-
gales, semimartingales).
Section 9. The presentation follows Shiryaev [70, 71]. For the development of the
approach given here to the generalization of Ito’s formula see [27].
Historical and Bibliographical Notes (Chaps. 4–8) 337
Section 10. Martingale methods in insurance are treated in [29]. The proofs pre-
sented here are close to those in [70].
Section 11–12. For more detailed exposition of the topics related to application of
martingale methods in financial mathematics and engineering see [71].
Section 13. The basic monographs on the theory and problems of optimal stopping
rules are Dynkin and Yushkevich [22], Robbins, Chow, and Siegmund [66], and
Shiryaev [69].
Chapter 8
Sections 1–2. For the definitions and basic properties of Markov chains see also
Dynkin and Yushkevich [22], Dynkin [21], Ventzel [75], Doob [18], Gihman and
Skorohod [31], Breiman [11], Chung [14, 15], and Revuz [65].
Sections 3–7. For problems related to limiting, ergodic, and stationary distributions
for Markov chains see Kolmogorov’s paper [49] and the books by Feller [25, 26],
Borovkov [10, 9], Ash [1], Chung [15], Revuz [65], and Dynkin and Yushkevich
[22].
Section 8. The simple random walk is a textbook example of the simplest Markov
chain for which many regularities were discovered (e.g., the properties of recur-
rence, transience, and ergodicity). These issues are treated in many of the books
cited earlier; see, for example, [1, 10, 15, 65].
Section 9. The interest in optimal stopping was due to the development of statisti-
cal sequential analysis (Wald [76], De Groot [17], Zacks [77], Shiryaev [69]). The
theory of optimal stopping rules is treated in Dynkin and Yushkevich [22], Shiryaev
[69], and Billingsley [5]. The martingale approach to the optimal stopping problems
is presented in Robbins, Chow, and Siegmund [66].
DEVELOPMENT OF MATHEMATICAL THEORY OF PROBABILITY: HISTORICAL
REVIEW. This historical review was written by the author as a supplement to the
third edition of Kolmogorov’s Foundations of the Theory of Probability [50].
References
Symbols Almost
B(K0 , N ; p), 227 invariant, 37
C+ , 156 periodic, 49
C(fN ; P), 222 Amplitude, 50
CN , 227 Arbitrage
R(n), 48 opportunity, 209
Xnπ , 208 absence of, 207
Z (λ), 56 Arithmetic properties, 265
Z (Δ), 56 Asymptotically uniformly infinitesimal, 196
[X , Y ]n , 117 Asymptotic negligibility condition, 185
[X ]n , 117 Asymptotic properties, 265
#, 200 Average time of return, 270
M(P), 210
N A, 210 B
Z, 47 Balance equation, 53
Cov(ξ, η), 48 Bank account, 207
M , 116 Bartlett’s estimator, 76
X , Y , 116 Bernoullian shifts, 44
f , g, 58 Binary expansion, 18
PN , 227 Birkhoff G. D., 34
MNn , 229 Borel, É.
En (λ), 144 normal numbers, 19, 45
C(f ; P), 224 zero–one law, 3
ρ(n), 48
Brownian motion, 184
(B, S)-market
θk ξ , 33
complete, 214
{Xn →}, 156
CRR (Cox, Ross, Rubinstein) model, 213,
225
A
Absolutely continuous C
change of measure, 178, 182 Cesàro summation, 277
probability measures, 165 Compensator, 115
spectral density, 51, 55 Contingent claim, 214
Absolute moment, 15 replicable, 214
Absorbing Convergence of
barrier, 290, 305, 307 submartingales and martingales
state, 167, 289, 306 general theorems , 148
© Springer Science+Business Media, LLC, part of Springer Nature 2019 343
A. N. Shiryaev, Probability-2, Graduate Texts in Mathematics 95,
https://doi.org/10.1007/978-0-387-72208-5
344 Index
T Transformation
Theorem Bernoulli, 45
Birkhoff and Khinchin, 39 ergodic, 37
Cantelli, 12 Esscher, 212
central limit conditional, 214
for dependent random variables, 183 Kolmogorov, 45
functional, 185 measurable, 34
Chernoff, 31 measure-preserving, 34
Doob mixing, 38
on convergence of submartingales, 148 metrically transitive, 37
on maximal inequalities, 132 Transition
on random time change, 119 function, 242
submartingale decomposition, 115 matrix, 259–264
ergodic, 44 algebraic properties, 259
mean-square, 69 operator, one-step, 296
Girsanov, discrete version, 179 probabilities, 237, 256
Herglotz, 54, 74 Trapezoidal rule, 199
Kolmogorov Two-pointed conditional distribution, 217
on interpolation, 91
regular stationary sequence, 84
W
strong law of large numbers, 13, 16
Wald’s identity, 124
three-series, 9
zero–one law, 3, 169 fundamental, 126
Kolmogorov and Khinchin Water level, 53
convergence of series, 6 White noise, 50, 67, 76
two-series, 9 Wiener process, 184
Lévy, 150 Wold’s expansion, 78, 81
law of the iterated logarithm, 23
Liouville, 35 Z
maximal ergodic, 40 Zero–one law, 1
Pólya, on random walk, 289 Borel, 3
Poincaré, on recurrence, 35 Hewitt–Savage, 5
of renewal theory, 129 Kolmogorov, 3, 152, 169