100% found this document useful (3 votes)
2K views356 pages

Probability-2 Shiryaev

Uploaded by

Qui Ph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (3 votes)
2K views356 pages

Probability-2 Shiryaev

Uploaded by

Qui Ph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 356

Graduate Texts in Mathematics

Albert N. Shiryaev

Probability-2
Third Edition
Graduate Texts in Mathematics 95
Graduate Texts in Mathematics

Series Editors:

Sheldon Axler
San Francisco State University, San Francisco, CA, USA

Kenneth Ribet
University of California, Berkeley, CA, USA

Advisory Board:

Alejandro Adem, University of British Columbia


David Eisenbud, University of California, Berkeley & MSRI
Brian C. Hall, University of Notre Dame
J.F. Jardine, University of Western Ontario
Jeffrey C. Lagarias, University of Michigan
Ken Ono, Emory University
Jeremy Quastel, University of Toronto
Fadil Santosa, University of Minnesota
Barry Simon, California Institute of Technology
Ravi Vakil, Stanford University
Steven H. Weintraub, Lehigh University

Graduate Texts in Mathematics bridge the gap between passive study and creative
understanding, offering graduate-level introductions to advanced topics in mathe-
matics. The volumes are carefully written as teaching aids and highlight character-
istic features of the theory. Although these books are frequently used as textbooks
in graduate courses, they are also suitable for individual study.

More information about this series at http://www.springer.com/series/136


Albert N. Shiryaev

Probability-2
Third Edition

Translated by R.P. Boas† and D.M. Chibisov

123
Albert N. Shiryaev
Department of Probability Theory
and Mathematical Statistics
Steklov Mathematical Institute and
Lomonosov Moscow State University
Moscow, Russia

Translated by R.P. Boas† and D.M. Chibisov

ISSN 0072-5285 ISSN 2197-5612 (electronic)


Graduate Texts in Mathematics
ISBN 978-0-387-72207-8 ISBN 978-0-387-72208-5 (eBook)
https://doi.org/10.1007/978-0-387-72208-5

Library of Congress Control Number: 2018953349

Mathematics Subject Classification: 60Axx, 60Exx, 60Fxx, 60Gxx, 60Jxx, 62Lxx

© Springer Science+Business Media New York 1984, 1996


© Springer Science+Business Media, LLC, part of Springer Nature 2019
Originally published in one volume.
Translation from the Russian language edition: Veroıatnost – 2 (fourth edition) by Albert N. Shiryaev
© Shiryaev, A. N. 2007 and © MCCME 2007. All Rights Reserved.
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Science+Business Media, LLC
part of Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface to the Third English Edition

The present edition is a translation of the fourth Russian edition of 2007, with the
previous three published in 1980, 1989, and 2004. The English translations of the
first two appeared in 1984 and 1996. The third and fourth Russian editions, extended
compared to the second edition, were published in two volumes titled Probability-1
and Probability-2. Accordingly, the present edition consists of two volumes: this
Vol. 2, titled Probability-2, contains Chaps. 4–8, and Chaps. 1–3 are contained in
Vol. 1, titled Probability-1, which was published in 2016.
This English translation of Probability-2 was prepared by the editor and transla-
tor Prof. D. M. Chibisov, Professor of the Steklov Mathematical Institute. A former
student of N. V. Smirnov, he has a broad view of probability and mathematical statis-
tics, which enabled him not only to translate the parts that had not been translated
before, but also to edit both the previous translation and the Russian text, making in
them quite a number of corrections and amendments.
The author is sincerely grateful to D. M. Chibisov for the translation and scien-
tific editing of this book.

Moscow, Russia A. Shiryaev


2018

Preface to the Fourth Russian Edition

A university course on probability and statistics usually consists of three one-


semester parts: probability theory, random processes, and mathematical statistics.
The book Probability-1 covered the material normally included in probability
theory.
This book, Probability-2, contains extensive material for a course on random
processes in the part dealing with discrete time processes, i.e., random sequences.
(The reader interested in random processes with continuous time may refer to [12],
which is closely related to Probability-1 and Probability-2.)
Chapter 4, which opens this book, is focused mostly on the properties of sums of
independent random variables that hold with probability one (e.g., “zero–one” laws,
the strong law of large numbers, the law of the iterated logarithm).
Chapters 5 and 6 treat the strict and wide sense stationary random sequences.

v
vi

In Chaps. 7 and 8, we set out random sequences that form martingales and
Markov chains. These classes of processes enable us to study the behavior of various
stochastic systems in the “future”, depending on their “past” and “present” thanks
to which these processes play a very important role in modern probability theory
and its applications.
The book concludes with a Historical Review of the Development of Mathemat-
ical Theory of Probability.

Moscow, Russia A. Shiryaev


2003
Contents

Preface to the Third English Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Preface to the Fourth Russian Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

4 Sequences and Sums of Independent Random Variables . . . . . . . . . . . 1


1 Zero–One Laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Convergence of Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Law of the Iterated Logarithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Probabilities of Large Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Stationary (Strict Sense) Random Sequences


and Ergodic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1 Stationary (Strict Sense) Random Sequences: Measure-Preserving
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2 Ergodicity and Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Ergodic Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Stationary (Wide Sense) Random Sequences: L2 -Theory . . . . . . . . . . 47


1 Spectral Representation of the Covariance Function . . . . . . . . . . . . . 47
2 Orthogonal Stochastic Measures and Stochastic Integrals . . . . . . . . 56
3 Spectral Representation of Stationary (Wide Sense) Sequences . . . 61
4 Statistical Estimation of Covariance Function
and Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Wold’s Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6 Extrapolation, Interpolation, and Filtering . . . . . . . . . . . . . . . . . . . . . 85
7 The Kalman–Bucy Filter and Its Generalizations . . . . . . . . . . . . . . . 95

7 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
1 Definitions of Martingales and Related Concepts . . . . . . . . . . . . . . . 107
2 Preservation of Martingale Property Under a Random
Time Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
vii
viii Contents

3 Fundamental Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132


4 General Theorems on Convergence of Submartingales
and Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5 Sets of Convergence of Submartingales and Martingales . . . . . . . . . 156
6 Absolute Continuity and Singularity of Probability Distributions
on a Measurable Space with Filtration . . . . . . . . . . . . . . . . . . . . . . . . 164
7 Asymptotics of the Probability of the Outcome of a Random Walk
with Curvilinear Boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
8 Central Limit Theorem for Sums of Dependent
Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9 Discrete Version of Itô’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
10 Application of Martingale Methods to Calculation of Probability
of Ruin in Insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
11 Fundamental Theorems of Stochastic Financial Mathematics: The
Martingale Characterization of the Absence of Arbitrage . . . . . . . . . 207
12 Hedging in Arbitrage-Free Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
13 Optimal Stopping Problems: Martingale Approach . . . . . . . . . . . . . . 228

8 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237


1 Definitions and Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
2 Generalized Markov and Strong Markov Properties . . . . . . . . . . . . . 249
3 Limiting, Ergodic, and Stationary Probability Distributions
for Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
4 Classification of States of Markov Chains in Terms of Algebraic
Properties of Matrices of Transition Probabilities . . . . . . . . . . . . . . . 259
5 Classification of States of Markov Chains in Terms of Asymptotic
Properties of Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 265
6 Limiting, Stationary, and Ergodic Distributions for Countable
Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
7 Limiting, Stationary, and Ergodic Distributions for Finite Markov
Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
8 Simple Random Walk as a Markov Chain . . . . . . . . . . . . . . . . . . . . . 284
9 Optimal Stopping Problems for Markov Chains . . . . . . . . . . . . . . . . 296

Development of Mathematical Theory of Probability:


Historical Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

Historical and Bibliographical Notes (Chaps. 4–8) . . . . . . . . . . . . . . . . . . . . . 333

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Table of Contents of Probability-1

Preface to the Third English Edition


Preface to the Fourth Russian Edition
Preface to the Third Russian Edition
Preface to the Second Edition
Preface to the First Edition
Introduction

1 Elementary Probability Theory


1 Probabilistic Model of an Experiment with a Finite Number of Outcomes
2 Some Classical Models and Distributions
3 Conditional Probability: Independence
4 Random Variables and Their Properties
5 The Bernoulli Scheme: I—The Law of Large Numbers
6 The Bernoulli Scheme: II—Limit Theorems (Local, de Moivre–Laplace,
Poisson)
7 Estimating the Probability of Success in the Bernoulli Scheme
8 Conditional Probabilities and Expectations with Respect to Decomposi-
tions
9 Random Walk: I—Probabilities of Ruin and Mean Duration in Coin Toss-
ing
10 Random Walk: II—Reflection Principle—Arcsine Law
11 Martingales: Some Applications to the Random Walk
12 Markov Chains: Ergodic Theorem, Strong Markov Property
13 Generating Functions
14 Inclusion–Exclusion Principle
2 Mathematical Foundations of Probability Theory
1 Kolmogorov’s Axioms
2 Algebras and σ-Algebras: Measurable Spaces

ix
x Table of Contents of Probability-1

3 Methods of Introducing Probability Measures on Measurable Spaces


4 Random Variables: I
5 Random Elements
6 Lebesgue Integral: Expectation
7 Conditional Probabilities and Conditional Expectations with Respect to a
σ-Algebra
8 Random Variables: II
9 Construction of a Process with Given Finite-Dimensional Distributions
10 Various Kinds of Convergence of Sequences of Random Variables
11 The Hilbert Space of Random Variables with Finite Second Moment
12 Characteristic Functions
13 Gaussian Systems
3 Convergence of Probability Measures. Central Limit Theorem
1 Weak Convergence of Probability Measures and Distributions
2 Relative Compactness and Tightness of Families of Probability Distribu-
tions
3 Proof of Limit Theorems by the Method of Characteristic Functions
4 Central Limit Theorem: I
5 Central Limit Theorem for Sums of Independent Random Variables: II
6 Infinitely Divisible and Stable Distributions
7 Metrizability of Weak Convergence
8 On the Connection of Weak Convergence of Measures
9 The Distance in Variation Between Probability Measures
10 Contiguity of Probability Measures
11 Rate of Convergence in the Central Limit Theorem
12 Rate of Convergence in Poisson’s Theorem
13 Fundamental Theorems of Mathematical Statistics

Historical and Bibliographical Notes


References
Keyword Index
Symbol Index
Chapter 4
Sequences and Sums of Independent
Random Variables

1. Zero–One Laws

The concept of mutual independence of two or more experiments holds, in a certain sense,
a central position in the theory of probability. . . . Historically, the independence of exper-
iments and random variables represents the very mathematical concept that has given the
theory of probability its peculiar stamp.
A. N. Kolmogorov, Foundations of Probability Theory [50]
∞ ∞
1. The series n=1 (1/n) diverges and the series n=1 (−1)n (1/n) converges.
We ask the following∞ question. What can we say about the convergence or diver-
gence of a series n=1 (ξn /n), where ξ1 , ξ2 , . . . is a sequence of independent identi-
cally distributed Bernoulli random variables with P(ξ1 = +1) = P(ξ1 = −1) = 12 ?
In other words, what can be said about the convergence of a series whose general
term is ±1/n, where the signs are chosen in a random manner, according to the
sequence ξ1 , ξ2 , . . .?
Let  ∞

 ξn
A1 = ω : converges
n=1
n
∞
be the set of sample points for which n=1 (ξn /n) converges (to a finite number),
and consider the probability P(A1 ) of this set. It is far from clear, to begin with,
what values this probability might have. However, it is a remarkable fact that we are
able to say that the probability can have only two values, 0 or 1. This is a corollary
of Kolmogorov’s zero–one law, whose statement and proof form the main content
of this section.

2. Let (Ω, F , P) be a probability space, and let ξ1 , ξ2 , . . . be a sequence of random


variables. Let Fn∞ = σ(ξn , ξn+1 , . . .) be the σ-algebra generated by ξn , ξn+1 , . . .,
and write

© Springer Science+Business Media, LLC, part of Springer Nature 2019 1


A. N. Shiryaev, Probability-2, Graduate Texts in Mathematics 95,
https://doi.org/10.1007/978-0-387-72208-5 1
2 4 Sequences and Sums of Independent Random Variables


X = Fn∞ .
n=1

Since an intersection of σ-algebras is again a σ-algebra, X is a σ-algebra. It is


called a tail algebra (or terminal or asymptotic algebra), because every event A ∈ X
is independent of the values of ξ1 , . . . , ξn for every finite number n, and is deter-
mined, so to speak, only by the behavior of the infinitely remote values of ξ1 , ξ2 , . . ..
Since, for every k ≥ 1,
∞  ∞ 
 ξn  ξn
A1 ≡ converges = converges ∈ Fk∞ ,
n=1
n n
n=k

we have A1 ∈ k Fk∞ ≡ X . In the same way,
∞ 

A2 = ξn converges ∈ X .
n=1

The following events are also tail events:

A3 = {ξn ∈ In for infinitely many n} (= lim sup {ξn ∈ In }),

where In ∈ B(R), n ≥ 1;

A4 = lim sup ξn < ∞ ;
n

ξ1 + · · · + ξ n
A5 = lim sup <∞ ;
n n

ξ1 + · · · + ξ n
A6 = lim sup <c ;
n n

Sn
A7 = converges , where Sn = ξ1 + · · · + ξn ;
n

Sn
A8 = lim sup √ =1 .
n 2n log n

On the other hand,

B1 = {ξn = 0 for all n ≥ 1},




B2 = lim (ξ1 + · · · + ξn ) exists and is less than c
n

are examples of events that do not belong to X .


Let us now suppose that our random variables are independent. Then by the
Borel–Cantelli lemma it follows that

P(A3 ) = 0 ⇔ P(ξn ∈ In ) < ∞,
1 Zero–One Laws 3

P(A3 ) = 1 ⇔ P(ξn ∈ In ) = ∞.

Therefore the probability of A 3 can take only a value of 0 or 1 according to the


convergence or divergence of P(ξn ∈ In ). This is Borel’s zero–one law, which is
a particular case of the following theorem.

Theorem 1 (Kolmogorov’s Zero–One Law). Let ξ1 , ξ2 , . . . be a sequence of inde-


pendent random variables, and let A ∈ X . Then P(A) can only have a value of zero
or one.

PROOF. The idea of the proof is to show that every tail event A is independent of
itself, and therefore P(A ∩ A) = P(A) · P(A), i.e., P(A) = P2 (A), so that P(A) = 0
or 1.
If A ∈ X , then A ∈ F1∞ = σ{ξ1 , ξ2 , . . .} = σ( n F1n ), where F1n =
σ{ξ1 , . . . , ξn }, and we find (Problem 8, Sect. 3, Chap. 2, Vol. 1) sets An ∈ F1n , n ≥ 1,
such that P(A An ) → 0, n → ∞. Hence

P(An ) → P(A), P(An ∩ A) → P(A). (1)

But if A ∈ X , the events An and A are independent,

P(A ∩ An ) = P(A) P(An ),

for every n ≥ 1. Hence (1) implies that P(A) = P2 (A), and therefore P(A) = 0
or 1.
This completes the proof of the theorem.

Corollary. Let η be a random variable that is measurable with respect to the tail
σ-algebra X , i.e., {η ∈ B} ∈ X , B ∈ B(R). Then η is degenerate, i.e., there is a
constant c such that P(η = c) = 1.

3. Theorem 2 below provides an example of a nontrivial application of Kol-


mogorov’s zero–one law. Let ξ1 , ξ2 , . . . be a sequence of independent Bernoulli
random variables with P(ξn = 1) = p, P(ξn = −1) = q, p + q = 1, n ≥ 1, and let
Sn = ξ1 + · · · + ξn . It seems intuitively clear that in the symmetric case (p = 12 )
a “typical” path of the random walk Sn , n ≥ 1, will cross zero infinitely often,
whereas when p = 12 , it will go off to infinity. Let us give a precise formulation.

Theorem 2. (a) If p = 12 , then P(Sn = 0 i .o.) = 1.


(b) If p = 12 , then P(Sn = 0 i .o.) = 0.

PROOF. We  first observe that the event B = (Sn = 0 i.o.) is not a tail event, i.e.,
B ∈ X = Fn∞ , Fn∞ = σ{ξn , ξn+1 , . . .}. Consequently it is, in principle, not
clear that B should have only a value of 0 or 1.
Statement (b) is easily proved by applying (the first part of) the Borel–Cantelli
lemma. In fact, if B2n = {S2n = 0}, then, by Stirling’s formula (see (6), Sect. 2,
Chap. 1, Vol. 1),
4 4 Sequences and Sums of Independent Random Variables

(4pq)n
pq ∼ √
n n n
P(B2n ) = C2n
πn

and therefore P(B2n ) < ∞. Consequently, P(Sn = 0 i.o.) = 0.
To prove (a), it is enough to prove that the event

Sn Sn
A = lim sup √ = ∞, lim inf √ = −∞
n n

has probability 1 (since A ⊆ B).


Let Ac = Ac ∩ Ac , where
 
 Sn Sn
Ac = lim sup √ ≥ c , Ac = lim inf √ ≤ −c .
n n n n

Then Ac ↓ A, c → ∞, and all the events A, Ac , Ac are tail events. Let us show that
P(Ac ) = P(Ac ) = 1 for each c > 0. Since Ac ∈ X and Ac ∈ X , it is sufficient to
show only that P(Ac ) > 0, P(Ac ) > 0. But by Problem 5,
  
Sn Sn Sn
P lim inf √ ≤ −c = P lim sup √ ≥ c ≥ lim sup P √ ≥ c > 0,
n n n n n n

where the last inequality follows from the de Moivre–Laplace theorem (Sect. 6,
Chap. 1, Vol. 1).
Thus, P(Ac ) = 1 for all c > 0, and therefore P(A) = limc→∞ P(Ac ) = 1.
This completes the proof of the theorem.

4. Let us observe again that B = {Sn = 0 i.o.} is not a tail event. Nevertheless, it
follows from Theorem 2 that, for a Bernoulli scheme, the probability of this event,
just as for tail events, takes only the values 0 and 1. This phenomenon is not acci-
dental: it is a corollary of the Hewitt–Savage zero–one law, which for independent
identically distributed random variables extends the result of Theorem 1 to the class
of “symmetric” events (which includes the class of tail events).
Let us give the essential definitions. A one-to-one mapping π = (π1 , π2 , . . .) of
the set (1, 2, . . .) on itself is said to be a finite permutation if πn = n for every n with
a finite number of exceptions.
If ξ = ξ1 , ξ2 , . . . is a sequence of random variables, π(ξ) denotes the sequence
(ξπ1 , ξπ2 , . . .). If A is the event {ξ ∈ B}, B ∈ B(R∞ ), then π(A) denotes the event
{π(ξ) ∈ B}, B ∈ B(R∞ ).
We call an event A = {ξ ∈ B}, B ∈ B(R∞ ) symmetric if π(A) coincides with A
for every finite permutation π.
An example of a symmetric event is A = {Sn = 0 i.o.}, where Sn =
ξ1 + · · · + ξn . Moreover,  it can be shown (Problem 4) that every event in the tail
σ-algebra X (S) = Fn∞ (S), where Fn∞ (S) = σ{Sn , Sn+1 , . . .}, generated by
S1 = ξ1 , S2 = ξ1 + ξ2 , . . . is symmetric.
1 Zero–One Laws 5

Theorem 3 (Hewitt–Savage Zero–One Law). Let ξ1 , ξ2 , . . . be a sequence of inde-


pendent identically distributed random variables and A = {ξ ∈ B} a symmetric
event. Then P(A) = 0 or 1.
PROOF. Let A = {ξ ∈ B} be a symmetric event. Choose sets Bn ∈ B(Rn ) (see
Problem 8 in Sect. 3, Chap. 2, Vol. 1) such that, for An = {ω : (ξ1 , . . . , ξn ) ∈ Bn },

P(A An ) → 0, n → ∞. (2)

Since the random variables ξ1 , ξ2 , . . . are independent identically distributed, the


probability distributions Pξ (B) = P(ξ ∈ B) and Pπn (ξ) (B) = P(πn (ξ) ∈ B) co-
incide, where πn (ξ) = (ξn+1 , . . . , ξ2n , ξ1 , . . . , ξn , ξ2n+1 , ξ2n+2 , . . . ) for all n ≥ 1.
Therefore
P(A An ) = Pξ (B Bn ) = Pπn (ξ) (B Bn ). (3)
Since A is symmetric, we have

A ≡ {ξ ∈ B} = πn (A) ≡ {πn (ξ) ∈ B}.

Therefore

Pπn (ξ) (B Bn ) = P({πn (ξ) ∈ B} {πn (ξ) ∈ Bn })


= P({ξ ∈ B} {πn (ξ) ∈ Bn }) = P(A πn (An )). (4)

Hence, by (3) and (4),

P(A An ) = P(A πn (An )). (5)

By (2), this implies that

P(A (An ∩ πn (An ))) → 0, n → ∞. (6)

Therefore we conclude from (2), (5), and (6) that

P(An ) → P(A), P(πn (An )) → P(A),


(7)
P(An ∩ πn (An )) → P(A).

Moreover, since ξ1 , ξ2 , . . . are independent,

P(An ∩ πn (An )) = P{(ξ1 , . . . , ξn ) ∈ Bn , (ξn+1 , . . . , ξ2n ) ∈ Bn }


= P{(ξ1 , . . . , ξn ) ∈ Bn } · P{(ξn+1 , . . . , ξ2n ) ∈ Bn }
= P(An ) P(πn (An )),

whence, by (7),
P(A) = P2 (A)
and therefore P(A) = 0 or 1.
This completes the proof of the theorem.


6 4 Sequences and Sums of Independent Random Variables

5. PROBLEMS
1. Prove the corollary to Theorem 1.
2. Show that if (ξn )n≥1 is a sequence of independent random variables, then the
random variables lim sup ξn and lim inf ξn are degenerate.
3. Let (ξn ) be a sequence of independent random variables, Sn = ξ1 + · · · + ξn ,
and let the constants bn satisfy 0 < bn ↑ ∞. Show that the random variables
lim sup (Sn /bn ) and lim inf (Sn /bn ) are degenerate.
4. Let Sn = ξ1 + · · · + ξn , n ≥ 1, and

X (S) = Fn∞ (S), Fn∞ (S) = σ{Sn , Sn+1 , . . .}.

Show that every event in X (S) is symmetric.


5. Let (ξn ) be a sequence of random variables. Show that {lim sup ξn > c} ⊇
lim sup {ξn > c} for each c > 0.
6. Give an example of a tail event whose probability is strictly greater than 0 and
less than 1.
7. Let ξ1 , ξ2 , . . . be independent random variables
√ with E ξ1 = 0, E ξ12 = 1 that
satisfy the central limit theorem (P{Sn / n ≤ x} → Φ(x), x ∈ R, where Sn =
ξ1 + · · · + ξn ). Show that

lim sup n−1/2 Sn = +∞ (P-a.s.).


n→∞

In particular, this property holds for a sequence of independent identically dis-


tributed random variables (with E ξ1 = 0, E ξ12 = 1).
8. Let ξ1 , ξ2 , . . . be independent identically distributed random variables with
E |ξ1 | > 0. Show that
 n 
 

lim sup  ξk  = +∞ (P-a.s.).
n→∞
k=1

2. Convergence of Series

1. Let us suppose that ξ1 , ξ2 , . . . is a sequence of independent random variables,



Sn = ξ1 + · · · + ξn , and let A be the set of sample points ω for which ξn (ω)
converges to a finite limit.
 It follows from Kolmogorov’s zero–one law that P(A) =
0 or 1, i.e., the series ξn converges or diverges with probability 1. The object
of this section is to give criteria that will determine whether a sum of independent
random variables converges or diverges.
Theorem 1 (Kolmogorov and Khinchin). (a) Let E ξn = 0, n ≥ 1. Then, if

E ξn2 < ∞, (1)

the series ξn converges with probability 1.
2 Convergence of Series 7

 P (|ξn | ≤ c) = 1, c < ∞, n ≥ 1), then the


(b) If ξn are uniformly bounded (i.e.,
converse is true: the convergence of ξn with probability 1 implies (1).

The proof depends on

Kolmogorov’s Inequalities. (a) Let ξ1 , ξ2 , . . . , ξn be independent random vari-


ables with E ξi = 0, E ξi2 < ∞, 1 ≤ i ≤ n. Then for every ε > 0

E S2
P max |Sk | ≥ ε ≤ 2 n . (2)
1≤k≤n ε

(b) If also P (|ξi | ≤ c) = 1, 1 ≤ i ≤ n, then



(c + ε)2
P max |Sk | ≥ ε ≥ 1 − . (3)
1≤k≤n E Sn2

PROOF. (a) Put

A = { max |Sk | ≥ ε},


1≤k≤n

Ak = {|Si | < ε, i = 1, . . . , k − 1, |Sk | ≥ ε}, 1 ≤ k ≤ n.



Then A = Ak and 
E Sn2 ≥ E Sn2 IA = E Sn2 IAk .
But

E Sn2 IAk = E (Sk + (ξk+1 + · · · + ξn ))2 IAk


= E Sk2 IAk + 2E Sk (ξk+1 + · · · + ξn )IAk + E (ξk+1 + · · · + ξn )2 IAk
≥ E Sk2 IAk ,

since
E Sk (ξk+1 + · · · + ξn )IAk = E Sk IAk · E (ξk+1 + · · · + ξn ) = 0
because of independence and the conditions E ξi = 0, 1 ≤ i ≤ n. Hence
 
E Sn2 ≥ E Sk2 IAk ≥ ε2 P(Ak ) = ε2 P(A),

which proves the first inequality.


(b) To prove (3), we observe that

E Sn2 IA = E Sn2 − E Sn2 IĀ ≥ E Sn2 − ε2 P(Ā) = E Sn2 − ε2 + ε2 P(A). (4)

On the other hand, on the set Ak ,

|Sk−1 | ≤ ε, |Sk | ≤ |Sk−1 | + |ξk | ≤ ε + c


8 4 Sequences and Sums of Independent Random Variables

and therefore
 
E Sn2 IA = E Sk2 IAk + E (IAk (Sn − Sk )2 )
k k
 
n 
n
≤ (ε + c)2 P(Ak ) + P(Ak ) E ξj2
k k=1 j=k+1
⎡ ⎤

n
≤ P(A) ⎣(ε + c)2 + E ξj2 ⎦ = P(A)[(ε + c)2 + E Sn2 ]. (5)
j=1

From (4) and (5) we obtain

E Sn2 − ε2 (ε + c)2 (ε + c)2


P(A) ≥ = 1 − ≥ 1 − .
(ε + c)2 + E Sn2 − ε2 (ε + c)2 + E Sn2 − ε2 E Sn2

This completes the proof of (3).




PROOF OF THEOREM 1. (a) By Theorem 4 in Sect. 10, Chap. 2, Vol. 1, the se-
quence (Sn )n≥1 converges with probability 1 if and only if it is fundamental with
probability 1. By Theorem 1 of Sect. 10, Chap. 2, Vol. 1, the sequence (Sn )n≥1 , is
fundamental (P-a.s.) if and only if


P sup |Sn+k − Sn | ≥ ε → 0, n → ∞. (6)
k≥1

By (2),



P sup |Sn+k − Sn | ≥ ε = lim P max |Sn+k − Sn | ≥ ε
k≥1 N→∞ 1≤k≤N
n+N 2
∞ 2
k=n E ξk k=n E ξk
≤ lim = .
N→∞ ε2 ε2
∞ 
Therefore (6) is satisfied if k=1 E ξk2 < ∞, and consequently ξk converges with
probability
1.
(b) Let ξk converge. Then, by (6), for sufficiently large n,


P sup |Sn+k − Sn | ≥ ε < 12 . (7)
k≥1

By (3),
(c + ε)2
sup |Sn+k − Sn | ≥ ε ≥ 1 − ∞
P 2.
k≥1 k=n E ξk
∞
Therefore if we suppose that k=1 E ξk2 = ∞, then we obtain


P sup |Sn+k − Sn | ≥ ε = 1,
k≥1

which contradicts (7).


This completes the proof of the theorem.


2 Convergence of Series 9

EXAMPLE. If ξ1 , ξ2 , . . . is a sequence of independent Bernoulli


 random variables
with P(ξn = +1) = P(ξn = −1) = 12 ,  then the series ξn an , with |an | ≤ c,
converges with probability 1 if and only if a2n < ∞.

2. Theorem 2 (Kolmogorov–Khinchin’s  Two-Series Theorem). A sufficient condi-


tion for the convergence of the series ξn 
of independent random variables with
probability 1 is that both series E ξn and Var ξn converge. If P(|ξn | ≤ c) = 1
for some c > 0, this condition is also necessary.
 
PROOF. If Var ξn < ∞, then, by Theorem  1, the series (ξ n − E ξn ) converges
(P-a.s.). But by hypothesis the series E ξn converges; hence ξn also converges
(P-a.s.).
To prove the necessity, we use the following symmetrization method. In addition
to the sequence ξ1 , ξ2 , . . ., we consider a different sequence, ξ˜1 , ξ˜2 , . . ., of indepen-
dent random variables such that ξ˜n has the same distribution as ξn , n ≥ 1. (When the
original sample space is sufficiently rich, the existence of such a sequence follows
from Corollary 1 to Theorem 1 of Sect. 9, Chap. 2, Vol. 1. We can also show that this
assumption  involves no loss of generality.) ˜
Then,
 if ξn converges (P-a.s.), the series ξn also converges, and hence so
does (ξ − ˜
ξ ). But E (ξ − ˜
ξ ) = 0 and P(|ξ ˜
n − ξn | ≤ 2c) = 1. Therefore
 n n n n
Var(ξn − ξ˜n ) < ∞ by Theorem 1 (b). In addition,
 
Var ξn = 12 Var(ξn − ξ˜n ) < ∞.

 by Theorem 1 (a), (ξn − E ξn ) converges with probability 1, and
Consequently,
therefore  E ξn converges.
 ξn converges
Thus, if  (P-a.s.) (and P(|ξn | ≤ c) = 1, n ≥ 1), then it follows
that both E ξn and Var ξn converge.
This completes the proof of the theorem.

3. The following
 theorem provides a necessary and sufficient condition for the con-
vergence of ξn without any boundedness condition on the random variables.
Let c be a constant and 
ξ, |ξ| ≤ c,
ξc =
0, |ξ| > c.

Theorem 3 (Kolmogorov’s Three-Series Theorem). Let ξ1 , ξ2 , . . . be a sequence


 of
independent random variables. A necessary condition for the convergence of ξn
with probability 1 is that the series
  
E ξnc , Var ξnc , P (|ξn | ≥ c)

converge for every c > 0; a sufficient condition is that these series converge for
some c > 0.
10 4 Sequences and Sums of Independent Random Variables
 c
PROOF . Sufficiency. By the two-series  theorem, ξn converges with probability 1.
But if P(|ξn | ≥ c) < ∞, then I(|ξn | ≥ c) < ∞ with probability 1 by the
Borel–Cantelli lemma. Consequently, ξn = ξnc for all n with at most finitely many
exceptions. Therefore
 ξn also converges (P-a.s.).
Necessity. If ξn converges (P-a.s.), then ξn → 0 (P-a.s.), and therefore, for
every c > 0, at most a finite number of the events {|ξn | ≥ c} can occur (P-a.s.).
 I(|ξn | ≥ c) < ∞ (P-a.s.), and, by the second part
Therefore of the Borel–Cantelli
lemma, P(|ξ
 c n | > c) < ∞. Moreover, the convergence of ξn implies the
 con-c
vergence
 of ξ n . Therefore, by the two-series theorem, both of the series E ξn
and Var ξnc converge.
This completes the proof of the theorem.

Corollary. Let ξ1 , ξ2 , . . . be independent variables with E ξn = 0. Then, if


 ξn2
E < ∞,
1 + |ξn |

the series ξn converges with probability 1.

For the proof we observe that


 ξn2   
E <∞⇔ E ξn2 I(|ξn | ≤ 1) + |ξn |I(|ξn | > 1) < ∞.
1 + |ξn |

Therefore if ξn1 = ξn I(|ξn | ≤ 1), then we have



E (ξn1 )2 < ∞.

Since E ξn = 0, we have
  
|E ξn1 | = |E ξn I(|ξn | ≤ 1)| = |E ξn I(|ξn | > 1)|

≤ E |ξn |I(|ξn | > 1) < ∞.
 
Therefore both E ξn1 and Var ξn1 converge. Moreover, by Chebyshev’s inequal-
ity,
P{|ξn | > 1} = P{|ξn |I(|ξn | > 1) > 1} ≤ E (|ξn |I(|ξn | > 1).
 
Therefore P(|ξn | > 1) < ∞. Hence the convergence of ξn follows from the
three-series theorem.
2 Convergence of Series 11

4. PROBLEMS
1. Let ξ1 , ξ2 , . . . be a sequence of independent random variables, Sn = ξ1 +. . .+ξn .
Show, using
 the three-series theorem,  that:
(a) If  ξn2 < ∞ (P-a.s.), then ξn converges with probability 1 if and only
if  E ξi I(|ξi | ≤ 1) converges; 
(b) If ξn converges (P-a.s.), then ξn2 < ∞ (P-a.s.) if and only if

(E |ξn |I(|ξn | ≤ 1))2 < ∞.

Let ξ1 , ξ2 , . . . be a sequence of independent random variables. Show that


2. 
ξn2 < ∞ (P-a.s.) if and only if
 ξn2
E < ∞.
1 + ξn2

3. Let ξ1 , ξ2 , . . . be a sequence of independent random variables. Then the follow-


ing three conditions  are equivalent:
(a) The series  ξn converges with probability 1;
(b) The series  ξn converges in probability;
(b) The series ξn converges in distribution.
4. Give an example showing that in Theorems 1 and 2 we cannot dispense with the
uniform boundedness condition (P{|ξn | ≤ c} = 1 for some c > 0).
5. Let ξ1 , . . . , ξn be independent identically distributed random variables such that
E ξ1 = 0, E ξ12 < ∞, and let Sn = ξ1 + · · · + ξn . Prove the following one-sided
analog (A. V. Marshall) of Kolmogorov’s inequality (2):

E Sn2
P max Sk ≥ ε ≤ .
1≤k≤n ε2 + E Sn2

Let ξ1 , ξ2 , . . . be a sequence
6.   of (arbitrary) random variables. Show that if
n≥1 E |ξ n | < ∞, then ξ
n≥1 n absolutely converges with probability 1.
7. Let ξ1 , ξ2 , . . . be independent random variables with a symmetric distribution.
Show that   2  
E ξn ∧ 1 ≤ E (ξn2 ∧ 1).
n n

8. Let ξ1 , ξ2 , 
. . . be independent random variables  with finite
 second moments.
Show that ξn converges in L2 if and only if E ξn and Varξn converge.
9. Let ξ1 , ξ2 , . . . be independent random variables and the series ξn converge
a.s. Show that  the value of this series is independent of the order of its terms if
and only if | E (ξn ; |ξn | ≤ 1)| < ∞.
10. Let ξ1 , ξ2 , . . . be independent random variables with E ξn = 0, n ≥ 1, and


E [ξn2 I(|ξn | ≤ 1) + |ξn |I(|ξn | > 1)] < ∞.
n=1
∞
Then n=1 ξn converges P-a.s.
12 4 Sequences and Sums of Independent Random Variables

∞ A1 , A2 , . . . be independent events with P(An ) > 0, n ≥ 1, and


11. Let
n=1 P(An ) = ∞. Show that


n 
n
I(Aj ) P(Aj ) → 1 (P-a.s.) as n → ∞.
j=1 j=1

variables with expectations E ξn and vari-


12. Let ξ1 , ξ2 , . . . be independent random

ances σn2 such that limn E ξn = c and n=1 σn−2 = ∞. Show that in this case

 ξj   1
n n
→c (P-a.s.) as n → ∞.
σ2
j=1 j
σ2
j=1 j

13. Let ξ1 , ξ2 , . . . , ξn be independent random variables with E ξi = 0, i ≤ n, and let


Sk = ξ1 + ξ2 + · · · + ξk . Prove Etemadi’s inequality
 
P max |Sk | ≥ 3ε ≤ 3 max P(|Sk | ≥ ε)
1≤k≤n 1≤k≤n

and deduce from it Kolmogorov’s inequality (with an extra factor 27):


  27
P max |Sk | ≥ 3ε ≤ 2 E Sn2 .
1≤k≤n ε

3. Strong Law of Large Numbers

1. Let ξ1 , ξ2 , . . . be a sequence of independent random variables with finite second


moments: Sn = ξ1 + · · · + ξn . By Problem 2 in Sect. 3, Chapter 3, Vol. 1, if the
variances Var ξi are uniformly bounded, we have the (weak) law of large numbers:

Sn − E Sn P
→ 0, n → ∞. (1)
n
A strong law of large numbers is a proposition in which convergence in proba-
bility is replaced by convergence with probability 1.
One of the earliest results in this direction is the following theorem.

Theorem 1 (Cantelli). Let ξ1 , ξ2 , . . . be independent random variables with finite


fourth moments, and let

E |ξn − E ξn |4 ≤ C, n ≥ 1,

for some constant C. Then, as n → ∞,

Sn − E Sn
→ 0 (P -a.s.). (2)
n
3 Strong Law of Large Numbers 13

PROOF. Without loss of generality, we may assume that E ξn = 0 for n ≥ 1. By the


corollary to Theorem 1, Sect. 10 of Chap. 2, Vol. 1, we will have Sn /n → 0 (P-a.s.),
provided that
  Sn 
P   ≥ ε < ∞
n
for every ε > 0. In turn, by Chebyshev’s inequality, this will follow from
 4
  Sn 
E   < ∞.
n

Let us show that this condition is actually satisfied under our hypotheses.
We have

n  4!  4!
Sn4 = (ξ1 + · · · + ξn )4 = ξi4 + ξi2 ξj2 + ξi2 ξj ξk
i=1 i,j
2!2! 2!1!1!
i=j
i<j i=k
j<k
  4!
+ 4! ξi ξj ξk ξl + ξ 3 ξj .
3!1! i
i<j<k<l i=j

Remembering that E ξk = 0, k ≥ 1, we then obtain


n 
n n 

E Sn4 = E ξi4 + 6 E ξi2 E ξj2 ≤ nC + 6 E ξi4 · E ξj4
i=1 i,j=1 i,j=1
i<j
6n(n − 1)
≤ nC + C = (3n2 − 2n)C < 3n2 C.
2
Consequently,
 4  1
Sn
E ≤ 3C < ∞.
n n2
This completes the proof of the theorem.

2. The hypotheses of Theorem 1 can be considerably weakened by the use of more


precise methods.

Theorem 2 (Kolmogorov). Let ξ1 , ξ2 , . . . be a sequence of independent random


variables with finite second moments, and let there be positive numbers bn such that
bn ↑ ∞ and
 Var ξn
< ∞. (3)
b2n
Then
Sn − E Sn
→ 0 (P -a.s.). (4)
bn
14 4 Sequences and Sums of Independent Random Variables

In particular, if
 Var ξn
<∞ (5)
n2
then
Sn − E Sn
→ 0 (P -a.s.). (6)
n
For the proof of this, and of Theorem 3 in what follows, we need two lemmas.
Lemma 1 (Toeplitz). Let {an } be a sequence of nonnegative numbers, bn =
 n
i=1 ai , b1 = a1 > 0, and bn ↑ ∞, n → ∞. Let {xn }n≥1 be a sequence of
numbers converging to x. Then

1 
n
aj xj → x. (7)
bn j=1

In particular, if an = 1, then
x1 + · · · + xn
→ x. (8)
n
PROOF. Let ε > 0, and let n0 = n0 (ε) be such that |xn − x| ≤ ε/2 for all n ≥ n0 .
Choose n1 > n0 such that

1 
n0
|xj − x| < ε/2.
bn1 j=1

Then, for n > n1 ,


 
 
1  n
 1 
n
 a x − x  ≤ aj |xj − x|
 bn j j  bn
 j=1  j=1

1  1 
n0 n
= aj |xj − x| + aj |xj − x|
bn j=1 bn j=n +1
0

1 
n0
1 
n
≤ aj |xj − x| + aj |xj − x|
bn1 j=1
bn j=n +1
0

ε bn − bn0 ε
≤ + ≤ ε.
2 bn 2
This completes the proof of the lemma.

Lemma 2 (Kronecker). Let {bn } be a sequence of positive increasing


 numbers,
bn ↑ ∞, n → ∞, and let {xn } be a sequence of numbers such that xn converges.
Then
3 Strong Law of Large Numbers 15

1 
n
bj xj → 0, n → ∞.
bn j=1

In particular, if bn = n, xn = yn /n and (yn /n) converges, then
y1 + · · · + yn
→ 0, n → ∞. (9)
n
n
PROOF. Let b0 = 0, S0 = 0, Sn = j=1 xj . Then (by summation by parts)


n 
n 
n
bj xj = bj (Sj − Sj−1 ) = bn Sn − b0 S0 − Sj−1 (bj − bi−1 )
j=1 j=1 j=1

and therefore (setting aj = bj − bj−1 ),

1  1 
n n
bj xj = Sn − Sj−1 aj → 0,
bn j=1 bn j=1

since, if Sn → x, then, by Toeplitz’s lemma,

1 
n
Sj−1 aj → x.
bn j=1

This establishes the lemma.


PROOF OF THEOREM 2. Since



1 
n
Sn − E Sn ξk − E ξk
= bk ,
bn bn bk
k=1

a sufficient condition for (4) is, by Kronecker’s lemma, that the series [(ξk −
E ξk )/bk ] converges (P-a.s.). But this series does converge by (3) and Theorem 1 of
Sect. 2.
This completes the proof of the theorem.

EXAMPLE 1. Let ξ1 , ξ2 , . . . be a sequence of independent


 Bernoulli random vari-
ables with P(ξn = 1) = P(ξn = −1) = 12 . Then, since [1/(n log2 n)] < ∞, we
have
S
√ n → 0 (P -a.s.). (10)
n log n
3. In the case where the variables ξ1 , ξ2 , . . . are not only independent but also
identically distributed, we can obtain a strong law of large numbers without requir-
ing (as in Theorem 2) the existence of the second moment, provided that the first
absolute moment exists.
16 4 Sequences and Sums of Independent Random Variables

Theorem 3 (Kolmogorov). Let ξ1 , ξ2 , . . . be a sequence of independent identically


distributed random variables with E |ξ1 | < ∞. Then

Sn
→ m (P -a.s.) (11)
n
where m = E ξ1 .

For the proof we need the following lemma.

Lemma 3. Let ξ be a nonnegative random variable. Then



 ∞

P(ξ ≥ n) ≤ E ξ ≤ 1 + P(ξ ≥ n). (12)
n=1 n=1

The proof consists of the following chain of inequalities:



 ∞ 

P(ξ ≥ n) = P(k ≤ ξ < k + 1)
n=1 n=1 k≥n
∞ ∞

= k P(k ≤ ξ < k + 1) = E[kI(k ≤ ξ < k + 1)]
k=1 k=0
∞
≤ E[ξI(k ≤ ξ < k + 1)]
k=0


=Eξ≤ E[(k + 1)I(k ≤ ξ < k + 1)]
k=0


= (k + 1) P(k ≤ ξ < k + 1)
k=0
∞ ∞
 ∞

= P(ξ ≥ n) + P(k ≤ ξ < k + 1) = P(ξ ≥ n) + 1.
n=1 k=0 n=1

(Or one can use formula (69) with n = 1 of Sect. 6, Chap. 2, Vol. 1.)

PROOF OF THEOREM 3. By Lemma 3 and the Borel–Cantelli lemma (Sect. 10,


Chap. 2, Vol. 1),

E |ξ1 | < ∞ ⇔ P{|ξ1 | ≥ n} < ∞

⇔ P{|ξn | ≥ n} < ∞ ⇔ P{|ξn | ≥ n i.o.} = 0.

Hence |ξn | < n, except for a finite number of n, with probability 1.


Let us put 
˜ ξn , |ξn | < n,
ξn =
0, |ξn | ≥ n,
3 Strong Law of Large Numbers 17

and suppose that E ξn = 0, n ≥ 1. Then ξn = ξ˜n for finitely many n (P-a.s.), and
therefore (ξ1 +· · ·+ξn )/n → 0 (P-a.s.) if and only if (ξ˜1 +· · ·+ ξ˜n )/n → 0 (P-a.s.).
Note that in general E ξ˜n = 0, but

E ξ˜n = E ξn I(|ξn | < n) = E ξ1 I(|ξ1 | < n) → E ξ1 = 0.

Hence, by Toeplitz’ lemma,


1 ˜
n
E ξk → 0, n → ∞,
n
k=1

and consequently, (ξ1 + · · · + ξn )/n → 0 (P-a.s.) as n → ∞ if and only if


(ξ˜1 − E ξ˜1 ) + · · · + (ξ˜n − E ξ˜n )
→ 0 (P -a.s.). (13)
n

Write ξ n = ξ˜n −E ξ˜n . By Kronecker’s lemma, (13) will be established if (ξ˜n /n)
converges (P-a.s.). In turn, by Theorem 1 of Sect. 2, this will follow if we show that,
when E |ξ1 | < ∞, the series (Var ξ¯n /n2 ) converges.
We have
 Var ξ¯n ∞
 ∞
E ξ˜2 1
≤ n
= E [ξn I(|ξn | < n)]2
n2 n=1
n2 n=1
n2

∞ ∞
1 
n
1
= 2
E [ξ 2
1 I(|ξ 1 | < n)] = 2
E [ξ12 I(k − 1 ≤ |ξ1 | < k)]
n=1
n n=1
n
k=1

 ∞
 1
= E [ξ12 I(k − 1 ≤ |ξ1 | < k)] ·
n2
k=1 n=k
∞
1
≤2 E [ξ12 I(k − 1 ≤ |ξ1 | < k)]
k
k=1


≤2 E [|ξ1 |I(k − 1 ≤ |ξ1 | < k)] = 2 E |ξ1 | < ∞.
k=1

This completes the proof of the theorem.



Remark 1. The theorem admits a converse in the following sense. Let ξ1 , ξ2 , . . . be
a sequence of independent identically distributed random variables such that

ξ1 + · · · + ξ n
→ C,
n
with probability 1, where C is a (finite) constant. Then E |ξ1 | < ∞ and C = E ξ1 .
18 4 Sequences and Sums of Independent Random Variables

In fact, if Sn /n → C (P -a.s.), then



ξn Sn n − 1 Sn−1
= − → 0 (P -a.s.)
n n n n−1

and therefore P(|ξn | > n i.o.) = 0. By the Borel–Cantelli lemma (Sect. 10, Chap. 2,
Vol. 1), 
P(|ξ1 | > n) < ∞,
and by Lemma 3 we have E |ξ1 | < ∞. Then it follows from the theorem that C =
E ξ1 .
Consequently, for independent identically distributed random variables, the con-
dition E |ξ1 | < ∞ is necessary and sufficient for the convergence (with probabil-
ity 1) of the ratio Sn /n to a finite limit.
Remark 2. If the expectation m = E ξ1 exists but is not necessarily finite, the con-
clusion (9) of the theorem remains valid.
In fact, let, for example, E ξ1− < ∞ and E ξ1+ = ∞. With C > 0, put


n
SnC = ξi I(ξi ≤ C).
i=1

Then (P-a.s.)
Sn SC
lim inf ≥ lim inf n = E ξ1 I(ξ1 ≤ C).
n n n n
But as C → ∞,
E ξ1 I(ξ1 ≤ C) → E ξ1 = ∞,
and therefore Sn /n → +∞ (P-a.s.).
Remark 3. Theorem 3 asserts the convergence Snn → m (P-a.s.). Note that, be-
sides
 the convergence
 almost surely
 (a.s.),
 in this case, the convergence in the mean
Sn L
1
 Sn 
n − → m also holds, i.e., E n − m → 0, n → ∞. This follows from the ergodic
Theorem 3 of Sect. 3, Chap. 5. But in the case under consideration of independent
identically distributed random variables ξ1 , ξ2 , . . . and Sn = ξ1 + ξ2 + · · · + ξn , this
can be proved directly (Problem 7) without invoking the ergodic theorem.
4. Let us give some applications of the strong law of large numbers.
EXAMPLE 2 (Application to number theory). Let Ω = [0, 1), let B be the sigma-
algebra of Borel subsets of Ω, and let P be a Lebesgue measure on [0, 1). Consider
the binary expansions ω = 0. ω1 ω2 . . . of numbers ω ∈ Ω (with infinitely many 0s),
and define random variables ξ1 (ω), ξ2 (ω), . . . by putting ξn (ω) = ωn . Since, for all
n ≥ 1 and all x1 , . . . , xn taking a value 0 or 1,

{ω : ξ1 (ω) = x1 , . . . , ξn (ω) = xn }

x1 x2 xn x1 xn 1
= ω: + 2 + ··· + n ≤ ω < + ··· + n + n ,
2 2 2 2 2 2
3 Strong Law of Large Numbers 19

the P-measure of this set is 1/2n . It follows that ξ1 , ξn , . . . is a sequence of indepen-


dent identically distributed random variables with

P(ξ1 = 0) = P(ξ1 = 1) = 12 .

Hence, by the strong law of large numbers, we have the following result of Borel:
almost every number in [0, 1) is normal, in the sense that with probability 1 the
proportion of zeroes and ones in its binary expansion tends to 12 , i.e.,

1
n
1
I(ξk = 1) → (P -a.s.).
n 2
k=1

EXAMPLE 3 (The Monte Carlo method). Let f (x) be a continuous function defined
on [0, 1], with values in [0, 1]. The following idea is the foundation of the statistical
1
method of calculating 0 f (x) dx (the Monte Carlo method). Let ξ1 , η1 , ξ2 , η2 , . . .
be a sequence of independent random variables uniformly distributed on [0, 1]. Put

1 if f (ξi ) > ηi ,
ρi =
0 if f (ξi ) ≤ ηi .

It is clear that  1
E ρ1 = P{f (ξ1 ) > η1 } = f (x) dx.
0
By the strong law of large numbers (Theorem 3),

1
n 1
ρi → f (x) dx (P -a.s.).
n i=1 0

1
Consequently, we can approximate an integral 0 f (x) dx by taking a simulation
n of pairs of random variables (ξi , ηi ), i ≥ 1, and then calculating ρi and
consisting
(1/n) i=1 ρi .
EXAMPLE 4 (The strong law of large numbers for a renewal process). Let N =
(Nt )t≥0be a renewal process introduced in Subsection 4 of Sect. 9, Chap. 2, Vol. 1:

Nt = n=1 I(Tn ≤ t), Tn = σ1 + · · · + σn , where σ1 , σ2 , . . . is a sequence of
independent identically distributed positive random variables. We assume now that
μ = E σ1 < ∞.
Under this condition, the process N satisfies the strong law of large numbers:
Nt 1
→ (P -a.s.), t → ∞. (14)
t μ
For the proof, we observe that the assumption Nt > 0 and the fact that TNt ≤ t <
TNt +1 , t ≥ 0, imply the inequalities

TNt t TNt +1  1
≤ < 1+ . (15)
Nt Nt Nt + 1 Nt
20 4 Sequences and Sums of Independent Random Variables

Clearly, Nt = Nt (ω) → ∞ (P-a.s.) as t → ∞. At the same time, by Theorem 3,

Tn (ω) σ1 (ω) + · · · + σn (ω)


= →μ (P -a.s.), n → ∞.
n n
Therefore we also have
TNt (ω) (ω)
→μ (P -a.s.), n → ∞,
Nt (ω)

and hence we see from (15) that there exists (P-a.s.) the limit limt→∞ t/Nt , which
is equal to μ, which proves the strong law of large numbers (14).

5. PROBLEMS
∞
1. Show that E ξ 2 < ∞ if and only if n=1 n P(|ξ| > n) < ∞.
2. Supposing that ξ1 , ξ2 , . . . are independent identically distributed, show that if
E |ξ1 |α < ∞ for some α, 0 < α < 1, then Sn /n1/α → 0 (P-a.s.), and if
E |ξ1 |β < ∞ for some β, 1 ≤ β < 2, then (Sn − n E ξ1 )/n1/β → 0 (P-a.s.).
3. Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-
ables, and let E |ξ1 | = ∞. Show that
 
 Sn 
lim sup  − an  = ∞ (P -a.s.)

n n

for every sequence of constants {an }.


4. Are all rational numbers in [0, 1) normal (in the sense of Example 3)?
5. Give an example of a sequence of independent random variables ξ1 , ξ2 , . . . such
that the limit limn→∞ (Sn /n) does exist in probability but does not exist with
probability 1.
6. (N. Etemadi) Show that Theorem 3 remains valid with the independence condi-
tion of ξ1 , ξ2 , . . . replaced by their pairwise independence.
7. Show that under the conditions of Theorem 3, convergence in the mean (i.e.,
E |(Sn /n) − m| → 0, n → ∞) also holds.
8. Let ξ1 , ξ2 , . . . be independent identically distributed random variables
with E ξ12 < ∞. Show that
√ 1 P
n P{|ξ1 | ≥ ε n} → 0 and √ max |ξk | → 0.
n k≤n
9. Consider decimal expansions of the numbers ω = 0.ω1 ω2 . . . in [0, 1).
(a) Carry over to this case the strong law of large numbers obtained in Subsec-
tion 4 for binary expansions.
(b) Show that rational numbers are not normal (in the Borel sense), i.e., in their
decimal expansion (ξk (ω) = ωk , k ≥ 1),

1
n
1
I(ξk (ω) = i)  (P-a.s.) for any i = 0, 1, . . . , 9.
n 10
k=1
3 Strong Law of Large Numbers 21

(c) Show that the Champernowne number ω = 0.12345678910111213 . . . ,


containing all the integers in a row, is normal (Example 3).
10. (a) Let ξ1 , ξ2 , . . . be a sequence of independent random variables such that
P{ξn = ±na } = 1/2. Show that this sequence satisfies the strong law of large
numbers if and only if a < 1/2.
(b) Let f = f (x) be a bounded continuous function on (0, ∞). Show that, for
any a > 0 and x > 0,
∞ 
 k  −an (an)k
lim f x+ e = f (x + a).
n→∞ n k!
k=1

11. Prove that Kolmogorov’s law of large numbers (Theorem 3) can be restated in
the following form: Let ξ1 , ξ2 , . . . be independent identically distributed ran-
dom variables; then

E |ξ1 | < ∞ ⇐⇒ n−1 Sn → E ξ1 (P-a.s.),


−1
E |ξ1 | = ∞ ⇐⇒ lim sup n Sn = +∞ (P-a.s.).

Prove that the first statement remains true with independence replaced by pair-
wise independence.
12. Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-
ables. Show that
ξ 
 n
E sup  < ∞ ⇐⇒ E |ξ1 | log+ |ξ1 | < ∞.
n n

13. Let Sn = ξ1 + · · · + ξn , n ≥ 1, where ξ1 , ξ2 , . . . is a sequence of independent


identically distributed random variables with E ξ1 = 0, E |ξ1 | > 0. Show that
lim sup n−1/2 Sn = ∞, lim inf n−1/2 Sn = −∞ (P-a.s.).
14. Let Sn = ξ1 + · · · + ξn , n ≥ 1, where ξ1 , ξ2 , . . . is a sequence of independent
identically distributed random variables. Show that for any α ∈ (0, 1/2] one of
the following properties holds:
(a) n−α Sn → ∞ (P-a.s.);
(b) n−α Sn → −∞ (P-a.s.);
(c) lim sup n−α Sn = ∞, lim inf n−α Sn = −∞ (P-a.s.).
15. Let S0 = 0 and Sn = ξ1 + · · · + ξn , n ≥ 1, where ξ1 , ξ2 , . . . is a sequence of
independent identically distributed random variables. Show that:
(a) For any ε > 0


P{|Sn | ≥ nε} < ∞ ⇐⇒ E ξ1 = 0, E ξ12 < ∞;
n=1

(b) If E ξ1 < 0, then for p > 1


 p−1
E sup Sn < ∞ ⇐⇒ E(ξ1+ )p < ∞;
n≥0
22 4 Sequences and Sums of Independent Random Variables

(c) If E ξ1 = 0 and 1 < p ≤ 2, then for a constant Cp







P max Sk ≥ n ≤ Cp E |ξ1 |p , P max |Sk | ≥ n ≤ 2Cp E |ξ1 |p ;
k≤n k≤n
n=1 n=1

(d) If E ξ1 = 0, E ξ12 < ∞, and M(ε) = supn≥0 (Sn − nε), ε > 0, then

lim εM(ε) = σ 2 /2.


ε→0

4. Law of the Iterated Logarithm

1. Let ξ1 , ξ2 , . . . be a sequence of independent Bernoulli random variables with


P(ξn = 1) = P(ξn = −1) = 12 ; let Sn = ξ1 + · · · + ξn . It follows from the
proof of Theorem 2, Sect. 1, that
Sn Sn
lim sup √ = +∞, lim inf √ = −∞, (1)
n n

with probability 1. On the other hand, by (10) of Sect. 3,


Sn
√ → 0 (P -a.s.). (2)
n log n

Let us compare these results.


It follows√from (1) that with probability 1 the paths of (Sn )n≥1 intersect the
“curves” ±ε n infinitely often for any given ε > 0; but at the same time, (2)
√ shows
that they only finitely often leave the region bounded by the curves ±ε n log n.
These two results yield useful information on the amplitude of the oscillations of
the symmetric random walk (Sn )n≥1 . The law of the iterated logarithm, which we
present in what follows, improves this picture of the amplitude of the oscillations of
(Sn )n≥1 .

Definition. We call a function ϕ∗ = ϕ∗ (n), n ≥ 1, upper (for (Sn )n≥1 ) if, with
probability 1, Sn ≤ ϕ∗ (n) for all n from some n = n0 (ω) on.
We call a function ϕ∗ = ϕ∗ (n), n ≥ 1, lower (for (Sn )n≥1 ) if, with probability 1,
Sn > ϕ∗ (n) for infinitely many n.

Using these
√ definitions, and appealing to (1) and (2),√we can say that every func-
tion ϕ∗ = ε n log n, ε > 0, is upper, whereas ϕ∗ = ε n is lower, ε > 0.
Let ϕ = ϕ(n) be a function and ϕ∗ε = (1 + ε)ϕ, ϕ∗ε = (1 − ε)ϕ, where ε > 0.
Then it is easily seen that
  !
Sn Sm
lim sup ≤ 1 = lim sup ≤1
ϕ(n) n m≥n ϕ(m)
4 Law of the Iterated Logarithm 23
 
Sm
⇔ sup ≤ 1 + ε for any ε > 0 and some n1 (ε, ω)
m≥n1 (ε,ω) ϕ(m)

⇔ {Sm ≤ (1 + ε)ϕ(m) for any ε > 0, from some n1 (ε, ω) on}. (3)

In the same way,


  !
Sn Sm
lim sup ≥ 1 = lim sup ≥1
ϕ(n) n m≥n ϕ(m)
 
Sm
⇔ sup ≥ 1 − ε for any ε > 0 and some n2 (ε, ω)
m≥n2 (ε,ω) ϕ(m)
 
Sm ≥ (1 − ε)ϕ(m) for any ε > 0 and
⇔ (4)
for m larger than some n3 (ε, ω) ≥ n2 (ε, ω).

It follows from (3) and (4) that to verify that each function ϕ∗ε = (1+ε)ϕ, ε > 0,
is upper, we must show that

Sn
P lim sup ≤ 1 = 1, (5)
ϕ(n)

and to show that ϕ∗ε = (1 − ε)ϕ, ε > 0, is lower, we must show that

Sn
P lim sup ≥ 1 = 1. (6)
ϕ(n)

2. Theorem 1 (Law of the Iterated Logarithm). Let ξ1 , ξ2 , . . . be a sequence of in-


dependent identically distributed random variables with Eξi = 0 and Eξi2 = σ 2 > 0.
Then 
Sn
P lim sup = 1 = 1, (7)
ψ(n)
where "
ψ(n) = 2σ 2 n log log n. (8)
For uniformly bounded random variables, the law of the iterated logarithm was
established in 1924 by Khinchin [46]. In 1929 Kolmogorov [48] generalized this
result to a wide class of independent variables. Under the conditions of Theorem 1,
the law of the iterated logarithm was established by Hartman and Wintner [40].
Since the proof of Theorem 1 is rather complicated, we shall confine ourselves to
the special case where the random variables ξn are normal, ξn ∼ N (0, 1), n ≥ 1.
We begin by proving two auxiliary results.

Lemma 1. Let ξ1 , . . . , ξn be independent random variables that are symmetrically


distributed (P(ξk ∈ B) = P(−ξk ∈ B) for every B ∈ B(R), k ≤ n). Then for every
real number a > 0
24 4 Sequences and Sums of Independent Random Variables
 
P max Sk > a ≤ 2 P(Sn > a). (9)
1≤k≤n

PROOF. Let Ak = {Si ≤ a, i ≤ k − 1; Sk > a}, A = {max1≤k≤n Sk > a}, and


B = {Sn > a}. Since Ak ∩ B ⊇ Ak ∩ {Sn ≥ Sk }, we have

P(Ak ∩ B) ≥ P(Ak ∩ {Sn ≥ Sk }) = P(Ak ) P(Sn ≥ Sk )


= P(Ak ) P(ξk+1 + · · · + ξn ≥ 0).

By the symmetry of the distributions of the random variables ξ1 , . . . , ξn , we have

P(ξk+1 + · · · + ξn > 0) = P(ξk+1 + · · · + ξn < 0).

Hence P(ξk+1 + · · · + ξn ≥ 0) ≥ 12 , and therefore


n
1
n
1
P(B) ≥ P(Ak ∩ B) ≥ P(Ak ) = P(A),
2 2
k=1 k=1

which establishes (9) (cf. proof in Subsection 3 of Sect. 2, Chap. 8).



Lemma 2. Let Sn ∼ N (0, σ 2 (n)), σ 2 (n) ↑ ∞, and let a(n), n ≥ 1, satisfy


a(n)/σ(n) → ∞, n → ∞. Then

σ(n)
P(Sn > a(n)) ∼ √ exp{− 12 a2 (n)/σ 2 (n)}. (10)
2πa(n)

The proof follows from the asymptotic formula


 ∞
1 1
e−y /2 dy ∼ √ e−x /2 ,
2 2
√ x → ∞,
2π x 2πx
since Sn /σ(n) ∼ N (0, 1).

PROOF OF THEOREM 1 (for ξi ∼ N (0, 1)). Let us first establish (5). Let ε > 0,
λ = 1 + ε, nk = λk , where k ≥ k0 , and k0 is chosen so that log log k0 is defined.
We also define

Ak = {Sn > λψ(n) for some n ∈ (nk , nk+1 ]} (11)

and put
A = {Ak i.o.} = {Sn > λψ(n) for infinitely many n}.

 we can establish (5) by showing that P(A) = 0.


In accordance with (3),
Let us show that P(Ak ) < ∞. Then P(A) = 0 by the Borel–Cantelli lemma
(Sect. 10, Chap. 2, Vol. 1).
4 Law of the Iterated Logarithm 25

From (11), (9), and (10) we find that

P(Ak ) ≤ P{Sn > λψ(nk ) for some n ∈ (nk , nk+1 )}


≤ P{Sn > λψ(nk ) for some n ≤ nk+1 }

2 nk √
≤ 2 P{Snk+1 > λψ(nk )} ∼ √ exp{− 12 λ2 [ψ(nk )/ nk ]2 }
2πλψ(nk )
≤ C1 exp( −λ log log λk ) ≤ C2 e−λ log k = C2 k−λ ,
∞
where C1 and C2 are constants. But k=1 k−λ < ∞, and therefore

P(Ak ) < ∞.

Consequently, (5) is established.


We turn now to the proof of (6). In accordance with (4), we must show that, with
λ = 1 − ε, ε > 0, we have with probability 1 that Sn ≥ λψ(n) for infinitely many n.
Let us apply (5), which we just proved, to the sequence (−Sn )n≥1 . Then we find
that for all n, with finitely many exceptions, −Sn ≤ 2ψ(n) (P-a.s.). Consequently,
if nk = N k , N > 1, then for sufficiently large k, either

Snk−1 ≥ −2ψ(nk−1 )

or
Snk ≥ Yk − 2ψ(nk−1 ), (12)
where Yk = Snk − Snk−1 .
Hence, if we show that for infinitely many k

Yk > λψ(nk ) + 2ψ(nk−1 ), (13)

this and (12) show that (P-a.s.) Snk > λψ(nk ) for infinitely many k. Take some
λ ∈ (λ, 1). Then there is an N > 1 such that for all k

λ [2(N k − N k−1 ) log log N k ]1/2 > λ(2N k log log N k )1/2
+ 2(2N k−1 log log N k−1 )1/2 ≡ λψ(N k ) + 2ψ(N k−1 ).

It is now enough to show that

Yk > λ [2(N k − N k−1 ) log log N k ]1/2 (14)

for infinitely many k. Evidently Yk ∼ N (0, N k − N k−1 ). Therefore, by Lemma 2,


1  2
P{Yk > λ [2(N k − N k−1 ) log log N k ]1/2 }∼ √ e−(λ ) log log N k
2πλ (2 log log N k )1/2
C1  2 C2
≥ k−(λ ) ≥ .
(log k)1/2 k log k
26 4 Sequences and Sums of Independent Random Variables

Since (1/k log k) = ∞, it follows from the second part of the Borel–Cantelli
lemma that, with probability 1, inequality (14) is satisfied for infinitely many k, so
that (6) is established.
This completes the proof of the theorem.

Remark 1. Applying (7) to the random variables (−Sn )n≥1 , we find that (P-a.s.)

Sn
lim inf = −1. (15)
ϕ(n)

It follows from (7) and (15) that the law of the iterated logarithm can be put in the
form 
|Sn |
P lim sup = 1 = 1. (16)
ϕ(n)
Remark 2. The law of the iterated logarithm says that for every ε > 0 each function
ψε∗ = (1 + ε)ψ is upper and ψ∗ε = (1 − ε)ψ is lower.
The conclusion (7) is also equivalent to the statement that, for each ε > 0,

P{|Sn | ≥ (1 − ε)ψ(n) i.o.} = 1,


P{|Sn | ≥ (1 + ε)ψ(n) i.o.} = 0.

3. PROBLEMS
1. Let ξ1 , ξ2 , . . . be a sequence of independent random variables with ξn ∼
N (0, 1). Show that 
ξn
P lim sup √ = 1 = 1.
2 log n
2. Let ξ1 , ξ2 , . . . be a sequence of independent random variables, distributed ac-
cording to Poisson’s law with parameter λ > 0. Show that (regardless of λ)

ξn log log n
P lim sup = 1 = 1.
log n

3. Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-


ables with
E eitξ1 = e−|t| , 0 < α < 2.
α

Show that    
 Sn 1/(log log n)
P lim sup  1/α  =e1/α
= 1.
n
4. Establish the following generalization of (9). Let ξ1 , . . . , ξn be independent ran-
dom variables, and let S0 = 0, Sk = ξ1 + · · · + ξk . Then Lévy’s inequality

P max [Sk + μ(Sn − Sk )] > a ≤ 2 P(Sn > a)
0≤k≤n
5 Probabilities of Large Deviations 27

holds for every real a > 0, where μ(ξ) is the median of ξ, i.e., a constant such
that
P(ξ ≥ μ(ξ)) ≥ 12 , P(ξ ≤ μ(ξ)) ≥ 12 .
5. Let ξ1 , . . . , ξn be independent random variables, and let S0 = 0, Sk = ξ1 +· · ·+ξk .
Prove that:
(a) (In addition to Problem 4)


P max |Sk + μ(Sn − Sk )| ≥ a ≤ 2 P{|Sn | ≥ a},
1≤k≤n

where μ(ξ) is the median of ξ;


(b) If ξ1 , . . . , ξn are identically distributed and symmetric, then


1 − e−n P{|ξ1 |>x} ≤ P max |ξk | > x ≤ 2 P{|Sn | > x}.
1≤k≤n

6. Let ξ1 , . . . , ξn be independent random variables with E ξi = 0, 1 ≤ i ≤ n, and let


Sk = ξ1 + · · · + ξk . Show that


P max Sk > a ≤ 2 P{Sn ≥ a − E |Sn |} for a > 0.
1≤k≤n

7. Let ξ1 , . . . , ξn be independent random variables such that E ξi = 0, σ 2 = E ξi2 <


∞, and |ξi | ≤ C (P-a.s.), i ≤ n. Let Sn = ξ1 + · · · + ξn . Show that

E exSn ≤ exp{2−1 nx2 σ 2 (1 + xC)} for any 0 ≤ x ≤ 2C−1 .

Under the same √ assumptions, show that if (an ) is a sequence of real numbers
such that an / n → ∞, but an = o(n), then for any ε > 0 and sufficiently large
n
a2
P{Sn > an } > exp − n 2 (1 + ε) .
2nσ
8. Let ξ1 , . . . , ξn be independent
n random variables such that E ξi = 0, |ξi | ≤ C
(P-a.s.), i ≤ n. Let Dn = i=1 Var ξi . Show that Sn = ξ1 + · · · + ξn satisfies the
inequality (Yu. V. Prohorov)

a aC
P{Sn ≥ a} ≤ exp − arcsin , a ∈ R.
2C 2Dn

5. Probabilities of Large Deviations

1. Consider the Bernoulli scheme treated in Sect. 6, Chap. 1, Vol. 1. For this scheme,
the de Moivre–Laplace theorem provides an √ approximation for the probabilities of
−np| ≥ ε n, i.e., deviations of Sn from the central
standard (normal) deviations |Sn√
value np by a quantity of order n. In the same Sect. 6, Chap. 1, Vol. 1 we gave a
28 4 Sequences and Sums of Independent Random Variables

bound for probabilities of so-called large deviations |Sn − np| ≥ εn, i.e., deviations
of Sn from np of order n:
 
 Sn 
 
P  − p ≥ ε ≤ 2e−2nε
2
(1)
n

(see (42) in Sect. 6, Chap. 1, Vol. 1). From this, of course, there follow the inequali-
ties
     
 Sm   Sm  2
P sup   
− p ≥ ε ≤ P   
− p ≥ ε ≤ 2 e
−2nε2
, (2)
1−e −2ε
m≥n m m
m≥n

which provide an idea of the rate of convergence to p by the quantity Sn /n with


probability 1.
We now consider the question of the validity of formulas of the types (1) and (2)
in a more general situation, when Sn = ξ1 + · · · + ξn is a sum of independent
identically distributed random variables.

2. We say that a random variable ξ satisfies Cramér’s condition if there is a neigh-


borhood of zero such that for any λ in this neighborhood
E eλ|ξ| < ∞ (3)

(it can be shown that this condition is equivalent to an exponential decrease of


P(|ξ| > x), as x → ∞).
Let
ϕ(λ) = E eλξ and ψ(λ) = log ϕ(λ). (4)
On the interior of the set

Λ = {λ ∈ R : ψ(λ) < ∞} (5)

the function ψ(λ) is convex (from below) and infinitely differentiable. We also no-
tice that
ψ(0) = 0, ψ  (0) = m (= E ξ), ψ  (λ) ≥ 0.
We define the function

H(a) = sup[aλ − ψ(λ)], a ∈ R, (6)


λ

called the Cramér transform (of the distribution function F = F(x) of the random
variable ξ). The function H(a) is also convex (from below) and its minimum is zero,
attained at a = m.
If a > m, we have
H(a) = sup [aλ − ψ(λ)].
λ>0

Then
P{ξ ≥ a} ≤ inf E eλ(ξ−a) = inf e−[aλ−ψ(λ)] = e−H(a) . (7)
λ>0 λ>0
5 Probabilities of Large Deviations 29

Similarly, for a < m we have H(a) = supλ<0 [aλ − ψ(λ)] and

P{ξ ≤ a} ≤ e−H(a) . (8)

Consequently (cf. (42) in Sect. 6, Chap. 1, Vol. 1)

P{|ξ − m| ≥ ε} ≤ e− min{H(m−ε),H(m+ε)} . (9)

If ξ, ξ1 , . . . , ξn are independent identically distributed random variables that sat-


isfy Cramér’s condition (3), Sn = ξ1 + · · · + ξn , ψn (λ) = log E exp (λSn /n),
ψ(λ) = log E eλξ , and
Hn (a) = sup[aλ − ψn (λ)], (10)
λ

then  
Hn (a) = nH(a) = n sup[aλ − ψ(λ)]
λ

and inequalities (7), (8), and (9) assume the following forms:

Sn
P ≥ a ≤ e−nH(a) , a > m, (11)
n

Sn
P ≤ a ≤ e−nH(a) , a < m, (12)
n
 
 Sn 
 
P  − m ≥ ε ≤ 2e− min{H(m−ε),H(m+ε)}·n . (13)
n

Remark 1. Results of the type


 
 Sn 
 
P  − m ≥ ε ≤ ae−bn , (14)
n

where a > 0 and b > 0, indicate exponential convergence “adjusted” by the con-
stants a and b. In the theory of large deviations, such results are often presented in
a somewhat different, “cruder,” form,
 
1  Sn 
lim sup log P  − m ≥ ε < 0, (15)
n n n

that clearly arises from (14) and refers to the “exponential” rate of convergence, but
without specifying the values of the constants a and b.
Now we turn to the question of upper bounds for the probabilities
    
Sk Sk  Sk 
P sup > a , P inf 
< a , P sup  − m > ε , 
k≥n k k≥n k k≥n k

which can provide definite bounds on the rate of convergence in the strong law of
large numbers.
30 4 Sequences and Sums of Independent Random Variables

Let us suppose that the independent identically distributed nondegenerate ran-


dom variables ξ, ξ1 , ξ2 , . . . satisfy Cramér’s condition (3).
We fix n ≥ 1 and set

Sk
κ = min k ≥ n : >a ,
k

taking κ = ∞ if Sk /k < a for all k ≥ n.


In addition, let a and λ > 0 satisfy

λa − log ϕ(λ) ≥ 0. (16)

Then
⎧ ⎫
 ⎨ &  ⎬
Sk Sk
P sup >a =P >a
k≥n k ⎩ k ⎭
k≥n


=P > a, κ < ∞ = P{eλSκ > eλaκ , κ < ∞}
κ
= P{eλSκ −κ log ϕ(λ) > eκ(λa−log ϕ(λ)) , κ < ∞}
≤ P{eλSκ −κ log ϕ(λ) > en(λa−log ϕ(λ)) , κ < ∞}

≤ P sup eλSk −k log ϕ(λ) ≥ en(λa−log ϕ(λ) . (17)
k≥n

To take the final step, we notice that the sequence of random variables

eλSk −k log ϕ(λ) , k ≥ 1,

with respect to the flow of σ-algebras Fk = σ{ξ1 , . . . , ξk }, k ≥ 1, forms a martin-


gale. (For more details, see Chap. 7 and, in particular, Example 2 in Sect. 1 therein.)
Then it follows from inequality (8) in Sect. 3, Chap. 7, that

P sup eλSk −k log ϕ(λ) ≥ en(λa−log ϕ(λ)) ≤ e−n(λa−log ϕ(λ)) ,
k≥n

and consequently (assuming (16)) we obtain the inequality



Sk
P sup > a ≤ e−n(λa−log ϕ(λ)) . (18)
k≥n k

Let a > m. Since the function f (λ) = aλ − log ϕ(λ) has the properties f (0) = 0,
f  (0) > 0, there is a λ > 0 for which (16) is satisfied, and consequently we obtain
from (18) that if a > m, then

Sk
P sup > a ≤ e−n supλ>0 [λa−log ϕ(λ)] = e−nH(a) . (19)
k≥n k
5 Probabilities of Large Deviations 31

Similarly, if a < m, then



Sk
P sup < a ≤ e−n supλ<0 [λa−log ϕ(λ)] = e−nH(a) . (20)
k≥n k

From (19) and (20) we obtain


  
 Sk 
 
P sup  − m > ε ≤ 2e− min[H(m−ε),H(m+ε)]·n . (21)
k≥n k

Remark 2. The fact that the right-hand sides of inequalities (11) and (19) are the
same leads us to suspect that this situation is not accidental. In fact, this expectation
is concealed in the property that the sequences (Sk /k)n≤k≤N form, for every n ≤ N,
reversed martingales (see Problem 5 in Sect. 1, Chap. 7, and Example 4 in Sect. 11,
Chap. 1, Vol. 1).

2. PROBLEMS
1. Carry out the proof of inequalities (8) and (20).
2. Verify that under condition (3), the function ψ(λ) is convex (from below) on the
interior of the set Λ (see (5)) (and strictly convex provided ξ is nondegenerate)
and infinitely differentiable.
3. Assuming that ξ is nondegenerate, prove that the function H(a) is differentiable
on the whole real line and is convex (from below).
4. Prove the following inversion formula for Cramér’s transform:

ψ(λ) = sup[λa − H(a)]


a

(for all λ, except, possibly, the endpoints of the set Λ = {λ : ψ(λ) < ∞}).
5. Let Sn = ξ1 + · · · + ξn , where ξ1 , . . . , ξn , n ≥ 1, are independent identically
distributed simple random variables with E ξ1 < 0, P{ξ1 > 0} > 0. Let ϕ(λ) =
E eλξ1 and inf λ ϕ(λ) = ρ (0 < ρ < 1).
Show that the following result (Chernoff’s theorem) holds:
1
lim log P{Sn ≥ 0} = log ρ. (22)
n
6. Using (22), prove that in the Bernoulli scheme (P{ξ1 = 1} = p, P{ξ1 = 0} = q)
1
lim log P{Sn ≥ nx} = −H(x), (23)
n
for p < x < 1, where (cf. notation in Sect. 6, Chap. 1, Vol. 1)
x 1−x
H(x) = x log + (1 − x) log .
p 1−p
32 4 Sequences and Sums of Independent Random Variables

7. Let Sn = ξ1 + · · · + ξn , n ≥ 1, where ξ1 , ξ2 , . . . are independent identically dis-


tributed random variables with E ξ1 = 0, Var ξ1 = 1. Let (xn )n≥1 be a sequence
such that xn → ∞ and √xnn → 0 as n → ∞. Show that

√ xn2
P{Sn ≥ xn n} = e− 2 (1+yn ) ,

where yn → 0, n → ∞.
8. Derive from (23) that in the Bernoulli case (P{ξ1 = 1} = p, P{ξ1 = 0} = q)
we have:
(a) For p < x < 1 and xn = n(x − p),

 xn 
P{Sn ≥ np + xn } = exp −nH p + (1 + o(1)) ; (24)
n
√ √
(b) For xn = an npq with an → ∞, an / n → 0,

x2
P{Sn ≥ np + xn } = exp − n (1 + o(1)) . (25)
2npq
Compare (24) with (25) and both of them with the corresponding results in Sect. 6
of Chap. 1, Vol. 1.
Chapter 5
Stationary (Strict Sense) Random
Sequences and Ergodic Theory

In the strict sense, the theory [of stationary stochastic processes] can be stated outside the
framework of probability theory as the theory of one-parameter groups of transformations
of a measure space that preserve the measure; this theory is very close to the general theory
of dynamical systems and to ergodic theory.
Encyclopaedia of Mathematics [42, Vol. 8, p. 479].

1. Stationary (Strict Sense) Random Sequences:


Measure-Preserving Transformations

1. Let (Ω, F , P) be a probability space and ξ = (ξ1 , ξ2 , . . .) a sequence of ran-


dom variables or, as we say, a random sequence. Let θk ξ denote the sequence
(ξk+1 , ξk+2 , . . .).

Definition 1. A random sequence ξ is stationary (in the strict sense) if the probabil-
ity distributions of θk ξ and ξ are the same for every k ≥ 1:

P((ξ1 , ξ2 , . . .) ∈ B) = P((ξk+1 , ξk+2 , . . .) ∈ B), B ∈ B(R∞ ).

The simplest example is a sequence ξ = (ξ1 , ξ2 , . . .) of independent identically


distributed random variables. Starting from such a sequence, we can construct a
broad class of stationary sequences η = (η1 , η2 , . . .) by choosing any Borel function
g(x1 , . . . , xn ) and setting ηk = g(ξk , ξk+1 , . . . , ξk+n−1 ).
If ξ = (ξ1 , ξ2 , . . .) is a sequence of independent identically distributed random
variables with E |ξ1 | < ∞ and E ξ1 = m, then the strong law of large numbers tells
us that, with probability 1,

ξ1 + · · · + ξ n
→ m, n → ∞.
n

© Springer Science+Business Media, LLC, part of Springer Nature 2019 33


A. N. Shiryaev, Probability-2, Graduate Texts in Mathematics 95,
https://doi.org/10.1007/978-0-387-72208-5 2
34 5 Stationary (Strict Sense) Random Sequences and Ergodic Theory

In 1931, Birkhoff [6] obtained a remarkable generalization of this fact, which was
stated as a theorem of statistical mechanics dealing with the behavior of the “relative
residence time” of dynamical systems described by differential equations admitting
an integral invariant (“conservative systems”). Soon after, in 1932, Khinchin [47]
obtained an extension of Birkhoff’s theorem to a more general case of “stationary
motions of a multidimensional space within itself preserving the measure of a set.”
The following presentation of Birkhoff’s and Khinchin’s results will combine the
ideas of the theory of “dynamical systems” and the theory of “stationary in a strict
sense random sequences.”
In this presentation we will primarily concentrate on the “ergodic” results of
these theories.
2. Let (Ω, F , P) be a (complete) probability space.

Definition 2. A transformation T of Ω into itself is measurable if, for every A ∈ F ,

T −1 A = {ω : Tω ∈ A} ∈ F .

Definition 3. A measurable transformation T is a measure-preserving transforma-


tion (or morphism) if, for every A ∈ F,

P(T −1 A) = P(A).

Let T be a measure-preserving transformation, T n its nth iterate, and ξ1 = ξ1 (ω)


a random variable. Set ξn (ω) = ξ1 (T n−1 ω), n ≥ 2, and consider the sequence
ξ = (ξ1 , ξ2 , . . .). We claim that this sequence is stationary.
In fact, let A = {ω : ξ ∈ B} and A1 = {ω : θ1 ξ ∈ B}, where B ∈ B(R∞ ). Then
ω ∈ A1 if and only if Tω ∈ A, i.e., A1 = T −1 A. But P(T −1 A) = P(A), hence
P(A1 ) = P(A). Similarly, P(Ak ) = P(A) for any Ak = {ω : θk ξ ∈ B}, k ≥ 2.
Thus we can use measure-preserving transformations to construct stationary (in
strict sense) random sequences.
In a certain sense, there is a converse result: for every stationary sequence ξ
considered on (Ω, F , P) we can construct a new probability space (Ω̃, F˜ , P̃), a
random variable ξ˜1 (ω̃), and a measure-preserving transformation T̃, such that the
distribution of ξ˜ = {ξ˜1 (ω̃), ξ˜1 (T̃ ω̃), . . .} coincides with the distribution of ξ =
{ξ1 (ω), ξ2 (ω), . . . }.
In fact, take Ω̃ to be the coordinate space R∞ , and set F˜ = B(R∞ ), P̃ = Pξ ,
where Pξ (B) = P{ω : ξ ∈ B}, B ∈ B(R∞ ). The action of T̃ on Ω̃ is given by

T̃(x1 , x2 , . . .) = (x2 , x3 , . . .).

If ω̃ = (x1 , x2 , . . .), set

ξ˜1 (ω̃) = x1 , ξ˜n (ω̃) = ξ˜1 (T̃ n−1 ω̃), n ≥ 2.

Now let A = {ω̃ : (x1 , . . . , xk ) ∈ B}, B ∈ B(Rk ), and

T̃ −1 A = {ω̃ : (x2 , . . . , xk+1 ) ∈ B}.


1 Stationary (Strict Sense) Random Sequences: Measure-Preserving Transformations 35

Then the property of being stationary means that

P̃(A) = P{ω : (ξ1 , . . . , ξk ) ∈ B} = P{ω : (ξ2 , . . . , ξk+1 ) ∈ B} = P̃(T̃ −1 A),

i.e., T̃ is a measure-preserving transformation. Since P̃{ω̃ : (ξ˜1 , . . . , ξ˜k ) ∈ B} =


P{ω : (ξ1 , . . . , ξk ) ∈ B} for every k, it follows that ξ and ξ˜ have the same distribu-
tion.
What follows are some examples of measure-preserving transformations.

EXAMPLE 1. Let Ω = {ω1 , . . . , ωn } consist of n points (a finite number), n ≥ 2,


let F be the collection of its subsets, and let Tωi = ωi+1 , 1 ≤ i ≤ n − 1, and
Tωn = ω1 . If P(ωi ) = 1/n, then the transformation T is measure-preserving.

EXAMPLE 2. If Ω = [0, 1), F = B([0, 1)), P is the Lebesgue measure, λ ∈ [0, 1),
then Tx = (x + λ) mod 1 is a measure-preserving transformation.

Let us consider the physical hypotheses that lead to the consideration of measure-
preserving transformations.
Suppose that Ω is the phase space of a system that evolves (in discrete time)
according to a given law of motion. If ω is the state at instant n = 1, then T n ω,
where T is the translation operator induced by the given law of motion, is the state
attained by the system after n steps. Moreover, if A is some set of states ω, then
T −1 A = {ω : Tω ∈ A} is, by definition, the set of states ω that lead to A in one step.
Therefore, if we interpret Ω as an incompressible fluid, the condition P(T −1 A) =
P(A) can be thought of as the rather natural condition of conservation of volume.
(For the classical conservative Hamiltonian systems, Liouville’s theorem asserts that
the corresponding transformation T preserves the Lebesgue measure.)

3. One of the earliest results on measure-preserving transformations was Poincaré’s


recurrence theorem [63].

Theorem 1. Let (Ω, F , P) be a probability space, let T be a measure-preserving


transformation, and let A ∈ F . Then, for almost every point ω ∈ A, we have
T n ω ∈ A for infinitely many n ≥ 1.

PROOF. Let C = {ω ∈ A : T n ω ∈ A for all n ≥ 1}. Since C ∩ T −n C = ∅ for


all n ≥ 1, we have T −m C ∩ T −(m+n) C = T −m (C ∩ T −n C) = ∅. Therefore the
−n ∞
sequence
∞ {T C} consists of disjoint sets of equal measure. But n=0 P(C) =
−n
n=0 P(T C) ≤ P(Ω) = 1, and consequently P(C) = 0. Therefore, for almost
every point ω ∈ A, for at least one n ≥ 1, we have T n ω ∈ A. We will show that,
consequently, T n ω ∈ A for infinitely many n.
Let us apply the preceding result to T k , k ≥ 1. Then for every ω ∈ A \ N, where
N is a set of probability zero, which is the union of the corresponding sets related
to the various values of k, there is an nk such that (T k )nk ω ∈ A. It is then clear that
T n ω ∈ A for infinitely many n. This completes the proof of the theorem.


36 5 Stationary (Strict Sense) Random Sequences and Ergodic Theory

Corollary. Let ξ(ω) ≥ 0. Then




ξ(T k ω) = ∞ (P -a.s.)
k=0

on the set {ω : ξ(ω) > 0}. ∞


In fact, let An = {ω : ξ(ω) ≥ 1/n}. Then, by the theorem, k=0 ξ(T k ω) = ∞
(P-a.s.) on An , and the required result follows by letting n → ∞.

Remark. The theorem remains valid if we replace the probability measure P by


any finite measure μ with μ(Ω) < ∞.

4. PROBLEMS
1. Let T be a measure-preserving transformation and ξ = ξ(ω) a random variable
whose expectation E ξ(ω) exists. Show that E ξ(ω) = E ξ(Tω).
2. Show that the transformations in Examples 1 and 2 are measure-preserving.
3. Let Ω = [0, 1), F = B([0, 1)), and let P be a measure whose distribution func-
tion is continuous. Show that the transformations Tx = λx, 0 < λ < 1, and
Tx = x2 are not measure-preserving.
4. Let Ω be the set of all sequences ω = (. . . , ω−1 , ω0 , ω1 , . . . ) of real numbers, F
the σ-algebra generated by measurable cylinders {ω : (ωk , . . . , ωk+n−1 ) ∈ Bn },
where n = 1, 2, . . . , k = 0, ±1, ±2, . . . , and Bn ∈ B(Rn ). Let P be a probability
measure on (Ω, F ), and let T be the two-sided transformation defined by

T(. . . , ω−1 , ω0 , ω1 , . . . ) = (. . . , ω0 , ω1 , ω2 , . . . ).

Show that T is measure-preserving if and only if

P{ω : (ω0 , . . . , ωn−1 ) ∈ Bn } = P{ω : (ωk , . . . , ωk+n−1 ) ∈ Bn }

for all n = 1, 2, . . . , k = 0, ±1, ±2, . . . , and Bn ∈ B(Rn ).


5. Let ξ0 , ξ1 , . . . be a stationary sequence of random elements taking values in a
Borel space S (see Definition 9 in Sect. 7, Chap. 2, Vol. 1). Show that one can con-
struct (maybe on an extended probability space) random elements ξ−1 , ξ−2 , . . .
with values in S such that the two-sided sequence . . . , ξ−1 , ξ0 , ξ1 , . . . is station-
ary.
6. Let T be a measurable transformation on (Ω, F , P), and let E be a π-system
of subsets of Ω that generates F (i.e., π(E ) = F ). Prove that if the equality
P(T −1 A) = P(A) holds for all A ∈ E , then it holds also for all A ∈ F (= π(E )).
7. Let T be a measure-preserving transformation on (Ω, F , P), and let G be a sub-
σ-algebra of F . Show that for any A ∈ F

P(A | G )(Tω) = P(T −1 A | T −1 G )(ω) (P-a.s.). (1)

In particular, let Ω = R∞ be the space of numerical sequences ω = (ω0 , ω1 , . . . )


and ξk (ω) = ωk . Let T be the shift transformation T(ω0 , ω1 , . . . ) = (ω1 , ω2 , . . . )
2 Ergodicity and Mixing 37

(in other words, if ξk (ω) = ωk , then ξk (Tω) = ωk+1 ). Then (1) becomes

P(A | ξn )(Tω) = P(T −1 A | ξn+1 )(ω) (P-a.s.).

2. Ergodicity and Mixing

1. In this section, T denotes a measure-preserving transformation on the probability


space (Ω, F , P).

Definition 1. A set A ∈ F is invariant if T −1 A = A. A set A ∈ F is almost


invariant if A and T −1 A differ only by a set of measure zero, i.e., P(A T −1 A) = 0.

It is easily verified that the classes I and I ∗ of invariant or almost invariant


sets, respectively, are σ-algebras.

Definition 2. A measure-preserving transformation T is ergodic (or metrically tran-


sitive) if every invariant set A has measure either zero or one.

Definition 3. A random variable η = η(ω) is invariant (or almost invariant) if


η(ω) = η(Tω) for all ω ∈ Ω (or for almost all ω ∈ Ω).

The following lemma establishes a connection between invariant and almost in-
variant sets.

Lemma 1. If A is an almost invariant set, then there is an invariant set B such that
P(A B) = 0.

PROOF. Let B = lim sup T −n A. Then T −1 B = lim sup T −(n+1) A = B, i.e., B ∈ I .



It is easily seen that A B ⊆ k=0 (T −k A T −(k+1) A). But

P(T −k A T −(k+1) A) = P(A T −1 A) = 0.

Hence P(A B) = 0.

Lemma 2. A transformation T is ergodic if and only if every almost invariant set


has measure zero or one.

PROOF. Let A ∈ I ∗ ; then, by Lemma 1, there is an invariant set B such that


P(A B) = 0. But T is ergodic, and therefore P(B) = 0 or 1. Therefore P(A) = 0
or 1. The converse is evident, since I ⊆ I ∗ .

Theorem 1. Let T be a measure-preserving transformation. Then the following con-


ditions are equivalent:
(1) T is ergodic;
38 5 Stationary (Strict Sense) Random Sequences and Ergodic Theory

(2) Every almost invariant random variable is P-a.s. constant;


(3) Every invariant random variable is P-a.s. constant.
PROOF. (1) ⇔ (2). Let T be ergodic and ξ almost invariant, i.e., ξ(ω) = ξ(Tω)
(P-a.s.). Then for every c ∈ R we have Ac = {ω : ξ(ω) ≤ c} ∈ I ∗ , and then
P(Ac ) = 0 or 1 by Lemma 2. Let C = sup{c : P(Ac ) = 0}. Since Ac ↑ Ω as c ↑ ∞
and Ac ↓ ∅ as c ↓ −∞, we have |C| < ∞. Then
∞  
& 1
P{ω : ξ(ω) < C} = P ξ(ω) ≤ C − = 0.
n=1
n

And, similarly, P{ω : ξ(ω) > C} = 0. Consequently, P{ω : ξ(ω) = C} = 1.


(2) ⇒ (3). Evident.
(3) ⇒ (1). Let A ∈ I ; then IA is an invariant random variable, and therefore
(P-a.s.) IA = 0 or IA = 1, whence P(A) = 0 or 1.


Remark 1. The conclusion of the theorem remains valid in the case where “random
variable” is replaced by “bounded random variable.”
We illustrate the theorem with the following example.
EXAMPLE. Let Ω = [0, 1), F = B([0, 1)), let P be the Lebesgue measure, and let
Tω = (ω + λ) mod 1. Let us show that T is ergodic if and only if λ is irrational.
Let ξ = ξ(ω) be an invariant random variable with E ξ 2 (ω) < ∞. Then we know

that the Fourier series n=−∞ cn e2πinω of ξ(ω) converges in the mean square sense,

|cn | < ∞, and, because T is a measure-preserving transformation (Example 2,
2

Sect. 1), we have (Problem 1, Sect. 1) that, since the random variable ξ is invariant,
cn = E ξ(ω)e−2πinξ(ω) = E ξ(Tω)e−2πinTω = e−2πinλ E ξ(Tω)e−2πinω
= e−2πinλ E ξ(ω)e−2πinω = cn e−2πinλ .

Thus, cn (1 − e−2πinλ ) = 0. By hypothesis, λ is irrational, and therefore e−2πinλ = 1


for all n = 0. Therefore cn = 0, n = 0, ξ(ω) = c0 (P-a.s.), and T is ergodic by
Theorem 1.
On the other hand, let λ be rational, i.e., λ = k/m, where k and m are integers.
Consider the set
& 
2m−2
2k 2k + 1

A= ω: ≤ω< .
2m 2m
k=0

It is clear that this set is invariant; but P(A) = 12 . Consequently, T is not ergodic.
2. Definition 4. A measure-preserving transformation is mixing (or has the mixing
property) if, for all A and B ∈ F ,

lim P(A ∩ T −n B) = P(A) P(B). (1)


n→∞

The following theorem establishes a connection between ergodicity and mixing.


3 Ergodic Theorems 39

Theorem 2. Every mixing transformation T is ergodic.


PROOF. Let A ∈ F , B ∈ I . Then B = T −n B, n ≥ 1, and therefore

P(A ∩ T −n B) = P(A ∩ B)

for all n ≥ 1. Because of (1), P(A ∩ B) = P(A) P(B). Hence we find, when A = B,
that P(B) = P2 (B), and consequently P(B) = 0 or 1. This completes the proof.

3. PROBLEMS
1. Show that a random variable ξ is invariant if and only if it is I -measurable.
2. Show that a set A is almost invariant if and only if P(T −1 A \ A) = 0.
3. Show that a transformation T is mixing if and only if, for all random variables ξ
and η with E ξ 2 < ∞ and E η 2 < ∞,

E ξ(T n ω)η(ω) → E ξ(ω) E η(ω), n → ∞.

4. Give an example of a measure-preserving ergodic transformation that is not mix-


ing.
5. Let T be a measure-preserving transformation on (Ω, F , P). Let A be an algebra
of subsets of Ω and σ(A ) = F . Suppose that Definition 1 requires only that the
property
lim P(A ∩ T −n B) = P(A) P(B)
n→∞

be satisfied for sets A and B in A . Show that this property will then hold for all
A and B in F = σ(A ) (and therefore the transformation T is mixing).
Show that this statement remains true if A is a π-system such that π(A ) = F .
6. Let A be an almost invariant set. Show that ω ∈ A (P-a.s.) if and only if T n ω ∈ A
for all n = 1, 2, . . . (cf. Theorem 1 in Sect. 1.)
7. Give examples of measure-preserving transformations T on (Ω, F , P) such that
(a) A ∈ F does not imply that TA ∈ F and (b) A ∈ F and TA ∈ F do not
imply that P(A) = P(TA).
8. Let T be a measurable transformation on (Ω, F ), and let P be the set of proba-
bility measures P with respect to which T is measure-preserving. Show that:
(a) The set P is convex;
(b) T is an ergodic transformation with respect to P if and only if P is an
extreme point of P (i.e., P cannot be represented as P = λ1 P1 +λ2 P2
with λ1 > 0, λ2 > 0, λ1 + λ2 = 1, P1 = P2 , and P1 , P2 ∈ P).

3. Ergodic Theorems

1. Theorem 1 (Birkhoff and Khinchin). Let T be a measure-preserving transforma-


tion and ξ = ξ(ω) a random variable with E |ξ| < ∞. Then (P-a.s.)

1
n−1
lim ξ(T k ω) = E(ξ | I ), (1)
n n
k=0
40 5 Stationary (Strict Sense) Random Sequences and Ergodic Theory

where I is the invariant σ-algebra. If also T is ergodic, then (P-a.s.)

1
n−1
lim ξ(T k ω) = E ξ. (2)
n n
k=0

The proof given below is based on the following proposition, whose simple proof
was given by Garsia [28].
Lemma (Maximal Ergodic Theorem). Let T be a measure-preserving transforma-
tion, let ξ be a random variable with E |ξ| < ∞, and let

Sk (ω) = ξ(ω) + ξ(Tω) + · · · + ξ(T k−1 ω),


Mk (ω) = max{0, S1 (ω), . . . , Sk (ω)}.

Then
E[ξ(ω)I{Mn >0} (ω)] ≥ 0
for every n ≥ 1.

PROOF. If n > k, we have Mn (Tω) ≥ Sk (Tω), and therefore ξ(ω) + Mn (Tω) ≥


ξ(ω)+Sk (Tω) = Sk+1 (ω). Since it is evident that ξ(ω) = S1 (ω) ≥ S1 (ω)−Mn (Tω),
we have
ξ(ω) ≥ max{S1 (ω), . . . , Sn (ω)} − Mn (Tω).
Therefore, since {Mn (ω) > 0} = {max(S1 (ω), . . . , Sn (ω)) > 0},

E[ξ(ω)I{Mn >0} (ω)] ≥ E[(max(S1 (ω), . . . , Sn (ω)) − Mn (Tω))I{Mn >0} (ω)]


≥ E{(Mn (ω) − Mn (Tω))I{Mn (ω)>0} } ≥ E{Mn (ω) − Mn (Tω)} = 0,

where we have used the fact that if T is a measure-preserving transformation, then


E Mn (ω) = E Mn (Tω) (Problem 1 in Sect. 1).
This completes the proof of the lemma.


PROOF OF THEOREM 1. Let us suppose that E(ξ | I ) = 0 (otherwise, replace ξ by
ξ − E(ξ | I )).
Let η = lim sup(Sn /n) and η = lim inf(Sn /n). It will be enough to establish that

0≤η≤η≤0 (P -a.s.).

Consider the random variable η = η(ω). Since η(ω) = η(Tω), the variable η is
invariant, and consequently, for every ε > 0, the set Aε = {η(ω) > ε} is also
invariant. Let us introduce the new random variable

ξ ∗ (ω) = (ξ(ω) − ε)IAk (ω),


3 Ergodic Theorems 41

and set

Sk∗ (ω) = ξ ∗ (ω) + · · · + ξ ∗ (T k−1 ω), Mk∗ (ω) = max(0, S1∗ , . . . , Sk∗ ).

Then, by the lemma,


E[ξ ∗ I{Mn∗ >0} ] ≥ 0
for every n ≥ 1. But as n → ∞,
  
S∗
{Mn∗ > 0} = max Sk∗ > 0 ↑ sup Sk∗ > 0 = sup k > 0
1≤k≤n k≥1 k≥1 k

Sk
= sup > ε ∩ Aε = Aε ,
k≥1 k

where the last equation follows because supk≥1 (Sk∗ /k) ≥ η and Aε = {ω : η > ε}.
Moreover, E |ξ ∗ | ≤ E |ξ| + ε. Hence, by the dominated convergence theorem,

0 ≤ E[ξ ∗ I{Mn∗ >0} ] → E[ξ ∗ IA ].

Thus,

0 ≤ E[ξ ∗ IAε ] = E[(ξ − ε)IAε ] = E[ξIAε ] − ε P(Aε )


= E[E(ξ|I )IAε ] − ε P(Aε ) = −ε P(Aε ),

so that P(Aε ) = 0, and therefore P(η ≤ 0) = 1.


Similarly, if we consider −ξ(ω) instead of ξ(ω), we find that

Sn Sn
lim sup − = − lim inf = −η
n n

and P(−η ≤ 0) = 1, i.e., P(η ≥ 0) = 1. Therefore 0 ≤ η ≤ η ≤ 0 (P-a.s.), and the


first part of the theorem is established.
To prove the second part, we observe that since E(ξ | I ) is an invariant random
variable, we have E(ξ | I ) = E ξ (P-a.s.) in the ergodic case.
This completes the proof of the theorem.

Corollary. A measure-preserving transformation T is ergodic if and only if, for all


A and B ∈ F ,
1
n−1
lim P(A ∩ T −k B) = P(A) P(B). (3)
n n
k=0

To prove the ergodicity of T, we let A = B ∈ I in (3). Then A ∩ T −k B = B, and


therefore P(B) = P2 (B), i.e., P(B) = 0 or 1. Conversely, let T be ergodic. Then, if
we apply (2) to the random variable ξ = IB (ω), where B ∈ F , we find that (P-a.s.)
42 5 Stationary (Strict Sense) Random Sequences and Ergodic Theory

1
n−1
lim IT −k B (ω) = P(B).
n n
k=0

If we now integrate both sides over A ∈ F and use the dominated convergence
theorem, we obtain (3), as required.

2. We now show that, under the hypotheses of Theorem 1, there is not only almost
sure convergence in (1) and (2), but also convergence in the mean. (This result will
be used subsequently in the proof of Theorem 3.)

Theorem 2. Let T be a measure-preserving transformation, and let ξ = ξ(ω) be a


random variable with E |ξ| < ∞. Then
 n−1 
1  
E  ξ(T k ω) − E(ξ | I ) → 0, n → ∞. (4)
n
k=0

If also T is ergodic, then


 n−1 
1  

E ξ(T ω) − E ξ  → 0,
k
n → ∞. (5)
n
k=0

PROOF. For every ε > 0 there is a bounded random variable η (|η(ω)| ≤ M) such
that E |ξ − η| ≤ ε. Then
     
 1 n−1   1 n−1 
E  ξ(T k ω) − E(ξ | I )  ≤ E  (ξ(T k ω) − η(T k ω))
n n
k=0 k=0
  
1 n−1

+ E  η(T k ω) − E(η | I ) + E | E(ξ | I ) − E(η | I )|. (6)
n
k=0

Since |η| ≤ M, by the dominated convergence theorem and using (1), we find that
the second term on the right-hand side of (6) tends to zero as n → ∞. The first
and third terms are each at most ε. Hence, for sufficiently large n, the left-hand side
of (6) is less than 3ε, so that (4) is proved. Finally, if T is ergodic, then (5) follows
from (4) and the remark that E(ξ | I ) = E ξ (P-a.s.).
This completes the proof of the theorem.

3. We now turn to the question of the validity of the ergodic theorem for station-
ary (in strict sense) random sequences ξ = (ξ1 , ξ2 , . . .) defined on a probabil-
ity space (Ω, F , P). In general, (Ω, F , P) need not carry any measure-preserving
transformations, so that it is not possible to apply Theorem 1 directly. However, as
we observed in Sect. 1, we can construct a coordinate probability space (Ω̃, F˜ , P̃),
random variables ξ˜ = (ξ˜1 , ξ˜2 , . . .), and a measure-preserving transformation T̃ such
that ξ˜n (ω̃) = ξ˜1 (T̃ n−1 ω̃) and the distributions of ξ and ξ˜ are the same. Since such
3 Ergodic Theorems 43

properties as almost sure convergence and convergence in the mean n are defined
only for probability distributions, from the convergence of (1/n) k=1 ξ˜1 (T k−1 ω̃)
n
(P-a.s. and in the mean) to a random variable η̃ it follows that (1/n) k=1 ξk (ω)
d
also converges (P-a.s. and in the mean) to a random variable η such that η = η̃. It
follows from Theorem 1 that if Ẽ|ξ˜1 | < ∞, then η̃ = Ẽ(ξ˜1 | I˜), where I˜ is a
collection of invariant sets (Ẽ is the expectation with respect to the measure P̃). We
now describe the structure of η.

Definition 1. A set A ∈ F is invariant with respect to the sequence ξ if there is a


set B ∈ B(R∞ ) such that for n ≥ 1

A = {ω : (ξn , ξn+1 , . . .) ∈ B}.

The collection of all such invariant sets is a σ-algebra, denoted by Iξ .

Definition 2. A stationary sequence ξ is ergodic if the measure of every invariant


set is either 0 or 1.

n show that if the random variable η is the limit (P-a.s. and in the
Let us now
mean) of 1n k=1 ξk (ω), n → ∞, then it can be taken equal to E(ξ1 | Iξ ). To this
end, notice that we can set

1
n
η(ω) = lim sup ξk (ω). (7)
n n
k=1

It follows from the definition of lim sup that for the random variable η(ω) so
defined, the sets {ω : η(ω) < y}, y ∈ R, are   therefore η is Iξ -
invariant, and
 1 n−1 
measurable. Now, let A ∈ Iξ . Then, since E  n k=1 ξk − η  → 0, we have for η
defined by (7)
n  
1
ξk d P → η d P . (8)
n A A
k=1

Let B ∈ B(R ) be such that A = {ω : (ξk , ξk+1 , . . .) ∈ B} for all k ≥ 1. Then
since ξ is stationary,
   
ξk d P = ξk d P = ξ1 d P = ξ1 d P .
A {ω : (ξk ,ξk+1 ,...)∈B} {ω : (ξ1 ,ξ2 ,...)∈B} A

Hence it follows from (8) that for all A ∈ Iξ ,


 
ξ1 d P = η d P,
A A

which implies (see (1) in Sect. 7, Chap. 2, Vol. 1) that (η being Iξ -measurable) η =
E(ξ1 | Iξ ). Here E(ξ1 | Iξ ) = E ξ1 if ξ is ergodic.
Therefore we have proved the following theorem.
44 5 Stationary (Strict Sense) Random Sequences and Ergodic Theory

Theorem 3 (Ergodic Theorem). Let ξ = (ξ1 , ξ2 , . . .) be a stationary (strict sense)


random sequence with E |ξ1 | < ∞. Then (P-a.s. and in the mean)

1
n
lim ξk (ω) = E(ξ1 | Iξ ).
n
k=1

If ξ is also an ergodic sequence, then (P-a.s. and in the mean)

1
n
lim ξk (ω) = E ξ1 .
n
k=1

4. PROBLEMS
1. Let ξ = (ξ1 , ξ2 , . . .) be a Gaussian stationary sequence with E ξn = 0 and
covariance function R(n) = E ξk+n ξk . Show that R(n) → 0 is a sufficient con-
dition for the measure-preserving transformation related to ξ to be mixing (and,
hence, ergodic).
2. Show that for every sequence ξ = (ξ1 , ξ2 , . . .) of independent identically dis-
tributed random variables the corresponding measure-preserving transformation
is mixing.
3. Show that a stationary sequence ξ is ergodic if and only if

1
n
IB (ξi , . . . , ξi+k−1 ) → P((ξ1 , . . . , ξk ) ∈ B) (P-a.s.)
n i=1

for every B ∈ B(Rk ), k = 1, 2, . . . .


4. Let P and P̄ be two measures on the space (Ω, F ) such that the measure-
preserving transformation T is ergodic with respect to each of them. Prove that,
then, either P = P̄ or P ⊥ P̄.
5. Let T be a measure-preserving transformation on (Ω, F , P) and A an algebra
of subsets of Ω such that σ(A ) = F . Let

1
n−1
(n)
IA = IA (T k ω).
n
k=0

Prove that T is ergodic if and only if one of the following conditions holds:
(n) P
(a) IA − → P(A) for any A ∈ A ;
n−1
(b) lim n k=0 P(A ∩ T −k B) = P(A) P(B) for all A, B ∈ A ;
1

(n) P
(c) IA − → P(A) for any A ∈ F .
6. Let T be a measure-preserving transformation on (Ω, F , P). Prove that T is
ergodic (with respect to P) if and only if there is no measure P̄ = P on (Ω, F )
such that P̄  P and T is measure-preserving with respect to P̄.
7. (Bernoullian shifts.) Let S be a finite set (say, S = {1, 2, . . . , N}), and let Ω =
S∞ be the space of sequences ω = (ω0 , ω1 , . . . ) with ωi ∈ S. Set ξk (ω) = ωk ,
3 Ergodic Theorems 45

and define the shift transformation T(ω0 , ω1 , . . . ) = (ω1 , ω2 , . . . ), or, in terms


of ξk , ξk (Tω) = ωk+1 if ξk (ω) = ωk . Suppose that for i ∈ {1, 2, . . . , N} there are
N
nonnegative numbers pi such that i=1 pi = 1 (i.e., (p1 , . . . , pN ) is a probability
distribution). Define the probability measure P on (S∞ , B(S∞ )) (see Sect. 3,
Chap. 2, Vol. 1) such that

P{ω : (ω1 , . . . , ωk ) = (u1 , . . . , uk )} = pu1 . . . puk .

In other words, this probability measure is introduced to provide the indepen-


dence of ξ0 (ω), ξ1 (ω), . . . . The shift transformation T (relative to this measure
P) is called the Bernoullian shift or the Bernoulli transformation.
Show that the Bernoulli transformation is mixing.
8. Let T be a measure-preserving transformation on (Ω, F , P). Use the notation
T −n F = {T −n A : A ∈ F }. We say that the σ-algebra


F−∞ = T −n F
n=1

is trivial (P-trivial) if every set in F−∞ has measure 0 or 1 (such transfor-


mations are referred to as Kolmogorov transformations). Prove that the Kol-
mogorov transformations are ergodic and, what is more, mixing.
9. Let 1 ≤ p < ∞, and let T be a measure-preserving transformation on a proba-
bility space (Ω, F , P). Consider a random variable ξ(ω) ∈ Lp (Ω, F , P).
Prove the following ergodic theorem in Lp (Ω, F , P) (von Neumann). There
exists a random variable η(ω) such that
  p
 1 n−1 

E ξ(T ω) − η(ω) → 0,
k
n → ∞.
n
k=0

10. Borel’s normality theorem (Example 3 in Sect. 3, Chap. 4) states that the fraction
of ones and zeros in the binary expansion of a number ω in [0, 1) converges to
1
2 almost everywhere (with respect to the Lebesgue measure). Prove this result
by considering the transformation T : [0, 1) → [0, 1) defined by

T(ω) = 2ω (mod 1),

and using the ergodic Theorem 1.


11. As in Problem 10, let ω ∈ [0, 1). Consider the transformation T : [0, 1) → [0, 1)
defined by 
0, if ω = 0,
T(ω) =
{1/ω}, if ω = 0,
where {x} is the fractional part of x.
46 5 Stationary (Strict Sense) Random Sequences and Ergodic Theory

Show that T preserves the Gaussian measure P = P(·) on [0, 1) defined by



1 dx
P(A) = , A ∈ B([0, 1)).
log 2 A 1 + x

12. Show by an example that Poincaré’s recurrence theorem (Subsection 3 of


Sect. 1) is, in general, false for measurable spaces with infinite measure.
Chapter 6
Stationary (Wide Sense) Random
Sequences: L2 -Theory

The [spectral] decomposition provides grounds for considering any stationary stochastic
process in the wide sense as a superposition of a set of non-correlated harmonic oscillations
with random amplitudes and phases.
Encyclopaedia of Mathematics [42, Vol. 8, p. 480].

1. Spectral Representation of the Covariance Function

1. According to the definition given in the preceding chapter, a random sequence


ξ = (ξ1 , ξ2 , . . .) is stationary in the strict sense if, for every set B ∈ B(R∞ ) and
every n ≥ 1,

P{(ξ1 , ξ2 , . . .) ∈ B} = P{(ξn+1 , ξn+2 , . . .) ∈ B}. (1)

It follows, in particular, that if E ξ12 < ∞, then E ξn does not depend on n:

E ξn = E ξ1 , (2)

and the covariance Cov(ξn+m ξn ) = E(ξn+m − E ξn+m )(ξn − E ξn ) depends only


on m:
Cov(ξn+m , ξn ) = Cov(ξ1+m , ξ1 ). (3)
In this chapter we study sequences that are stationary in the wide sense (having
finite second moments), namely, those for which (1) is replaced by the (weaker)
conditions (2) and (3).
The random variables ξn are understood to be defined for n ∈ Z = {0, ± 1, . . .}
and to be complex-valued. The latter assumption not only does not complicate the
theory but makes it more elegant. It is also clear that results for real random variables

© Springer Science+Business Media, LLC, part of Springer Nature 2019 47


A. N. Shiryaev, Probability-2, Graduate Texts in Mathematics 95,
https://doi.org/10.1007/978-0-387-72208-5 3
48 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

can easily be obtained as special cases of the corresponding results for complex
random variables.
Let H 2 = H 2 (Ω, F , P) be the space of (complex) random variables ξ = α + iβ,
α, β ∈ R, with E |ξ|2 < ∞, where |ξ|2 = α2 + β 2 . If ξ and η ∈ H 2 , then we set

(ξ, η) = E ξη, (4)

where η = γ − iδ is the complex conjugate of η = γ + iδ, and

ξ = (ξ, ξ)1/2 . (5)

As for real random variables, the space H 2 (more precisely, the space of equiva-
lence classes of random variables; cf. Sects. 10 and 11 of Chap. 2, Vol. 1) is complete
under the scalar product (ξ, η) and norm ξ. In accordance with the terminology of
functional analysis, H 2 is called the complex (or unitary) Hilbert space (of random
variables considered on the probability space (Ω, F , P)).
If ξ, η ∈ H 2 their covariance is

Cov(ξ, η) = E(ξ − E ξ)(η − E η). (6)

It follows from (4) and (6) that if E ξ = E η = 0, then

Cov(ξ, η) = (ξ, η). (7)

Definition. A sequence of complex random variables ξ = (ξn )n ∈ Z with E |ξn |2 <


∞, n ∈ Z, is stationary (in the wide sense)if, for all n ∈ Z,

E ξ n = E ξ0 ,
Cov(ξk+n , ξk ) = Cov(ξn , ξ0 ), k ∈ Z. (8)

As a matter of convenience, we shall always suppose that E ξ0 = 0. This involves


no loss of generality but does make it possible (by (7)) to identify the covariance
with the scalar product and, hence, to apply the methods and results of the theory of
Hilbert spaces.
Let us write
R(n) = Cov(ξn , ξ0 ), n ∈ Z, (9)
and (assuming R(0) = E |ξ0 |2 = 0)

R(n)
ρ(n) = , n ∈ Z. (10)
R(0)

We call R(n) the covariance function and ρ(n) the correlation function of the se-
quence ξ (assumed stationary in the wide sense).
1 Spectral Representation of the Covariance Function 49

It follows immediately from (9) that R(n) is positive semidefinite, i.e., for all
complex numbers a1 , . . . , am and t1 , . . . , tm ∈ Z, m ≥ 1, we have


m
ai aj R(ti − tj ) ≥ 0, (11)
i,j=1


since the left-hand side of (11) is equal to  (αi ξti )2 . It is then easy to deduce
(either from (11) or directly from (9)) the following properties of the covariance
function (see Problem 1):

R(0) ≥ 0, R(−n) = R(n), |R(n)| ≤ R(0),


|R(n) − R(m)| ≤ 2R(0)[R(0) − Re R(n − m)].
2
(12)

2. Let us give some examples of stationary sequences ξ = (ξn )n∈Z . (From now on,
the words “in the wide sense” and the statement n ∈ Z will often be omitted.)

EXAMPLE 1. Let ξn = ξ0 · g(n), where E ξ0 = 0, E ξ02 = 1, and g = g(n) is


a function. The sequence ξ = (ξn ) will be stationary if and only if g(k + n)g(k)
depends only on n. Hence it is easy to see that there is a λ such that

g(n) = g(0)eiλn .

Consequently, the sequence of random variables

ξn = ξ0 · g(0)eiλn

is stationary with
R(n) = |g(0)|2 eiλn .
In particular, the random “constant” ξn ≡ ξ0 is a stationary sequence.

Remark. In connection with this example, notice that, since eiλn = ein(λ+2πk) , k =
±1, ±2, . . ., the (circular) frequency λ is defined up to a multiple of 2π. Following
tradition, we will assume henceforth that λ ∈ [−π, π].

EXAMPLE 2 (An almost periodic sequence). Let


N
ξn = zk eiλk n , (13)
k=1

where z1 , . . . , zN are orthogonal (E zi zj = 0, i = j) random variables with zero


means and E |zk |2 = σk2 > 0; −π ≤ λk < π, k = 1, . . . , N; λi = λj , i = j. The
sequence ξ = (ξn ) is stationary with


N
R(n) = σk2 eiλk n . (14)
k=1
50 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

As a generalization of (13) we now suppose that




ξn = zk eiλk n , (15)
k=−∞

where zk , k ∈ Z, have the same properties as in (13). If we suppose that


 ∞
k=−∞ σk < ∞, the series on the right-hand side of (15) converges in mean
2

square and
∞
R(n) = σk2 eiλk n . (16)
k=−∞

Let us introduce the function



F(λ) = σk2 . (17)
{k : λk ≤λ}

Then the covariance function (16) can be written as a Lebesgue–Stieltjes integral:


 π   
R(n) = eiλn dF(λ) = eiλn dF(λ) . (18)
−π [−π,π)

The stationary sequence (15) is represented as a sum of “harmonics” eiλk n with


“frequencies” λk and random “amplitudes” zk of “intensities” σk2 = E |zk |2 . Conse-
quently, the values of F(λ) provide complete information on the “spectrum” of the
sequence ξ, i.e., on the intensity with which each frequency appears in (15). By (18),
the values of F(λ) also completely determine the structure of the covariance func-
tion R(n).
Up to a constant multiple, a (nondegenerate) F(λ) is evidently a distribution
function, which in the examples considered so far has been piecewise constant. It
is quite remarkable that the covariance function of every stationary (wide sense)
random sequence can be represented (see theorem in Subsection 3) in the form
(18), where F(λ) is a distribution function (up to normalization) whose support is
concentrated on [−π, π), i.e., F(λ) = 0 for λ < −π and F(λ) = F(π) for λ > π.
The result on the integral representation of the covariance function, if compared
with (15) and (16), suggests that every stationary sequence also admits an “integral”
representation. This is in fact the case, as will be shown in Sect. 3 using what we
shall learn to call stochastic integrals with respect to orthogonal stochastic measures
(Sect. 2).

EXAMPLE 3 (White noise). Let ε = (εn ) be an orthonormal sequence of random


variables, E εn = 0, E εi εj = δij , where δij is the Kronecker delta. Such a sequence
is evidently stationary, and

1, n = 0,
R(n) =
0, n = 0.
1 Spectral Representation of the Covariance Function 51

Observe that R(n) can be represented in the form


 π
R(n) = eiλn dF(λ), (19)
−π

where  λ
1
F(λ) = f (v)dv; f (λ) = , −π ≤ λ < π. (20)
−π 2π
Comparison of the spectral functions (17) and (20) shows that, whereas the spec-
trum in Example 2 is discrete, in the present example it is absolutely continuous
with constant “spectral density” f (λ) ≡ 1/2π. In this sense we can say that the se-
quence ε = (εn ) “consists of harmonics of equal intensities.” It is just this property
that has led to calling such a sequence ε = (εn ) “white noise” by analogy with white
light, which consists of different frequencies with the same intensities.
EXAMPLE 4 (Moving Averages). Starting from the white noise ε = (εn ) introduced
in Example 3, let us form the new sequence


ξn = ak εn−k , (21)
k=−∞
∞
where ak are complex numbers such that k=−∞ |ak |2 < ∞. From (21) we obtain


Cov(ξn+m , ξm ) = Cov(ξn , ξ0 ) = an+k ak ,
k=−∞

so that ξ = (ξk ) is a stationary sequence, which we call the sequence obtained from
ε = (εk ) by a (two-sided) moving average.
In the special case where the ak of negative index are zero, i.e.,


ξn = ak εn−k ,
k=0

the sequence ξ = (ξn ) is a one-sided moving average. If, in addition, ak = 0 for


k > p, i.e., if
ξn = a0 εn + a1 εn−1 + · · · + ap εn−p , (22)
then ξ = (ξn ) is a moving average of order p.
It can be shown (Problem 3) that (22) has a covariance function of the form
π
R(n) = −π eiλn f (λ) dλ, where the spectral density is

1
f (λ) = |P(e−iλ )|2 (23)

with
P(z) = a0 + a1 z + · · · + ap zp .
52 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

EXAMPLE 5 (Autoregression). Again let ε = (εn ) be white noise. We say that a


random sequence ξ = (ξn ) is described by an autoregressive model of order q if

ξn + b1 ξn−1 + · · · + bq ξn−q = εn . (24)

Under what conditions on b1 , . . . , bn can we say that (24) has a stationary solu-
tion? To find an answer, let us begin with the case q = 1:

ξn = αξn−1 + εn , (25)

where α = −b1 . If |α| < 1, then it is easy to verify that the stationary sequence
ξ˜ = (ξ˜n ) with
∞
ξ˜n = αj εn−j (26)
j=0

is a solution of (25). (The series on the right-hand side of (26) converges in mean
square.) Let us now show that, in the class of stationary sequences ξ = (ξn ) (with
finite second moments), this is the only solution. In fact, we find from (25), by
successive iteration, that


k−1
ξn = αξn−1 + εn = α[αξn−2 + εn−1 ] + εn = · · · = αk ξn−k + αj εn−j .
j=0

Hence it follows that



k−1 !2
E ξn − j
α εn−j = E[αk ξn−k ]2 = α2k E ξn−k
2
= α2k E ξ02 → 0, k → ∞.
j=0

Therefore, when |α| < 1, a stationary solution of (25) exists and is representable as
the one-sided moving average (26).
There is a similar result for every q > 1: if all the zeros of the polynomial

Q(z) = 1 + b1 z + · · · + bq zq (27)

lie outside the unit disk, then the autoregression equation (24) has a unique station-
ary solution, which is representable as a one-sided moving average (Problem 2).
Here the covariance function R(n) can be represented (Problem 3) in the form
 π  λ
R(n) = eiλn dF(λ), F(λ) = f (v)dv, (28)
−π −π

where
1 1
f (λ) = · . (29)
2π |Q(e−iλ )|2
1 Spectral Representation of the Covariance Function 53

In the special case q = 1, we find easily from (25) that E ξ0 = 0,


1 αn
E ξ02 = , and R(n) = , n≥0
1 − |α|2 1 − |α|2

(when n < 0, we have R(n) = R(−n)). Here


1 1
f (λ) = · .
2π |1 − αe−iλ |2

EXAMPLE 6. This example illustrates how autoregression arises in the construction


of probabilistic models in hydrology. Consider a body of water. We try to construct a
probabilistic model of the deviations of the level of the water from its average value
because of variations in the inflow and evaporation from the surface.
If we take a year as the unit of time and let Hn denote the water level in year n,
we obtain the following balance equation:
Hn+1 = Hn − KS(Hn ) + Σn+1 , (30)
where Σn+1 is the inflow in year (n + 1), S(H) is the area of the surface of the water
at level H, and K is the coefficient of evaporation.
Let ξn = Hn − H be the deviation from the mean level (which is obtained from
observations over many years), and suppose that S(H) = S(H) + c(H − H). Then it
follows from the balance equation that ξn satisfies
ξn+1 = αξn + εn+1 (31)
with α = 1 − cK, εn = Σn − KS(H). It is natural to assume that the random
variables εn have zero means and, as a first order approximation, are uncorrelated
and identically distributed. Then, as we showed in Example 5, Eq. (31) has (for
|α| < 1) a unique stationary solution, which we think of as the steady-state solution
(with respect to time in years) of the oscillations of the level in the body of water.
As an example of practical conclusions that can be drawn from a (theoretical)
model (31), we call attention to the possibility of predicting the level for the follow-
ing year from the results of the observations of the present and preceding years. It
turns out (see also Example 2 in Sect. 6) that (in the mean-square sense) the optimal
linear estimator of ξn+1 in terms of the values of . . . , ξn−1 , ξn is simply αξn .
EXAMPLE 7 (Autoregression and moving average (mixed model)). If we suppose
that the right-hand side of (24) contains α0 εn + α1 εn−1 + · · · + αp εn−p instead
of εn , we obtain a mixed model with autoregression and moving average of order
(p, q):

ξn + b1 ξn−1 + · · · + bq ξn−q = a0 εn + a1 εn−1 + · · · + ap εn−p . (32)

Under the same hypotheses as in Example 5 on the zeros of Q(z) (see (27)) it will
be shown later (Corollary 6, Sect. 3) that (32) has a stationary solution ξ = (ξn ) for
π λ
which the covariance function is R(n) = −π eiλn dF(λ) with F(λ) = −π f (v) dv,
where
54 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
 
1  P(e−iλ ) 2
f (λ) = 
· 
2π Q(e−iλ ) 
with P and Q as in (23) and (27).
3. Theorem (Herglotz). Let R(n) be the covariance function of a stationary (wide
sense) random sequence with zero mean. Then there is, on ([−π, π), B([−π, π))),
a finite measure F = F(B), B ∈ B([−π, π)), such that for every n ∈ Z
 π
R(n) = eiλn F(dλ), (33)
−π

where the integral is understood as the Lebesgue–Stieltjes integral over [−π, π).
PROOF. For N ≥ 1 and λ ∈ [−π, π], set

1 
N N
fN (λ) = R(k − l) e−ikλ eilλ . (34)
2πN
k=1 l=1

Since R(n) is nonnegative definite, fN (λ) is nonnegative. Since there are N − |m|
pairs (k, l) for which k − l = m, we have

1  |m|
fN (λ) = 1− R(m)e−imλ . (35)
2π N
|m|<N

Let 
FN (B) = fN (λ) dλ, B ∈ B([−π, π)).
B
Then
   
π π |n|
iλn iλn 1− N R(n), |n| < N,
e FN (dλ) = e fN (λ) dλ = (36)
−π −π 0, |n| ≥ N.

The measures FN , N ≥ 1, are supported on the interval [−π, π] and FN ([−π, π]) =
R(0) < ∞ for all N ≥ 1. Consequently, the family of measures {FN }, N ≥ 1,
is tight, and by Prokhorov’s theorem (Theorem 1 of Sect. 2, Chap. 3, Vol. 1) there
w
are a sequence {Nk } ⊆ {N} and a measure F such that FNk → F. (The concepts of
tightness, relative compactness, and weak convergence, together with Prokhorov’s
theorem, can be extended in an obvious way from probability measures to any finite
measures.)
It then follows from (36) that
 π  π
iλn
e F(dλ) = lim eiλn FNk (dλ) = R(n).
−π Nk →∞ −·π

The measure F so constructed is supported on [−π, π]. Without changing the inte-
π
gral −π eiλn F(dλ), we can redefine F by transferring the “mass” F({π}), which is
1 Spectral Representation of the Covariance Function 55

concentrated at π, to −π. The resulting new measure (which we again denote by F)


will be supported on [−π, π). (Regarding the choice of [−π, π) as the domain of λ
see the Remark to Example 1.)
This completes the proof of the theorem.


Remark 1. The measure F = F(B) involved in (33) is known as the spectral mea-
sure, and F(λ) = F([−π, λ]) as the spectral function, of the stationary sequence
with covariance function R(n).

In the preceding Example 2, the spectral measure was discrete (concentrated at


λk , k = 0, ±1, . . .). In Examples 3–6, the spectral measures were absolutely con-
tinuous.
Remark 2. The spectral measure F is uniquely defined by the covariance function.
In fact, let F1 and F2 be two spectral measures, and let
 π  π
eiλn F1 (dλ) = eiλn F2 (dλ), n ∈ Z.
−π −π

Since every bounded continuous function g(λ) can be uniformly approximated on


[−π, π) by trigonometric polynomials, we have
 π  π
g(λ) F1 (dλ) = g(λ) F2 (dλ).
−π −π

It follows (cf. proof of Theorem 2 in Sect. 12, Chap. 2, Vol. 1) that F1 (B) = F2 (B)
for all B ∈ B([−π, π)).
Remark 3. If ξ = (ξn ) is a stationary sequence of real random variables ξn , then
R(n) = R(−n), and therefore
 π
R(n) + R(−n)
R(n) = = cos λn F(dλ).
2 −π

4. PROBLEMS
1. Derive (12) from (11).
2. Prove that the autoregression Eq. (24) has a unique stationary solution repre-
sentable as a one-sided moving average if all the zeros of the polynomial Q(z)
defined by (27) lie outside the unit disk.
3. Show that the spectral functions of sequences (22) and (24) have densities spec-
ified by (23)
and (29), respectively.
+∞
4. Show that if n=−∞ |R(n)|2 < ∞, then the spectral function F(λ) has a density
f (λ) given by ∞
1  −iλn
f (λ) = e R(n),
2π n=−∞

where the series converges in the complex space L2 = L2 ([−π, π),


B([−π, π)), λ) with λ the Lebesgue measure.
56 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

2. Orthogonal Stochastic Measures and Stochastic Integrals

1. As we observed in Sect. 1, the integral representation of the covariance function


and the example of a stationary sequence


ξn = zk eiλk n (1)
k=−∞

with pairwise orthogonal random variables zk , k ∈ Z, suggest the possibility of rep-


resenting an arbitrary stationary sequence as a corresponding integral generalization
of (1).
If we set 
Z(λ) = zk , (2)
{k : λk ≤λ}

we can rewrite (1) in the form




ξn = eiλk n ΔZ(λk ), (3)
k=−∞

where ΔZ(λk ) ≡ Z(λk ) − Z(λk −) = zk .


 π The right-hand side of (3) reminds us of an approximating sum for an integral
iλn
−π
e dZ(λ) of the Riemann–Stieltjes type. However, in the present case, Z(λ) is
a random function (it also depends on ω). And it will be seen that for an integral
representation of a general stationary sequence we need to use functions Z(λ) that
do not have bounded variation for each ω. Consequently, the simple interpretation
π
of −π eiλn dZ(λ) as a Riemann–Stieltjes integral for each ω is inapplicable.

2. By analogy with the general ideas of the Lebesgue, Lebesgue-Stieltjes, and


Riemann–Stieltjes integrals (Sect. 6, Chap. 2, Vol. 1), we begin by defining stochas-
tic measure.
Let (Ω, F , P) be a probability space, and let E be a set, with an algebra E0 of its
subsets and the σ-algebra E generated by E0 , E = σ(E0 ).
Definition 1. A complex-valued function Z(Δ) = Z(ω; Δ), defined for ω ∈ Ω and
Δ ∈ E0 , is a finitely additive stochastic measure if
(1) E |Z(Δ)|2 < ∞ for every Δ ∈ E0 ;
(2) For every pair Δ1 and Δ2 of disjoint sets in E0 ,

Z(Δ1 + Δ2 ) = Z(Δ1 ) + Z(Δ2 ) (P-a.s.). (4)

Definition 2. A finitely additive stochastic measure Z(Δ) is an elementary


∞ stochas-
tic measure if, for all disjoint sets Δ1 , Δ2 , . . . of E0 such that Δ = k=1 Δk ∈ E0 ,
  2
 n


E Z(Δ) − Z(Δk ) → 0, n → ∞. (5)
k=1
2 Orthogonal Stochastic Measures and Stochastic Integrals 57

Remark 1. In this definition of an elementary stochastic measure on subsets of E0 ,


it is assumed that its values are in the Hilbert spaceH 2 = H 2 (Ω, F , P), and that
countable additivity is understood in the mean-square sense (5). There are other def-
initions of stochastic measures, without the requirement of the existence of second
moments, where countable additivity is defined (for example) in terms of conver-
gence in probability or with probability 1.

Remark 2. In analogy with nonstochastic measures, one can show that for finitely
additive stochastic measures the condition (5) of countable additivity (in the mean-
square sense) is equivalent to continuity (in the mean-square sense) at “zero”:

E |Z(Δn )|2 → 0, Δn ↓ ∅, Δn ∈ E0 . (6)

A particularly important class of elementary stochastic measures consists of


those that are orthogonal according to the following definition.

Definition 3. An elementary stochastic measure Z(Δ), Δ ∈ E0 , is orthogonal (or a


measure with orthogonal values) if

E Z(Δ1 )Z(Δ2 ) = 0 (7)

for every pair of disjoint sets Δ1 and Δ2 in E0 , or, equivalently, if

E Z(Δ1 )Z(Δ2 ) = E |Z(Δ1 ∩ Δ2 )|2 (8)

for all Δ1 and Δ2 in E0 .

We write
m(Δ) = E |Z(Δ)|2 , Δ ∈ E0 . (9)
For elementary orthogonal stochastic measures, the set function m = m(Δ), Δ ∈
E0 , is, as is easily verified, a finite measure, and, consequently, by Carathéodory’s
theorem (Sect. 3, Chap. 2, Vol. 1), it can be extended to (E, E ). The resulting mea-
sure will again be denoted by m = m(Δ) and called the structure function (of the
elementary orthogonal stochastic measure Z = Z(Δ), Δ ∈ E0 ).
The following question now arises naturally: since the set function m = m(Δ)
defined on (E, E0 ) admits an extension to (E, E ), where E = σ(E0 ), can an elemen-
tary orthogonal stochastic measure Z = Z(Δ), Δ ∈ E0 , be extended to sets Δ in E
in such a way that E |Z(Δ)|2 = m(Δ), Δ ∈ E ?
The answer is affirmative, as follows from the construction given below. This
construction, at the same time, leads to the stochastic integral that we need for the
integral representation of stationary sequences.

3. Let Z = Z(Δ) be an elementary orthogonal stochastic measure, Δ ∈ E0 , with


structure function m = m(Δ), Δ ∈ E . For every function

f (λ) = fk IΔk (λ), Δk ∈ E0 , (10)
58 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

with only a finite number of different (complex) values, we define the random vari-
able 
I (f ) = fk Z(Δk ).

Let L2 = L2 (E, E , m) be the Hilbert space of complex-valued functions with the


scalar product 
f , g = f (λ)g(λ) m(dλ)
E

and the norm f  = f , f  , and let H 2 = H 2 (Ω, F , P) be the Hilbert space of


1/2

complex-valued random variables with the scalar product

(ξ, η) = E ξη

and the norm ξ = (ξ, ξ)1/2 .


Then it is clear that, for every pair of functions f and g of the form (10),
(I (f ), I (g)) = f , g

and 
I (f ) = f  =
2 2
|f (λ)|2 m(dλ).
E

Now let f ∈ L2 , and let {fn } be functions of the type (10) such that f − fn  → 0,
n → ∞ (Problem 2). Consequently,

I (fn ) − I (fm ) = fn − fm  → 0, n, m → ∞.

Therefore the sequence {I (fn )} is fundamental in the mean-square sense and, by


Theorem 7 in Sect. 10, Chap. 2, Vol. 1, there is a random variable (denoted by I (f ))
such that I (f ) ∈ H 2 and I (fn ) − I (f ) → 0, n → ∞.
The random variable I (f ) constructed in this way is uniquely defined (up to
stochastic equivalence) and is independent of the choice of the approximating se-
quence {fn }. We call it the stochastic integral of f ∈ L2 with respect to the elemen-
tary orthogonal stochastic measure Z and denote it by

I (f ) = f (λ) Z(dλ).
E

We note the following basic properties of the stochastic integral I (f ); these are
direct consequences of its construction. Let g, f , and fn ∈ L2 . Then
(I (f ), I (g)) = f , g; (11)
I (f ) = f ; (12)
I (af + bg) = aI (f ) + bI (g) (P-a.s.) (13)

where a and b are constants;


I (fn ) − I (f ) → 0 (14)
if fn − f  → 0, n → ∞.
2 Orthogonal Stochastic Measures and Stochastic Integrals 59

4. Let us use the preceding definition of the stochastic integral to extend the elemen-
tary stochastic measure Z(Δ), Δ ∈ E0 , to sets in E = σ(E0 ).
Since measure m is assumed to be finite, we have IΔ = IΔ (λ) ∈ L2 for all
Δ ∈ E . Write Z̃(Δ) = I (IΔ ). It is clear that Z̃(Δ) = Z(Δ) for Δ ∈ E0 . It follows
from (13) that if Δ1 ∩ Δ2 = ∅ for Δ1 and Δ2 ∈ E , then

Z̃(Δ1 + Δ2 ) = Z̃(Δ1 ) + Z̃(Δ2 ) (P-a.s.)

and it follows from (12) that

E |Z̃(Δ)|2 = m(Δ), Δ ∈ E.

Let us show that the random set function Z̃(Δ), Δ∈ E , is countably additive in

the mean-square sense. In fact, let Δk ∈ E and Δ = k=1 Δk . Then


n
Z̃(Δ) − Z̃(Δk ) = I (gn ),
k=1

where

n ∞

gn (λ) = IΔ (λ) − IΔk (λ) = IΣn (λ), Σn = Δk .
k=1 k=n+1

But
E |I (gn )|2 = gn 2 = m(Σn ) ↓ 0, n → ∞,
i.e.,

n
E |Z̃(Δ) − Z̃(Δk )|2 → 0, n → ∞.
k=1

It also follows from (11) that

E Z̃(Δ1 )Z̃(Δ2 ) = 0

when Δ1 ∩ Δ2 = ∅, Δ1 , Δ2 ∈ E .
Thus, our function Z̃(Δ), defined on Δ ∈ E , is countably additive in the mean-
square sense and coincides with Z(Δ) on the sets Δ ∈ E0 . We shall call Z̃(Δ),
Δ ∈ E , an orthogonal stochastic measure (since it is an extension of the elementary
orthogonal stochastic measure Z(Δ)) with  respect to the structure function m(Δ),
Δ ∈ E ; and we call the integral I (f ) = E f (λ) Z̃(dλ), defined earlier, a stochastic
integral with respect to this measure.

5. We now consider the case (E, E ) = (R, B(R)), which is the most important for
our purposes. As we know (Theorem 1, Sect. 3, Chap. 2, Vol. 1), there is a one-to-one
correspondence between finite measures m = m(Δ) on (R, B(R)) and (generalized)
distribution functions G = G(x), with m(a, b] = G(b) − G(a).
It turns out that there is something similar for orthogonal stochastic measures.
We introduce the following definition.
60 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

Definition 4. A set of (complex-valued) random variables {Zλ }, λ ∈ R, defined on


(Ω, F , P), is a random process with orthogonal increments if
(1) E |Zλ |2 < ∞, λ ∈ R;
(2) For every λ ∈ R

E |Zλ − Zλn |2 → 0, λn ↓ λ, λn ∈ R;

(3) Whenever λ1 < λ2 < λ3 < λ4 ,

E(Zλ4 − Zλ3 )(Zλ2 − Zλ1 ) = 0.

Condition (3) is the condition of orthogonal increments. Condition (1) means that
Zλ ∈ H 2 . Finally, condition (2) is included for technical reasons; it is a requirement
of continuity on the right (in the mean-square sense) at each λ ∈ R.
Let Z = Z(Δ) be an orthogonal stochastic measure with respect to the struc-
ture function m = m(Δ), which is a finite measure with (generalized) distribution
function G(λ). Let us set
Zλ = Z(−∞, λ].
Then

E |Zλ |2 = m(−∞, λ] = G(λ) < ∞, E |Zλ − Zλn |2 = m(λ, λn ] ↓ 0, λn ↓ λ,

and (evidently) (3) is also satisfied. Thus, {Zλ } is a process with orthogonal incre-
ments.
On the other hand, let G(λ) be a generalized distribution function, G(−∞) = 0,
G(+∞) < ∞, and let {Zλ } be a process with orthogonal increments such that
E |Zλ |2 = G(λ). Set
Z(Δ) = Zb − Za
n
when Δ = (a, b]. Let E0 be the algebra generated by the sets Δ = k=1 (ak , bk ]
with disjoint (ak , bk ] and
n
Z(Δ) = Z(ak , bk ].
k=1

It is clear that
E |Z(Δ)|2 = m(Δ),
n
where m(Δ) = k=1 [G(bk ) − G(ak )] and

E Z(Δ1 )Z(Δ2 ) = 0

for disjoint intervals Δ1 = (a1 , b1 ] and Δ2 = (a2 , b2 ].


Due to continuity on the right of G(λ), λ ∈ R, this implies that Z = Z(Δ), Δ ∈
E0 , is an elementary stochastic measure with orthogonal values. The set function
m = m(Δ), Δ ∈ E0 , has a unique extension to a measure on E = B(R), and
it follows from the preceding constructions that Z = Z(Δ), Δ ∈ E0 , can also be
extended to the sets Δ ∈ E , where E = B(R), and E |Z(Δ)|2 = m(Δ), Δ ∈ B(R).
3 Spectral Representation of Stationary (Wide Sense) Sequences 61

Therefore there is a one-to-one correspondence between processes {Zλ }, λ ∈ R,


with orthogonal increments and E |Zλ |2 = G(λ), G(−∞) = 0, G(+∞) < ∞, and
orthogonal stochastic measures Z = Z(Δ), Δ ∈ B(R), with structure functions
m = m(Δ). The correspondence is given by
Zλ = Z(−∞, λ], G(λ) = m(−∞, λ]
and
Z(a, b] = Zb − Za , m(a, b] = G(b) − G(a).
By analogy with the usual notation of the theory of Lebesgue–Stieltjes and
Riemann–Stieltjes integration
 (Subsections 9 and 11 of Sect. 6, Chap. 2, Vol. 1),
 {Zλ } is a process with orthogonal incre-
the stochastic integral R f (λ) dZλ , where
ments, means the stochastic integral R f (λ) Z(dλ) with respect to the orthogonal
stochastic measure corresponding to {Zλ }.
6. PROBLEMS
1. Prove the equivalence of (5) and (6).
2. Let f ∈ L2 . Using the results of Chap. 2, Vol. 1 (Theorem 1 in Sect. 4, the Corol-
lary to Theorem 3 of Sect. 6, and Problem 8 of Sect. 3), prove that there is a
sequence of functions fn of the form (10) such that f − fn  → 0, n → ∞.
3. Establish the following properties of an orthogonal stochastic measure Z(Δ)
with structure function m(Δ):
E |Z(Δ1 ) − Z(Δ2 )|2 = m(Δ1 Δ2 ),
Z(Δ1 \ Δ2 ) = Z(Δ1 ) − Z(Δ1 ∩ Δ2 ) (P-a.s.),
Z(Δ1 Δ2 ) = Z(Δ1 ) + Z(Δ2 ) − 2Z(Δ1 ∩ Δ2 ) (P-a.s.).

3. Spectral Representation of Stationary (Wide Sense)


Sequences

1. If ξ = (ξn ) is a stationary sequence with E ξn = 0, n ∈ Z, then, by the theorem


of Sect. 1, there is a finite measure F = F(Δ) on ([−π, π), B([−π, π))) such that
the covariance function R(n) = Cov(ξk+n , ξk ) admits the spectral representation
 π
R(n) = eiλn F(dλ). (1)
−π

The following result provides the corresponding spectral representation of the


sequence ξ = (ξn ), n ∈ Z, itself.
Theorem 1. There is an orthogonal stochastic measure Z = Z(Δ), Δ ∈
B([−π, π)), such that for every n ∈ Z (P-a.s.)
 π  
iλn iλn
ξn = e Z(dλ) = e Z(dλ) . (2)
−π [−π,π)

Moreover, E Z(Δ) = 0, E |Z(Δ)|2 = F(Δ).


62 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

PROOF. The simplest proof is based on properties of Hilbert spaces.


Let L2 (F) = L2 (E, E , F) be a Hilbert space of complex functions, E = [−π, π),
E = B([−π, π)), with the scalar product
 π
f , g = f (λ)g(λ) F(dλ), (3)
−π

and let L02 (F) be the linear manifold (L02 (F) ⊆ L2 (F)) spanned by the functions
en = en (λ), n ∈ Z, where en (λ) = eiλn .
Observe that since E = [−π, π) and F is finite, the closure of L02 (F) coincides
(Problem 1) with L2 (F):
L02 (F) = L2 (F).
Also, let L02 (ξ) be the linear manifold spanned by the random variables ξn , n ∈ Z,
and let L2 (ξ) be its closure in the mean-square sense (with respect to P).
We establish a one-to-one correspondence between the elements of L02 (F) and
2
L0 (ξ), denoted by “↔,” by setting
en ↔ ξn , n ∈ Z, (4)

and defining it for elements in general (more precisely, for equivalence classes of
elements) by linearity:  
αn en ↔ α n ξn (5)
(here we suppose that only finitely many of the complex numbers αn are different
from zero).
 in the sense that Σαn en = 0 almost
Observe that (5) is a consistent definition,
everywhere with respect to F if and only if αn ξn = 0 (P-a.s.).
The correspondence “↔” is an isometry, i.e., it preserves scalar products. In fact,
by (3),
 π  π
en , em  = en (λ)em (λ) F(dλ) = eiλ(n−m) F(dλ)
−π −π
= R(n − m) = E ξn ξ m = (ξn , ξm )

and similarly,
*  +   
αn en , βn en = α n ξn , β n ξn . (6)

Now let η ∈ L2 (ξ). Since L2 (ξ) = L02 (ξ), there is a sequence {ηn } such that
ηn ∈ L02 (ξ) and ηn − η → 0, n → ∞. Consequently, {ηn } is a fundamental
sequence, and therefore so is the sequence {fn }, where fn ∈ L02 (F) and fn ↔ ηn .
The space L2 (F) is complete, and consequently there is an f ∈ L2 (F) such that
fn − f  → 0.
There is an evident converse: if f ∈ L2 (F) and f − fn  → 0, fn ∈ L02 (F), there
is an element η of L2 (ξ) such that η − ηn  → 0, ηn ∈ L02 (ξ), and ηn ↔ fn .
3 Spectral Representation of Stationary (Wide Sense) Sequences 63

Up to now, the isometry “↔” has been defined only as between elements of L02 (ξ)
and L02 (F). We extend it by continuity, taking f ↔ η when f and η are the elements
considered earlier. It is easily verified that the correspondence obtained in this way
is one-to-one (between classes of equivalent random variables and of functions), is
linear, and preserves scalar products.
Consider the function f (λ) = IΔ (λ), where Δ ∈ B([−π, π)), λ ∈ [−π, π),
and let Z(Δ) be the element of L2 (ξ) such that IΔ (λ) ↔ Z(Δ). It is clear that
IΔ (λ)2 = F(Δ), and therefore E |Z(Δ)|2 = F(Δ). Since E ξn = 0, n ∈ Z, we
have for every element of L02 (ξ) (and hence of L2 (ξ)) that it has zero expectation. In
particular, E Z(Δ) = 0. Moreover, if Δ1 ∩ Δ2 = ∅, we have E Z(Δ1 )Z(Δ2 ) = 0
 n 2 ∞
and E Z(Δ) − k=1 Z(Δk ) → 0, n → ∞, where Δ = k=1 Δk .
Hence the family of elements Z(Δ), Δ ∈ B([−π, π)), form an orthogonal
stochastic measure, with respect to which (according to Sect. 2) we can define the
stochastic integral
 π
I (f ) = f (λ) Z(dλ), f ∈ L2 (F).
−π

Let f ∈ L2 (F) and η ↔ f . Denote the element η by Φ(f ) (more precisely, se-
lect single representatives from the corresponding equivalence classes of random
variables or functions). Let us show that (P-a.s.)
I (f ) = Φ(f ). (7)

In fact, if 
f (λ) = αk IΔk (λ) (8)

is a finite linear combination of functions IΔk(λ), Δk = (ak , bk ], then, by the very


definition of the stochastic integral, I (f ) = αk Z(Δk ), which is evidently equal
to Φ(f ). Therefore (7) is valid for functions of the form (8). But if f ∈ L2 (F) and
fn − f  → 0, where fn are functions of the form (8), then Φ(fn ) − Φ(f ) → 0 and
I (fn ) − I (f ) → 0 (by (14) of Sect. 2). Therefore Φ(f ) = I (f ) (P-a.s.).
iλn iλn
 π iλnf (λ) = e . Then Φ(e ) = ξn by (4), but on the other
Consider the function
hand, I (e ) = −π e Z(dλ). Therefore
iλn

 π
ξn = eiλn Z(dλ), n ∈ Z (P-a.s.)
−π

by (7). This completes the proof of the theorem.




Corollary 1. Let ξ = (ξn ) be a stationary sequence of real random variables ξn ,
n ∈ Z. Then the stochastic measure Z = Z(Δ) involved in the spectral representa-
tion (2) has the property that

Z(Δ) = Z(−Δ) (9)

for every Δ = B([−π, π)), where −Δ = {λ : − λ ∈ Δ}.


64 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
 
In fact, let f (λ) = αk eiλk and η = αk ξk (finite sums). Then f ↔ η, and
therefore  
η= α k ξk ↔ αk eiλk = f (−λ). (10)

Since IΔ (λ) ↔ Z(Δ), it follows from (10) that IΔ (−λ) ↔ Z(Δ) (or, equivalently,
I−Δ (λ) ↔ Z(Δ)). On the other hand, I−Δ (λ) ↔ Z(−Δ). Therefore Z(Δ) =
Z(−Δ) (P-a.s.).
Corollary 2. Again let ξ = (ξn ) be a stationary sequence of real random variables
ξn and Z(Δ) = Z1 (Δ) + iZ2 (Δ). Then
E Z1 (Δ1 )Z2 (Δ2 ) = 0 (11)

for every Δ1 and Δ2 in B([−π, π)), and if Δ1 ∩ Δ2 = ∅ and (−Δ1 ) ∩ Δ2 = ∅,


then
E Z1 (Δ1 )Z1 (Δ2 ) = 0, E Z2 (Δ1 )Z2 (Δ2 ) = 0. (12)
In fact, since Z(Δ) = Z(−Δ), we have

Z1 (−Δ) = Z1 (Δ), Z2 (−Δ) = −Z2 (Δ). (13)

Moreover, since E Z(Δ1 )Z(Δ2 ) = E |Z(Δ1 ∩ Δ2 )|2 , we have Im E Z(Δ1 )Z(Δ2 ) =


0, i.e.,
E Z1 (Δ1 )Z2 (Δ2 ) − E Z2 (Δ1 )Z1 (Δ2 ) = 0. (14)
If we take the interval −Δ1 instead of Δ1 , we therefore obtain

E Z1 (−Δ1 )Z2 (Δ2 ) − E Z2 (−Δ1 )Z1 (Δ2 ) = 0,

which, by (13), can be transformed into

E Z1 (Δ1 )Z2 (Δ2 ) + E Z2 (Δ1 )Z1 (Δ2 ) = 0. (15)

Then (11) follows from (14) and (15).


When Δ1 ∩ Δ2 = ∅ and (−Δ1 ) ∩ Δ2 = ∅, we have E Z(Δ1 )Z(Δ2 ) = 0,
whence Re E Z(Δ1 )Z(Δ2 ) = 0 and Re E Z(−Δ1 )Z(Δ2 ) = 0, which, with (13),
provides an evident proof of (12).
Corollary 3. Let ξ = (ξn ) be a Gaussian sequence. Then, for any Δ1 , . . . , Δk , the
vector (Z1 (Δ1 ), . . . , Z1 (Δk ), Z2 (Δ1 ), . . . , Z2 (Δk )) is normally distributed.
In fact, the linear manifold L02 (ξ) consists of (complex-valued) Gaussian random
variables η, i.e., the vector (Re η, Im η) has a Gaussian distribution. Then, accord-
ing to Subsection 5 of Sect. 13, Chap. 2, Vol. 1, the closure of L02 (ξ) also consists of
Gaussian variables. It follows from Corollary 2 that, when ξ = (ξn ) is a Gaussian
sequence, the real and imaginary parts of Z1 and Z2 are independent in the sense that
the families of random variables (Z1 (Δ1 ), . . . , Z1 (Δk )) and (Z2 (Δ1 ), . . . , Z2 (Δk ))
are independent. It also follows from (12) that if Δi ∩ Δj = (−Δi ) ∩ Δj = ∅,
i, j = 1, . . . , k, i = j, the random variables Zi (Δ1 ), . . . , Zi (Δk ) are mutually inde-
pendent, i = 1, 2.
3 Spectral Representation of Stationary (Wide Sense) Sequences 65

Corollary 4. If ξ = (ξn ) is a stationary sequence of real random variables, then


(P-a.s.)  π  π
ξn = cos λn Z1 (dλ) − sin λn Z2 (dλ). (16)
−π −π

Remark. If {Zλ }, λ ∈ [−π, π), is a process with orthogonal increments, corre-


sponding to an orthogonal stochastic measure Z = Z(Δ), then, in accordance with
Sect. 2, the spectral representation (2) can also be written in the following form:
 π
ξn = eiλn dZλ , n ∈ Z. (17)
−π

2. Let ξ = (ξn ) be a stationary sequence with the spectral representation (2), and let
η ∈ L2 (ξ). The following theorem describes the structure of such random variables.

Theorem 2. If η ∈ L2 (ξ), then there is a function ϕ ∈ L2 (F) such that (P-a.s.)


 π
η= ϕ(λ) Z(dλ). (18)
−π

PROOF. If 
ηn = α k ξk , (19)
|k|≤n

then, by (2),   
π
ηn = αk eiλk Z(dλ), (20)
−π |k|≤n

i.e., (18) is satisfied by 


ϕn (λ) = αn eiλk . (21)
|k|≤n

In the general case, where η ∈ L2 (ξ), there are variables ηn of type (19) such that
η − ηn  → 0, n → ∞. But then ϕn − ϕm  = ηn − ηm  → 0, n, m → ∞.
Consequently, {ϕn } is fundamental in L2 (F), and therefore there is a function ϕ ∈
L2 (F) such that ϕ − ϕn  → 0, n → ∞.
By property (14) of Sect. 2, we have I (ϕn ) − I (ϕ) → 0, and since ηn =
I (ϕn ), we also have η = I (ϕ) (P-a.s.).
This completes the proof of the theorem.

Remark. Let H0 (ξ) and H0 (F) be the respective closed linear manifolds spanned
by the variables ξ 0 = (ξn )n≤0 and by the functions e0 = (en )n≤0 . Then, if η ∈
π
H0 (ξ), there is a function ϕ ∈ H0 (F) such that (P-a.s.) η = −π ϕ(λ) Z(dλ).

3. Formula (18) describes the structure of the random variables that are obtained
from ξn , n ∈ Z, by linear transformations, i.e., in the form of finite sums (19) and
their mean-square limits.
66 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

A special but important class of such linear transformations is defined by means


of what are known as (linear) filters. Let us suppose that, at instant m, a system
(filter) receives as input a signal xm , and that the output of the system is, at instant n,
the signal h(n − m)xm , where h = h(s), s ∈ Z, is a complex-valued function called
the impulse response (of the filter).
Therefore the total signal obtained at the output can be represented in the form


yn = h(n − m)xm . (22)
m=−∞

For physically realizable systems, the values of the input at instant n are deter-
mined only by the “past” values of the signal, i.e., the values xm for m ≤ n. It is
therefore natural to call a filter with the impulse response h(s) physically realizable
if h(s) = 0 for all s < 0, in other words if

 ∞

yn = h(n − m)xm = h(m)xn−m . (23)
m=−∞ m=0

An important spectral characteristic of a filter with the impulse response h is its


Fourier transform


ϕ(λ) = e−iλm h(m), (24)
m=−∞

known as the frequency characteristic or transfer function of the filter.


Let us now take up conditions, about which nothing has been said so far, for
the convergence of the series in (22) and (24). Let us suppose that the input is a
stationary random sequence ξ = (ξn ), n ∈ Z, with covariance function R(n) and
spectral decomposition (2). Then, if


h(k)R(k − l)h(l) < ∞, (25)
k,l=−∞
∞
the series m=−∞ h(n − m)ξm converges in mean square, and therefore there is a
stationary sequence η = (ηn ) with

 ∞

ηn = h(n − m)ξm = h(m)ξn−m . (26)
m=−∞ m=−∞

In terms of the spectral measure, (25) is evidently equivalent to saying that ϕ(λ) ∈
L2 (F), i.e.,  π
|ϕ(λ)|2 F(dλ) < ∞. (27)
−π

Under (25) or (27), we obtain the spectral representation


 π
ηn = eiλn ϕ(λ) Z(dλ), n ∈ Z, (28)
−π
3 Spectral Representation of Stationary (Wide Sense) Sequences 67

of η from (26) and (2). Consequently, the covariance function Rη (n) of η is given
by the formula  π
Rη (n) = eiλn |ϕ(λ)|2 F(dλ). (29)
−π

In particular, if the input to a filter with frequency characteristic ϕ = ϕ(λ) is taken to


be white noise ε = (εn ), the output will be a stationary sequence (moving average)


ηn = h(m)εn−m (30)
m=−∞

with spectral density


1
|ϕ(λ)|2 .
fη (λ) =

The following theorem shows that, in a certain sense, every stationary sequence
with a spectral density is obtainable by means of a moving average.

Theorem 3. Let η = (ηn ) be a stationary sequence with spectral density fη (λ). Then
(possibly at the expense of enlarging the original probability space) we can find a
sequence ε = (εn ) representing white noise, and a filter, such that the representation
(30) holds.

PROOF. For a given (nonnegative) function


π fη (λ) we can find a function ϕ(λ) such
that fη (λ) = (1/2π)|ϕ(λ)|2 . Since −π fη (λ) dλ < ∞, we have ϕ(λ) ∈ L2 (dμ),
where dμ is the Lebesgue measure on [−π,  ππ). Hence ϕ can be represented as a
Fourier series (24) with h(m) = (1/2π) −π eimλ ϕ(λ) dλ, where convergence is
understood in the sense that
 π   2
 
ϕ(λ) − e−iλm
h(m) dλ → 0, n → ∞.

−π |m|≤n

Let  π
ηn = eiλn Z(dλ), n ∈ Z.
−π

Besides the measure Z = Z(Δ), we introduce another, independent of Z, orthogonal


stochastic measure Z̃ = Z̃(Δ) with E |Z̃(a, b]|2 = (b − a)/2π. (The possibility of
constructing such a measure depends, in general, on having a sufficiently “rich”
original probability space.) Let us set
 
Z(Δ) = ϕ (λ) Z(dλ) + [1 − ϕ⊕ (λ)ϕ(λ)] Z̃(dλ),

Δ Δ

where 
a−1 , if a = 0,
a⊕ =
0, if a = 0.
68 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

The stochastic measure Z = Z(Δ) is a measure with orthogonal values, and for
every Δ = (a, b], we have
 
1 ⊕ 1 |Δ|
E |Z(Δ)| =
2
|ϕ (λ)| |ϕ(λ)| dλ +
2 2
|1 − ϕ⊕ (λ)ϕ(λ)|2 dλ = ,
2π Δ 2π Δ 2π

where |Δ| = b − a. Therefore the stationary sequence ε = (εn ), n ∈ Z, with


 π
εn = eiλn Z(dλ),
−π

is a white noise.
We now observe that
 π  π
iλn
e ϕ(λ) Z(dλ) = eiλn Z(dλ) = ηn (31)
−π −π

and, on the other hand, by definition of ϕ(λ) and property (14) in Sect. 2, we have
(P-a.s.)
 π  π 
∞ 
eiλn ϕ(λ) Z(dλ) = eiλn e−iλm h(m) Z(dλ)
−π −π m=−∞

  π ∞

= h(m) eiλ(n−m) Z(dλ) = h(m)εn−m ,
m=−∞ −π m=−∞

which, together with (31), establishes representation (30).


This completes the proof of the theorem.


Remark. If fη (λ) > 0 (almost everywhere with respect to Lebesgue measure), the
introduction of the auxiliary measure Z̃ = Z̃(Δ) becomes unnecessary (since then
1 − ϕ⊕ (λ)ϕ(λ) = 0 almost everywhere with respect to Lebesgue measure), and the
reservation concerning the necessity of extending the original probability space can
be omitted.
Corollary 5. Let the spectral density fη (λ) > 0 (almost everywhere with respect to
Lebesgue measure) and
1
fη (λ) = |ϕ(λ)|2 ,

where
∞ ∞
ϕ(λ) = e−iλk h(k), |h(k)|2 < ∞.
k=0 k=0

Then the sequence η admits a representation as a one-sided moving average,




ηn = h(m)εn−m .
m=0
3 Spectral Representation of Stationary (Wide Sense) Sequences 69

In particular, let P(z) = a0 + a1 z + · · · + ap zp . Then the sequence η = (ηn ) with


spectral density
1
fη (λ) = |P(e−iλ )|2

can be represented in the form

ηn = a0 εn + a1 εn−1 + · · · + ap εn−p .

Corollary 6. Let ξ = (ξn ) be a stationary sequence with rational spectral density


 2
1  P(e−iλ ) 
fξ (λ) = , (32)
2π  Q(e−iλ ) 

where P(z) = a0 + a1 z + · · · + ap zp , Q(z) = 1 + b1 z + · · · + bq zq .


If Q(z) has no zeros on {z : |z| = 1}, there is a white noise ε = ε(n) such that
(P-a.s.)
ξn + b1 ξn−1 + · · · + bq ξn−q = a0 εn + a1 εn−1 + · · · + ap εn−p . (33)

Conversely, every stationary sequence ξ = (ξn ) that satisfies this equation with
some white noise ε = (εn ) and some polynomial Q(z) with no zeros on {z : |z| = 1}
has a spectral density (32).
In fact, let ηn = ξn + b1 ξn−1 + · · · + bq ξn−q . Then fη (λ) = (1/2π)|P(e−iλ )|2 ,
and the required representation follows from Corollary 5.
On the other hand, if (33) holds and Fξ (λ) and Fη (λ) are the spectral functions
of ξ and η, then
 λ  λ
−iv 1
Fη (λ) = |Q(e 2
)| dFξ (v) = |P(e−iv )|2 dv.
−π 2π −π

Since |Q(e−iv )|2 > 0, it follows that Fξ (λ) has a density defined by (32).

4. The following mean-square ergodic theorem can be thought of as an analog of


the law of large numbers for stationary (wide sense) random sequences.
Theorem 4. Let ξ = (ξn ), n ∈ Z, be a stationary sequence with E ξn = 0, covari-
ance function (1), and spectral representation (2). Then

1  L2
n−1
ξk → Z({0}) (34)
n
k=0

and
1
n−1
R(k) → F({0}). (35)
n
k=0
PROOF. By (2),
 
1 1  ikλ
n−1 π n−1 π
ξk = e Z(dλ) = ϕn (λ) Z(dλ),
n −π n −π
k=0 k=0
70 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

where 
1  ikλ
n−1
1, λ = 0,
ϕn (λ) = e = 1 einλ −1
(36)
n
k=0 n eiλ −1 , λ = 0.
It is clear that |ϕn (λ)| ≤ 1.
L2 (F)
Moreover, ϕn (λ) −−−→ I{0} (λ), and therefore, by (14) of Sect. 2,
 π  π
L2
ϕn (λ) Z(dλ) → I{0} (λ) Z(dλ) = Z({0}),
−π −π

which establishes (34).


Relation (35) can be proved in a similar way.
This completes the proof of the theorem.

Corollary. If the spectral function is continuous at zero, i.e., F({0}) = 0, then


Z({0}) = 0 (P-a.s.) and by (34) and (35),

1 1  L2
n−1 n−1
R(k) → 0 ⇒ ξk → 0.
n n
k=0 k=0

Since  2  , - 2  
1 
n−1   1
n−1  n−1 2
1 
    2  
 R(k) = E ξk ξ0  ≤ E |ξ0 | E  ξk  ,
n   n  n 
k=0 k=0 k=0

the converse implication also holds:

1  L2 1
n−1 n−1
ξk → 0 ⇒ R(k) → 0.
n n
k=0 k=0

n−1
Therefore the condition (1/n) k=0 R(k) → 0 is necessary and sufficient for the
n−1
convergence (in the mean-square sense) of the arithmetic means (1/n) k=0 ξk to
zero. It follows that if the original sequence ξ = (ξn ) has expectation m (that is,
E ξ0 = m), then
1 1  L2
n−1 n−1
R(k) → 0 ⇔ ξk → m, (37)
n n
k=0 k=0

where R(n) = E(ξn − E ξn )(ξ0 − E ξ0 ).


Let us also observe that if Z({0}) = 0 with a positive probability and m = 0,
then ξn “contains a random constant α”:

ξn = α + η n ,
4 Statistical Estimation of Covariance Function and Spectral Density 71

where α = Z({0}) and the measure Zη = Zη (Δ) in the spectral representation


π
ηn = −π eiλn Zη (dλ) is such that Zη ({0}) = 0 (P-a.s.). Conclusion (34) means that
the arithmetic mean converges in mean square to precisely this random constant α.
5. PROBLEMS
1. Show that L02 (F) = L2 (F) (for the notation see the proof of Theorem 1).
2. Let ξ = (ξn ) be a stationary sequence with the property that ξn+N = ξn for some
N and all n. Show that the spectral representation of such a sequence reduces to
(13) of Sect. 1.
3. Let ξ = (ξn ) be a stationary sequence such that E ξn = 0 and
!
1  1 
N−1 N−1
|k|
R(k − l) = R(k) 1 − ≤ CN −α
N2 N N
k=0 l=0 |k|≤N−1

for some C > 0, α > 0. Use the Borel–Cantelli lemma to show that then

1
N
ξk → 0 (P-a.s.).
N
k=0

4. Let the spectral density fξ (λ) of the sequence ξ = (ξn ) be rational,


1 |Pn−1 (e−iλ )|
fξ (λ) = , (38)
2π |Qn (e−iλ )|

where Pn−1 (z) = a0 + a1 z + · · · + an−1 zn−1 and Qn (z) = 1 + b1 z + · · · + bn zn ,


and no zeros of Qn (z) lie on the unit circle.
Show that there is a white noise ε = (εm ), m ∈ Z, such that the sequence
(ξm ) is a component of an n-dimensional sequence (ξm1 , ξm2 , . . . , ξmn ), ξm1 = ξm ,
satisfying the system of equations
i
ξm+1 = ξmi+1 + βi εm+1 , i = 1, . . . , n − 1,

n−1
n
ξm+1 =− bn−j ξmj+1 + βn εm+1 , (39)
j=0
i−1
where β1 = a0 , βi = ai−1 − k=1 βk bi−k .

4. Statistical Estimation of Covariance Function


and Spectral Density

1. Problems of the statistical estimation of various characteristics of the probability


distributions of random sequences arise in the most diverse branches of science (e.g.,
geophysics, medicine, economics). The material presented in this section will give
the reader an idea of the concepts and methods of estimation and of the difficulties
that are encountered.
72 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

To begin with, let ξ = (ξn ), n ∈ Z, be a sequence, stationary in the wide


sense
 π iλn (for simplicity, real) with expectation E ξn = m and covariance R(n) =
−π
e F(dλ).
Suppose we have the results x0 , x1 , . . . , xN−1 of observing the random variables
ξ0 , ξ1 , . . . , ξN−1 . How are we then to construct a “good” estimator of the (unknown)
mean value m?
Let us set
1 
N−1
mN (x) = xk . (1)
N
k=0

Then it follows from the elementary properties of the expectation that this is a
“good” estimator of m in the sense that “in the average over all possible realiza-
tions of data x0 , . . . , xN−1 ” it is unbiased, i.e.,
N−1 
1 
E mN (ξ) = E ξk = m. (2)
N
k=0
N
In addition, it follows from Theorem 4 of Sect. 3 that when (1/N) k=0 R(k) → 0,
N → ∞, our estimator is consistent (in mean square), i.e.,

E |mN (ξ) − m|2 → 0, N → ∞. (3)

Next we take up the problem of estimating the covariance function R(n), the
spectral function F(λ) = F([−π, λ]), and the spectral density f (λ), all under the
assumption that m = 0.

Since R(n) = E ξn+k ξk , it is natural to estimate this function on the basis of N


observations x0 , x1 , . . . , xN−1 (when 0 ≤ n < N) by

1 
N−n−1
R̂N (n; x) = xn+k xk .
N−n
k=0

It is clear that this estimator is unbiased in the sense that

E R̂N (n; ξ) = R(n), 0 ≤ n < N.

Let us now consider the question of its consistency. If we replace ξk in (37) of


Sect. 3 by ζk = ξn+k ξk and suppose that for each integer n the sequence ζ = (ζk )k∈Z
is wide-sense stationary (which implies, in particular, that E ξ04 < ∞), we find that
the condition

1 
N−1
E[ξn+k ξk − R(n)][ξn ξ0 − R(n)] → 0, N → ∞, (4)
N
k=0
4 Statistical Estimation of Covariance Function and Spectral Density 73

is necessary and sufficient for

E |R̂N (n; ξ) − R(n)|2 → 0, N → ∞. (5)

Let us suppose that the original sequence ξ = (ξn ) is Gaussian (with zero mean
and covariance R(n)). Then, proceeding analogously to (51) of Sect. 12, Chap. 2,
Vol. 1, we obtain

E[ξn+k ξk − R(n)][ξn ξ0 − R(n)] = E ξn+k ξk ξn ξ0 − R2 (n)


= E ξn+k ξk · E ξn ξ0 + E ξn+k ξn · E ξk ξ0
+ E ξn+k ξ0 · E ξk ξn − R2 (n)
= R2 (k) + R(n + k)R(n − k).

Therefore, in the Gaussian case, condition (4) is equivalent to

1  2
N−1
[R (k) + R(n + k)R(n − k)] → 0, N → ∞. (6)
N
k=0

Since |R(n + k)R(n − k)| ≤ |R(n + k)|2 + |R(n − k)|2 , the condition

1  2
N−1
R (k) → 0, N → ∞, (7)
N
k=0

implies (6). Conversely, if (6) holds for n = 0, then (7) is satisfied.


We have now established the following theorem.

Theorem. Let ξ = (ξn ) be a Gaussian stationary sequence with E ξn = 0 and co-


variance function R(n). Then (7) is a necessary and sufficient condition that, for
every n ≥ 0, the estimator R̂N (n; x) is mean-square consistent (i.e., that (5) is satis-
fied).

Remark. If we use the spectral representation of the covariance function, we obtain


 π  π
1  2 1  i(λ−ν)k
N−1 N−1
R (k) = e F(dλ)F(dν)
N −π −π N
k=0 k=0
 π  π
= fN (λ, ν) F(dλ)F(dν),
−π −π

where (cf. (36) of Sect. 3)



1, λ = ν,
fN (λ, ν) = i(λ−ν)N
1−e
N[1−ei(λ−ν) ]
, λ = ν.
74 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

But as N → ∞, 
1, λ = ν,
fN (λ, v) → f (λ, ν) =
0, λ = ν.
Therefore
 π  π
1  2
N−1
R (k) → f (λ, ν) F(dλ)F(dν)
N −π −π
k=0
 π 
= F({λ}) F(dλ) = F 2 ({λ}),
−π λ

where the sum over λ contains at most a countable number of terms since the mea-
sure F is finite.
Hence (7) is equivalent to

F 2 ({λ}) = 0, (8)
λ

which means that the spectral function F(λ) = F([−π, λ]) is continuous.
2. We now turn to the problem of finding estimators for the spectral function F(λ)
and the spectral density f (λ) (under the assumption that they exist).
A method that naturally suggests itself for estimating the spectral density follows
from the proof of Herglotz’s theorem that we gave earlier. Recall that the function

1  |n|
fN (λ) = 1− R(n)e−iλn (9)
2π N
|n|<N

introduced in Sect. 1 has the property that the function


 λ
FN (λ) = fN (ν) dν
−π

converges on the whole to the spectral function F(λ). Therefore, if F(λ) has a den-
sity f (λ), then we have
 λ  λ
fN (ν) dν → f (ν) dν (10)
−π −π

for each λ ∈ [−π, π).


Starting from these facts and recalling that an estimator for R(n) (on the basis of
the observations x0 , x1 , . . . , xN−1 ) is R̂N (n; x), we take as an estimator for f (λ) the
function 
1  |n|
f̂N (λ; x) = 1− R̂N (n; x)e−iλn , (11)
2π N
|n|<N

setting R̂N (n; x) = R̂N (|n|; x) for |n| < N.


4 Statistical Estimation of Covariance Function and Spectral Density 75

The function f̂N (λ; x) is known as a periodogram. It is easily verified that it can
also be represented in the following more convenient form:
 N−1 2
1   −iλn 
f̂N (λ; x) = xn e  . (12)
2πN  n=0

Since E R̂N (n; ξ) = R(n), |n| < N, we have

E f̂N (λ; ξ) = fN (λ).

If the spectral function F(λ) has density f (λ), then, since fN (λ) can also be written
in the form (34) of Sect. 1, we find that
N−1 N−1 
1   π iν(k−l) iλ(l−k)
fN (λ) = e e f (ν) dν
2πN
k=0 l=0 −π
 π  N−1 2
1   i(ν−λ)k 
= e  f (ν) dν.
2πN 
−π k=0

The function  N−1 2  2


1   iλk  1  sin λ2 N 
ΦN (λ) = e  =
2πN  2πN  sin λ/2 
k=0

is the Fejér kernel. It is known, from the properties of this function, that for almost
every λ (with respect to Lebesgue measure)
 π
ΦN (λ − ν)f (ν) dν → f (λ). (13)
−π

Therefore, for almost every λ ∈ [−π, π),

E f̂N (λ; ξ) → f (λ); (14)

in other words, the estimator f̂N (λ; x) of f (λ) on the basis of x0 , x1 , . . . , xN−1 is
asymptotically unbiased.

In this sense, the estimator f̂N (λ; x) could be considered “good.” However, at
the individual observed values x0 , . . . , xN−1 the values of the periodogram f̂N (λ; x)
usually turn out to be far from the actual values f (λ). In fact, let ξ = (ξn ) be a sta-
tionary sequence of independent Gaussian random variables, ξn ∼ N (0, 1). Then
f (λ) ≡ 1/2π and
 2
1  1  −iλk 
N−1
f̂N (λ; ξ) = √ ξk e  .
2π  N
k=0

Therefore for λ = 0 we have that 2π f̂N (0, ξ) coincides in distribution with the square
of the Gaussian random variable η ∼ N (0, 1). Hence, for every N,
76 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

1
E |f̂N (0; ξ) − f (0)|2 = E |η 2 − 1|2 > 0.
4π 2
Moreover, an easy calculation shows that if f (λ) is the spectral density of a station-
ary sequence ξ = (ξn ) that is constructed as a moving average:


ξn = ak εn−k (15)
k=0
∞ ∞
with k=0 |ak | < ∞, k=0 |ak |2 < ∞, where ε = (εn ) is white noise with E ε40 <
∞, then  2
2f (0), λ = 0, ±π,
lim E |f̂N (λ; ξ) − f (λ)| =
2
(16)
N→∞ f 2 (λ), λ = 0, ±π.
Hence it is clear that the periodogram cannot be a satisfactory estimator of the
spectral density. To improve the situation, one often uses an estimator for f (λ) of
the form  π
W
fN (λ; x) = WN (λ − ν)f̂N (ν; x) dν, (17)
−π

which is obtained from the periodogram f̂N (λ; x) by means of a smoothing function
WN (λ), which we call a spectral window. Natural requirements on WN (λ) are as
follows:
(a) WN (λ) has a sharp maximum at λ = 0;
π
(b) −π WN (λ) dλ = 1;
(c) P |f̂NW (λ; ξ) − f (λ)|2 → 0, N → ∞, λ ∈ [−π, π).
By (14) and (b), the estimators f̂NW (λ; ξ) are asymptotically unbiased. Condition (c)
is the condition of consistency in mean square, which, as we showed above, is vio-
lated for the periodogram. Finally, condition (a) ensures that the required frequency
λ is “picked out” from the periodogram.
Let us give some examples of estimators of the form (17).
Bartlett’s estimator is based on the spectral window
WN (λ) = aN B(aN λ),

where aN ↑ ∞, aN /N → 0, N → ∞, and
 2
1  sin(λ/2) 
B(λ) = .
2π  λ/2 

Parzen’s estimator uses the spectral window

WN (λ) = aN P(aN λ),

where aN are the same as before and


 
3  sin(λ/4) 4
P(λ) =  
8π  λ/4  .
4 Statistical Estimation of Covariance Function and Spectral Density 77

Zhurbenko’s estimator is constructed from a spectral window of the form

WN (λ) = aN Z(aN λ)

with 
− α+1
2α |λ| +
α α+1
2α , |λ| ≤ 1,
Z(λ) =
0, |λ| > 1,
where 0 < α ≤ 2 and the aN are selected in a particular way.
We shall not spend any more time on problems of estimating spectral densities;
we merely note that there is an extensive statistical literature dealing with the con-
struction of spectral windows and the comparison of the corresponding estimators
f̂NW (λ; x). (See, e.g., [36, 37, 38].)

3. We now consider the problem of estimating the spectral function F(λ) =


F([−π, λ]). We begin by defining
 λ  λ
FN (λ) = fN (ν) dν, F̂N (λ; x) = f̂N (ν; x) dν,
−π −π

where f̂N (ν; x) is the periodogram constructed with (x0 , x1 , . . . , xN−1 ).


It follows from the proof of Herglotz’s theorem (Sect. 1) that
 π  π
eiλn dFN (λ) → eiλn dF(λ)
−π −π

for every n ∈ Z. Hence it follows (cf. corollary to Theorem 1 of Sect. 3, Chap. 3,


Vol. 1) that FN ⇒ F, i.e., FN (λ) converges to F(λ) at each point of continuity of
F(λ).
Observe that  π 
|n|
eiλn dF̂N (λ; ξ) = R̂N (n; ξ) 1 −
−π N
for all |n| < N. Therefore, if we suppose that R̂N (n; ξ) converges to R(n) with
probability 1 (or in mean square) as N → ∞, we have
 π  π
e dF̂N (λ; ξ) →
iλn
eiλn dF(λ) (P-a.s.)
−π −π

and therefore F̂N (λ; ξ) ⇒ F(λ) (P-a.s.) (or in mean square).


It is then easy to deduce (if necessary, passing from a sequence to a subsequence)
that if R̂N (n; ξ) → R(n) in probability, then F̂N (λ; ξ) ⇒ F(λ) in probability.
4. PROBLEMS
1. In (15) let εn ∼ N (0, 1). Show that
 π
(N − |n|) Var R̂N (n, ξ) → 2π (1 + e2inλ ) f 2 (λ) dλ
−π

for every n, as N → ∞.
78 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

2. Establish (16) and the following generalization:


⎧ 2
⎨ 2f (0), λ = ν = 0, ±π,
lim Cov(f̂N (λ; ξ), f̂N (ν; ξ)) = f 2 (λ), λ = ν = 0, ±π,
N→∞ ⎩
0, λ = ν.

5. Wold’s Expansion

1. In contrast to representation (2) of Sect. 3, which gives an expansion of a sta-


tionary sequence in the frequency domain, Wold’s expansion operates in the time
domain. The main point of this expansion is that a stationary sequence ξ = (ξn ),
n ∈ Z, can be represented as the sum of two stationary sequences, one of which is
completely predictable (in the sense that its values are completely determined by its
“past”), whereas the second does not have this property.
We begin with some notation. Let Hn (ξ) = L2 (ξ n ) and H(ξ) = L2 (ξ) be
closed linear manifolds, spanned respectively by ξ n = (. . . , ξn−1 , ξn ) and ξ =
(. . . , ξn−1 , ξn , . . .). Let 
S(ξ) = Hn (ξ).
n

For every η ∈ H(ξ), denote by

π̂n (η) = Ê(η | Hn (ξ))

the projection of η on the subspace Hn (ξ) (Sect. 11, Chap. 2, Vol. 1). We also write

π̂−∞ (η) = Ê(η | S(ξ)).

Every element η ∈ H(ξ) can be represented as


η = π̂−∞ (η) + (η − π̂−∞ (η)),

where η − π̂−∞ (η) ⊥ π̂−∞ (η). Therefore H(ξ) is represented as the orthogonal
sum
H(ξ) = S(ξ) ⊕ R(ξ),
where S(ξ) consists of the elements π̂−∞ (η) with η ∈ H(ξ), and R(ξ) consists of
the elements of the form η − π̂−∞ (η).
We shall now assume that E ξn = 0 and Var ξn > 0. Then H(ξ) is automatically
nontrivial (contains elements different from zero).
Definition 1. A stationary sequence ξ = (ξn ) is regular if

H(ξ) = R(ξ)

and singular if
H(ξ) = S(ξ).
5 Wold’s Expansion 79

Remark 1. Singular sequences are also called deterministic and regular sequences
are called purely or completely nondeterministic. If S(ξ) is a proper subspace of
H(ξ), we just say that ξ is nondeterministic.
Theorem 1. Every stationary (wide sense) random sequence ξ has a unique decom-
position,
ξn = ξnr + ξns , (1)
where ξ r = (ξnr ) is regular and ξ s = (ξns ) is singular. Here ξ r and ξ s are orthogonal
(ξnr ⊥ ξms for all n and m).
PROOF. We define
ξns = Ê(ξn | S(ξ)), ξnr = ξn − ξns .
Since ξnr ⊥ S(ξ) for every n, we have S(ξ r ) ⊥ S(ξ). On the other hand, S(ξ r ) ⊆ S(ξ),
and therefore S(ξ r ) is trivial (contains only random sequences that coincide almost
surely with zero). Consequently, ξ r is regular.
Moreover, Hn (ξ) ⊆ Hn (ξ s ) ⊕ Hn (ξ r ) and Hn (ξ s ) ⊆ Hn (ξ), Hn (ξ r ) ⊆ Hn (ξ).
Therefore Hn (ξ) = Hn (ξ s ) ⊕ Hn (ξ r ), and hence

S(ξ) ⊆ Hn (ξ s ) ⊕ Hn (ξ r ) (2)

for every n. Since ξnr ⊥ S(ξ), it follows from (2) that

S(ξ) ⊆ Hn (ξ s ),

and therefore S(ξ) ⊆ S(ξ s ) ⊆ H(ξ s ). But ξns ∈ S(ξ); hence H(ξ s ) ⊆ S(ξ), and
consequently
S(ξ) = S(ξ s ) = H(ξ s ),
which means that ξ s is singular.
The orthogonality of ξ s and ξ r follows in an obvious way from ξns ∈ S(ξ) and
ξn ⊥ S(ξ).
r

This completes the proof of the theorem.




Remark 2. Decomposition (1) into regular and singular parts is unique (Problem 4).
2. Definition 2. Let ξ = (ξn ) be a nondegenerate stationary sequence. A random
sequence ε = (εn ) is an innovation sequence (for ξ) if
(a) ε = (εn ) consists of pairwise orthogonal random variables with E εn = 0,
E |εn |2 = 1;
(b) Hn (ξ) = Hn (ε) for all n ∈ Z.
Remark 3. The reason for the term “innovation” is that εn+1 provides, so to speak,
new “information” not contained in Hn (ξ) (in other words, “innovates” in Hn (ξ) the
information that is needed for forming Hn+1 (ξ)).
The following important theorem establishes a connection between one-sided
moving averages (Example 4 in Sect. 1) and regular sequences.
80 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

Theorem 2. A necessary and sufficient condition for a nondegenerate sequence ξ to


be regular is that there are an innovation
∞ sequence ε = (εn ) and a sequence (an ) of
complex numbers, n ≥ 0, with n=0 |an |2 < ∞ such that


ξn = ak εn−k (P-a.s.). (3)
k=0

PROOF. Necessity. We represent Hn (ξ) in the form

Hn (ξ) = Hn−1 (ξ) ⊕ Bn .

Since Hn (ξ) is spanned by elements of Hn−1 (ξ) and elements of the form βξn , where
β is a complex number, the dimension of Bn is either zero or one. But the space
Hn (ξ) is different from Hn−1 (ξ) for any value of n. In fact, if Bn is trivial for some
n, then, by stationarity, Bk is trivial for all k, hence H(ξ) = S(ξ), contradicting the
assumption that ξ is regular. Thus, Bn has the dimension dim Bn = 1.
Let ηn be a nonzero element of Bn . Set
ηn
εn = ,
ηn 

where ηn 2 = E |ηn |2 > 0.


For given n and k ≥ 0, consider the decomposition

Hn (ξ) = Hn−k (ξ) ⊕ Bn−k+1 ⊕ · · · ⊕ Bn .

Then εn−k , . . . , εn is an orthogonal basis in Bn−k+1 ⊕ · · · ⊕ Bn and


k−1
ξn = aj εn−j + π̂n−k (ξn ), (4)
j=0

where aj = E ξn εn−j .
By Bessel’s inequality (6), Sect. 11, Chap. 2, Vol. 1,


|aj |2 ≤ ξn 2 < ∞.
j=0
∞
It follows that j=0 ai εn−j converges in mean square, and then, by (4), Eq. (3) will
L2
be established as soon as we show that π̂n−k (ξn ) → 0, k → ∞.
It is enough to consider the case n = 0. Let π̂i = π̂i (ξ0 ). Since


k
π̂−k = π̂0 + [π̂−i − π̂−i+1 ],
i=0

and the terms that appear in this sum are orthogonal, we have for every k ≥ 0
5 Wold’s Expansion 81

 . .2
k
. k .
.
π̂−i − π̂−i+1  = .
2
(π̂−i − π̂−i+1 ).
.
i=0 i=0
= π̂−k − π̂0 2 ≤ 4ξ0 2 < ∞.

 π̂−k ∈ H−k (ξ) for


Therefore the limit limk→∞ π̂−k exists (in mean square). Now
each k, and therefore the limit in question must belong to k≥0 H−k (ξ) = S(ξ).
L2
But, by assumption, S(ξ) is trivial, and therefore π̂−k → 0, k → ∞.
Sufficiency. Let the nondegenerate sequence ξ have a representation (3), where
ε = (εn ) is an orthonormal system (not necessarily satisfying thecondition Hn (ξ) =
Hn (ε), n ∈ Z). Then Hn (ξ) ⊆ Hn (ε), and therefore S(ξ) = k Hk (ξ) ⊆ Hn (ε)
for every n. But εn+1 ⊥ Hn (ε), and therefore εn+1 ⊥ S(ξ), and at the same time
ε = (εn ) is a basis in H(ξ). It follows that S(ξ) is trivial, and consequently ξ is
regular.
This completes the proof of the theorem.

Remark 4. It follows from the proof that a nondegenerate sequence ξ is regular if


and only if it admits a representation as a one-sided moving average,


ξn = ãk ε̃n−k , (5)
k=0

where ε̃ = ε̃n is an orthonormal system (see the definition in Example 4 of Sect. 1).
In this sense, the conclusion of Theorem 2 says more, specifically that for a regular
sequence ξ there exist a = (an ) and an orthonormal system ε = (εn ) such that not
only (5) but also (3) is satisfied, with Hn (ξ) = Hn (ε), n ∈ Z.

The following theorem is an immediate corollary of Theorems 1 and 2.

Theorem 3 (Wold’s Expansion). If ξ = (ξn ) is a nondegenerate stationary se-


quence, then


ξn = ξns + ak εn−k , (6)
k=0
∞
where k=0 |ak | < ∞ and ε = (εn ) is an innovation sequence (for ξ r ).
2

3. The significance of the concepts introduced here (regular and singular sequences)
becomes particularly clear if we consider the following (linear) extrapolation prob-
lem, for whose solution the Wold expansion (6) is especially useful.
Let H0 (ξ) = L2 (ξ 0 ) be the closed linear manifold spanned by the variables
0
ξ = (. . . , ξ−1 , ξ0 ). Consider the problem of constructing an optimal (least-squares)
linear estimator ξˆn of ξn in terms of the “past” ξ 0 = (. . . , ξ−1 , ξ0 ).
It follows from Sect. 11, Chap. 2, Vol. 1, that

ξˆn = Ê(ξn | H0 (ξ)). (7)


82 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

(In the notation of Subsection 1, ξˆn = π̂0 (ξn ).) Since ξ r and ξ s are orthogonal and
H0 (ξ) = H0 (ξ r ) ⊕ H0 (ξ s ), we obtain, by using (6),

ξn = Ê(ξns + ξnr | H0 (ξ)) = Ê(ξns | H0 (ξ)) + Ê(ξnr | H0 (ξ))


= Ê(ξns | H0 (ξ r ) ⊕ H0 (ξ s )) + Ê(ξnr | H0 (ξ r ) ⊕ H0 (ξ s ))
= Ê(ξns | H0 (ξ s )) + Ê(ξnr | H0 (ξ r ))
 ∞ 
s
= ξn + Ê ak εn−k | H0 (ξ ) .
r

k=0

In (6), the sequence ε = (εn ) is an innovation sequence for ξ r = (ξnr ), and therefore
H0 (ξ r ) = H0 (ε). Therefore

∞  ∞

ξˆn = ξns + Ê ak εn−k | H0 (ε) = ξns + ak εn−k (8)
k=0 k=n

and the mean-square error of predicting ξn by ξ0 = (. . . , ξ−1 , ξ0 ) is


n−1
σn2 = E |ξn − ξˆn |2 = |ak |2 . (9)
k=0

We can draw two important conclusions.


(a) If ξ is singular, then for every n ≥ 1 the error (in the extrapolation) σn2
is zero; in other words, we can predict ξn without error from its “past”
ξ 0 = (. . . , ξ−1 , ξ0 ).
(b) If ξ is regular, then σn2 ≤ σn+1
2
and


lim σn2 = |ak |2 . (10)
n→∞
k=0

Since


|ak |2 = E |ξn |2 ,
k=0

it follows from (10) and (9) that


2
L
ξˆn → 0, n → ∞,

i.e., as n increases, the prediction of ξn in terms of ξ0 = (. . . , ξ−1 , ξ0 ) becomes


trivial (reducing simply to E ξn = 0).

4. Let us suppose that ξ is a nondegenerate regular stationary sequence. According


to Theorem 2, every such sequence admits a representation as a one-sided moving
average,
5 Wold’s Expansion 83


ξn = ak εn−k , (11)
k=0
∞
where k=0 |ak |2 < ∞, and the orthonormal sequence ε = (εn ) has the important
property that
Hn (ξ) = Hn (ε), n ∈ Z. (12)
The representation (11) means (Subsection 3, Sect. 3) that ξn can be interpreted as
the output signal of a physically realizable filter with impulse response a = (ak ),
k ≥ 0, when the input is ε = (εn ).
Like any sequence of two-sided moving averages, a regular sequence has a spec-
tral density f (λ). But since a regular sequence admits a representation as a one-sided
moving average, it is possible to obtain additional information about the properties
of the spectral density.
In the first place, it is clear that
1
f (λ) = |ϕ(λ)|2 ,

where

 ∞

ϕ(λ) = e−iλk ak , |ak |2 < ∞. (13)
k=0 k=0

Set


Φ(z) = ak zk . (14)
k=0
∞
This function is analytic in the open domain |z| < 1, and since k=0 |ak |2 < ∞, it
belongs to the Hardy class H 2 , the class of functions g = g(z), analytic in |z| < 1,
satisfying  π
1
sup |g(reiθ )|2 dθ < ∞. (15)
0≤r<1 2π −π

In fact,
 π ∞

1
|Φ(reiθ )|2 dθ = |ak |2 r2k
2π −π k=0

and  
sup |ak |2 r2k ≤ |ak |2 < ∞.
0≤r<1

It is shown in the theory of functions of a complex variable (e.g., [64]) that the
boundary function Φ(eiλ ), −π ≤ λ < π, of Φ ∈ H 2 , not identically zero, has the
property that  π
log |Φ(e−iλ )| dλ > −∞. (16)
−π

In our case,
1
f (λ) = |Φ(e−iλ )|2 ,

84 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

where Φ ∈ H 2 . Therefore

log f (λ) = − log 2π + 2 log |Φ(e−iλ )|,

and consequently the spectral density f (λ) of a regular process satisfies


 π
log f (λ) dλ > −∞. (17)
−π

On the other hand, let the spectral density f (λ) satisfy (17). It again follows
from thetheory of functions of a complex variable that there is then a function

Φ(z) = k=0 ak zk in the Hardy class H 2 such that (almost everywhere with respect
to Lebesgue measure)
1
f (λ) = |Φ(e−iλ )|2 .

Therefore, if we set ϕ(λ) = Φ(e−iλ ), we obtain
1
f (λ) = |ϕ(λ)|2 ,

where ϕ(λ) is given by (13). Then it follows from Corollary 5, Sect. 3, that ξ ad-
mits a representation as a one-sided moving average (11), where ε = (εn ) is an
orthonormal sequence. From this and from Remark 4 it follows that ξ is regular.
Thus, we have the following theorem.

Theorem 4 (Kolmogorov). Let ξ be a nondegenerate regular stationary sequence.


Then there is a spectral density f (λ) such that
 π
log f (λ) dλ > −∞. (18)
−π

In particular, f (λ) > 0 (almost everywhere with respect to Lebesgue measure).


Conversely, if ξ is a stationary sequence with a spectral density satisfying (18),
the sequence is regular.

5. PROBLEMS
1. Show that a stationary sequence with discrete spectrum (piecewise-constant
spectral function F(λ)) is singular.
2. Let σn2 = E |ξn − ξˆn |2 , ξˆn = Ê(ξn | H0 (ξ)). Show that if σn2 = 0 for some n ≥ 1,
the sequence is singular; if σn2 → R(0) as n → ∞, the sequence is regular.
3. Show that the stationary sequence ξ = (ξn ), ξn = einϕ , where ϕ is a uniform
random variable on [0, 2π], is regular. Find the estimator ξˆn and its mean-square
error σn2 , and show that the nonlinear estimator
n
˜ ξ0
ξn =
ξ−1
6 Extrapolation, Interpolation, and Filtering 85

provides an error-free prediction of ξn by the “past” ξ 0 = (. . . , ξ−1 , ξ0 ), i.e.,

E |ξ˜n − ξn |2 = 0, n ≥ 1.

4. Prove that decomposition (1) into regular and singular components is unique.

6. Extrapolation, Interpolation, and Filtering

1. Extrapolation. According to the preceding section, a singular sequence ad-


mits an error-free prediction (extrapolation) of ξn , n ≥ 1, in terms of the “past,”
ξ 0 = (. . . , ξ−1 , ξ0 ). Consequently, it is reasonable, when considering the problem
of extrapolation for arbitrary stationary sequences, to begin with the case of regular
sequences.
According to Theorem 2 of Sect. 5, every regular sequence ξ = (ξn ) admits a
representation as a one-sided moving average,


ξn = ak εn−k (1)
k=0
∞
with k=0 |ak |2 < ∞ and some innovation sequence ε = (εn ). It follows from
Sect. 5 that the representation (1) solves the problem of finding the optimal (linear)
estimator ξˆn = Ê(ξn | H0 (ξ)) since, by (8) of Sect. 5,


ξˆn = ak εn−k (2)
k=n

and

n−1
σn2 = E |ξn − ξˆn |2 = |ak |2 . (3)
k=0

However, this can be considered only a theoretical solution, for the following rea-
sons.
The sequences that we consider are ordinarily not given to us by means of their
representations (1), but by their covariance functions R(n) or the spectral densities
f (λ) (which exist for regular sequences). Hence a solution (2) can only be regarded
as satisfactory if the coefficients ak are given in terms of R(n) or of f (λ), and the εk
in terms of . . . , ξk−1 , ξk .
Without discussing the problem in general, we consider only the special case (of
interest in applications) when the spectral density has the form
1
f (λ) = |Φ(e−iλ )|2 , (4)

86 6 Stationary (Wide Sense) Random Sequences: L2 -Theory
∞
where Φ(z) = k=0 bk zk has radius of convergence r > 1 and has no zeros in
|z| ≤ 1.
Let  π
ξn = eiλn Z(dλ) (5)
−π

be the spectral representation of ξ = (ξn ), n ∈ Z.

Theorem 1. If the spectral density of ξ has the form (4), then the optimal (linear)
estimator ξˆn of ξn in terms of ξ 0 = (. . . , ξ−1 , ξ0 ) is given by
 π
ξˆn = ϕ̂n (λ) Z(dλ), (6)
−π

where
Φn (e−iλ )
ϕ̂n (λ) = eiλn (7)
Φ(e−iλ )
and


Φn (z) = bk zk .
k=n

PROOF. According to Remark 4 on Theorem 2 of Sect. 3, every variable ξ˜n ∈ H0 (ξ)


admits a representation in the form
 π
ξ˜n = ϕ̃n (λ) Z(dλ), ϕ̃n ∈ H0 (F), (8)
−π

where H0 (F) is the closed linear manifold spanned by the functions en = eiλn for
 λ 
n ≤ 0 F(λ) = −π f (ν) dν .
Since (Sect. 2)
 π 2
 
E |ξn − ξ˜n |2 = E  (eiλn − ϕ̃n (λ)) Z(dλ)
−π
 π
= |eiλn − ϕ̃n (λ)|2 f (λ) dλ,
−π

the proof that (6) is optimal reduces to proving that


 π  π
inf |eiλn − ϕ̃n (λ)|2 f (λ) dλ = |eiλn − ϕ̂n (λ)|2 f (λ) dλ. (9)
ϕ̃n ∈H0 (F) −π −π

It follows from Hilbert-space theory (Sect. 11, Chap. 2, Vol. 1) that the optimal
function ϕ̂n (λ) (in the sense of (9)) is determined by the two conditions

(i) ϕ̂n (λ) ∈ H0 (F),


(10)
(ii) eiλn − ϕ̂n (λ) ⊥ H0 (F).
6 Extrapolation, Interpolation, and Filtering 87

Since

eiλn Φn (e−iλ ) = eiλn [bn e−iλn + bn+1 e−iλ(n+1) + · · · ] ∈ H0 (F)

and, in a similar way, 1/Φ(e−iλ ) ∈ H0 (F), the function ϕ̂n (λ) defined in (7) belongs
to H0 (F). Therefore in proving that ϕ̂n (λ) is optimal, it is sufficient to verify that,
for every m ≥ 0,
eiλn − ϕ̂n (λ) ⊥ eiλm ,
i.e.,  π
In,m ≡ [eiλn − ϕ̂n (λ)]eiλm f (λ) dλ = 0, m ≥ 0.
−π

The following chain of equations shows that this is actually the case:
 π !
1 Φn (e−iλ )
In,m = e iλ(n+m)
1− |Φ(e−iλ )|2 dλ
2π −π Φ(e−iλ )
 π
1
= eiλ(n+m) [Φ(e−iλ ) − Φn (e−iλ )]Φ(e−iλ ) dλ
2π −π
 π 
n−1   ∞ 
1 iλ(n+m) −iλk iλl
= e bk e bl e dλ
2π −π
k=0 l=0
 π 
n−1   ∞ 
1
= eiλm bk eiλ(n−k) bl eiλl dλ = 0,
2π −π
k=0 l=0

where the last equation follows because, for m ≥ 0 and r > 1,


 π
e−iλm eiλr dλ = 0.
−π

This completes the proof of the theorem.




Remark 1. Expanding ϕ̂n (λ) in a Fourier series

ϕ̂n (λ) = C0 + C−1 e−iλ + C−2 e−2iλ + · · · ,

we find that the predicted value ξˆn of ξn , n ≥ 1, in terms of the past, ξ 0 =


(. . . , ξ−1 , ξ0 ), is given by the formula

ξˆn = C0 ξ0 + C−1 ξ−1 + C−2 ξ−2 + · · · .

Remark 2. A typical example of a spectral density represented in the form (4) is


the rational function  2
1  P(e−iλ ) 
f (λ) = ,
2π  Q(e−iλ ) 
where the polynomials P(z) = a0 + a1 z + · · · + ap zp and Q(z) = 1 + b1 z + · · · + bq zq
have no zeros in {z : |z| ≤ 1}.
88 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

In fact, in this case it is enough to set Φ(z) = P(z)/Q(z). Then Φ(z) =


∞ k
k=0 Ck z , and the radius of convergence of this series is greater than one.

Let us illustrate Theorem 1 with two examples.

EXAMPLE 1. Let the spectral density be


1
(5 + 4 cos λ).
f (λ) =

The corresponding covariance function R(n) has the shape of a triangle with

R(0) = 5, R(±1) = 2, R(n) = 0 for |n| ≥ 2. (11)

Since this spectral density can be represented in the form


1
f (λ) = |2 + e−iλ |2 ,

we may apply Theorem 1. We find easily that

e−iλ
ϕ̂1 (λ) = eiλ , ϕ̂n (λ) = 0 for n ≥ 2. (12)
2 + e−iλ

Therefore ξˆn = 0 for all n ≥ 2, i.e., the (linear) prediction of ξn in terms of ξ 0 =


(. . . , ξ−1 , ξ0 ) is trivial, which is not at all surprising if we observe that, by (11), the
correlation between ξn and any of ξ0 , ξ−1 , . . . is zero for n ≥ 2.
For n = 1, we find from (6) and (12) that
 π
e−iλ
ξˆ1 = eiλ Z(dλ)
−π 2 + e−iλ
 ∞
 
1 π 1 (−1)k π −ikλ
= Z(dλ) = e Z(dλ)
2 −π (1 + 12 e−iλ ) 2k+1 −π
k=0

 (−1)k ξk 1 1
= = ξ0 − ξ−1 + · · · .
2k+1 2 4
k=0

EXAMPLE 2. Let the covariance function be

R(n) = an , |a| < 1.

Then (see Example 5 in Sect. 1)

1 1 − |a|2
f (λ) = ,
2π |1 − ae−iλ |2

i.e.,
1
f (λ) = |Φ(e−iλ )|2 ,

6 Extrapolation, Interpolation, and Filtering 89

where
 ∞
(1 − |a|2 )1/2
Φ(z) = = (1 − |a|2 )1/2 (az)k ,
1 − az
k=0
n
from which ϕ̂n (λ) = a , and therefore
 π
ξˆn = an Z(dλ) = an ξ0 .
−π

In other words, to predict the value of ξn from the observations ξ 0 =


(. . . , ξ−1 , ξ0 ), it is sufficient to know only the last observation ξ0 .

Remark 3. It follows from the Wold expansion of a regular sequence ξ = (ξn ) with


ξn = ak ξn−k (13)
k=0

that the spectral density f (λ) admits the representation


1
f (λ) = |Φ(e−iλ )|2 , (14)

where


Φ(z) = ak zk . (15)
k=0

It is evident that the converse also holds, that is, if f (λ) admits the representa-
tion (14) with a function Φ(z) of the form (15), then the Wold expansion of ξn
has the form (13). Therefore the problem of representing the spectral density in the
form (14) and the problem of determining the coefficients ak in the Wold expansion
are equivalent.

The assumptions that Φ(z) in Theorem 1 has no zeros for |z| ≤ 1 and that r > 1
are in fact not essential. In other words, if the spectral density of a regular sequence
is represented in the form (14), then the optimal estimator ξˆn (in the mean-square
sense) for ξn in terms of ξ 0 = (. . . , ξ−1 , ξ0 ) is determined by formulas (6) and (7).

Remark 4. Theorem 1 (with the preceding Remark 3) solves the prediction problem
for regular sequences. Let us show that in fact the same answer remains valid for
arbitrary stationary sequences. More precisely, let
 π
ξn = ξns + ξnr , ξn = eiλn Z(dλ), F(Δ) = E |Z(Δ)|2 ,
−π

and let f r (λ) = (1/2π)|Φ(e−iλ )|2 be the spectral density of the regular sequence
ξ r = (ξnr ). Then ξˆn is determined by (6) and (7).
90 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

In fact, let (see Subsection 3 of Sect. 5)


 π  π
ˆ
ξn = ˆ
ϕ̂n (λ) Z(dλ), ξn =r
ϕ̂rn (λ) Z r (dλ),
−π −π

where Z r (Δ) is the orthogonal stochastic measure in the representation of the regu-
lar sequence ξ r . Then
 π
ˆ
E |ξn − ξn | =
2
|eiλn − ϕ̂n (λ)|2 F(dλ)
−π
 π  π
≥ |e
iλn
− ϕ̂n (λ)| f (λ) dλ ≥
2 r
|eiλn − ϕ̂rn (λ)|2 f r (λ) dλ
−π −π

= E |ξnr − ξˆnr |2 . (16)

But ξn − ξˆn = ξˆnr − ξˆr . Hence E |ξn − ξˆn |2 = E |ξnr − ξˆnr |2 , and it follows from (16)
that we may take ϕ̂n (λ) to be ϕ̂rn (λ).

2. Interpolation. Suppose that ξ = (ξn ) is a regular sequence with spectral density


f (λ). The simplest interpolation problem is the problem of constructing the optimal
(mean-square) linear estimator for ξ0 from the results of the measurements {ξn , n =
±1, ±2, . . .} with omitted ξ0 .
Let H 0 (ξ) be the closed linear manifold spanned by ξn , n = 0. Then, according
to Theorem 2 of Sect. 3, every random variable η ∈ H 0 (ξ) can be represented in the
form  π
η= ϕ(λ) Z(dλ),
−π

where ϕ belongs to H 0 (F), the closed linear manifold spanned by the functions eiλn ,
n = 0. The estimator  π
ˇ
ξ0 = ϕ̌(λ) Z(dλ) (17)
−π

will be optimal if and only if


 π
inf0 E |ξ0 − η|2 = inf0 |1 − ϕ(λ)|2 F(dλ)
η∈H (ξ) ϕ∈H (F) −π
 π
= |1 − ϕ̌(λ)|2 F(dλ) = E |ξ0 − ξˇ0 |2 .
−π

It follows from the perpendicularity properties of the Hilbert space H 0 (F) that
ϕ̌(λ) is completely determined (compare (10)) by the two conditions

(i) ϕ̌(λ) ∈ H 0 (F),


(18)
(ii) 1 − ϕ̌(λ) ⊥ H 0 (F).
6 Extrapolation, Interpolation, and Filtering 91

Theorem 2 (Kolmogorov). Let ξ = (ξn ) be a regular sequence such that


 π

< ∞. (19)
−π f (λ)

Then α
ϕ̌(λ) = 1 − , (20)
f (λ)
where 2π
α = π dλ
, (21)
−π f (λ)

and the interpolation error δ 2 = E |ξ0 − ξˇ0 |2 is given by δ 2 = 2πα.


PROOF. We shall give the proof only under very stringent hypotheses on the spectral
density, specifically that
0 < c ≤ f (λ) ≤ C < ∞. (22)
It follows from (2) in (18) that
 π
[1 − ϕ̌(λ)]einλ f (λ) dλ = 0 (23)
−π

for every n = 0. By (22), the function [1 − ϕ̌(λ)] f (λ) belongs to the Hilbert space
√ π], B[−π, π], μ) with Lebesgue measure μ. In this space the functions
L2 ([−π,
{einλ / 2π, n = 0, ±1, . . .} form an orthonormal basis (Problem 10, Sect. 12,
Chap. 2, Vol. 1). Hence it follows from (23) that [1 − ϕ̌(λ)] f (λ) is a constant, which
we denote by α.
Thus, the second condition in (18) leads to the conclusion that
α
ϕ̌(λ) = 1 − . (24)
f (λ)

Starting from the first condition (18), we now determine α.


By (22), we have ϕ̌ ∈ L2 , and the condition ϕ̌ ∈ H 0 (F) is equivalent to the
condition that ϕ̌ belongs to the closed (in the L2 norm) linear manifold spanned by
the functions eiλn , n = 0. Hence it is clear that the zeroth coefficient in the expansion
of ϕ̌(λ) must be zero. Therefore
 π  π

0= ϕ̌(λ) dλ = 2π − α
−π −π f (λ)

and hence α is determined by (21).


Finally,
 π
δ = E |ξ0 − ξˇ0 |2 =
2
|1 − ϕ̌(λ)|2 f (λ) dλ
−π
 π
f (λ) 4π 2
= |α|2 dλ =  π dλ
.
−π f 2 (λ) −π f (λ)

This completes the proof (under condition (22)).




92 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

Corollary. If 
ϕ̌(λ) = ck eiλk ,
0<|k|≤N

then
  π 
ξˇ0 = ck eiλk Z(dλ) = ck ξk .
0<|k|≤N −π 0<|k|≤N

EXAMPLE 3. Let f (λ) be the spectral density in Example 2 above. Then an easy
calculation shows that
 π
a a
ˇ
ξ0 = [eiλ + e−iλ ] Z(dλ) = [ξ1 + ξ−1 ],
−π 1 + |a| 2 1 + |a|2

and the interpolation error is


1 − |α|2
δ2 = .
1 + |α|2

3. Filtering. Let (θ, ξ) = ((θn ), (ξn )), n ∈ Z, be a partially observed sequence,


where θ = (θn ) and ξ = (ξn ) are respectively the unobserved and observed compo-
nents. Each of the sequences θ and ξ will be supposed stationary (wide sense) with
zero means and spectral representations
 π  π
iλn
θn = e Zθ (dλ) and ξn = eiλn Zξ (dλ).
−π −π

We write
Fθ (Δ) = E |Zθ (Δ)|2 , Fξ (Δ) = E |Zξ (Δ)|2
and
Fθξ (Δ) = E Zθ (Δ)Zξ (Δ).
In addition, we suppose that θ and ξ are connected in a stationary way, i.e., that their
covariance function Cov(θn , ξm ) = E θn ξ m depends only on the difference n − m.
Let Rθξ (n) = E θn ξ 0 ; then
 π
Rθξ (n) = eiλn Fθξ (dλ).
−π

The filtering problem of interest is the construction of the optimal (mean-square)


linear estimator θ̂n of θn in terms of some observation of the sequence ξ.
The problem is easily solved under the assumption that θn is to be constructed
from all the values ξm , m ∈ Z. In fact, since θ̂n = Ê(θn | H(ξ)), there is a function
ϕ̂n (λ) such that  π
θ̂n = ϕ̂n (λ) Zξ (dλ). (25)
−π
6 Extrapolation, Interpolation, and Filtering 93

As in Subsections 1 and 2, the conditions to impose on the optimal ϕ̂n (λ) are that
(i) ϕ̂n (λ) ∈ H(Fξ ),
(ii) θn − θ̂n ⊥ H(ξ).
From the latter condition we find
 π  π
eiλ(n−m)
Fθξ (dλ) − e−iλm ϕ̂n (λ) Fξ (dλ) = 0 (26)
−π −π

for every m ∈ Z. Therefore, if we suppose that Fθξ (λ) and Fξ (λ) have densities
fθξ (λ) and fξ (λ), we find from (26) that
 π
eiλ(n−m) [ fθξ (λ) − e−iλn ϕ̂n (λ) fξ (λ)] dλ = 0.
−π

If fξ (λ) > 0 (almost everywhere with respect to Lebesgue measure), we find


immediately that
ϕ̂n (λ) = eiλn ϕ̂(λ), (27)
where
ϕ̂(λ) = fθξ (λ) · fξ⊕ (λ)
and fξ⊕ (λ) is the “pseudoinverse” of fξ (λ), i.e.,

[ fξ (λ)]−1 , fξ (λ) > 0,
fξ⊕ (λ) =
0, fξ (λ) = 0.

Then the filtering error is


 π
E |θn − θ̂n |2 = [ fθ (λ) − fθξ
2
(λ) fξ⊕ (λ)] dλ. (28)
−π

As is easily verified, ϕ̂ ∈ H(Fξ ), and consequently the estimator (25), with the
function (27), is optimal.

EXAMPLE 4 (Detection of a signal in the presence of noise). Let ξn = θn +ηn , where


the signal θ = (θn ) and the noise η = (ηn ) are uncorrelated sequences with spectral
densities fθ (λ) and fn (λ). Then
 π
θ̂n = eiλn ϕ̂(λ) Zξ (dλ),
−π

where
ϕ̂(λ) = fθ (λ)[ f0 (λ) + fη (λ)]⊕ ,
and the filtering error is
 π
E |θn − θ̂n | =
2
[ fθ (λ) fη (λ)][ fθ (λ) + fη (λ)]⊕ dλ.
−π
94 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

The solution (25) obtained earlier can now be used to construct an optimal esti-
mator θ̃n+m of θn+m based on observations ξk , k ≤ n, where m is a given number in
Z. Let us suppose that ξ = (ξn ) is regular, with spectral density
1
f (λ) = |Φ(e−iλ )|2 ,

∞
where Φ(z) = k=0 ak zk . By the Wold expansion,


ξn = ak εn−k ,
k=0

where ε = (εn ) is white noise with the spectral decomposition


 π
εn = eiλn Zε (dλ).
−π

Since

θ̃n+m = Ê[θn+m | Hn (ξ)] = Ê[Ê[θn+m | H(ξ)] | Hn (ξ)] = Ê[θ̂n+m | Hn (ξ)]

and  π ∞

θ̂n+m = eiλ(n+m) ϕ̂(λ)Φ(e−iλ ) Zε (dλ) = ân+m−k εk ,
−π k=−∞

where  π
1
âk = eiλk ϕ̂(λ)Φ(e−iλ ) dλ, (29)
2π −π

we have / ∞
0

θ̃n+m = Ê ân+m−k εk | Hn (ξ) .
k=−∞

But Hn (ξ) = Hn (ε), and therefore


⎡ ⎤
  π 
θ̃n+m = ân+m−k εk = ⎣ iλk ⎦
ân+m−k e Zε (dλ)
k≤n −π k≤n
 / ∞
0
π 
= eiλn âl+m e−iλl Φ⊕ (e−iλ ) Zξ (dλ),
−π l=0

where Φ⊕ is the pseudoinverse of Φ.

We have therefore established the following theorem.

Theorem 3. If the sequence ξ = (ξn ) under observation is regular, then the optimal
(mean-square) linear estimator θ̃n+m of θn+m in terms of ξk , k ≤ n, is given by
7 The Kalman–Bucy Filter and Its Generalizations 95
 π
θ̃n+m = eiλn Hm (e−iλ ) Zξ (dλ), (30)
−π

where


Hm (e−iλ ) = âl+m e−iλl Φ⊕ (e−iλ ) (31)
l=0

and the coefficients ak are defined by (29).


4. PROBLEMS
1. Show that the conclusion of Theorem 1 remains valid even without the hypothe-
ses that Φ(z) has a radius of convergence r > 1 and that the zeros of Φ(z) all lie
in |z| > 1.
2. Show that, for a regular process, the function Φ(z) involved in (4) can be repre-
sented in the form
 ∞
√ 1
Φ(z) = 2π exp c0 + ck z , |z| < 1,
k
2
k=1

where  π
1
ck = eikλ log f (λ) dλ.
2π −π

Deduce from this formula that the one-step prediction error σ12 = E |ξˆ1 − ξ1 |2 is
given by the Szegő–Kolmogorov formula
  π
2 1
σ1 = 2π exp log f (λ) dλ .
2π −π

3. Prove Theorem 2 without assuming (22).


4. Let a signal θ and a noise η, not correlated with each other, have spectral densities
1 1 1 1
fθ (λ) = · and fη (λ) = · .
2π |1 + b1 e−iλ |2 2π |1 + b2 e−iλ |2

Using Theorem 3, find an estimator θ̃n+m for θn+m in terms of ξk , k ≤ n, where


ξk = θk + ηk . Consider the same problem for the spectral densities
1 1
fθ (λ) = |2 + e−iλ |2 and fη (λ) = .
2π 2π

7. The Kalman–Bucy Filter and Its Generalizations

1. From a computational point of view, the solution presented earlier for the problem
of filtering out an unobservable component θ by means of observations of ξ is not
practical since, because it is expressed in terms of the spectrum, it has to be carried
96 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

out by analog devices. In the method proposed by Kalman and Bucy, the synthesis
of the optimal filter is carried out recursively; this makes it possible to do it with a
digital computer. There are also other reasons for the wide use of the Kalman–Bucy
filter, one being that it still “works” even without the assumption that the sequence
(θ, ξ) is stationary.
We shall present not only the usual Kalman–Bucy method but also its generaliza-
tions in which the recurrent equations for (θ, ξ) have coefficients that may depend
on all the data observed in the past.
Thus, let us suppose that (θ, ξ) = ((θn ), (ξn )) is a partially observed sequence,
and let
θn = (θ1 (n), . . . , θk (n)) and ξn = (ξ1 (n), . . . , ξl (n))
be governed by the recurrent equations

θn+1 = a0 (n, ξ) + a1 (n, ξ)θn + b1 (n, ξ)ε1 (n + 1) + b2 (n, ξ)ε2 (n + 1),


(1)
ξn+1 = A0 (n, ξ) + A1 (n, ξ)θn + B1 (n, ξ)ε1 (n + 1) + B2 (n, ξ)ε2 (n + 1).

Here

ε1 (n) = (ε11 (n), . . . , ε1k (n)) and ε2 (n) = (ε21 (n), . . . , ε2l (n))

are independent Gaussian vectors with independent components, each of which is


normally distributed with parameters 0 and 1; a0 (n, ξ) = (a01 (n, ξ), . . . , α0k (n, ξ))
and A0 (n, ξ) = (A01 (n, ξ), . . . , A0l (n, ξ)) are vector functions with nonan-
ticipative dependence on ξ = {ξ0 , . . . , ξn ), i.e., for a given n the functions
a0 (n, ξ), . . . , A0l (n, ξ) depend only on ξ0 , . . . , ξn ; the matrix functions
(1) (2)
b1 (n, ξ) = bij (n, ξ), b2 (n, ξ) = bij (n, ξ),
(1) (2)
B1 (n, ξ) = Bij (n, ξ), B2 (n, ξ) = Bij (n, ξ),
(1) (1)
a1 (n, ξ) = aij (n, ξ), A1 (n, ξ) = Aij (n, ξ)

have orders k × k, k × l, l × k, l × l, k × k, l × k, respectively, and also depend on


ξ nonanticipatively. We also suppose that the initial vector (θ0 , ξ0 ) is independent
of the sequences ε1 = (ε1 (n)) and ε2 = (ε2 (n)).
To simplify the presentation, we shall frequently not indicate the dependence of
the coefficients on ξ.
So that the system (1) will have  a solution with finite second moments, we as- 
k
sume that E(θ0  + ξ0  ) < ∞ with x2 = i=1 xi2 for x = (x1 , . . . , xk ) ,
2 2

(1) (1) (1)


|aij (n, ξ)| ≤ C, |Aij (n, ξ)| ≤ C, and if g(n, ξ) is any of the functions a0i , A0j , bij ,
(2) (1) (2)
bij , Bij , or Bij , then E |g(n, ξ)|2 < ∞, n = 0, 1, . . . . With these assumptions,
(θ, ξ) has E(θn 2 + ξn 2 ) < ∞, n ≥ 0.
Now let Fnξ = σ{ξ0 , . . . , ξn } be the smallest σ-algebra generated by ξ0 , . . . , ξn
and
mn = E(θn | Fnξ ), γn = E[(θn − mn )(θn − mn )∗ | Fnξ ].
7 The Kalman–Bucy Filter and Its Generalizations 97

According to Theorem 1, Sect. 8, Chap. 2, Vol. 1, mn = (m1 (n), . . . , mk (n)) is an op-


timal estimator (in the mean-square sense) for the vector θn = (θ1 (n), . . . , θk (n)),
and E γn = E[(θn − mn )(θn − mn )∗ ] is the matrix of errors of observation. Deter-
mining these matrices for arbitrary sequences (θ, ξ) governed by Eqs. (1) is a very
difficult problem. However, under a further supplementary condition on (θ0 , ξ0 ),
namely, that the conditional distribution P(θ0 ≤ a | ξ0 ) is Gaussian,
 a 
1 (x − m0 )2
P(θ0 ≤ a | ξ0 ) = √ exp − dx, (2)
2πγ0 −∞ 2γ0

with parameters m0 = m0 (ξ0 ), γ0 = γ0 (ξ0 ), we can derive a system of recurrent


equations for mn and γm that also include the Kalman–Bucy filter equations.
To begin with, let us establish an important auxiliary result.

Lemma 1. Under the assumptions made earlier about the coefficients of (1), to-
gether with (2), the sequence (θ, ξ) is conditionally Gaussian, i.e., the conditional
distribution function
P{θ0 ≤ a0 , . . . , θn ≤ an | Fnξ }
is (P-a.s.) the distribution function of an n-dimensional Gaussian vector whose
mean and covariance matrix depend on (ξ0 , . . . , ξn ).

PROOF. We prove only the Gaussian character of P(θn ≤ a | Fnξ ); this is enough to
let us obtain equations for mn and γn .
First we observe that (1) implies that the conditional distribution

P(θn+1 ≤ a1 , ξn+1 ≤ x | Fnξ , θn = b)

is Gaussian with mean-value vector



a0 + a1 b
A0 + A1 b =
A0 + A1 b

and covariance matrix 


b◦b b◦B
B = ,
(b ◦ B)∗ B ◦ B
where b ◦ b = b1 b∗1 + b2 b∗2 , b ◦ B = b1 B∗1 + b2 B∗2 , B ◦ B = B1 B∗1 + B2 B∗2 .
Let ζn = (θn , ξn ) and t = (t1 , . . . , tk+l ). Then

E[exp(it∗ ζn+1 ) | Fnξ , θn ] = exp{it∗ (A0 (n, ξ) + A1 (n, ξ)θn ) − 12 t∗ B(n, ξ)t}. (3)

Suppose now that the conclusion of the lemma holds for some n ≥ 0. Then

E[exp(it∗ A1 (n, ξ)θn ) | Fnξ ]


1 2
= exp it∗ A1 (n, ξ)mn − 12 t∗ (A1 (n, ξ)γn A∗1 (n, ξ))t . (4)

Let us show that (4) is also valid when n is replaced by n + 1.


98 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

From (3) and (4) we have


1
E[exp(it∗ ζn+1 ) | Fnξ ] = exp it∗ (A0 (n, ξ) + A1 (n, ξ)mn )
2
− 21 t∗ B(n, ξ)t − 12 t∗ (A1 (n, ξ)γn A∗1 (n, ξ))t .

Hence the conditional distribution


P(θn+1 ≤ a, ξn+1 ≤ x | Fnξ ) (5)

is Gaussian.
As in the proof of the theorem on normal correlation (Theorem 2 in Sect. 13,
Chap. 2, Vol. 1) we can verify that there is a matrix C such that the vector

η = [θn+1 − E(θn+1 | Fnξ )] − C[ξn+1 − E(ξn+1 | Fnξ )]

has the property that (P-a.s.)

E[η(ξn+1 − E(ξn+1 | Fnξ ))∗ | Fnξ ] = 0.

This implies that the conditionally Gaussian vectors η and ξn+1 , considered under
the condition Fnξ , are independent, i.e., (P-a.s.)

P(η ∈ A, ξn+1 ∈ B | Fnξ ) = P(η ∈ A | Fnξ ) · P(ξn+1 ∈ B | Fnξ )

for all A ∈ B(Rk ), B ∈ B(Rl ).


Therefore, if s = (s1 , . . . , sn ), then
E[exp(is∗ θn+1 ) | Fnξ , ξn+1 ]
= E{exp(is∗ [E(θn+1 | Fnξ ) + η + C[ξn+1 − E(ξn+1 | Fnξ )]]) | Fnξ , ξn+1 }
= exp{is∗ [E(θn+1 | Fnξ ) + C[ξn+1 − E(ξn+1 | Fnξ )]]}
× E[exp(is∗ η) | Fnξ , ξn+1 ]
= exp{is∗ [E(θn+1 | Fnξ ) + C[ξn+1 − E(ξn+1 | Fnξ )]]}
× E(exp(is∗ η) | Fnξ ). (6)

By (5), the conditional distribution P(η ≤ y | Fnξ ) is Gaussian. With (6), this
ξ
shows that the conditional distribution P(θn+1 ≤ a | Fn+1 ) is also Gaussian.
This completes the proof of the lemma.


Theorem 1. Let (θ, ξ) be a partially observed sequence that satisfies the system (1)
and condition (2). Then (mn , γn ) obey the following recursion relations:

mn+1 = [a0 + a1 mn ] + [b ◦ B + a1 γn A∗1 ][B ◦ B + A1 γn A∗1 ]⊕


×[ξn+1 − A0 − A1 mn ], (7)
γn+1 = [a1 γn a∗1
+ b ◦ b] − [b ◦ B + a1 γn A∗1 ][B ◦B+ A1 γn A∗1 ]⊕
×[b ◦ B + a1 γn A∗1 ]∗ . (8)
7 The Kalman–Bucy Filter and Its Generalizations 99

PROOF. From (1),

E(θn+1 | Fnξ ) = a0 + a1 mn , E(ξn+1 | Fnξ ) = A0 + A1 mn (9)

and

θn+1 − E(θn+1 | Fnξ ) = a1 [θn − mn ] + b1 ε1 (n + 1) + b2 ε2 (n + 1),


(10)
ξn+1 − E(ξn+1 | Fnξ ) = A1 [θn − mn ] + B1 ε1 (n + 1) + B2 ε2 (n + 1).

Let us write

d11 = Cov(θn+1 , θn+1 | Fnξ )


= E{[θn+1 − E(θn+1 | Fnξ )][θn+1 − E(θn+1 | Fnξ )]∗ | Fnξ },
d12 = Cov(θn+1 , ξn+1 | Fnξ )
= E{[θn+1 − E[θn+1 | Fnξ )][ξn+1 − E(ξn+1 | Fnξ )]∗ | Fnξ },
d22 = Cov(ξn+1 , ξn+1 | Fnξ )
= E{[ξn+1 − E(ξn+1 | Fnξ )][ξn+1 − E(ξn+1 | Fnξ )]∗ | Fnξ }.

Then, by (10),

d11 = a1 γn a∗1 + b ◦ b, d12 = a1 γn A∗1 + b ◦ B, d22 = A1 γn A∗1 + B ◦ B. (11)

By the theorem on normal correlation (see Theorem 2 and Problem 4 in Sect. 13,
Chap. 2, Vol. 1),

mn+1 = E(θn+1 | Fnξ , ξn+1 ) = E(θn+1 | Fnξ ) + d12 d22 (ξn+1 − E(ξn+1 | Fnξ ))

and
⊕ ∗
γn+1 = Cov(θn+1 , θn+1 | Fnξ , ξn+1 ) = d11 − d12 d22 d12 .
If we then use the expressions from (9) for E(θn+1 | Fnξ ) and E(ξn+1 | Fnξ ) and
those for d11 , d12 , d22 from (11), we obtain the required recursion formulas (7) and
(8).
This completes the proof of the theorem.


Corollary 7. If the coefficients a0 (n, ξ), . . . , B2 (n, ξ) in (1) are independent of ξ,
the corresponding method is known as the Kalman–Bucy method, and Eqs. (7) and
(8) for mn and γn describe the Kalman–Bucy filter. It is important to observe that in
this case the conditional and unconditional error matrices γn agree, i.e.,

γn ≡ E γn = E[(θn − mn )(θn − mn )∗ ].

Corollary 8. Suppose that a partially observed sequence (θn , ξn ) has the property
that θn satisfies the first equation (1), and that ξn satisfies the equation

ξn = Ã0 (n − 1, ξ) + Ã1 (n − 1, ξ)θn


+B̃1 (n − 1, ξ)ε1 (n) + B̃2 (n − 1, ξ)ε2 (n). (12)
100 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

Then evidently
ξn+1 = Ã0 (n, ξ) + Ã1 (n, ξ)[a0 (n, ξ) + a1 (n, ξ)θn
+b1 (n, ξ)ε1 (n + 1) + b2 (n, ξ)ε2 (n + 1)
+B̃1 (n, ξ)ε1 (n + 1) + B̃2 (n, ξ)ε2 (n + 1),

and with the notation


A0 = Ã0 + Ã1 a0 , A1 = Ã1 a1 ,
B1 = Ã1 b1 + B̃1 , B2 = Ã2 b2 + B̃2 ,

we find that the case under consideration also obeys the model (1) and that mn and
γn satisfy (7) and (8).
2. We now consider a linear model (cf. (1))
θn+1 = a0 + a1 θn + a2 ξn + b1 ε1 (n + 1) + b2 ε2 (n + 1),
(13)
ξn+1 = A0 + A1 θn + A2 ξn + B1 ε1 (n + 1) + B2 ε2 (n + 1),

where the coefficients a0 , . . . , B2 may depend on n (but not on ξ), and εij (n) are
independent Gaussian random variables with E εij (n) = 0 and E ε2ij (n) = 1.
Let (13) be solved with initial values (θ0 , ξ0 ) such that the conditional distri-
bution P(θ0 ≤ a | ξ0 ) is Gaussian with parameters m0 = E(θ0 , ξ0 ) and γ =
Cov(θ0 , θ0 | ξ0 ) = E γ0 . Then, by the theorem on normal correlation and (7) and
(8), the optimal estimator mn = E(θn | Fnξ ) is a linear function of ξ0 , ξ1 , . . . , ξn .
This remark makes it possible to prove the following important statement about
the structure of the optimal linear filter without the assumption that the random
variables involved are Gaussian.
Theorem 2. Let (θ, ξ) = (θn , ξn )n≥0 be a partially observed sequence that satisfies
(13), where εij (n) are uncorrelated random variables with E εij (n) = 0, E ε2ij (n) =
1, and the components of the initial vector (θ0 , ξ0 ) have finite second moments. Then
the optimal linear estimator m̂n = Ê(θn | ξ0 , . . . , ξn ) satisfies (7) with a0 (n, ξ) =
a0 (n) + a2 (n)ξn , A0 (n, ξ) = A0 (n) + A2 (n)ξn , and the error matrix
γ̂n = E[(θn − m̂n )(θn − m̂n )∗ ]

satisfies (8) with initial values


m̂0 = Cov(θ0 , ξ0 ) Cov⊕ (ξ0 , ξ0 ) · ξ0 ,
(14)
γ̂0 = Cov(θ0 , θ0 ) − Cov(θ0 , ξ0 ) Cov⊕ (ξ0 , ξ0 ) Cov∗ (θ0 , ξ0 ).

For the proof of this theorem, we need the following lemma, which reveals the
role of the Gaussian case in determining optimal linear estimators.
Lemma 2. Let (α, β) be a two-dimensional random vector with E(α2 + β 2 ) <
∞, and (α̃, β̃) a two-dimensional Gaussian vector with the same first and second
moments as (α, β), i.e.,

E α̃i = E αi , E β̃ i = E β i , i = 1, 2; E α̃β̃ = E αβ.


7 The Kalman–Bucy Filter and Its Generalizations 101

Let λ(b) be a linear function of b such that

λ(b) = E(α̃ | β̃ = b).

Then λ(β) is the optimal (in the mean-square sense) linear estimator of α in terms
of β, i.e.,
Ê(α | β) = λ(β).
Here E λ(β) = E α.
PROOF. We first observe that the existence of a linear function λ(b) coinciding with
E(α̃ | β̃ = b) follows from the theorem on normal correlation. Moreover, let λ(b)
be any other linear estimator. Then
E[α̃ − λ(β̃)]2 ≥ E[α̃ − λ(β̃)]2

and since λ(b) and λ(b) are linear and the hypotheses of the lemma are satisfied, we
have

E[α − λ(β)]2 = E[α̃ − λ(β̃)]2 ≥ E[α̃ − λ(β̃)]2 = E[α − λ(β)]2 ,

which shows that λ(β) is optimal in the class of linear estimators. Finally,

E λ(β) = E λ(β̃) = E[E(α̃ | β̃)] = E α̃ = E α.

This completes the proof of the lemma.




PROOF OF THEOREM 2. We consider, besides (13), the system
θ̃n+1 = a0 + a1 θ̃n + a2 ξ˜n + b1 ε̃11 (n + 1) + b2 ε̃12 (n + 1),
(15)
ξ˜n+1 = A0 + A1 θ̃n + A2 ξˆn + B1 ε̃21 (n + 1) + B2 ε̃22 (n + 1),

where ε̃ij (n) are independent Gaussian random variables with E ε̃ij (n) = 0 and
E ε̃2ij (n) = 1. Let (θ̃0 , ξ˜0 ) also be a Gaussian vector that has the same first mo-
ments and covariance as (θ0 , ξ0 ) and is independent of ε̃ij (n). Then, since (15) is
linear, the vector (θ̃0 , . . . , θ̃n , ξ˜0 , . . . , ξ˜n ) is Gaussian, and therefore the conclusion
of the theorem follows from Lemma 2 (more precisely, from its multidimensional
analog) and the theorem on normal correlation.
This completes the proof of the theorem.


3. Let us consider some illustrations of Theorems 1 and 2.
EXAMPLE 1. Let θ = (θn ) and η = (ηn ) be two stationary (wide sense) uncorrelated
random sequences with E θn = E ηn = 0 and spectral densities
1 1 1
fθ (λ) = and fη (λ) = · ,
2π|1 + b1 e−iλ |2 2π |1 + b2 e−iλ |2

where |b1 | < 1, |b2 | < 1.


102 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

We shall interpret θ as an informative signal and η as noise and suppose that


observation produces a sequence ξ = (ξn ) with
ξn = θ n + η n .

According to Corollary 2 to Theorem 3 in Sect. 3, there are (mutually uncorrelated)


white noises ε1 = (ε1 (n)) and ε2 = (ε2 (n)) such that

θn+1 + b1 θn = ε1 (n + 1), ηn+1 + b2 ηn = ε2 (n + 1).

Then
ξn+1 = θn+1 + ηn+1 = −b1 θn − b2 ηn + ε1 (n + 1) + ε2 (n + 1)
= −b2 (θn + ηn ) − θn (b1 − b2 ) + ε1 (n + 1) + ε2 (n + 1)
= −b2 ξn − (b1 − b2 )θn + ε1 (n + 1) + ε2 (n + 1).

Hence θ and ξ satisfy the recursion relations

θn+1 = −b1 θn + ε1 (n + 1),


(16)
ξn+1 = −(b1 − b2 )θn − b2 ξn + ε1 (n + 1) + ε2 (n + 1),

and, according to Theorem 2, mn = Ê(θn | ξ0 , . . . , ξn ) and γn = E(θn − mn )2 satisfy


the following system of recursion equations for optimal linear filtering:
b1 (b1 − b2 )γn
mn+1 = −b1 mn + [ξn+1 + (b1 − b2 )mn + b2 ξn ],
2 + (b1 − b2 )2 γn
(17)
[1 + b1 (b1 − b2 )γn ]2
γn+1 = b21 γn + 1 − .
2 + (b1 − b2 )2 γn

Let us find the initial conditions under which we should solve this system. Write
d11 = E θn2 , d12 = E θn ξn , d22 = E ξn2 . Then we find from (16) that

d11 = b21 d11 + 1,


d12 = b1 (b1 − b2 )d11 + b1 b2 d12 + 1,
d22 = (b1 − b2 )2 d11 + b22 d22 + 2b2 (b1 − b2 )d12 + 2,

from which
1 1 2 − b21 − b22
d11 = , d12 = , d22 = ,
1 − b21 1 − b22 (1 − b21 )(1 − b22 )

which, by (14), leads to the following initial values:

d12 1 − b22
m0 = ξ0 = ξ0 ,
d22 2 − b21 − b22
(18)
d2 1 1 − b22 1
γ0 = d11 − 12 = − = .
d22 1 − b12 (1 − b1 )(2 − b1 − b2 )
2 2 2 2 − b21 − b22
7 The Kalman–Bucy Filter and Its Generalizations 103

Thus the optimal (in the least-squares sense) linear estimators mn for the signal
θn in terms of ξ0 , . . . , ξn and the mean-square error are determined by the system of
recurrent equations (17), solved under the initial conditions (18). Observe that the
equation for γn contains no random components, and consequently the numbers γn ,
which are needed for finding mn , can be calculated in advance, before solving the
filtering problem.

EXAMPLE 2. This example is instructive because it shows that the result of Theo-
rem 2 can be applied to find the optimal linear filter in a case where the sequence
(θ, ξ) is described by a (nonlinear) system that is different from (13).
Let ε1 = (ε1 (n)) and ε2 = (ε2 (n)) be two independent Gaussian sequences of
independent random variables with E εi (n) = 0 and E ε2i (n) = 1, n ≥ 1. Consider a
pair of sequences (θ, ξ) = (θn , ξn ), n ≥ 0, with

θn+1 = aθn + (1 + θn )ε1 (n + 1),


(19)
ξn+1 = Aθn + ε2 (n + 1).

We shall suppose that θ0 is independent of (ε1 , ε2 ) and that θ0 ∼ N (m0 , γ0 ).


System (19) is nonlinear, and Theorem 2 is not immediately applicable. How-
ever, if we set
1 + θn
ε̃1 (n + 1) = " ε1 (n + 1),
E(1 + θn )2
we can observe that E ε̃1 (n) = 0, E ε̃1 (n)ε̃1 (m) = 0, n = m, E ε21 (n) = 1. Hence
we have reduced (19) to a linear system,

θn+1 = a1 θn + b1 ε̃1 (n + 1),


(20)
ξn+1 = A1 θn + ε2 (n + 1),
"
where b1 = E(1 + θn )2 , and {ε̃1 (n)} is a sequence of uncorrelated random vari-
ables.
Now (20) is a linear system of the same type as (13), and consequently the op-
timal linear estimator m̂n = Ê(θn | ξ0 , . . . , ξn ) and its error γ̂n can be determined
from (7) and (8) via Theorem 2, applied in the following form in the present case:

a1 A1 γ̂n
m̂n+1 = a1 m̂n + [ξn+1 − A1 m̂n ],
1 + A21 γ̂n
(a1 A1 γ̂n )2
γ̂n+1 = (a21 γ̂n + b21 (n)) − ,
1 + A21 γ̂n
"
where b1 (n) = E(1 + θn )2 must be found from the first equation in (19).

EXAMPLE 3 (Estimators for parameters). Let θ = (θ1 , . . . , θk ) be a Gaussian vector


with E θ = m0 and Cov(θ, θ) = γ0 . Suppose that (with known m0 and γ0 ) we look
for the optimal estimator of θ in terms of observations on an l-dimensional sequence
ξ = (ξn ), n ≥ 0, with
104 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

ξn+1 = A0 (n, ξ) + A1 (n, ξ)θ + B1 (n, ξ)ε1 (n + 1), ξ0 = 0, (21)

where ε1 is as in (1).
Then we have from (7) and (8) that mn = E(θ | Fnξ ) and γn can be found from

mn+1 = mn + γn A∗1 (n, ξ)[(B1 B∗1 )(n, ξ) + A1 (n, ξ)γn A∗1 (n, ξ)]⊕
× [ξn+1 − A0 (n, ξ) − A1 (n, ξ)mn ], (22)
γn+1 = γn − γn A∗1 (n, ξ)[(B1 B∗1 )(n, ξ) + A1 (n, ξ)γn A∗1 (n, ξ)]⊕ A1 (n, ξ)γn .

If the matrices B1 B∗1 are nonsingular for all n and ξ, the solution of (22) is given
by
/ 0−1

n
mn+1 = E + γ A∗1 (i, ξ)(B1 B∗1 )−1 (i, ξ)A∗1 (i, ξ)
i=0
/ 0

n
× m+γ A∗1 (i, ξ)(B1 B∗1 )−1 (i, ξ)(ξi+1 − A0 (i, ξ)) , (23)
i=0
/ 0−1

n
γn+1 = E + γ A∗1 (i, ξ)(B1 B∗1 )−1 (i, ξ)A1 (i, ξ) γ,
i=0

where E is the identity matrix.

4. PROBLEMS
1. Show that the vectors mn and θn − mn in (1) are uncorrelated:

E[m∗n (θ − mn )] = 0.

2. In (1)–(2), let γ and the coefficients other than a0 (n, ξ) and A0 (n, ξ) be inde-
pendent of “chance” (i.e., of ξ). Show that then the conditional covariance γn is
independent of “chance”: γn = E γn .
3. Show that the solution of (22) is given by (23).
4. Let (θ, ξ) = (θn , ξn ) be a Gaussian sequence satisfying the following special case
of (1):
θn+1 = aθn + bε1 (n + 1), ξn+1 = Aθn + Bε2 (n + 1).
Show that if A = 0, b = 0, B = 0, the limiting error of filtering, γ = limn→∞ γn ,
exists and is determined as the positive root of the equation
!
B2 (1 − a2 ) b2 B2
2
γ + − b2
γ − = 0.
A2 A2
7 The Kalman–Bucy Filter and Its Generalizations 105

5. (Interpolation, [54, 13.3]) Let (θ, ξ) be a partially observed sequence governed


by recurrence relations (1) and (2). Suppose that the conditional distribution

πa (m, m) = P(θm ≤ a | Fmξ )

of θm is Gaussian.
(a) Show that the conditional distribution

πa (m, n) = P(θm ≤ a | Fnξ ), n ≥ m,

is also Gaussian, πa (m, n) ∼ N (μ(m, n), γ(m, n)).


(b) Find the interpolation estimator μ(m, n) (of θm given Fnξ ) and the matrix
γ(m, n).
6. (Extrapolation, [54, 13.4]) In (1) and (2), let

a0 (n, ξ) = a0 (n) + a2 (n)ξn , a1 (n, ξ) = a1 (n),


A0 (n, ξ) = A0 (n) + A2 (n)ξn , A1 (n, ξ) = A1 (n).

(a) Show that in this case the distribution πa,b (m, n) = P(θn ≤ a, ξn ≤ b | Fmξ ) is
Gaussian (n ≥ m).
(b) Find the extrapolation estimators

E(θn | Fmξ ) and E(ξn | Fmξ ).

7. (Optimal control, [54, 14.3]) Consider a “controlled” partially observed system


(θn , ξn )0≤n≤N , where

θn+1 = un + θn + bε1 (n + 1),


ξn+1 = θn + ε2 (n + 1).

Here the “control” un is Fnξ -measurable and satisfies E u2n < ∞ for all 0 ≤ n ≤
N − 1. The variables ε1 (n) and ε2 (n), n = 1, . . . , N, are the same as in (1), (2);
ξ0 = 0, θ0 ∼ N (m, γ).
We say that the “control” u∗ = (u∗0 , . . . , u∗N−1 ) is optimal if V(u∗ ) = supu V(u),
where /N−1 0

2 2 2
V(u) = E (θn + un ) + θN .
n=0

Show that
u∗n = −[1 + Pn+1 ]+ Pn+1 m∗n , n = 0, . . . , N − 1,
where 
a−1 , a = 0,
a+ =
0, a = 0,
106 6 Stationary (Wide Sense) Random Sequences: L2 -Theory

(Pn )0≤n≤N are found from the recurrence relations

Pn = 1 + Pn+1 − P2n+1 [1 + Pn+1 ]+ , PN = 1,

and (m∗n ) are determined by

m∗n+1 = u∗n + γn∗ (1 + γn∗ )+ (ξn+1 − m∗n ), 0 ≤ n ≤ N − 1,

with m∗0 = m and (γn∗ ) by



γn+1 = γn∗ + 1 − (γn∗ )2 (1 + γn∗ )+ , 0 ≤ n ≤ N − 1,

with γ0∗ = γ.
Chapter 7
Martingales

1. Definitions of Martingales and Related Concepts

Martingale theory illustrates the history of mathematical probability; the basic definitions
are inspired by crude notions of gambling, but the theory has become a sophisticated tool
of modern abstract mathematics, drawing from and contributing to other fields.
J. L. Doob [19]

1. The study of the dependence between random variables arises in various ways in
probability theory. In the theory of stationary (wide sense) random sequences, the
basic indicator of dependence is the covariance function, and the inferences made in
this theory are determined by the properties of that function. In the theory of Markov
chains (Sect. 12 of Chap. 1, Vol. 1 and Chap. 8) the basic dependence is supplied by
the transition function, which completely determines the development of the random
variables involved in Markov dependence.
In this chapter (see also Sect. 11 of Chap. 1, Vol. 1) we single out a rather wide
class of sequences of random variables (martingales and their generalizations) for
which dependence can be studied by methods based on the properties of conditional
expectations.

2. Let (Ω, F , P) be a given probability space with a filtration (flow), i.e., with a
family (Fn ) of σ-algebras Fn , n ≥ 0, such that F0 ⊆ F1 ⊆ . . . ⊆ F (“filtered
probability space”).
Let X0 , X1 , . . . be a sequence of random variables defined on (Ω, F , P). If, for
each n ≥ 0, the variable Xn is Fn -measurable, the set X = (Xn , Fn )n≥0 , or simply
X = (Xn , Fn ), is called a stochastic sequence.
If a stochastic sequence X = (Xn , Fn ) has the property that, for each n ≥ 1, the
variable Xn is Fn−1 -measurable, we write X = (Xn , Fn−1 ), taking F−1 = F0 , and
call X a predictable sequence. We call such a sequence increasing if X0 = 0 and
Xn ≤ Xn+1 (P-a.s.).

© Springer Science+Business Media, LLC, part of Springer Nature 2019 107


A. N. Shiryaev, Probability-2, Graduate Texts in Mathematics 95,
https://doi.org/10.1007/978-0-387-72208-5 4
108 7 Martingales

Definition 1. A stochastic sequence X = (Xn , Fn ) is a martingale, or a submartin-


gale, if, for all n ≥ 0,
E |Xn | < ∞ (1)
and,

E(Xn+1 | Fn ) = Xn (P -a.s.) (martingale)


or (2)
E(Xn+1 | Fn ) ≥ Xn (P -a.s.) (submartingale).

A stochastic sequence X = (Xn , Fn ) is a supermartingale if the sequence −X =


(−Xn , Fn ) is a submartingale.
In the special case where Fn = FnX , where FnX = σ{X0 , . . . , Xn }, and the
stochastic sequence X = (Xn , Fn ) is a martingale (or submartingale), we say that
the sequence (Xn )n≥0 itself is a martingale (or submartingale).
It is easy to deduce from the properties of conditional expectations that (2) is
equivalent to the property that, for every n ≥ 0 and A ∈ Fn ,
 
Xn+1 d P = Xn d P
A A
or   (3)
Xn+1 d P ≥ Xn d P .
A A

EXAMPLE 1. If (ξn )n≥0 is a sequence of independent random variables such that


E |ξn | < ∞, E ξn = 0, and Xn = ξ0 + · · · + ξn , Fn = σ{ξ0 , . . . , ξn }, the stochastic
sequence X = {Xn , Fn ) is a martingale.
EXAMPLE 2. If (ξn )n≥0 is a sequence of independent random variables such
3n that
E |ξn | < ∞ and E ξn = 1, the stochastic sequence (Xn , Fn ) with Xn = k=0 ξk ,
Fn = σ{ξ0 , . . . , ξn } is also a martingale.
EXAMPLE 3. Let ξ be a random variable with E |ξ| < ∞ and

F 0 ⊆ F1 ⊆ · · · ⊆ F .

Then the sequence X = (Xn , Fn ) with Xn = E(ξ | Fn ), is a martingale called Levy’s


martingale.
EXAMPLE 4. If (ξn )n≥0 is a sequence of nonnegative integrable random variables,
the sequence (Xn ) with Xn = ξ0 + · · · + ξn is a submartingale.
EXAMPLE 5. If X = (Xn , Fn ) is a martingale and g(x) is convex downward with
E |g(Xn )| < ∞, n ≥ 0, then the stochastic sequence (g(Xn ), Fn ) is a submartingale
(as follows from Jensen’s inequality; see Sect. 6 of Chap. 2, Vol. 1).
If X = (Xn , Fn ) is a submartingale and g(x) is convex downward and nonde-
creasing, with E |g(Xn )| < ∞ for all n ≥ 0, then (g(Xn ), Fn ) is also a submartin-
gale.
1 Definitions of Martingales and Related Concepts 109

Assumption (1) in Definition 1 ensures the existence of the conditional expecta-


tions E(Xn+1 |Fn ), n ≥ 0. However, these expectations can also exist without the
assumption that E |Xn+1 | < ∞. Recall that, according to Sect. 7 of Chap. 2, Vol. 1,

+
E(Xn+1 | Fn ) and E(Xn+1 | Fn ) are always defined. Let us write A = B (P-a.s.)
when P(A B) = 0. Then if

{ω : E(Xn+1
+
| Fn ) < ∞} ∪ {ω : E(Xn− | Fn ) < ∞} = Ω (P -a.s.)

we say that E(Xn+1 | Fn ) is also defined and is given by



E(Xn+1 | Fn ) = E(Xn+1
+
| Fn ) − E(Xn+1 | Fn ).

After this, the following definition is natural.

Definition 2. A stochastic sequence X = (Xn , Fn ) is a generalized martingale (or


submartingale) if the conditional expectations E(Xn+1 | Fn ) are defined for every
n ≥ 0 and the corresponding condition (2) is satisfied.

Notice that it follows from this definition that E(Xn+1 | Fn ) < ∞ for a general-
ized submartingale and that E(|Xn+1 | | Fn ) < ∞ (P-a.s.) for a generalized martin-
gale.

3. In the following definition we introduce the concept of a Markov time, which


plays a very important role in the subsequent theory.

Definition 3. A random variable τ = τ(ω) with values in the set {0, 1, . . . , +∞}
is a Markov time (with respect to (Fn )) (or a random variable independent of the
future) if, for each n ≥ 0,
{τ = n} ∈ Fn . (4)
When P(τ < ∞) = 1, a Markov time τ is called a stopping time.

Let X = (Xn , Fn ) be a stochastic sequence, and let τ be a Markov time (with


respect to (Fn )). We write


Xτ (ω) = Xn (ω)I{τ=n} (ω)
n=0

(hence we set X∞ = 0 and Xτ = 0 on the set {ω : τ = ∞}).


Then, for every B ∈ B(R),


{ω : Xτ ∈ B} = {ω : X∞ ∈ B, τ = ∞} + {Xn ∈ B, τ = n} ∈ F ,
n=0

and consequently, Xτ = Xτ(ω) (ω) is a random variable.


110 7 Martingales

EXAMPLE 6. Let X = (Xn , Fn ) be a stochastic sequence, and let B ∈ B(R). Then


the time of first hitting the set B, that is,
τB = min{n ≥ 0 : Xn ∈ B}

(with τB = +∞ if {.} = ∅) is a Markov time, since

{τB = n} = {X0 ∈ B, . . . , Xn−1 ∈ B, Xn ∈ B} ∈ Fn

for every n ≥ 0.
EXAMPLE 7. Let X = (Xn , Fn ) be a martingale (or submartingale) and τ a Markov
time (with respect to (Fn )). Then the “stopped” sequence X τ = (Xn ∧ τ , Fn ) is also
a martingale (or submartingale).
In fact, the equation

n−1
Xn ∧ τ = Xm I{τ=m} + Xn I{τ≥n}
m=0

implies that the variables Xn∧τ are Fn -measurable, are integrable, and satisfy
X(n+1)∧τ − Xn∧τ = I{τ>n} (Xn+1 − Xn ),

whence

E[X(n+1) ∧ τ − Xn ∧ τ | Fn ] = I{τ > n} E[Xn+1 − Xn | Fn ] = 0 (or ≥ 0).

Every flow (Fn ) and Markov time τ corresponding to it generate a collection of


sets
Fτ = {A ∈ F : A ∩ {τ = n} ∈ Fn for all n ≥ 0}.
It is clear that Ω ∈ Fτ and Fτ is closed under countable unions. Moreover, if
A ∈ Fτ , then A ∩ {τ = n} = {τ = n}\(A ∩ {τ = n}) ∈ Fn , and therefore A ∈ Fτ .
Hence it follows that Fτ is a σ-algebra.
If we think of Fn as the collection of events observed up to time n (inclusive),
then Fτ can be thought of as the collection of events observed until the “random”
time τ.
It is easy to show (Problem 3) that the random variables τ and Xτ are Fτ -
measurable.

4. Definition 4. A stochastic sequence X = (Xn , Fn ) is a local martingale (or sub-


martingale) if there is a (localizing) sequence (τk )k≥1 of finite Markov times such
that τk ≤ τk+1 (P-a.s.), τk ↑ ∞ (P-a.s.) as k → ∞, and every “stopped” sequence
X τk = (Xτk ∧n I{τk >0} , Fn ) is a martingale (or submartingale).
In Theorem 1 below, we show that in fact the class of local martingales coincides
with the class of generalized martingales. Moreover, every local martingale can be
obtained as a “martingale transform” from a martingale and a predictable sequence.
1 Definitions of Martingales and Related Concepts 111

Definition 5. Let Y = (Yn , Fn )n≥0 be a stochastic sequence and V = (Vn , Fn−1 )n≥0
a predictable sequence (F−1 = F0 ). The stochastic sequence V ·Y = ((V ·Y)n , Fn )
with
n
(V · Y)n = V0 Y0 + Vi ΔYi , (5)
i=1

where ΔYi = Yi − Yi−1 , is called the transform of Y by V. If, in addition, Y is a


martingale (or a local martingale), we say that V · Y is a martingale transform.

Theorem 1. Let X = (Xn , Fn )n≥0 be a stochastic sequence, and let X0 = 0 (P-a.s.).


The following conditions are equivalent:
(a) X is a local martingale;
(b) X is a generalized martingale;
(c) X is a martingale transform, i.e., there are a predictable sequence V =
(Vn , Fn−1 ) with V0 = 0 and a martingale Y = (Yn , Fn ) with Y0 = 0 such
that X = V · Y.

PROOF. (a) ⇒ (b). Let X be a local martingale, and let (τk ) be a localizing se-
quence of Markov times for X. Then, for every m ≥ 0,

E[|Xm∧τk |I{τk >0} ] < ∞, (6)

and therefore

E[|X(n+1)∧τk |I{τk >n} ] = E[|Xn+1 |I{τk >n} ] < ∞. (7)

The random variable I{τk >n} is Fn -measurable. Hence it follows from (7) that

E[|Xn+1 |I{τk >n} | Fn ] = I{τk >n} E[|Xn+1 | | Fn ] < ∞ (P -a.s.).

Here I{τk >n} → 1 (P-a.s.) as k → ∞, and therefore

E[|Xn+1 | | Fn ] < ∞ (P -a.s.). (8)

Under this condition, E[Xn+1 | Fn ] is defined, and it remains only to show that
E[Xn+1 | Fn ] = Xn (P-a.s.).
To do this, we need to show that
 
Xn+1 d P = Xn d P
A A

for A ∈ Fn . By Problem 7, Sect. 7,  Chap. 2, Vol. 1, we have E[|Xn+1 |Fn ] < ∞


(P-a.s.) if and only
 if the measure A
|Xn+1 | d P, A ∈ Fn , is σ-finite. Let us show
that the measure A |Xn | d P, A ∈ Fn , is also σ-finite.
Since X τk is a martingale, |X τk | = (|Xτk ∧n |I{τk >0} , Fn ) is a submartingale, and
therefore (since {τk > n} ∈ Fn )
112 7 Martingales
 
|Xn | d P = |Xn∧τk |I{τk >0} d P
A∩{τk >n} A∩{τk >n}
 
≤ |X(n+1)∧τk |I{τk >0} d P = |Xn+1 | d P .
A∩{τk >n} A∩{τk >n}

Letting k → ∞, we have
 
|Xn | d P ≤ |Xn+1 | d P,
A A

from which there follows the required σ-finiteness of the measure A |Xn | d P, A ∈
Fn . 
Let A ∈ Fn have the property A |Xn+1 | d P < ∞. Then, by Lebesgue’s theorem
on dominated convergence, we may take limits in the relation
 
Xn d P = Xn+1 d P,
A∩{τk >n} A∩{τk >n}

which is valid since X is a local martingale. Therefore


 
Xn d P = Xn+1 d P
A A

for all A ∈ Fn such that |Xn+1 | d P < ∞. It then follows that the preceding
A
relation also holds for every A ∈ Fn , and therefore E(Xn+1 | Fn ) = Xn (P-a.s.).
(b)⇒ (c). Let ΔXn = Xn − Xn−1 , X0 = 0, and V0 = 0, Vn = E[|ΔXn | | Fn−1 ],
n ≥ 1. We set ,  -
−1
V , V n
= 0
Wn = Vn⊕ =
n
,
0, Vn = 0
n
Y0 = 0, and Yn = i=1 Wi ΔXi , n ≥ 1. It is clear that

E[|ΔYn | | Fn−1 ] ≤ 1, E[ΔYn | Fn−1 ] = 0,

and consequently, Y = (Yn , Fn ) is a martingale. Moreover, X0 = V0 · Y0 = 0 and


Δ(V · Y)n = ΔXn . Therefore
X = V · Y.
(c)⇒(a). Let X = V · Y, where V is a predictable sequence, Y is a martingale, and
V0 = Y0 = 0. Set
τk = min{n ≥ 0 : |Vn+1 | > k}
letting τk = ∞ if the set {·} = ∅. Since Vn+1 is Fn -measurable, the variables τk
are Markov times for every k ≥ 1.
1 Definitions of Martingales and Related Concepts 113

Consider the sequence X τk = ((V · Y)n∧τk I{τk >0} , Fn ). On the set {τk > 0}, the
inequality |Vn∧tk | ≤ k is in effect. Hence it follows that E |(V · Y)n∧τk I{τk >0} | < ∞
for every n ≥ 1. In addition, for n ≥ 1,

E{[(V · Y)(n+1)∧τk − (V · Y)n∧τk ] I{τk >0} | Fn }


= I{τk >0} V(n+1)∧τk · E{Y(n+1)∧τk − Yn∧τk | Fn } = 0

since (Example 7) E{Y(n+1)∧τk − Yn∧τk | Fn } = 0.


Thus for every k ≥ 1 the “stopped” sequence X τk is a martingale, τk ↑ ∞ (P-a.s.),
and consequently X is a local martingale.
This completes the proof of the theorem.

5. EXAMPLE 8. Let (ηn )n≥1 be a sequence of independent identically distributed


Bernoulli random variables with P(ηn = 1) = p, P(ηn = −1) = q, p + q = 1.
We interpret the event {ηn = 1} as the success (gain) and {ηn = −1} as the failure
(loss) of a player at the nth turn. Let us suppose that the player’s stake at the nth turn
is Vn . Then the player’s total gain through the nth turn is


n
Xn = Vi ηi = Xn−1 + Vn ηn , X0 = 0.
i=1

It is quite natural to suppose that the amount Vn at the nth turn may depend on the
results of the preceding turns, i.e., on V1 , . . . , Vn−1 and on η1 , . . . , ηn−1 . In other
words, if we put F0 = {∅, Ω} and Fn = σ{η1 , . . . , ηn }, then Vn is an Fn−1 -
measurable random variable, i.e., the sequence V = (Vn , Fn−1 ) that determines the
player’s “strategy” is predictable. Putting Yn = η1 + · · · + ηn , we find that


n
Xn = Vi ΔYi ,
i=1

i.e., the sequence X = (Xn , Fn ) with X0 = 0 is the transform of Y by V.


From the player’s point of view, the game in question is fair (or favorable or
unfavorable) if, at every stage, the conditional expectation

E(Xn+1 − Xn | Fn ) = 0 (or ≥ 0 or ≤ 0).

Moreover, it is clear that the game is

fair if p = q = 12 ,
favorable if p > q,
unfavorable if p < q.
114 7 Martingales

Since X = (Xn , Fn ) is a

martingale if p = q = 12 ,
submartingale if p > q,
supermartingale if p < q,

we can say that the assumption that the game is fair (or favorable or unfavorable)
corresponds to the assumption that the sequence X is a martingale (or submartingale
or supermartingale).
Let us now consider the special class of strategies V = (Vn , Fn−1 )n≥1 with
V1 = 1 and (for n > 1)
 n−1
2 if η1 = −1, . . . , ηn−1 = −1,
Vn = (9)
0 otherwise.

In such a strategy, a player, having started with a stake V1 = 1, doubles the stake
after a loss and drops out of the game immediately after a win.
If η1 = −1, . . . , ηn = −1, the total loss to the player after n turns will be


n
2i−1 = 2n − 1.
i=1

Therefore, if also ηn+1 = 1, then we have

Xn+1 = Xn + Vn+1 = −(2n − 1) + 2n = 1.

Let τ = min{n ≥ 1 : Xn = 1}. If p = q = 12 , i.e., the game in question is fair,


then P(τ = n) = ( 12 )n , P(τ < ∞) = 1, P(Xτ = 1) = 1, and E Xτ = 1. Therefore,
even for a fair game, by applying the strategy (9), a player can in a finite time (with
probability 1) complete the game “successfully,” increasing his capital by one unit
(E Xτ = 1 > X0 = 0).
In gambling practice, this system (doubling the stakes after a loss and dropping
out of the game after a win) is called a martingale. This is the origin of the mathe-
matical term “martingale.”
Remark. When p = q = 12 , the sequence X = (Xn , Fn ) with X0 = 0 is a martin-
gale and therefore

E Xn = E X0 = 0 for every n ≥ 1.

We may therefore expect that this equation will be preserved if the instant n is
replaced by a random instant τ. It will appear later (Theorem 1 in Sect. 2) that
E Xτ = E X0 in “typical” situations. Violations of this equation (as in the game
discussed above) arise in what we may describe as physically unrealizable situa-
tions, when either τ or |Xn | takes values that are much too large. (Note that the game
discussed above would be physically unrealizable since it supposes an unbounded
time for playing and an unbounded initial capital for the player.)
1 Definitions of Martingales and Related Concepts 115

6. Definition 6. A stochastic sequence ξ = (ξn , Fn ) is a martingale difference if


E |ξ| < ∞ for all n ≥ 0 and

E(ξn+1 | Fn ) = 0 (P -a.s.). (10)

The connection between martingales and martingale differences is clear from


Definitions 1 and 6. That is, if X = (Xn , Fn ) is a martingale, then ξ = (ξn , Fn ) with
ξ0 = X0 and ξn = ΔXn , n ≥ 1 is a martingale difference. In turn, if ξ = (ξn , Fn ) is
a martingale difference, then X = (Xn , Fn ) with Xn = ξ0 + · · · + ξn is a martingale.
In agreement with this terminology, every sequence ξ = (ξn )n≥0 of independent
integrable random variables with E ξn = 0 is a martingale difference (with Fn =
σ{ξ0 , ξ1 , . . . , ξn }).

7. The following theorem elucidates the structure of submartingales (or supermartin-


gales).
Theorem 2 (Doob). Let X = (Xn , Fn ) be a submartingale. Then there are a mar-
tingale m = (mn , Fn ) and a predictable increasing sequence A = (An , Fn−1 ) such
that for every n ≥ 0, Doob’s decomposition

Xn = mn + An (P -a.s.) (11)

holds. A decomposition of this kind is unique.


PROOF. Let us put m0 = X0 , A0 = 0, and


n−1
mn = m0 + [Xj+1 − E(Xj+1 | Fj )], (12)
j=0


n−1
An = [E(Xj+1 | Fj ) − Xj ]. (13)
j=0

It is evident that m and A, defined in this way, have the required properties. In addi-
tion, let Xn = mn + An , where m = (mn , Fn ) is a martingale and A = (An , Fn ) is a
predictable increasing sequence. Then

An+1 − An = (An+1 − An ) + (mn+1 − mn ) − (mn+1 − mn ),

and if we take conditional expectations on both sides, we find that (P-a.s.) An+1 −
An = An+1 − An . But A0 = A0 = 0, and therefore An = An and mn = mn (P-a.s.)
for all n ≥ 0.
This completes the proof of the theorem.


It follows from (11) that the sequence A = (An , Fn−1 ) compensates X = (Xn , Fn )
so that it becomes a martingale. This observation justifies the following definition.
Definition 7. A predictable increasing sequence A = (An , Fn−1 ) appearing in the
Doob decomposition (11) is called a compensator (of the submartingale X).
116 7 Martingales

The Doob decomposition plays a key role in the study of square-integrable mar-
tingales M = (Mn , Fn ), i.e., martingales for which E Mn2 < ∞, n ≥ 0; this depends
on the observation that the stochastic sequence M 2 = (M 2 , Fn ) is a submartingale.
According to Theorem 2, there is a martingale m = (mn , Fn ) and a predictable
increasing sequence M = (Mn , Fn−1 ) such that

Mn2 = mn + Mn . (14)

The sequence M is called the quadratic characteristic of M and, in many re-
spects, determines its structure and properties.
It follows from (13) that

n
Mn = E[(ΔMj )2 | Fj−1 ] (15)
j=1

and, for all l ≤ k,

E[(Mk − Ml )2 | Fl ] = E[Mk2 − Ml2 | Fl ] = E[Mk − Ml | Fl ]. (16)

In particular, if M0 = 0 (P-a.s.), then

E Mk2 = EMk . (17)

It is useful to observe that if M0 = 0 and Mn = ξ1 + · · · + ξn , where (ξn ) is


a sequence of independent random variables with E ξi = 0 and E ξi2 < ∞, the
quadratic characteristic

Mn = E Mn2 = Var ξ1 + · · · + Var ξn (18)

is not random and, indeed, coincides with the variance.


If X = (Xn , Fn ) and Y = (Yn , Fn ) are square-integrable martingales, we put

X, Yn = 14 [X + Yn − X − Yn ]. (19)

It is easily verified that (Xn Yn −X, Yn , Fn ) is a martingale, and therefore, for l ≤ k,

E[(Xk − Xl )(Yk − Yl ) | Fl ] = E[X, Yk − X, Yl | Fl ]. (20)

In the case when Xn = ξ1 + · · · + ξn , Yn = η1 + · · · + ηn , where (ξn ) and


(ηn ) are sequences of independent random variables with E ξi = E ηi = 0 and
E ξi2 < ∞, E ηi2 < ∞, the variable X, Yn is given by


n
X, Yn = Cov(ξi , ηi ).
i=1

The sequence X, Y = (X, Yn , Fn−1 ), defined in (19), is often called the mu-
tual characteristic of the (square-integrable) martingales X and Y. It is easy to show
1 Definitions of Martingales and Related Concepts 117

(cf. (15)) that



n
X, Yn = E[ΔXi ΔYi | Fi−1 ].
i=1

In the theory of martingales, an important role is also played by the quadratic


covariation,

n
[X, Y]n = ΔXi ΔYi ,
i=1

and the quadratic variation,


n
[X]n = (ΔXi )2 ,
i=1

which can be defined for all random sequences X = (Xn )n≥1 and Y = (Yn )n≥1 .

8. In connection with Theorem 1, it is natural to ask when a local martingale (and


hence a generalized martingale or a martingale transform) is in fact a martingale.

Theorem 3. (1) Suppose that a stochastic sequence X = (Xn , Fn )n≥0 is a local


martingale (with X0 = 0 or, more generally, with E |X0 | < ∞).
If E Xn− < ∞, n ≥ 0, or E Xn+ < ∞, n ≥ 0, then X = (Xn , Fn )n≥0 is a
martingale.
(2) Let X = (Xn , Fn )0≤n≤N be a local martingale, N < ∞, and either E XN− <
∞ or E XN+ < ∞. Then X = (Xn , Fn )0≤n≤N is a martingale.

PROOF. (1) Let us show that either of the conditions E Xn− < ∞, n ≥ 0, or E Xn+ <
∞, n ≥ 0, implies that E |Xn | < ∞, n ≥ 0.
Indeed, let, for example, E Xn− < ∞ for all n ≥ 0. Then, by the Fatou lemma,

E Xn+ = E lim inf Xn∧τ
+
k
≤ lim inf E Xn∧τ
+
k
= lim inf [E Xn∧τk + E Xn∧τ k
]
k k k

n

= E X0 + lim inf E Xn∧τ k
≤ | E X0 | + E Xk− < ∞.
k
k=0

Therefore E |Xn | < ∞, n ≥ 0.


To prove the martingale property E(Xn+1 | Fn ) = Xn , n ≥ 0, let us observe that
for any Markov time τk we have


n+1
|X(n+1)∧τk | ≤ |Xi |,
i=0

where

n+1
E |Xi | < ∞.
i=0
118 7 Martingales

Therefore, taking the limit as k → ∞, τk ↑ ∞ (P-a.s.) in the equality


E(X(n+1)∧τk | Fn ) = Xn∧τk , we obtain by Lebesgue’s dominated convergence
theorem that E(Xn+1 | Fn ) = Xn (P-a.s.).
(2) Assume, for example, that E XN− < ∞. We will then show that E Xn− < ∞
for all n < N.
Indeed, since a local martingale is a generalized martingale, we have Xn =
E(Xn+1 | Fn ), where E(|Xn+1 | | Fn ) < ∞ (P-a.s.). Then, by Jensen’s inequal-
ity for conditional expectations (see Problem 5 in Sect. 7, Chap. 2, Vol. 1), Xn− ≤
− −
E(Xn+1 | Fn ). Therefore E Xn− ≤ E Xn+1 ≤ E XN− < ∞.
Thus the desired martingale property of the local martingale X = (Xn , Fn )0≤n≤N
follows from conclusion (1).


9. PROBLEMS
1. Show that (2) and (3) are equivalent.
2. Let σ and τ be Markov times. Show that τ + σ, τ ∧ σ, and τ ∨ σ are also Markov
times; in addition, if P(σ ≤ τ) = 1, then Fσ ⊆ Fτ (see Example 7 for the
definition of Fτ ).
3. Show that τ and Xτ are Fτ -measurable.
4. Let Y = (Yn , Fn ) be a martingale (or submartingale), let V = (Vn , Fn−1 ) be
a predictable sequence, and let (V · Y)n be integrable random variables, n ≥ 0.
Show that V · Y is a martingale (or submartingale).
5. Let G1 ⊇ G2 ⊇ · · · be a nonincreasing family of σ-algebras, and let ξ be an
integrable random variable. Show that (Xn )n≥1 with Xn = E(ξ | Gn ) is a reversed
martingale, i.e.,

E(Xn | Xn+1 , Xn+2 , . . .) = Xn+1 (P -a.s.)

for every n ≥ 1.
6. Let ξ1 , ξ2 , . . . be independent random variables,

4
n
1
P(ξi = 0) = P(ξi = 2) = 2 and Xn = ξi .
i=1

Show that there does not exist an integrable random variable ξ and a nondecreas-
ing family (Fn ) of σ-algebras such that Xn = E(ξ | Fn ). This example shows
that not every martingale (Xn )n≥1 can be represented in the form (E(ξ | Fn ))n≥1
(cf. Example 3 in Sect. 11, Chap. 1, Vol. 1).
7. (a) Let ξ1 , ξ2 , . . . be independent random variables with E |ξn | < ∞, E ξn = 0,
n ≥ 1. Show that for any k ≥ 1 the sequence

Xn(k) = ξi1 . . . ξik , n ≥ k,
1≤i1 <···<ik ≤n

is a martingale.
2 Preservation of Martingale Property Under a Random Time Change 119

(b) Let ξ1 , ξ2 , . . . be integrable random variables such that

ξ1 + · · · + ξ n
E(ξn+1 | ξ1 , . . . , ξn ) = (= Xn ).
n
Prove that the sequence X1 , X2 , . . . is a martingale.
8. Give an example of a martingale (Xn , Fn )n≥1 such that the family {Xn , n ≥ 1}
is not uniformly integrable.
9. Let X = (Xn )n≥0 be a Markov chain (Sect. 1, Chap. 8) with a countable state
space E = {i, j, . . . } and transition
 probabilities pij . Let ψ = ψ(x), x ∈ E, be a
bounded function such that pij ψ(j) ≤ λψ(i) for λ > 0 and i ∈ E. Show that
j∈E
the sequence (λ−n ψ(Xn ))n≥0 is a supermartingale.

2. Preservation of Martingale Property Under a Random


Time Change

1. If X = (Xn , Fn )n≥0 is a martingale, then we have

E Xn = E X0 (1)

for every n ≥ 1. Is this property preserved if the time n is replaced by a Markov


time τ? Example 1 of the preceding section shows that, in general, the answer is no:
there exist a martingale X and a Markov time τ (finite with probability 1) such that

E Xτ = E X0 . (2)

The following basic theorem describes the “typical” situation, in which, in par-
ticular, E Xτ = E X0 . (We let Xτ = 0 on the set {τ = ∞}.)
Theorem 1 (Doob). (a) Let X = (Xn , Fn )n≥0 be a submartingale, and τ and σ
finite (P-a.s.) stopping times for which E Xτ and E Xσ are defined (e.g., such that
E |Xτ | < ∞ and E |Xσ | < ∞). Assume that

lim inf E[Xm+ I(τ > m)] = 0. (3)


m→∞

Then
E(Xτ | Fσ ) ≥ Xτ∧σ (P -a.s.) (4)
or, equivalently,
E(Xτ | Fσ ) ≥ Xσ ({τ ≥ σ}; P -a.s.).
(b) Let M = (Mn , Fn )n≥0 be a martingale, and τ and σ finite (P-a.s.) stop-
ping times for which E Mτ and E Mσ are defined (e.g., such that E |Mτ | < ∞ and
E |Mσ | < ∞). Assume that

lim inf E[|Mm |I(τ > m)] = 0. (5)


m→∞
120 7 Martingales

Then
E(Mτ | Fσ ) = Mτ∧σ (P -a.s.) (6)
or, equivalently,
E(Mτ | Fσ ) = Mσ ({τ ≥ σ}; P -a.s.).

PROOF. (a) We must show that, for every A ∈ Fσ ,

E Xτ I(A, τ ≥ σ) ≥ E Xσ I(A, τ ≥ σ), (7)

where I(A, τ ≥ σ) is the indicator function of the set A ∩ {τ ≥ σ}.


To prove (7), it suffices to show that for any n ≥ 0

E Xτ I(A, τ ≥ σ, σ = n) ≥ E Xσ I(A, τ ≥ σ, σ = n),

i.e., that
E Xτ I(B, τ ≥ n) ≥ E Xn I(A, τ ≥ n), B = A ∩ {σ = n}.
Using the property B ∪ {τ > n} ∈ Fn and the fact that the process X =
(Xn , Fn )n≥0 is a submartingale, we find by iterating in n that for any m ≥ n

E Xn I(B, τ ≥ n) = E Xn I(B, τ = n) + E Xn I(B, τ > n)


 
≤ E Xn I(B, τ = n) + E E(Xn+1 | Fn ) I(B, τ > n)
 
= E Xn I(B, τ = n) + E Xn+1 I(B, τ ≥ n + 1)
= E Xτ I(B, n ≤ τ ≤ n + 1) + E Xn+1 I(B, τ > n + 1)
≤ E Xτ I(B, n ≤ τ ≤ n + 1) + E Xn+2 I(B, τ ≥ n + 2)
≤ · · · ≤ E Xτ I(B, n ≤ τ ≤ m) + E Xm I(B, τ > m).

Consequently,

E Xτ I(B, n ≤ τ ≤ m) ≥ E Xn I(B, τ ≥ n) − E Xm I(B, τ > m). (8)

By assumption, E Xτ is defined. Therefore the function Q(C) = E Xτ I(C) of Borel


sets C ∈ B(R) is countably additive (Subsection 8 in Sect. 6, Chap. 2, Vol. 1), and
hence there exists the limit limm→∞ E Xτ I(B, n ≤ τ ≤ m). Therefore, since the
Markov time τ is finite (P-a.s.), inequality (8) implies that
 
E Xτ I(B, τ ≥ n) ≥ lim sup E Xn I(B, τ ≥ n) − E Xm I(B, τ > m)
m→∞
= E Xn I(B, τ ≥ n) − lim inf E Xm I(B, τ > m)
m→∞
≥ E Xn I(B, τ ≥ n) − lim E Xm+ I(B, τ > m)
m→∞
= E Xn I(B, τ ≥ n).

Thus, we have

E Xτ I(B, σ = n, τ ≥ n) ≥ E Xn I(B, σ = n, τ ≥ n)
2 Preservation of Martingale Property Under a Random Time Change 121

or
E Xτ I(A, τ ≥ σ, σ = n) ≥ E Xσ I(A, τ ≥ σ, σ = n).
Hence, using the assumption P{σ < ∞} = 1 and the fact that the expectations E Xτ
and E Xσ are defined, we obtain the desired inequality (7).
(b) Let M = (Mn , Fn )n≥0 be a martingale satisfying (5). This condition implies
that    
lim inf E Mm+ I(τ > m) = lim inf E Mm− I(τ > m) = 0.
m→∞ m→∞

Setting X = M and X = −M in (a) we find that (P-a.e.)

E[Mτ | Fσ ] ≥ Mτ∧σ and E[−Mτ | Fσ ] ≥ −Mτ∧σ

with the latter inequality telling us that E[Mτ | Fσ ] ≤ Mτ∧σ . Hence E[Mτ | Fσ ] =
Mτ∧σ (P-a.s.), which is precisely equality (6).


Corollary 1. Let τ and σ be stopping times such that

P{σ ≤ τ ≤ N} = 1

for some N. Then for a submartingale X we have

E X0 ≤ E Xσ ≤ E Xτ ≤ E XN ,

and for a martingale M

E M0 = E Mσ = E Mτ = E MN .

Corollary 2. Let X = (Xn , Fn )n≥0 be a submartingale. If the family of random


variables {Xn , n ≥ 0} is uniformly integrable (in particular, if |Xn | ≤ c (P-a.s.),
n ≥ 0, for some c), then for any finite (P-a.s.) stopping times τ and σ inequality (4)
holds, and if P{σ ≤ τ} = 1, then

E X0 ≤ E Xσ ≤ E Xτ .

Moreover, if X = M is a martingale, then equality (6) holds, and if P{σ ≤ τ} = 1,


then
E M0 = E Mσ = E Xτ .
For the proof, let us observe that the properties (3) and (5) follow from Lemma 2
in Subsection 5, Sect. 6, Chap. 2, Vol. 1, and the fact that P{τ > m} → 0 as m → ∞.
We will now show that the expectations E |Xτ | and E |Xσ | are finite. To prove
this, it suffices to show that

E |Xτ | ≤ 3 sup E |XN | (9)


N

(and similarly for σ) because, due to inequality (16) of Sect. 6, Chap. 2, Vol. 1, the
assumption of uniform integrability of {Xn , n ≥ 0} implies that supN E |XN | < ∞;
122 7 Martingales

hence the required inequality E |Xτ | < ∞ (and, similarly, E |Xσ | < ∞). will follow
from (9).
Corollary 1 applied to the bounded stopping time τN = τ ∧ N implies

E X0 ≤ E XτN .

Therefore
E |XτN | = 2 E Xτ+N − E XτN ≤ 2 E Xτ+N − E X0 . (10)
The sequence X +
= (Xn+ , Fn )n≥0 is a submartingale (see Example 5 in Sect. 1);
hence

N
   
E Xτ+N = E Xj+ I(τN = j) + E XN+ I(τ > N)
j=0


N
   
≤ E XN+ I(τN = j) + E XN+ I(τ > N) = E XN+ ≤ E |XN | ≤ sup E |Xm |,
m
j=0

which, combined with the inequality in (10), yields

E |XτN | ≤ 3 sup E |Xm |.


m

Hence we obtain by Fatou’s lemma (Theorem 2 (a) in Sect. 6, Chap. 2, Vol. 1)

E |Xτ | = E lim |XτN | = E lim inf |XτN | ≤ lim inf E |XτN | ≤ 3 sup E |XN |,
N N N N

which proves (9).

Remark 1. The martingale X = (Xn , Fn )n≥0 (with p = q = 1/2) in Example 8 of


the previous section was shown to satisfy

E |Xm | I(τ > m) = (2m − 1) P{τ > m} = (2m − 1) · 2−m → 1, m → ∞.

Therefore condition (5) fails here. It is of interest to notice that the property (6) fails
here as well since, as was shown in that example, there is a stopping time τ such that
E Xτ = 1 > X0 = 0. In this sense, condition (5) (together with the condition that
E Xσ and E Xτ are defined) is not only sufficient for (6), but also “almost necessary.”

2. The following proposition, which we shall deduce from Theorem 1, is often useful
in applications.

Theorem 2. Let X = (Xn ) be a martingale (or submartingale) and τ a stopping


time (with respect to (FnX ), where FnX = σ{X0 , . . . , Xn )). Suppose that E τ < ∞
and that for some n ≥ 0 and some constant C

E{|Xn+1 − Xn | | FnX } ≤ C ({τ ≥ n}; P -a.s.).


2 Preservation of Martingale Property Under a Random Time Change 123

Then
E |Xτ | < ∞
and
E Xτ = E X0 . (11)
(≥)

PROOF. We first verify that the stopping time τ has the properties

E |Xτ | < ∞ and lim inf |Xn | d P = 0,
n→∞
{τ>n}

which by Theorem 1 imply (11).


Let
Y0 = |X0 |, Yj = |Xj − Xj−1 |, j ≥ 1.

Then |Xτ | ≤ j=0 Yj and

τ   
τ ∞ 
 
n
E |Xτ | ≤ E Yj = Yj d P = Yj d P
j=0 Ω j=0 n=0 {τ=n} j=0

 n 
∞  ∞ 
∞ 
 ∞ 

= Yj d P = Yj d P = Yj d P .
n=0 j=0 {τ=n} j=0 n=j {τ=n} j=0 {τ≥j}

The set {τ ≥ j} = Ω\{τ < j} ∈ Fj−1


X
, j ≥ 1. Therefore
 
Yj d P = E[Yj | X0 , . . . , Xj−1 ] d P ≤ C P{τ ≥ j}
{τ≥j} {τ≥j}

for j ≥ 1; and hence



τ  ∞

E |Xτ | ≤ E Yj ≤ E |X0 | + C P{τ ≥ j} = E |X0 | + C E τ < ∞. (12)
j=0 j=1

Moreover, if τ > n, then



n τ

Yj ≤ Yj ,
j=0 j=0

and therefore   τ

|Xn | d P ≤ Yi d P .
{τ>n} {τ>n} j=0

Hence, since (by (12)) E j=0 Yj < ∞ and {τ > n} ↓ ∅, n → ∞, the dominated
convergence theorem yields
124 7 Martingales
  τ

lim inf |Xn | d P ≤ lim inf Yj d P = 0.
n→∞ {τ>n} n→∞ {τ>n} j=0

Hence the hypotheses of Theorem 1 are satisfied, and (11) follows, as required.
This completes the proof of the theorem.

3. Here we present some applications of the preceding theorems.


Theorem 3 (Wald’s Identities). Let ξ1 , ξ2 , . . . be independent identically dis-
tributed random variables with E |ξi | < ∞, and τ a stopping time (with respect to
Fnξ , where Fnξ = σ{ξ1 , . . . , ξn }, τ ≥ 1), and E τ < ∞. Then

E(ξ1 + · · · + ξτ ) = E ξ1 · E τ. (13)

If also E ξi2 < ∞, then

E{(ξ1 + · · · + ξτ ) − τ E ξ1 }2 = Var ξ1 · E τ. (14)

PROOF. Let X = (Xn , Fnξ )n≥1 , where Xn = (ξ1 + · · · + ξn ) − n E ξ1 . It is clear that


X is a martingale with

E[|Xn+1 − Xn | | X1 , . . . , Xn ] = E[|ξn+1 − E ξ1 | | ξ1 , . . . , ξn ]
= E |ξn+1 − E ξ1 | ≤ 2 E |ξ1 | < ∞.

Therefore, by Theorem 2, E Xτ = E X0 = 0, and (13) is established.


We will give three proofs of Wald’s second identity (14).
The first proof. Let ηi = ξi − E ξi , Sn = η1 + · · · + ηn . We must show that

E Sτ2 = E η12 · E τ.

Put τ(n) = τ ∧ n (= min(τ, n)).


Since
 n 
Sn2 = ηi2 + 2 ηi ηj ,
i=1 1≤i<j≤n


n
the sequence (Sn2 − ηi2 , Fnξ )n≥1 is a martingale with zero expectation.
i=1
By Corollary 1 we have
τ(n)

2
E Sτ(n) =E ηi2
i=1

and by Wald’s first identity (13)


τ(n)

E ηi2 = E η12 · E τ(n),
i=1

2
so that E Sτ(n) = E η12 · E τ(n).
2 Preservation of Martingale Property Under a Random Time Change 125

In a similar way we obtain that

E(Sτ(n) − Sτ(m) )2 = E η12 · E(τ(n) − τ(m)) → 0

as m, n → ∞, since E τ < ∞ by assumption. Hence the sequence {Sτ(n) }n≥1 is


fundamental (or a Cauchy sequence) in L2 (see Subsection 5 of Sect. 10, Chap. 2,
Vol. 1), so, by Theorem 7 of Sect. 10, Chap. 2, Vol. 1, there is a random variable S
such that E(Sτ(n) − S)2 → 0, n → ∞. This implies (Problem 1 in Sect. 11, Chap. 2,
2
Vol. 1) that E Sτ(n) → E S2 , n → ∞. As was shown earlier, E Sτ(n) 2
= E η12 · E τ(n);
therefore, letting n → ∞, we obtain that E S = E η1 · E τ.
2 2

It remains to identify the random variable S. Let us observe that with probability 1
there is a subsequence {n } ⊆ {n} such that both Sτ(n ) → S and τ(n ) → τ. But
then it is clear that Sτ(n ) → Sτ with probability 1. Therefore S and Sτ are the same
almost surely; hence E Sτ2 = E η12 · E τ, which was to be proved.
The second proof. By Fatou’s lemma (Theorem 2 (a), Sect. 6, Chap. 2, Vol. 1), we
2
obtain from the equality E Sτ(n) = E η12 · E τ(n) established above that

E Sτ2 = E lim inf Sτ(n)


2
≤ lim inf E Sτ(n)
2
= E η12 · E τ.

The required equality E Sτ2 = E η12 · E τ will be proved if we show that


2
E Sτ(n) ≤ E Sτ2

for any n ≥ 1.
Notice, using Wald’s first identity (13), that

E |Sτ | = E |η1 + · · · + ητ | ≤ E(|η1 | + · · · + |ητ |) = E |η1 | · E τ < ∞,

so

E |Sn |I(τ > n) = E |η1 + · · · + ηn | I(τ > n) ≤ E(|η1 | + · · · + |ηn |) I(τ > n)
≤ E(|η1 | + · · · + |ητ |) I(τ > n) → 0 as n → ∞.

Applying Theorem 1 to the submartingale (|Sn |, Fnξ )n≥1 ), we find that, on the
set {τ ≥ n},
E(|Sτ | | Fnξ ) ≥ |Sn | (P -a.s.).
Hence, by Jensen’s inequality for conditional expectations (Problem 5, Sect. 7,
Chap. 2, Vol. 1), we obtain that on the set {τ ≥ n}

E(Sτ2 | Fnξ ) ≥ Sn2 = Sτ(n)


2
(P -a.s.).

And on the complementary set {τ < n} we have E(Sτ2 | Fnξ ) = Sτ2 = Sτ(n)
2
. Thus
(P-a.s.)
E(Sτ2 | Fnξ ) ≥ Sτ(n)
2

and hence E Sτ2 ≥ E Sτ(n)


2
, as required.
126 7 Martingales
n
The third proof. We see from the first proof that (Sn2 − i=1 ηi2 , Fnξ )n≥1 is a
martingale and
2
E Sτ(n) = E η12 · E τ(n)
for τ(n) = τ ∧ n. Since E τ(n) → E τ, we only have to show that E Sτ(n)
2
→ E Sτ2 .
For that, it suffices to establish that
2
E sup Sτ(n) < ∞,
n

because the required convergence will then follow by Lebesgue’s dominated con-
vergence theorem (Theorem 3, Sect. 6, Chap. 2, Vol. 1).
For the proof of this inequality we will use the “maximal inequality” (13) to be
given in the next Sect. 3. This inequality applied to the martingale (Sτ(k) , Fkξ )k≥1
yields !
E 2
sup Sτ(k) ≤ 4 E Sτ(n)
2
≤ 4 sup E Sτ(n)
2
.
1≤k≤n n

Hence, using the monotone convergence theorem (Theorem 1 of Sect. 6, Chap. 2,


Vol. 1), we obtain that
2
E sup Sτ(k) ≤ 4 sup E Sτ(n)
2
.
k≥1 n

But
2
E Sτ(n) = E η12 · E τ(n) ≤ E η12 · E τ < ∞.
Therefore
2
E sup Sτ(n) ≤ 4 E η12 · E τ < ∞,
n

as was to be shown.

Corollary. Let ξ1 , ξ2 , . . . be independent identically distributed random variables


with
P(ξi = 1) = P(ξi = −1) = 12 , Sn = ξ1 + · · · + ξn ,
and τ = inf{n ≥ 1 : Sn = 1}. Then P{τ < ∞} = 1 (see, for example, (20) in
Sect. 9, Chap. 1, Vol. 1) and therefore P(Sτ = 1) = 1, E Sτ = 1. Hence it follows
from (13) that E τ = ∞.

Theorem 4 (Wald’s Fundamental Identity). Let ξ1 , ξ2 , . . . be a sequence of inde-


pendent identically distributed random variables, Sn = ξ1 + · · · + ξn , n ≥ 1. Let
ϕ(t) = E etξ1 , t ∈ R, and let ϕ(t0 ) exist for some t0 = 0 and ϕ(t0 ) ≥ 1.
If τ is a stopping time (with respect to (Fnξ ), Fnξ = σ{ξ1 , . . . , ξn }, τ ≥ 1), such
that |Sn | ≤ C ({τ ≥ n}; P -a.s.) and E τ < ∞, then
!
et0 Sτ
E = 1. (15)
(ϕ(t0 ))τ
2 Preservation of Martingale Property Under a Random Time Change 127

PROOF. Take
Yn = et0 Sn (ϕ(t0 ))−n .
Then Y = (Yn , Fnξ )n≥1 is a martingale with E Yn = 1 and, on the set {τ ≥ n},
 t0 ξn+1 
e 

E{|Yn+1 − Yn | | Y1 , . . . , Yn } = Yn E   
− 1 ξ1 , . . . , ξn
ϕ(t0 )
= Yn · E |et0 ξ1 (ϕ(t0 ))−1 − 1| ≤ C < ∞ (P -a.s.),

where C is a constant. Therefore Theorem 2 is applicable, and (15) follows since


E Y1 = 1.
This completes the proof.


EXAMPLE 1. This example will let us illustrate the use of the preceding examples to
find the probabilities of ruin and mean duration in games (Sect. 9, Chap. 1, Vol. 1).
Let ξ1 , ξ2 , . . . be a sequence of independent Bernoulli random variables with
P(ξi = 1) = p, P(ξi = −1) = q, p + q = 1, S = ξ1 + · · · + ξn , and

τ = min{n ≥ 1 : Sn = B or A}, (16)

where (−A) and B are positive integers.


It follows from (20) (Sect. 9, Chap. 1, Vol. 1) that P(τ < ∞) = 1 and E τ < ∞.
Then, if α = P(Sτ = A), β = P(Sτ = B), we have α + β = 1. If p = q = 12 , we
obtain from (13)
0 = E Sτ = αA + βB,
whence
B |A|
α= , β= .
B + |A| B + |A|
Applying (14), we obtain

E τ = E Sτ2 = αA2 + βB2 = |AB|.

However, if p = q, then we find, by considering the martingale ((q/p)Sn )n≥1 ,


that Sτ S1
q q
E =E = 1,
p p
and therefore A B
q q
α +β = 1.
p p
Together with the equation α + β = 1, this yields
 B  A
q
p −1 1− q
p
α =  B  A , β =  B  A . (17)
q
p − q
p
q
p − q
p
128 7 Martingales

Finally, since E Sτ = (p − q) E τ, we find

E Sτ αA + βB
Eτ = = ,
p−q p−q
where α and β are defined by (17).

EXAMPLE 2. In the example considered above, let p = q = 12 . Let us show that for
τ defined in (16) and every λ in 0 < λ < π/(B + |A|)
 
−τ cos λ B+A
E(cos λ) =  2 . (18)
cos λ B+|A|
2

For this purpose we consider the martingale X = (Xn , Fnξ )n≥0 with

B+A
Xn = (cos λ)−n cos λ Sn − (19)
2

and S0 = 0. It is clear that



B+A
E Xn = E X0 = cos λ . (20)
2

Let us show that the family {Xn∧τ } is uniformly integrable. For this purpose we
observe that, by Corollary 1 to Theorem 1 for 0 < λ < π/(B + |A|),

−(n∧τ) B+A
E X0 = E Xn∧τ = E(cos λ) cos λ Sn∧τ −
2

B−A
≥ E(cos λ)−(n∧τ) cos λ .
2

Therefore, by (20),  
−(n∧τ) cos λ B+A
E(cos λ) ≤  2 ,
cos λ B+|A|
2

and consequently, by Fatou’s lemma,


 
−τ cos λ B+A
E(cos λ) ≤  2 . (21)
cos λ B+|A|
2

Consequently, by (19),
|Xn∧τ | ≤ (cos λ)−τ .
With (21), this establishes the uniform integrability of the family {Xn∧τ }. Then, by
Corollary 2 to Theorem 1,
2 Preservation of Martingale Property Under a Random Time Change 129
 
B+A B−A
cos λ = E X0 = E Xτ = E(cos λ)−τ cos λ ,
2 2

from which the required equality (18) follows.


4. As an application of Wald’s identity (13), we will give the proof of the “ele-
mentary
∞ theorem” of renewal theory: If N = (Nt )t≥0 is a renewal process (Nt =
n=1 I(T n ≤ t), Tn = σ1 + · · · + σn , where σ1 , σ2 , . . . is a sequence of independent
identically distributed random variables (Subsection 4, Sect. 9, Chap. 2, Vol. 1)) and
μ = E σ1 < ∞, then the renewal function m(t) = E Nt satisfies

m(t) 1
→ , t → ∞. (22)
t μ

(Recall that the process Nt = (Nt )t≥0 itself obeys the strong law of large numbers:

Nt 1
→ (P -a.s.), t → ∞;
t μ
see Example 4 in Sect. 3, Chap. 4.)
To prove (22), we will show that

m(t) 1 m(t) 1
lim inf ≥ and lim sup ≤ . (23)
t→∞ t μ t→∞ t μ

To this end we notice that

TNt ≤ t < TNt +1 , t > 0. (24)

Since for any n ≥ 1


 

n
{Nt + 1 ≤ n} = {Nt ≤ n − 1} = {Nt < n} = {Tn > t} = σk > t ∈ Fn ,
k=1

where Fn is the σ-algebra generated by σ1 , . . . , σn , we have that Nt + 1 (but not Nt )


for any fixed t > 0 is a Markov time. Then Wald’s identity (13) implies that

E TNt +1 = μ[m(t) + 1]. (25)

Hence we see from the right inequality in (24) that t < μ[m(t) + 1], i.e.,

m(t) 1 1
> − , (26)
t μ t
whence, letting t → ∞, we obtain the first inequality in (23).
Next, the left inequality in (24) implies that t ≥ E TNt . Since TNt +1 = TNt +σNt +1 ,
we have

t ≥ E TNt = E(TNt +1 − σNt +1 ) = μ[m(t) + 1] − E σNt +1 . (27)


130 7 Martingales

If we assume that the variables σi are bounded from above (σi ≤ c), then (27)
implies that t ≥ μ[m(t) + 1] − c, and hence

m(t) 1 1 c−μ
≤ + · . (28)
t μ t μ
Then the second inequality in (23) would follow.
To discard the restriction σi ≤ c, i ≥ 1, we introduce, for some c > 0, the
variables
σic = σi I(σi < c) + cI(σi ≥ c)
∞
and define the related renewal process N c = (Ntc )t≥0 with Ntc = n=1 I(Tnc ≤ t),
Tnc = σ1c + · · · + σnc . Since σic ≤ σi , i ≥ 1, we have Ntc ≥ Nt ; hence

mc (t) = E Ntc ≥ E Nt = m(t).

Then we see from (28) that

m(t) mc (t) 1 1 c − μc
≤ ≤ c+ · ,
t t μ t μc

where μc = E σ1c .
Therefore
m(t) 1
lim sup ≤ c.
t→∞ t μ
Letting now c → ∞ and using that μc → μ, we obtain the required second inequal-
ity in (23).
Thus (22) is established.
Remark. For more general results of renewal theory see, for example, [10, Chap.
9], [25, Chap. 13].
5. PROBLEMS
1. Show that
E |Xτ | ≤ lim E |XN |
N→∞

for any martingale or nonnegative submartingale X = (Xn , Fn )n≥0 and any finite
(P -a.s.) stopping time τ. (Compare with inequality E |Xτ | ≤ 3 supN E |XN | in
Corollary 2 to Theorem 1.)
2. Let X = (Xn , Fn )n≥0 be a square-integrable martingale, E X0 = 0, τ a stopping
time, and 
lim inf Xn2 d P = 0.
n→∞ {τ>n}

Show that , τ
-

E Xτ2 = EXτ =E (ΔXj ) 2
,
j=0

where ΔX0 = X0 , ΔXj = Xj − Xj−1 , j ≥ 1.


2 Preservation of Martingale Property Under a Random Time Change 131

3. Let X = (Xn , Fn )n≥0 be a supermartingale such that Xn ≥ E(ξ | Fn ) (P-a.s.),


n ≥ 0, where E |ξ| < ∞. Show that for stopping times σ and τ with P{σ ≤ τ} =
1 the following relation holds:

Xσ ≥ E(Xτ | Fσ ) (P -a.s.).

4. Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-


ables with P(ξ1 = 1) = P(ξ1 = −1) = 12 , a and b positive numbers, b > a,


n 
n
Xn = a I(ξk = +1) − b I(ξk = −1)
k=1 k=1

and
τ = min{n ≥ 1 : Xn ≤ −r}, r > 0.
Show that E eλτ < ∞ for λ ≤ α0 and E eλτ = ∞ for λ > α0 , where
b 2b a 2a
α0 = log + log .
a+b a+b a+b a+b
5. Let ξ1 , ξ2 , . . . be a sequence of independent random variables with E ξi = 0,
Var ξi = σi2 , Sn = ξ1 + · · · + ξn , Fnξ = σ{ξ1 , . . . , ξn
}. Prove the following
τ
generalizations of Wald’s identities (13) and (14): If E j=1 E |ξj | < ∞, then

E Sτ = 0; if E j=1 E ξj2 < ∞, then
τ
 τ

E Sτ2 = E ξj2 = E σj2 . (29)
j=1 j=1

6. Let X = (Xn , F )n≥1 be a square-integrable martingale and τ a stopping time.


Establish the inequality

E Xτ2 ≤ E (ΔXn )2 .
n=1

Show that if

lim inf E(Xn2 I(τ > n)) < ∞ or lim inf E(|Xn | I(τ > n)) = 0,
n→∞ n→∞

then E Xτ2 = E n=1 (ΔXn )2 .
7. Let X = (Xn , Fn )n≥1 be a submartingale and τ1 ≤ τ2 ≤ . . . stopping times such
that E Xτm are defined and

lim inf E(Xn+ I(τm > n)) = 0, m ≥ 1.


n→∞

Prove that the sequence (Xτm , Fτm )m≥1 is a submartingale. (As usual, Fτm =
{A ∈ F : A ∩ {τm = j} ∈ Fj , j ≥ 1}.)
132 7 Martingales

3. Fundamental Inequalities

1. Let X = (Xn , Fn )n≥0 be a stochastic sequence,

Xn∗ = max |Xj |, Xn p = (E |Xn |p )1/p , p > 0.


0≤j≤n

In Theorems 1–3 below, we present Doob’s fundamental maximal inequalities for


probabilities and maximal inequalities in Lp for submartingales, supermartingales,
and martingales.
Theorem 1. I. Let X = (Xn , Fn )n≥0 be a submartingale. Then for all λ > 0

  
λ P max Xk ≥ λ ≤ E Xn+ I max Xk ≥ λ ≤ E Xn+ , (1)
k≤n k≤n

  
λ P min Xk ≤ −λ ≤ E Xn I min Xk > −λ − E X0 ≤ E Xn+ − E X0 , (2)
k≤n k≤n


λ P max |Xk | ≥ λ ≤ 3 max E |Xk |. (3)
k≤n k≤n

II. Let Y = (Yn , Fn )n≥0 be a supermartingale. Then for all λ > 0



  
λ P max Yk ≥ λ ≤ E Y0 − E Yn I max Yk < λ ≤ E Y0 + E Yn− , (4)
k≤n k≤n

  
λ P min Yk ≤ −λ ≤ − E Yn I min Yk ≤ −λ ≤ E Yn− , (5)
k≤n k≤n


λ P max |Yk | ≥ λ ≤ 3 max E |Yk |. (6)
k≤n k≤n

III. Let Y = (Yn , Fn )n≥0 be a nonnegative supermartingale. Then for all λ > 0


λ P max Yk ≥ λ ≤ E Y0 , (7)
k≤n


λ P sup Yk ≥ λ ≤ E Yn . (8)
k≥n

Theorem 2. Let X = (Xn , Fn )n≥0 be a nonnegative submartingale. Then for p ≥ 1


we have the following inequalities:
if p > 1,
p
Xn p ≤ Xn∗ p ≤ Xn p ; (9)
p−1
if p = 1,
e
Xn 1 ≤ Xn∗ 1 ≤ {1 + Xn log+ Xn 1 }. (10)
e−1
Theorem 3. Let X = (Xn , Fn )n≥0 be a martingale, λ > 0 and p ≥ 1. Then

E |X |p
n
P max |Xk | ≥ λ ≤ (11)
k≤n λp
3 Fundamental Inequalities 133

and if p > 1,
p
Xn p ≤ Xn∗ p ≤ Xn p . (12)
p−1
In particular, if p = 2,

E |X |2
n
P max |Xk | ≥ λ ≤ , (13)
k≤n λ2
 
E max Xk2 ≤ 4 E Xn2 . (14)
k≤n

PROOF OF THEOREM 1. Since a submartingale with the opposite sign is a super-


martingale, (1)–(3) follow from (4)–(6). Therefore we consider the case of a super-
martingale Y = (Yn , Fn )n≥0 .
Let us set τ = min{k ≤ n : Yk ≥ λ} with τ = n if maxk≤n Yk < λ. Then, by (6),
Sect. 2,
   
E Y0 ≥ E Yτ = E Yτ ; max Yk ≥ λ + E Yτ ; max Yk < λ
k≤n k≤n

 
≥ λ P max Yk ≥ λ + E Yn ; max Yk < λ ,
k≤n k≤n

which proves (4).


Now let us set σ = min{k ≤ n : Yk ≤ −λ} and take σ = n if mink≤n Yk > −λ.
Again, by (6), Sect. 2,
   
E Yn ≤ E Yτ = E Yτ ; min Yk ≤ −λ + E Yτ ; min Yk > −λ
k≤n k≤n

 
≤ −λ P min Yk ≤ −λ + E Yn ; min Yk > −λ .
k≤n k≤n

Hence
 
λP min Yk ≤ −λ ≤ − E Yn ; min Yk ≤ −λ ≤ E Yn− ,
k≤n k≤n

which proves (5).


To prove (6), we notice that Y − = (−Y)+ is a submartingale. Then, by (4) and
(1),



λ P max |Yk | ≥ λ ≤ λ P max Yk+ ≥ λ + λ P{max Yk− ≥ λ}
k≤n k≤n k≤n


= λ P max Yk ≥ λ + λ P{max Yk− ≥ λ}
k≤n k≤n

≤ E Y0 + 2 E Yn− ≤ 3 max E |Yk |.


k≤n

Inequality (7) follows from (4).


To prove (8), we set γ = min{k ≥ n : Yk ≥ λ}, taking γ = ∞ if Yk < λ for all
k ≥ n. Now let n < N < ∞. Then, by (6), Sect. 2,
134 7 Martingales

E Yn ≥ E Yγ∧N ≥ E[Yγ∧N I(γ ≤ N)] ≥ λ P{γ ≤ N},

from which, as N → ∞,


E Yn ≥ λ P{γ < ∞} = λ P sup Yk ≥ λ .
k≥n


PROOF OF THEOREM 2. The first inequalities in (9) and (10) are evident.
To prove the second inequality in (9), we first suppose that

Xn∗ p < ∞ (15)

and use the fact that, for every nonnegative random variable ξ and for r > 0,
 ∞
E ξr = r tr−1 P(ξ ≥ t) dt (16)
0

(see (69) in Sect. 6, Chap. 2, Vol. 1). Then we obtain, by (1) and Fubini’s theorem,
that for p > 1
  , -
∞ ∞
E(Xn∗ )p = p tp−1 P{Xn∗ ≥ t} dt ≤ p tp−2 Xn d P dt
0 0 {Xn∗ ≥t}
 ∞  !
=p tp−2 Xn I{Xn∗ ≥ t} d P dt
0 Ω
 / 0
Xn∗  
p
=p Xn t p−2
dt d P = E Xn (Xn∗ )p−1 . (17)
Ω 0 p−1

Hence, by Hölder’s inequality,

E(Xn∗ )p ≤ qXn p · (Xn∗ )p−1 q = qXn p [E(Xn∗ )p ]1/q , (18)

where q = p/(p − 1).


If (15) is satisfied, we immediately obtain the second inequality in (9) from (18).
However, if (15) is not satisfied, we proceed as follows. In (17), instead of Xn∗ ,
we consider (Xn∗ ∧ L), where L is a constant. Then we obtain

E(Xn∗ ∧ L)p ≤ q E[Xn (Xn∗ ∧ L)p−1 ] ≤ qXn p [E(Xn∗ ∧ L)p ]1/q ,

from which it follows, by the inequality E(Xn∗ ∧ L)p ≤ Lp < ∞, that

E(Xn∗ ∧ L)p ≤ qp E Xnp = qp Xn pp ,

and therefore
E(Xn∗ )p = lim E(Xn∗ ∧ L)p ≤ qp Xn pp .
L→∞
3 Fundamental Inequalities 135

We now prove the second inequality in (10). Again applying (1), we obtain
 ∞
∗ ∗
E Xn − 1 ≤ E(Xn − 1) = +
P{Xn∗ − 1 ≥ t} dt
 ∞ / 0
0  Xn∗ −1
1 dt
≤ Xn d P dt = E Xn = E Xn log Xn∗ .
0 1 + t ∗
{Xn ≥1+t} 0 1 + t

Since, for arbitrary a ≥ 0 and b > 0,

a log b ≤ a log+ a + be−1 , (19)

we have
E Xn∗ − 1 ≤ E Xn log Xn∗ ≤ E Xn log+ Xn + e−1 E Xn∗ .
If E Xn∗ < ∞, we immediately obtain the second inequality (10).
However, if E X ∗ = ∞, we proceed, as above, by replacing Xn∗ with Xn∗ ∧ L.
This proves the theorem.

PROOF OF THEOREM 3. The proof follows from the remark that |X|p , p ≥ 1, is
a nonnegative submartingale (if E |Xn |p < ∞, n ≥ 0), and from inequalities (1)
and (9).


Corollary of Theorem 3. Let Xn = ξ0 +· · ·+ξn , n ≥ 0, where (ξk )k≥0 is a sequence
of independent random variables with E ξk = 0 and E ξk2 < ∞. Then inequality (13)
becomes Kolmogorov’s inequality (Sect. 2, Chap. 4).
2. Let X = (Xn , Fn ) be a nonnegative submartingale and

Xn = Mn + An ,

its Doob decomposition. Then, since E Mn = 0, it follows from (1) that

E An
P{Xn∗ ≥ ε} ≤ .
ε
Theorem 4, below, shows that this inequality is valid, not only for submartingales,
but also for the wider class of sequences that have the property of domination in the
following sense.
Definition. Let X = (Xn , Fn ) be a nonnegative stochastic sequence and A =
(An , Fn−1 ) an increasing predictable sequence. We shall say that X is dominated
by sequence A if
E Xτ ≤ E Aτ (20)
for every stopping time τ.
Theorem 4. If X = (Xn , Fn ) is a nonnegative stochastic sequence dominated by an
increasing predictable sequence A = (An , Fn−1 ), then for λ > 0, a > 0, and any
stopping time τ,
136 7 Martingales

E Aτ
P{Xτ∗ ≥ λ} ≤ , (21)
λ
1
P{Xτ∗ ≥ λ} ≤ E(Aτ ∧ a) + P(Aτ ≥ a), (22)
λ
1/p
∗ 2 − p
Xτ p ≤ Aτ p , 0 < p < 1. (23)
1−p

PROOF. We set
σn = min{j ≤ τ ∧ n : Xj ≥ λ},
taking σn = τ ∧ n, if {·} = ∅. Then


E Aτ ≥ E Aσn ≥ E Xσn ≥ Xσn d P ≥ λ P{Xτ∧n > λ},
∗ >λ}
{Xτ∧n

from which
∗ 1
P{Xτ∧n > λ} ≤ E Aτ ,
λ
and we obtain (21) by Fatou’s lemma.
For the proof of (22), we introduce the time

γ = min{j : Aj+1 ≥ a},

setting γ = ∞ if {·} = ∅. Then

P{Xτ∗ ≥ λ} = P{Xτ∗ ≥ λ, Aτ < a} + P{Xτ∗ ≥ λ, Aτ ≥ a}


≤ P{I{Aτ <a} Xτ∗ ≥ λ} + P{Aτ ≥ a}
∗ 1
≤ P{Xτ∧γ ≥ λ} + P{Aτ ≥ a} ≤ E Aτ∧γ + P{Aτ ≥ a}
λ
1
≤ E(Aτ ∧ a) + P(Aτ ≥ a),
λ
where we used (21) and the inequality I{Aτ <a} Xτ∗ ≤ Xτ∧γ

. Finally, by (22),
 ∞  ∞
Xτ∗ pp = E(Xτ∗ )p = P{(Xτ∗ )p
≥ t} dt = P{Xτ∗ ≥ t1/p } dt
 ∞ 0
 ∞ 0

≤ t−1/p E[Aτ ∧ t1/p ] dt + P{Apτ ≥ t} dt


0 0
 Apτ  ∞
2−p
=E dt + E (Aτ t−1/p ) dt + E Apτ = E Apτ .
0 Apτ 1−p

This completes the proof.




Remark. Let us suppose that the hypotheses of Theorem 4 are satisfied, except that
the sequence A = (An , Fn )n≥0 is not necessarily predictable but has the property
that for some positive constant c
3 Fundamental Inequalities 137

P sup |ΔAk | ≤ c = 1,
k≥1

where ΔAk = Ak − Ak−1 . Then the following inequality is satisfied (cf. (22)):
1
P{Xτ∗ ≥ λ} ≤ E[Aτ ∧ (a + c)] + P{Aτ ≥ a}. (24)
λ
The proof is analogous to that of (22). We have only to replace the time γ =
min{j : Aj+1 ≥ a} with γ = min{j : Aj ≥ a} and notice that Aγ ≤ a + c.
Corollary. Let the sequences X k = (Xnk , Fnk ) and Ak = (Akn , Fnk ), n ≥ 0, k ≥ 1,
satisfy the hypotheses of Theorem 4 or the remark. Also, let (τk )k≥1 be a sequence
P P
of stopping times (with respect to F k = (Fnk )) and Akτk → 0. Then (X k )∗τk → 0.
3. In this subsection we present (without proofs, but with applications) a num-
ber of significant inequalities for martingales. These generalize the inequalities of
Khinchin and of Marcinkiewicz and Zygmund for sums of independent random
variables stated below.
Khinchin’s Inequalities. Let ξ1 , ξ2 , . . . be independent identically distributed
Bernoulli random variables with P(ξi = 1) = P(ξi = −1) = 12 , and let (cn )n≥1 be
a sequence of numbers.
Then for every p, 0 < p < ∞, there are universal constants Ap and Bp (indepen-
dent of (cn )) such that
 1/2 . n .  1/2
n
. . n
Ap c2j ≤.
. c ξ .
j j. ≤ Bp c2
j (25)
j=1 j=1 p j=1

for every n ≥ 1.
Marcinkiewicz and Zygmund’s Inequalities. If ξ1 , ξ2 , . . . is a sequence of inde-
pendent integrable random variables with E ξi = 0, then for p ≥ 1 there are univer-
sal constants Ap and Bp (independent of (ξn )) such that
. n  . . n . . n  .
.  2 1/2 . . . .  2 1/2 .
Ap .
. ξ j
.
. ≤ .
. ξ .
j. ≤ B .
p. ξ j
.
. (26)
i=1 p j=1 p j=1 p

for every n ≥ 1.
n n
The sequences X = (Xn ) with Xn = j=1 cj ξj and Xn = j=1 ξj in (25) and
(26) are martingales involving independent ξj . It is natural to ask whether these
inequalities can be extended to arbitrary martingales.
The first result in this direction was obtained by Burkholder.
138 7 Martingales

Burkholder’s Inequalities. If X = (Xn , Fn ) is a martingale, then for every p > 1


there are universal constants Ap and Bp (independent of X) such that
" "
Ap  [X]n p ≤ Xn p ≤ Bp  [X]n p , (27)

for every n ≥ 1, where [X]n is the quadratic variation of Xn :


n
[X]n = (ΔXj )2 , X0 = 0. (28)
j=1

The constants Ap and Bp can be taken to have the values

Ap = [18p3/2 /(p − 1)]−1 , Bp = 18p3/2 /(p − 1)1/2 .

It follows from (27), using (12), that


" "
Ap  [X]n p ≤ Xn∗ p ≤ B∗p  [X]n p , (29)

where
Ap = [18p3/2 /(p − 1)]−1 , B∗p = 18p5/2 /(p − 1)3/2 .
Burkholder’s inequalities (27) hold for p > 1, whereas the Marcinkiewicz–
Zygmund inequalities (26) also hold when p = 1. What can we say about the
validity of (27) for p = 1? It turns out that a direct generalization to p = 1 is
impossible, as the following example shows.

EXAMPLE. Let ξ1 , ξ2 , . . . be independent Bernoulli random variables with P(ξi =


1) = P(ξi = −1) = 12 , and let

n∧τ
Xn = ξj ,
j=1

where  

n
τ = min n ≥ 1 : ξj = 1 .
i=1

The sequence X = (Xn , Fnξ ) is a martingale, with

Xn 1 = E |Xn | = 2 E Xn+ → 2, n → ∞.

But 
" " τ∧n 1/2

 [X]n 1 = E [X]n = E 1 = E τ ∧ n → ∞.
j=1

Consequently, the first inequality in (27) fails.


It turns out that when p = 1, we must generalize (29) rather than (27) (which is
equivalent when p > 1).
3 Fundamental Inequalities 139

Davis’ Inequality. If X = (Xn , Fn ) is a martingale, there are universal constants


A and B, 0 < A < B < ∞, such that
" "
A [X]n 1 ≤ Xn∗ 1 ≤ B [X]n 1 , (30)

i.e., 5 5
6 ! 6
6 n 6 n
AE 7 (ΔXj ) ≤ E max |Xj | ≤ B E 7 (ΔXj )2 .
2
1≤j≤n
j= 1 j= 1

Corollary 1. Let ξ1 , ξ2 , . . . be independent identically distributed random vari-


ables, Sn = ξ1 + · · · + ξn . If E |ξ1 | < ∞ and E ξ1 = 0, then according to Wald’s
inequality (13) (Sect. 2), we have

E Sτ = 0 (31)

for every stopping time τ (with respect to (Fnξ )) for which E τ < ∞.
If we assume additionally that E |ξ1 |r < ∞, where 1 < r ≤ 2, then the condition
E τ1/r < ∞ is sufficient for (31).
For the proof, we set τn = τ ∧ n, Y = supn |Sτn | and let m = [tr ] (integral part of
r
t ) for t > 0. By Corollary 1 to Theorem 1 (Sect. 2), we have E Sτn = 0. Therefore
a sufficient condition for E Sτ = 0 is (by the dominated convergence theorem) that
E supn |Sτn | < ∞.
Using (1) and (27), we obtain

P(Y ≥ t) = P(τ ≥ tr , Y ≥ t) + P(τ < tr , Y ≥ t)




≤ P(τ ≥ tr ) + P max |Sτj | ≥ t
1≤j≤m
−r
≤ P(τ ≥ t ) + t
r
E |Sτm |r
 τm r/2
−r r
≤ P(τ ≥ t ) + t Br E
r 2
ξj
j=1
τm

≤ P(τ ≥ tr ) + t−r Brr E |ξj |r .
j=1

Notice that (with F0ξ = {∅, Ω})


τm
 ∞

E |ξj |r = E I(j ≤ τm )|ξj |r
j=1 j=1

 ξ
= E E[I(j ≤ τm )|ξj |r | Fj−1 ]
j=1

 τm

ξ
=E I(j ≤ τm ) E[|ξ| r
| Fj−1 ] =E E |ξj |r = μr E τm ,
j=1 j=1
140 7 Martingales

where μr = E |ξ1 |r . Consequently,

P(Y ≥ t) ≤ P(τ ≥ tr ) + t−r Brr μr E τm


 !
= P(τ ≥ t ) +
r
Brr μr t−r
m P(τ ≥ t ) + r
τdP
{τ<tr }

−r
≤ (1 + Br μr ) P(τ ≥ t ) + Br μr t
r r r
τdP
{τ<tr }

and therefore
 ∞  ∞  !
EY = P(Y ≥ t) dt ≤ (1 + Brr μr ) E τ1/r + Brr μr t−r τ d P dt
0 0 {τ<tr }
  ∞ !
−r
= (1 + Brr μr ) E τ1/r
+ τ Brr μr
t dt d P
Ω τ1/r

Br μr
= 1 + Brr μr + r E τ1/r < ∞.
r−1

Corollary 2. Let M = (Mn ) be a martingale with E |Mn |2r < ∞ for some r ≥ 1
and such that (with M0 = 0)

 E |ΔMn |2r
< ∞. (32)
n=1
n1+r

Then (cf. Theorem 2 in Sect. 3, Chap. 4) we have the strong law of large numbers:
Mn
→0 (P -a.s.), n → ∞. (33)
n
When r = 1, the proof follows the same lines as the proof of Theorem 2 in
Sect. 3, Chap. 4. In fact, let

n
ΔMk
mn = .
k
k=1

Then n
1
n
Mn k=1 ΔMk
= = kΔmk
n n n
k=1

and, by Kronecker’s lemma (Sect. 3, Chap. 4), a sufficient condition for the limit
relation (P-a.s.)
1
n
kΔmk → 0, n → ∞,
n
k=1

is that the limit limn mn exists and is finite (P-a.s.), which in turn (Theorems 1 and
4 in Sect. 10, Chap. 2, Vol. 1) is true if and only if

P sup |mn+k − mn | ≥ ε → 0, n → ∞. (34)
k≥1
3 Fundamental Inequalities 141

By (1),
 ∞
 E(ΔMk )2
P sup |mn+k − mn | ≥ ε ≤ ε−2 .
k≥1 k2
k=n

Hence the required result follows from (32) and (34).


Now let r > 1. Then statement (33) is equivalent (Theorem 1 of Sect. 10, Chap. 2,
Vol. 1) to the statement that

|Mj |
ε2r P sup ≥ ε → 0, n → ∞, (35)
j≥n j

for every ε > 0. By inequality (52) of Problem 1,


 
|Mj | |Mj |2r
ε2r P sup ≥ ε = ε2r lim P max 2r ≥ ε2r
j≥n j m→∞ n≤j≤m j

1  1
≤ 2r E |Mn |2r + E(|Mj |2r − |Mj−1 |2r ).
n j2r
j≥n+1

It follows from Kronecker’s lemma that


1
lim E |Mn |2r = 0.
n→∞ n2r
Hence, to prove (35), we need only prove that
 1
E(|Mj |2r − |Mj−1 |2r ) < ∞. (36)
j2r
j≥2

We have
N
1  
IN = 2r
E |Mj |2r − E |Mj−1 |2r
j=2
j

N !
1 1 E |MN |2r
≤ − E |Mj−1 | 2r
+ .
j=2
(j − 1)2r j2r N 2r

By Burkholder’s inequality (27) and Hölder’s inequality,


/ j 0r
 
j
E |Mj | 2r
≤ B2r
2r E (ΔMi ) 2
≤ B2r
2r |ΔMi |2r .
i=1 i=1

Hence


N−1 !  j
2r E |MN |
2r
1 1
IN ≤ B2r
2r − j r−1
E |ΔMi |
j=2
j2r (j + 1)2r i=1
N 2r
142 7 Martingales


N−1
1 
j
E |MN |2r
≤ C1 E |ΔMi |2r
j=2
jr+2 i=1
N 2r

N
E |ΔMj |2r
≤ C2 + C3
j=2
jr+1

(Ck are constants). By (32), this establishes (36).

4. The sequence of random variables {Xn }n≥1 has a limit lim Xn (finite or infinite)
with probability 1 if and only if the number of “oscillations between two arbitrary
rational numbers a and b, a < b” is finite with probability 1. In what follows, Theo-
rem 5 provides an upper bound for the number of “oscillations” for submartingales.
In the next section, this will be applied to prove the fundamental result on their
convergence.
Let us choose two numbers a and b, a < b, and define the following times in
terms of the stochastic sequence X = (Xn , Fn ):

τ0 = 0,
τ1 = min{n > 0 : Xn ≤ a},
τ2 = min{n > τ1 : Xn ≥ b},
··· ··· ························
τ2m−1 = min{n > τ2m−2 : Xn ≤ a},
τ2m = min{n > τ2m−1 : Xn ≥ b},
··· ··· ························

taking τk = ∞ if the corresponding set {·} is empty.


In addition, for each n ≥ 1 we define the random variables

0, if τ2 > n,
βn (a, b) =
max{m : τ2m ≤ n} if τ2 ≤ n.

In words, βn (a, b) is the number of upcrossings of [a, b] by the sequence


X1 , . . . , Xn .

Theorem 5 (Doob). Let X = (Xn , Fn )n≥1 be a submartingale. Then, for every n ≥


1,
E[Xn − a]+
E βn (a, b) ≤ . (37)
b−a
PROOF. The number of intersections of X = (Xn , Fn ) with [a, b] is equal to the
number of intersections of the nonnegative submartingale X + = ((Xn − a)+ , Fn )
with [0, b − a]. Hence it is sufficient to suppose that X is nonnegative with a = 0
and show that
E Xn
E βn (0, b) ≤ . (38)
b
3 Fundamental Inequalities 143

Set X0 = 0, F0 = {∅, Ω}, and for i = 1, 2, . . ., let



1 if τm < i ≤ τm+1 for some odd m,
ϕi =
0 if τm < i ≤ τm+1 for some even m.

It is easily seen that



n
bβn (0, b) ≤ ϕi [Xi − Xi−1 ]
i=1

and &
{ϕi = 1 } = [{τm < i}\{τm+1 < i}] ∈ Fi−1 .
odd m

Therefore

n n 

b E βn (0, b) ≤ E ϕi [Xi − Xi−1 ] = (Xi − Xi−1 ) d P
i=1 i=1 {ϕi =1}
n 

= E(Xi − Xi−1 | Fi−1 ) d P
i=1 {ϕi =1}
n 
= [E(Xi | Fi−1 ) − Xi−1 ] d P
i=1 {ϕi =1}
n 
≤ [E(Xi | Fi−1 ) − Xi−1 ] d P = E Xn ,
i=1 Ω

which establishes (38).

5. In this subsection we discuss some of the simplest inequalities for the probabilities
of large deviations for square-integrable martingales.
Let M = (Mn , Fn )n≥0 be a square-integrable martingale with quadratic char-
acteristic M = (Mn , Fn−1 ), setting M0 = 0. If we apply inequality (22) to
Xn = Mn2 , An = Mn , we find that for a > 0 and b > 0
 
P max |Mk | ≥ an} = P max Mk ≥ (an) 2 2
k≤n k≤n
1
≤ E[Mn ∧ (bn)] + P{Mn ≥ an}. (39)
(an)2

In fact, at least in the case where |ΔMn | ≤ C for all n and ω ∈ Ω, this inequality
can be substantially improved using the ideas explained in Sect. 5 of Chap. 4 for
estimating the probabilities of large deviations for sums of independent identically
distributed random variables.
Let us recall that in Sect. 5, Chap. 4, when we introduced the corresponding in-
equalities, the essential point was to use the property that the sequence

(eλSn /[ϕ(λ)]n , Fn )n≥1 , Fn = σ{ξ1 , . . . , ξn }, (40)


144 7 Martingales

formed a nonnegative martingale, to which we could apply inequality (8). If we now


take Mn instead of Sn , by analogy with (40), then

(eλMn /En (λ), Fn )n≥1

will be a nonnegative martingale, where

4
n
En (λ) = E(eλΔMj | Fj−1 ) (41)
j=1

is called the stochastic exponential (see also Subsection 13, Sect. 6, Chap. 2, Vol. 1).
This expression is rather complicated. At the same time, in using (8) it is not
necessary for the sequence to be a martingale. It is enough for it to be a nonnega-
tive supermartingale. Here we can arrange this by forming a sequence (Zn (λ), Fn )
((43), below), which sufficiently simply depends on Mn and Mn , and to which we
can apply the method used in Sect. 5, Chap. 4.
Lemma 1. Let M = (Mn , Fn )n≥0 be a square-integrable martingale, M0 = 0,
ΔM0 = 0, and |ΔMn (ω)| ≤ c for all n and ω. Let λ > 0,
⎧ λc
⎨ (e − 1 − λc )/c2 , c > 0,
ψc (λ) = (42)
⎩ 1 2
2 λ , c = 0,

and
Zn (λ) = eλMn −ψc (λ)Mn . (43)
Then for every c ≥ 0 the sequence Z(λ) = (Zn (λ), Fn )n≥0 is a nonnegative
supermartingale.
PROOF. For |x| ≤ c,
 (λx)m−2  (λc)m−2
eλx − 1 − λx = (λx)2 ≤ (λx)2 ≤ x2 ψc (λ).
m! m!
m≥2 m≥2

Using this inequality and the following representation (Zn = Zn (λ)),

ΔZn = Zn−1 [(eλΔMn − 1)e−ΔMn ψc (λ) + (e−ΔMn ψc (λ) − 1)],

we find that

E(ΔZn | Fn−1 )
= Zn−1 [E(eλΔMn − 1 | Fn−1 ) e−ΔMn ψc (λ) + (e−ΔMn ψc (λ) − 1)]
= Zn−1 [E(eλΔMn − 1 − λΔMn | Fn−1 ) e−ΔMn ψc (λ) + (e−ΔMn ψc (λ) − 1)]
≤ Zn−1 [ψc (λ) E((ΔMn )2 | Fn−1 ) e−ΔMn ψc (λ) + (e−ΔMn ψc (λ) − 1)]
= Zn−1 [ψc (λ)ΔMn e−ΔMn ψc (λ) + (e−ΔMn ψc (λ) − 1)] ≤ 0, (44)

where we have also used the fact that xe−x + (e−x − 1) ≤ 0 for x ≥ 0.
3 Fundamental Inequalities 145

We see from (44) that


E(Zn | Fn−1 ) ≤ Zn−1 ,
i.e., Z(λ) = (Zn (λ), Fn ) is a supermartingale.
This establishes the lemma.


Let the hypotheses of the lemma be satisfied. Then we can always find λ > 0 for
which, for given a > 0 and b > 0, we have aλ − bψc (λ) > 0. From this we obtain
 
P max Mk ≥ an = P max eλMk ≥ eλan
k≤n k≤n

≤ P max eλMk −ψc (λ)Mk ≥ eλan−ψc (λ)Mn
k≤n

= P max eλMk −ψc (λ)Mk ≥ eλan−ψc (λ)Mn , Mn ≤ bn
k≤n

+ P max eλMk −ψc (λ)Mk ≥ eλan−ψc (λ)Mn , Mn > bn
k≤n

≤ P max eλMk −ψc (λ)Mk ≥ eλan−ψc (λ)bn + P{Mn > bn}
k≤n
−n(λa−bψc (λ))
≤e + P{Mn > bn}, (45)

where the last inequality follows from (7).


Let us write (compare with H(a) in Sect. 5, Chap. 4)

Hc (a, b) = sup[aλ − bψc (λ)].


λ>0

Then it follows from (45) that



P max Mk ≥ an ≤ P{Mn > bn} + e−nHc (a,b) . (46)
k≤n

Passing from M to −M, we find that the right-hand side of (46) also provides an
upper bound for the probability P{mink≤n Mk ≤ −an}. Consequently,

P max |Mk | ≥ an ≤ 2 P{Mn > bn} + 2e−nHc (a,b) . (47)
k≤n

Thus, we have proved the following theorem.


Theorem 6. Let M = (Mn , Fn ) be a martingale with uniformly bounded steps, i.e.,
|ΔMn | ≤ c for some constant c > 0 and all n and ω. Then for every a > 0 and
b > 0, we have inequalities (46) and (47).
Remark 2.  
1 b ac  a
Hc (a, b) = a+ log 1 + − . (48)
c c b c
146 7 Martingales

6. Under the hypotheses of Theorem 6, we now consider the question of estimating


probabilities of the type 
Mk
P sup >a ,
k≥n Mk

which characterize, in particular, the rate of convergence in the strong law of large
numbers for martingales (also see Theorem 4 in Sect. 5).
Proceeding as in Sect. 5, Chap. 4, we find that for every a > 0 there is a λ > 0
for which aλ − ψc (λ) > 0. Then, for every b > 0,
 
Mk
P sup > a ≤ P sup eλMk −ψc (λ)Mk > e[αλ−ψc (λ)]Mn
k≥n Mk k≥n

λMk −ψc (λ)Mk
≤ P sup e >e [aλ−ψc (λ)]bn
+ P{Mn < bn}
k≥n
−bn[aλ−ψc (λ)]
≤e + P{Mn < bn}, (49)

from which

Mk
P sup > a ≤ P{Mn < bn} + e−nHc (ab,b) , (50)
k≥n M k
  
 Mk 
  −nHc (ab,b)
P sup 
k≥n Mk
 > a ≤ 2 P{Mn < bn} + 2e . (51)

We have therefore proved the following theorem.


Theorem 7. Let the hypotheses of the preceding theorem be satisfied. Then inequal-
ities (50) and (51) are satisfied for all a > 0 and b > 0.
Remark 3. A comparison of (51) with estimate (21) in Sect. 5, Chap. 4, for the case
of a Bernoulli scheme, p = 1/2, Mn = Sn − (n/2), b = 1/4, c = 1/2, shows that
for small ε > 0 it leads to a similar result:
     
 Mk   
P sup   > ε = P sup  Sk − (k/2)  > ε ≤ 2e−4ε2 n .
k≥n Mk
 k≥n
 k  4

7. PROBLEMS
1. Let X = (Xn , Fn ) be a nonnegative submartingale, and let V = (Vn , Fn−1 ) be
a predictable sequence such that 0 ≤ Vn+1 ≤ Vn ≤ C (P-a.s.), where C is a
constant. Establish the following generalization of (1):
  
n
ε P max Vj Xj ≥ ε + Vn Xn d P ≤ E Vj ΔXj . (52)
i≤j≤n {max1≤j≤n Vj Xj <ε} j=1

2. Establish Krickeberg’s decomposition: Every martingale X = (Xn , Fn ) with


sup E |Xn | < ∞ can be represented as the difference of two nonnegative mar-
tingales.
3 Fundamental Inequalities 147

an sequence of independent random variables, Sn = ξ1 + · · · +


3. Let ξ1 , ξ2 , . . . be
ξn , and Sm,n = j=m+1 ξj . Establish Ottaviani’s inequality:

P{|Sn | > ε}
P max |Sj | > 2ε ≤
1≤j≤n min1 ≤j≤n P{|Sj,n | ≤ ε}

and deduce (assuming E ξi = 0, i ≥ 1) that


 ∞   ∞
P max |Sj | > 2t dt ≤ 2 E |Sn | + 2 P{|Sn | > t} dt. (53)
0 1≤j≤n 2 E |Sn |

4. Let ξ1 , ξ2 , . . . be a sequence of independent random variables with E ξi = 0.


Use (53) to show that in this case we can strengthen inequality (10) to

E Sn∗ ≤ 8 E |Sn |.

5. Verify formula (16).


6. Establish inequality (19).
7. Let the σ-algebras F0 , . . . , Fn be such that F0 ⊆ F1 ⊆ · · · ⊆ Fn , and let the
events Ak ∈ Fk , k = 1, . . . , n. Use (22) to establish Dvoretzky’s inequality:
For each ε > 0,
/ n 0 / n 0
& 
P Ak ≤ ε + P P(Ak | Fk−1 ) > ε .
k=1 k=1

8. Let X = (Xn )n≥1 be a square-integrable martingale and (bn )n≥1 a nondecreas-


ing sequence of positive real numbers. Prove the following Hájek–Rényi in-
equality:
  
 Xk  1  E(ΔXk )2
n
P max   ≥ λ ≤ 2 , ΔXk = Xk − Xk−1 , X0 = 0.
1≤k≤n bk λ b2n
k=1

9. Let X = (Xn )n≥1 be a submartingale and g(x) a nonnegative increasing convex


function. Then, for any t > 0 and real x,

E g(tX )
n
P max Xk ≥ x ≤ .
1≤k≤n g(tx)

In particular,

P max Xk ≥ x ≤ e−tx E etXn .
1≤k≤n

10. Let ξ1 , ξ2 , . . . be independent random variables with E ξn = 0, E ξn2 = 1, n ≥ 1.


Let

n
τ = min n ≥ 1 : ξi > 0 .
i=1

Prove that E τ1/2 < ∞.


148 7 Martingales

11. Let ξ = (ξn )n≥1 be a martingale difference and 1 < p ≤ 2. Show that
 n p
  ∞

 
E sup ξj  ≤ Cp E |ξj |p
n≥1  
j=1 j=1

for a constant Cp .
12. Let X = (Xn )n≥1 be a martingale with E Xn = 0 and E Xn2 < ∞. As a general-
ization of Problem 5 of Sect. 2, Chap. 4, show that for any n ≥ 1 and ε > 0

E Xn2
P max Xk > ε ≤ .
1≤k≤n ε2 + E Xn2

4. General Theorems on Convergence of Submartingales


and Martingales

1. The following result, which is fundamental for all problems about the conver-
gence of submartingales, can be thought of as an analog of the fact that in real
analysis a bounded monotonic sequence of numbers has a (finite) limit.

Theorem 1 (Doob). Let X = (Xn , Fn ) be a submartingale with

sup E |Xn | < ∞. (1)


n

Then with probability 1 the limit lim Xn = X∞ exists and E |X∞ | < ∞.

PROOF. Suppose that

P(lim sup Xn > lim inf Xn ) > 0. (2)

Then, since
&
{lim sup Xn > lim inf Xn } = {lim sup Xn > b > a > lim inf Xn }
a<b

(here a and b are rational numbers), there are values a and b such that

P{lim sup Xn > b > a > lim inf Xn } > 0. (3)

Let βn (a, b) be the number of upcrossings of (a, b) by the sequence X1 , . . . , Xn ,


and let β∞ (a, b) = limn βn (a, b). By (37), Sect. 3,

E[Xn − a]+ E Xn+ + |a|


E βn (a, b) ≤ ≤
b−a b−a
and therefore
4 General Theorems on Convergence of Submartingales and Martingales 149

supn E Xn+ + |a|


E β∞ (a, b) = lim E βn (a, b) ≤ < ∞,
n b−a
which follows from (1) and the remark that

sup E |Xn | < ∞ ⇔ sup E Xn+ < ∞


n n

for submartingales (since E Xn+ ≤ E |Xn | = 2 E Xn+ − E Xn ≤ 2 E Xn+ − E X1 ). But


the condition E β∞ (a, b) < ∞ contradicts assumption (3). Hence lim Xn = X∞
exists with probability 1, and then, by Fatou’s lemma,

E |X∞ | ≤ sup E |Xn | < ∞.


n

This completes the proof of the theorem.



Corollary 1. If X is a nonpositive submartingale, then with probability 1 the limit
lim Xn exists and is finite.
Corollary 2. If X = (Xn , Fn )n≥1 is a nonpositive submartingale,
then the sequence
X = (Xn , Fn ) with 1 ≤ n ≤ ∞, X∞ = lim Xn and F∞ = σ{ Fn } is a (nonposi-
tive) submartingale.
In fact, by Fatou’s lemma,

E X∞ = E lim Xn ≥ lim sup E Xn ≥ E X1 > −∞

and (P-a.s.)

E(X∞ | Fm ) = E(lim Xn | Fm ) ≥ lim sup E(Xn | Fm ) ≥ Xm .

Corollary 3. If X = (Xn , Fn ) is a nonnegative supermartingale (or, in particular, a


nonnegative martingale), then lim Xn exists with probability 1.
In fact, in that case,

sup E |Xn | = sup E Xn = E X1 < ∞,


n n

and Theorem 1 is applicable.

2. Let ξ1 , ξ2 , . . . be a sequence of independent random


3n variables with P(ξi = 0) =
P(ξi = 2) = 12 . Then X = (Xn , Fnξ ) with Xn = i=1 ξi and Fnξ = σ{ξ1 , . . . , ξn } is
a martingale with E Xn = 1 and Xn → X∞ ≡ 0 (P-a.s.). At the same time, it is clear
L1
that E |Xn − X∞ | = 1, and therefore Xn  X∞ . Therefore condition (1) does not in
general guarantee the convergence of Xn to X∞ in the L1 sense.
Theorem 2 below shows that if hypothesis (1) is strengthened to uniform integra-
bility of the family {Xn } (from which (1) follows by (16) of Subsection 5, Sect. 6,
Chap. 2, Vol. 1), then, besides almost sure convergence, we also have convergence
in L1 .
150 7 Martingales

Theorem 2. Let X = {Xn , Fn } be a uniformly integrable submartingale (that is,


the family {Xn } is uniformly integrable). Then there is a random variable X∞ with
E |X∞ | < ∞ such that as n → ∞,

Xn → X∞ (P -a.s.), (4)
1
L
Xn → X∞ . (5)

Moreover, the sequence X = (Xn , Fn ), 1 ≤ n ≤ ∞, with F∞ = σ( Fn ) is also a
submartingale.

PROOF. Statement (4) follows from Theorem 1, and (5) follows from (4) and The-
orem 4 (Sect. 6, Chap. 2, Vol. 1).
Moreover, if A ∈ Fn and m ≥ n, then

E IA |Xm − X∞ | → 0, m → ∞,

and therefore  
lim Xm d P = X∞ d P .
m→∞ A A
 
The sequence A
Xm d P m≥n
is nondecreasing, and therefore
  
Xn d P ≤ Xm d P ≤ X∞ d P,
A A A

whence Xn ≤ E(X∞ | Fn ) (P-a.s.) for n ≥ 1.


This completes the proof of the theorem.

Corollary. If X = (Xn , Fn ) is a submartingale and, for some p > 1,

sup E |Xn |p < ∞, (6)


n

then there is an integrable random variable X∞ for which (4) and (5) are satisfied.

For the proof, it is enough to observe that, by Lemma 3 of Sect. 6, Chap. 2, Vol. 1,
condition (6) guarantees the uniform integrability of the family {Xn }.

3. We now present a theorem on the continuity properties of conditional expecta-


tions. This was one of the very first results concerning the convergence of martin-
gales.

Theorem 3 (P. Lévy). Let (Ω, F , P) be a probability space, and let (Fn )n≥1 be a
nondecreasing family of σ-algebras, F1 ⊆ F2 ⊆ · · · ⊆ F . Let ξ be a random
variable with E |ξ| < ∞ and F∞ = σ( n Fn ). Then, both P-a.s. and in the L1
sense,
E(ξ | Fn ) → E(ξ | F∞ ), n → ∞. (7)
4 General Theorems on Convergence of Submartingales and Martingales 151

PROOF. Let Xn = E(ξ | Fn ), n ≥ 1. Then, with a > 0 and b > 0,


  
|Xn | d P ≤ E(|ξ| | Fn ) d P = |ξ| d P
{|Xn |≥a} {|X |≥a} {|Xn |≥a}
 n 
= |ξ| d P + |ξ| d P
{|Xn |≥a,|ξ|≤b} {|Xn |≥a,|ξ|>b}

≤ b P{|Xn | ≥ a} + |ξ| d P
{|ξ|>b}

b
≤ E |ξ| + |ξ| d P .
a {|ξ|>b}

Letting a → ∞ and then b → ∞, we obtain



lim sup |Xn | d P = 0,
a→∞ n {|Xn |≥a}

i.e., the family {Xn } is uniformly integrable. Therefore, by Theorem 2, there is a


random variable X∞ such that Xn = E(ξ | Fn ) → X∞ (P-a.s. and in the L1 sense).
Hence we only have to show that

X∞ = E(ξ | F∞ ) (P -a.s.).

Let m ≥ n and A ∈ Fn . Then


   
Xm d P = Xn d P = E(ξ | Fn ) d P = ξ d P .
A A A A

Since the family {Xn } is uniformly integrable and since, by Theorem 5, Sect. 6,
Chap. 2, Vol. 1, we have E IA |Xm − X∞ | → 0 as m → ∞, it follows that
 
X∞ d P = ξ d P . (8)
A A

This equation is satisfied for all A ∈ Fn and, therefore, for all A ∈ n=1 Fn .
Since E |X∞ | < ∞ and E |ξ| < ∞, the left-hand and right-hand sides of (8) are
σ-additive measures, possibly taking negative as well as positive values, but finite

and agreeing on the algebra n=1 Fn . Because of the uniqueness of the extension
of a σ-additive measure from an algebra to the smallest σ-algebra containing it
(Carathéodory’s
theorem, Sect. 3, Chap. 2, Vol. 1), Eq. (8) remains valid for sets A ∈
F∞ = σ( Fn ). Thus
  
X∞ d P = ξ d P = E(ξ | F∞ ) d P, A ∈ F∞ . (9)
A A A

Since X∞ and E(ξ | F∞ ) are F∞ -measurable, it follows from Property I of Sub-


section 3, Sect. 6, Chap. 2, Vol. 1, and from (9) that X∞ = E(ξ | F∞ ) (P-a.s.).
This completes the proof of the theorem.


152 7 Martingales

Corollary. A stochastic sequence X = (Xn , Fn ) is a uniformly integrable mar-


tingale if and only if there is a random variable ξ with E |ξ| < ∞ such that
Xn = E(ξ | Fn ) for all n ≥ 1. Here Xn → E(ξ | F∞ ) (both P-a.s. and in the L1
sense) as n → ∞.

In fact, if X = (Xn , Fn ) is a uniformly integrable martingale, then, by Theorem 2,


there is an integrable random variable X∞ such that Xn → X∞ (P-a.s. and in the L1
sense) and Xn = E(X∞ | Fn ). As the random variable ξ we may take the F∞ -
measurable variable X∞ .
The converse follows from Theorem 3.

4. We now turn to some applications of these theorems.

EXAMPLE 1. The zero–one law. Let ξ1 , ξ2 , . . . be a sequence of independent random


variables, Fnξ = σ{ξ1 , . . . , ξn }, let X be the σ-algebra of the “tail” events, and
A ∈ X . By Theorem 3, we have E(IA | Fnξ ) → E(IA | F∞ ξ
) = IA (P-a.s.). But IA
and (ξ1 , . . . , ξn ) are independent. Since E(IA | Fnξ ) = E IA , and therefore IA = E IA
(P-a.s.), we find that either P(A) = 0 or P(A) = 1.

The next two examples illustrate possible applications of the preceding results to
convergence theorems in analysis.

EXAMPLE 2. If f = f (x) satisfies a Lipschitz condition on [0, 1), it is absolutely


continuous and, as is shown in courses in analysis, there is a (Lebesgue) integrable
function g = g(x) such that
 x
f (x) − f (0) = g(y)dy. (10)
0

(In this sense, g(x) is a “derivative” of f (x).) Let us show how this result can be
deduced from Theorem 1.
Let Ω = [0, 1), F = B([0, 1)), and let P denote Lebesgue measure. Put
2n
 
k−1 k−1 k
ξn (x) = I ≤ x < ,
2n 2n 2n
k=1

Fn = σ{ξ1 , . . . , ξn } = σ{ξn }, and

f (ξn + 2−n ) − f (ξn )


Xn = .
2−n
Since for a given ξn the random variable ξn+1 takes only the values ξn and ξn +
2−(n+1) with conditional probabilities equal to 12 , we have

E[Xn+1 | Fn ] = E[Xn+1 | ξn ] = 2n+1 E[f (ξn+1 + 2−(n+1) ) − f (ξn+1 ) | ξn ]


= 2n+1 { 12 [f (ξn + 2−(n+1) ) − f (ξn )] + 12 [f (ξn + 2−n ) − f (ξn + 2−n+1) )]}
= 2n {f (ξn + 2−n ) − f (ξn )} = Xn .
4 General Theorems on Convergence of Submartingales and Martingales 153

It follows that X = (Xn , Fn ) is a martingale, and it is uniformly integrable since


|Xn | ≤ L, where L is the
Lipschitz constant: |f (x) − f (y)| ≤ L|x − y|. Observe that
F = B([0, 1)) = σ( Fn ). Therefore, by the corollary to Theorem 3, there is an
F -measurable function g = g(x) such that Xn → g (P-a.s.) and

Xn = E[g | Fn ]. (11)

Consider the set B = [0, k/2n ]. Then, by (11),


  k/2n  k/2n
k
f − f (0) = Xn dx = g(x) dx,
2n 0 0

and since n and k are arbitrary, we obtain the required equation (10).

EXAMPLE 3. Let Ω = [0, 1), F = B([0, 1)), and let P denote Lebesgue measure.
Consider the Haar system {Hn (x)}n≥1 , as defined in Example
3 of Sect. 11, Chap. 2,
Vol. 1. Put Fn = σ{H1 , . . . , Hn }, and observe that σ( Fn ) = F . From the prop-
erties of conditional expectations and the structure of the Haar functions, it is easy
to deduce that
n
E[ f (x) | Fn ] = ak Hk (x) (P -a.s.) (12)
k=1

for every Borel function f ∈ L, where


 1
ak = (f , Hk ) = f (x)Hk (x) dx.
0

In other words, the conditional expectation E[ f (x) | Fn ] is a partial sum of the


Fourier series of f (x) in the Haar system. Then, if we apply Theorem 3 to the mar-
tingale (E(f | Fn ), Fn ), we find that, as n → ∞,


n
(f , Hk )Hk (x) → f (x) (P -a.s.)
k=1

and   n 
1  
 
 (f , Hk )Hk (x) − f (x) dx → 0.
0  
k=1

EXAMPLE 4. Let (ξn )n≥1 be a sequence of random variables.  By Theorem 2 of


Sect. 10, Chap. 2, Vol. 1, the P-a.s. convergence of the series ξn implies its con-
vergence in probability and in distribution. It turns out that if the random variables
ξ1 , ξ2 , . . . are independent,
 the converse is also valid: the convergence in distribu-
tion of the series ξn of independent random variables implies its convergence in
probability and with probability 1.
d
Let Sn = ξ1 + · · · + ξn , n ≥ 1, and Sn → S. Then E eitSn → E eitS for every real
number t. It is clear that there is a δ > 0 such that | E eitS | > 0 for all |t| < δ. Choose
154 7 Martingales

t0 so that |t0 | < δ. Then there is an n0 = n0 (t0 ) such that | E eit0 Sn | ≥ c > 0 for all
n ≥ n0 , where c is a constant.
For n ≥ n0 , we form the sequence X = (Xn , Fn ) with

eit0 Sn
Xn = , Fn = σ{ξ1 , . . . , ξn }.
E eit0 Sn
Since ξ1 , ξ2 , . . . were assumed to be independent, the sequence X = (Xn , Fn ) is a
martingale with
sup E |Xn | ≤ c−1 < ∞.
n≥n0

Then it follows from Theorem 1 that with probability 1 the limit limn Xn exists and is
finite. Therefore the limit limn→∞ eit0 Sn also exists with probability 1. Consequently,
we can assert that there is a δ > 0 such that for each t in the set T = {t : |t| < δ}
the limit limn eitSn exists with probability 1.
Let T × Ω = {(t, ω) : t ∈ T, ω ∈ Ω}, let B(T) be the σ-algebra of Lebesgue sets
on T, and let λ be Lebesgue measure on (T, B(T)). Also, let


C = (t, ω) ∈ T × Ω : lim eitSn (ω) exists .
n

It is clear that C ∈ B(T) ⊗ F .


It was shown earlier that P(Ct ) = 1 for every t ∈ T, where Ct = {ω ∈
Ω : (t, ω) ∈ C} is the section of C at point t. By Fubini’s theorem (Theorem 8
of Sect. 6, Chap. 2, Vol. 1),
   
IC (t, ω) d(λ × P) = IC (t, ω) d P dλ
T×Ω Ω
T
= P(Ct ) dλ = λ(T) = 2δ > 0.
T

On the other hand, again by Fubini’s theorem,


    
λ(T) = IC (t, ω) d(λ × P) = dP IC (t, ω) dλ = λ(Cω ) d P,
T×Ω Ω T Ω

where Cω = {t : (t, ω) ∈ C}.


Hence it follows that there is a set Ω̃ with P(Ω̃) = 1 such that λ(Cω ) = λ(T) =
2δ > 0 for all ω ∈ Ω̃.
Consequently, we may say that for every ω ∈ Ω̃ the limit limn eitSn exists for
all t ∈ Cω . In addition, the measure of Cω is positive. From this and Problem 8 it
follows that the limit limn Sn (ω) exists and is finite for ω ∈ Ω̃. Since P(Ω̃) = 1, the
limit limn Sn (ω) exists and is finite with probability 1.
5. PROBLEMS

 {Gn } be a nonincreasing family of σ-algebras, G1 ⊇ G2 ⊇ · · · , let G∞ =


1. Let
Gn , and let η be an integrable random variable. Establish the following analog
4 General Theorems on Convergence of Submartingales and Martingales 155

of Theorem 3: As n → ∞,

E(η | Gn ) → E(η | G∞ ) (P -a.s. and in the L1 sense).

2. Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random


variables with E |ξ1 | < ∞ and E ξ1 = m; let Sn = ξ1 + · · · + ξn . Having shown
(Problem 2, Sect. 7, Chap. 2, Vol. 1) that
Sn
E(ξ1 | Sn , Sn+1 , . . .) = E(ξ1 | Sn ) = (P -a.s.),
n
deduce from Problem 1 a stronger form of the law of large numbers: As n → ∞,
Sn
→ m (P -a.s. and in the L1 sense).
n
3. Establish the following result, which combines Lebesgue’s dominated conver-
gence theorem and P. Lévy’s theorem. Let {ξn }n≥1 be a sequence of random
variables such that ξn → ξ (P-a.s.), |ξn | ≤ η, E η < ∞, and let {Fm }m≥1 be a
nondecreasing family of σ-algebras with F∞ = σ( Fn ). Then

lim E(ξn | Fm ) = E(ξ | F∞ )


m→∞
(P -a.s.).
n→∞

4. Establish formula (12).


5. Let Ω = [0, 1], F = B([0, 1)), let P denote Lebesgue measure, and let f =
f (x) ∈ L1 . Set
 (k+1)2−n
fn (x) = 2n
f (y) dy, k2−n ≤ x < (k + 1)2−n .
k2−n

Show that fn (x) → f (x) (P-a.s.).


6. Let Ω = [0, 1), F = B([0, 1)), let P denote Lebesgue measure, and let f =
f (x) ∈ L1 . Continue this function periodically on [0, 2), and set
n

2
fn (x) = 2−n f (x + j2−n ).
j=1

Show that fn (x) → f (x) (P-a.s.).


7. Prove that Theorem 1 remains valid for generalized submartingales X =
(Xn , Fn ), if inf m supn≥m E(Xn+ | Fm ) < ∞ (P-a.s.).
8. Let (an )n≥1 be a sequence of real numbers such that for all real numbers t with
|t| < δ, δ > 0, the limit limn eitan exists. Prove that then the limit lim an exists
and is finite.
9. Let F = F(x), x ∈ R, be a distribution function, and let α ∈ (0, 1). Suppose that
there exists θ ∈ R such that F(θ) = α. Let us construct the sequence X1 , X2 , . . .
so that
Xn+1 = Xn − n−1 (Yn − α),
156 7 Martingales

where Y1 , Y2 , . . . are random variables such that



F(Xn ) if y = 1,
P(Yn = y | X1 , . . . , Xn ; Y1 , . . . , Yn−1 ) =
1 − F(Xn ) if y = 0

(the Robbins–Monro procedure). Prove the following result of the stochastic


approximation theory: E |Xn − θ|2 → 0, n → ∞.
10. Let X = (Xn , Fn )n≥1 be a submartingale such that E(Xτ I(τ < ∞)) = ∞ for
any stopping time τ. Show that with probability 1 there exists the 
limit limn Xn .

11. Let X = (Xn , Fn )n≥1 be a martingale and F∞ = σ n=1 Fn . Prove that

if the sequence (Xn )n≥1 is uniformly integrable, then the limit X∞ = limn Xn
exists (P-a.s.) and the “closed” sequence X = (Xn , Fn )1≤n≤∞ is amartingale.

12. Assume that X = (Xn , Fn )n≥1 is a submartingale, and let F∞ = σ n=1 Fn .
Prove that if (Xn+ )n≥1 is uniformly integrable, then the limit X∞ = limn Xn ex-
ists (P-a.s.) and the “closed” sequence X = (Xn , Fn )1≤n≤∞ is a submartingale.

5. Sets of Convergence of Submartingales and Martingales

1. Let X = (Xn , Fn ) be a stochastic sequence. Let us denote by {Xn →} or {−∞ <


lim Xn < ∞} the set of sample points for which lim Xn exists and is finite. Let
us also write A ⊆ B (P-a.s.) if P(IA ≤ IB ) = 1. We will also write {Xn } for
Ω \ {Xn →} and A = B a.s. if P(A B) = 0.
If X is a submartingale and sup E |Xn | < ∞ (or, equivalently, if sup E Xn+ < ∞),
then according to Theorem 1 of Sect. 4, we have

{Xn →} = Ω (P -a.s.), i.e. P{Xn } = 0.

Let us consider the structure of sets {Xn →} of convergence for submartingales


when the hypothesis sup E |Xn | < ∞ is not satisfied.
Let a > 0, and τa = min{n ≥ 1 : Xn > a} with τa = ∞ if {·} = ∅.
Definition. A stochastic sequence X = (Xn , Fn ) belongs to class C+ (X ∈ C+ ) if

E(ΔXτa )+ I{τa < ∞} < ∞ (1)

for every a > 0, where ΔXn = Xn − Xn−1 , X0 = 0.


It is evident that X ∈ C+ if

E sup |ΔXn | < ∞ (2)


n

or, all the more so, if


|ΔXn | ≤ C < ∞ (P -a.s.) (3)
for all n ≥ 1.
5 Sets of Convergence of Submartingales and Martingales 157

Theorem 1. If the submartingale X ∈ C+ , then

{sup Xn < ∞} = {Xn →} (P -a.s.). (4)

PROOF. The inclusion {Xn →} ⊆ {sup Xn < ∞} is evident. To establish the oppo-
site inclusion, we consider the stopped submartingale X τa = (Xτa ∧n , Fn ). Then, by
(1),

sup E Xτ+a ∧n ≤ a + E[Xτ+a · I{τa < ∞}]


n
≤ 2a + E[(ΔXτa )+ · I{τa < ∞}] < ∞, (5)

and therefore, by Theorem 1 from Sect. 4,

{τa = ∞} ⊆ {Xn →} (P -a.s.).



But a>0 {τa = ∞} = {sup Xn < ∞}; hence {sup Xn < ∞} ⊆ {Xn →} (P-a.s.}.
This completes the proof of the theorem.

Corollary. Let X be a martingale with E sup |ΔXn | < ∞. Then (P-a.s.)

{Xn →} ∪ {lim inf Xn = −∞, lim sup Xn = +∞} = Ω. (6)

In fact, if we apply Theorem 1 to X and to −X, we find that (P-a.s.)

{lim sup Xn < ∞} = {sup Xn < ∞} = {Xn →},


{lim inf Xn > −∞} = {inf Xn > −∞} = {Xn →}.

Therefore (P-a.s.)

{lim sup Xn < ∞} ∪ {lim inf Xn > −∞} = {Xn →},

which establishes (6).


Statement (6) means that, provided that E sup |ΔXn | < ∞, either almost all
trajectories of the martingale X have finite limits or all behave very badly, in the
sense that lim sup Xn = +∞ and lim inf Xn = −∞.

2. If ξ1 , ξ2 , . . . is a sequence of independent random variables withE ξi = 0 and


|ξi | ≤ c < ∞, then, byTheorem 1 from Sect. 2, Chap. 4, the series ξi converges
(P-a.s.) if and only if E ξi2 < ∞. The sequence X = (Xn , Fn ) with Xn = ξ1 +
·· · + ξn and Fn = σ{ξ1 , . . . , ξn } is a square-integrable martingale with Xn =
n 2
i=1 E ξi , and the proposition just stated can be interpreted as follows:

{X∞ < ∞} = {Xn →} = Ω (P -a.s.),

where X∞ = limn Xn .


158 7 Martingales

The following propositions extend this result to more general martingales and
submartingales.
Theorem 2. Let X = (Xn , Fn ) be a submartingale and

Xn = mn + An

its Doob decomposition.


(a) If X is a nonnegative submartingale, then (P-a.s.)

{A∞ < ∞} ⊆ {Xn →} ⊆ {sup Xn < ∞}. (7)

(b) If X ∈ C+ , then (P-a.s.)

{Xn →} = {sup Xn < ∞} ⊆ {A∞ < ∞}. (8)

(c) If X is a nonnegative submartingale and X ∈ C+ , then (P-a.s.)

{Xn →} = {sup Xn < ∞} = {A∞ < ∞}. (9)

PROOF. (a) The second inclusion in (7) is obvious. To establish the first inclusion,
we introduce the times

σa = min{n ≥ 1 : An+1 > a}, a > 0,

taking σa = +∞ if {·} = ∅. Then Aσa ≤ a, and, by Corollary 1 to Theorem 1,


Sect. 2, we have
E Xn∧σa = E An∧σa ≤ a.
Let Yna = Xn∧σa . Then Y a = (Yna , Fn ) is a submartingale with sup E Yna ≤ a < ∞.
Since the martingale is nonnegative, it follows from Theorem 1 in Sect. 4 that

{A∞ ≤ a} = {σa = ∞} ⊆ {Xn →} (P-a.s.).

Therefore (P-a.s.),
&
{A∞ < ∞} = {A∞ ≤ a} ⊆ {Xn →}.
a>0

(b) The first equation follows from Theorem 1. To prove the second, we notice
that, in accordance with (5),

E Aτa ∧n = E Xτa ∧n ≤ E Xτ+a ∧n ≤ 2a + E[(ΔXτa )+ I{τa < ∞}],

and therefore
E Aτa = E lim Aτa ∧n < ∞.
n

Hence {τa = ∞} ⊆ {A∞ < ∞}, and we obtain the required conclusion since

a>0 {τa = ∞} = {sup Xn < ∞}.
5 Sets of Convergence of Submartingales and Martingales 159

(c) This is an immediate consequence of (a) and (b).


This completes the proof of the theorem.

Remark. The hypothesis that X is nonnegative can be replaced by the hypothesis


supn E Xn− < ∞.

Corollary 1. Let Xn = ξ1 +· · ·+ξn , where ξi ≥ 0, E ξi < ∞, ξi are Fi -measurable,


and F0 = {∅, Ω}. Then (P-a.s.)


E(ξn | Fn−1 ) < ∞ ⊆ {Xn →}, (10)
n=1

and if, in addition, E supn ξn < ∞, then (P-a.s.)




E(ξn | Fn−1 ) < ∞ = {Xn →}. (11)
n=1

Corollary 2 (Borel–Cantelli–Lévy Lemma). If the events Bn ∈ Fn , then, if we set


ξn = IBn in (11), we find that (P-a.s.)

∞ 

P(Bn | Fn−1 ) < ∞ = IBn < ∞ . (12)
n=1 n=1

3. Theorem 3. Let M = (Mn , Fn )n≥1 be a square-integrable martingale. Then


(P-a.s.)
{M∞ < ∞} ⊆ {Mn →}. (13)
If also E sup |ΔMn |2 < ∞, then (P-a.s.)

{M∞ < ∞} = {Mn →}, (14)

where


M∞ = E((ΔMn )2 | Fn−1 ) (15)
n=1

with M0 = 0, F0 = {∅, Ω}.

PROOF. Consider the two submartingales M 2 = (Mn2 , Fn ) and (M + 1)2 = ((Mn +


1)2 , Fn ). Let their Doob decompositions be

Mn2 = mn + An , (Mn + 1)2 = mn + An .

Then An and An are the same, since


n 
n
An = E(Δ(Mk + 1)2 | Fk−1 ) = E(ΔMk2 | Fk−1 ) = An
k=1 k=1
160 7 Martingales

because the linear term in E(Δ(Mk + 1)2 | Fk−1 ) vanishes. Hence (7) implies that
(P-a.s.)

{M∞ < ∞} = {A∞ < ∞} ⊆ {Mn2 →} ∩ {(Mn + 1)2 →} = {Mn →}.

Because of (9), Eq. (14) will be established if we show that the condition
E sup |ΔMn |2 < ∞ guarantees that M 2 belongs to C+ .
Let τa = min{n ≥ 1 : Mn2 > a}, a > 0. Then, on the set {τa < ∞},

|ΔMτ2a | = |Mτ2a − Mτ2a −1 | ≤ |Mτa − Mτa −1 |2


+2|Mτa −1 | · |Mτa − Mτa −1 | ≤ (ΔMτa )2 + 2a1/2 |ΔMτa |,

whence
"
E |ΔMτ2a | I{τa < ∞} ≤ E(ΔMτa )2 I{τa < ∞} + 2a1/2 E(ΔMτa )2 I{τa < ∞}
"
≤ E sup |ΔMn |2 + 2a1/2 E sup |ΔMn |2 < ∞.

This completes the proof of the theorem.




As an illustration of this theorem, we present the following result, which can
be considered as a kind of the strong law of large numbers for square-integrable
martingales (cf. Theorem 2 in Sect. 3, Chap. 4).
Theorem 4. Let M = (Mn , Fn ) be a square-integrable martingale, and let A =
(An , Fn−1 ) be a predictable increasing sequence with A1 ≥ 1, A∞ = ∞ (P-a.s.).
If (P-a.s.)
 ∞
E[(ΔMi )2 | Fi−1 ]
< ∞, (16)
i=1
A2i
then
Mn /An → 0, n → ∞, (17)
with probability 1.
In particular, if M = (Mn , Fn−1 ) is the quadratic characteristic of the square-
integrable martingale M = (Mn , Fn ), and M∞ = ∞ (P-a.s.), then with probabil-
ity 1
Mn
→ 0, n → ∞. (18)
Mn
PROOF. Consider the square-integrable martingale m = (mn , Fn ) with


n
ΔMi
mn = .
i=1
Ai

Then

n
E[(ΔMi )2 | Fi−1 ]
mn = . (19)
i=1
A2i
5 Sets of Convergence of Submartingales and Martingales 161

Since n
Mn k=1 Ak Δmk
= ,
An An
we have, by Kronecker’s lemma (Sect. 3, Chap. 4), Mn /An → 0 (P-a.s.) if the limit
limn mn exists (finite) with probability 1. By (13),

{m∞ < ∞} ⊆ {mn →}. (20)

Therefore it follows from (19) that (16) is a sufficient condition for (17).
If now An = Mn , then (16) is automatically satisfied (Problem 6).
This completes the proof of the theorem.


EXAMPLE. Consider a sequence ξ1 , ξ2 , . . . of independent random variables with
E ξi = 0, Var ξi = Vi > 0, and let the sequence X = {Xn }n≥0 be defined recursively
by
Xn+1 = θXn + ξn+1 , (21)
where X0 is independent of ξ1 , ξ2 , . . . and θ is an unknown parameter, −∞ < θ <
∞.
We interpret Xn as the result of an observation made at time n and ask for an esti-
mator of the unknown parameter θ. As an estimator of θ in terms of X0 , X1 , . . . , Xn ,
we take n−1
(Xk Xk+1 )/Vk+1
θ̂n = k=0 n−1 2 , (22)
k=0 Xk /Vk+1

taking this to be 0 if the denominator is 0. (The quantity θ̂n is the least-squares


estimator of θ.)
It is clear from (21) and (22) that
Mn
θ̂ = θ + ,
An
where

n−1
Xk ξk+1 
n−1
Xk2
Mn = , An = Mn = .
Vk+1 Vk+1
k=0 k=0

Therefore, if the true value of the unknown parameter is θ, then

P(θ̂n → θ) = 1 (23)

if and only if (P-a.s.)


Mn
→ 0, n → ∞. (24)
An
Let us show that the conditions

 
Vn+1 ξn2
sup < ∞, E ∧1 =∞ (25)
n Vn n=1
Vn
162 7 Martingales

are sufficient for (24), and therefore sufficient for (23). We have
∞ 2
  ∞ ∞
ξ ξn2 (Xn − θXn−1 )2
∧1 ≤
n
=
n=1
Vn V
n=1 n n=1
Vn
/∞ ∞
0 !
 X2  X 2
n−1 Vn+1
≤2 n
+θ 2
≤ 2 sup + θ M∞ ,
2

n=1 n
V n=1
Vn Vn

which follows because



 ∞ ∞
X2 Xn2 Vn+1 Vn+1  Xn2 Vn+1
n
= ≤ sup = sup M∞ ,
n=1
Vn n=1
Vn+1 Vn Vn n=1
Vn+1 Vn

n−1 Xk2
where Mn = k=0 Vk+1 by definition.

Therefore 
∞ 
ξn2
∧ 1 = ∞ ⊆ {M∞ = ∞}.
n=1
Vn
By
∞the three-series theorem (Theorem 3, Sect. 2, Chap. 4)the divergence of

n=1 E((ξ 2
n /Vn ) ∧ 1) guarantees the divergence (P-a.s.) of n=1 ((ξ 2 /Vn ) ∧ 1).
Therefore P{M∞ = ∞} = 1, hence (24) follows directly from Theorem 4.
Estimators θ̂n , n ≥ 1, with property (23) are said to be strongly consistent; com-
pare the notion of consistency in Sect. 4, Chap. 1, Vol. 1.
In Subsection 5 of the next section we continue the discussion of this example
for Gaussian variables ξ1 , ξ2 , . . . .
Theorem 5. Let X = (Xn , Fn ) be a submartingale, and let

Xn = mn + An

be its Doob decomposition. If |ΔXn | ≤ C, then (P-a.s.)

{m∞ + A∞ < ∞} = {Xn →}, (26)

or, equivalently,


E[ΔXn + (ΔXn ) | Fn−1 ] < ∞
2
= {Xn →}. (27)
n=1

PROOF. Since

n
An = E(ΔXk | Fk−1 ) (28)
k=1

and


n
mn = [ΔXk − E(ΔXk | Fk−1 )], (29)
k=1
5 Sets of Convergence of Submartingales and Martingales 163

it follows from the assumption that |ΔXk | ≤ C that the martingale m = (mn , Fn )
is square-integrable with |Δmn | ≤ 2C. Then, by (13),

{m∞ + A∞ < ∞} ⊆ {Xn →} (P -a.s.) (30)

and, according to (8),

{Xn →} ⊆ {A∞ < ∞} (P -a.s.).

Therefore, by (14) and (30),

{Xn →} = {Xn →} ∩ {A∞ < ∞} = {Xn →} ∩ {A∞ < ∞} ∩ {mn →}


= {Xn →} ∩ {A∞ < ∞} ∩ {m∞ < ∞}
= {Xn →} ∩ {A∞ + m∞ < ∞} = {A∞ + m∞ < ∞}.

Finally, the equivalence of (26) and (27) follows because, by (29),



mn = {E[(ΔXk )2 | Fk−1 ] − [E(ΔXk | Fk−1 )]2 },
∞
∞series k=1 E(ΔXk2 | Fk−1 ) of nonnegative terms im-
and the convergence of the
plies the convergence of k=1 [E(ΔXk | Fk−1 )] . This completes the proof.

4. Kolmogorov’s three-series theorem (Theorem 3, Sect. 2, Chap. 4) gives a neces- 


sary and sufficient condition for the convergence, with probability 1, of a series ξn
of independent random variables. The following theorem,  whose proof is based on
Theorems 2 and 3, describes sets of convergence of ξn without the assumption
that the random variables ξ1 , ξ2 , . . . are independent.

Theorem 6. Let ξ = (ξn , Fn ), n ≥ 1, be a stochastic


 sequence, let F0 = {∅, Ω},
and let c be a positive constant. Then the series ξn converges on the set A of
sample points for which the three series
  
P(|ξn | ≥ c | Fn−1 ), E(ξnc | Fn−1 ), Var(ξnc | Fn−1 )

converge, where ξnc = ξn I(|ξn | ≤ c).


n 
PROOF. Let Xn = k=1 ξk . Since (on the set A) the series P(|ξn | ≥ c | Fn−1 )
converges,
 by Corollary 2 of Theorem 2, and by the convergence of the series
E(ξn | Fn−1 ), we have
c


n
A ∩ {Xn →} = A ∩ ξk I(|ξk | ≤ c) →
k=1
 n
=A ∩ [ξk I(|ξk | ≤ c) − E(ξk I(|ξk | ≤ c) | Fk−1 )] → . (31)
k=1
164 7 Martingales
n
Let ηk = ξkc − E(ξkc | Fk−1 ), and let Yn = k=1 ηk . Then Y = (Yn , Fn ) is a square-
integrable martingale with |ηk | ≤ 2c. By Theorem 5 we have


A⊆ Var(ξnc | Fn−1 ) < ∞ = {Y∞ < ∞} = {Yn →}. (32)

Then it follows from (31) that

A ∩ {Xn →} = A,

and therefore A ⊆ {Xn →}. This completes the proof.




5. PROBLEMS
1. Show that if a submartingale X = (Xn , Fn ) satisfies E supn |Xn | < ∞, then it
belongs to class C+ .
2. Show that Theorems 1 and 2 remain valid for generalized submartingales.
3. Show that generalized submartingales satisfy (P-a.s.) the inclusion


inf sup E(Xn+ | Fm ) < ∞ ⊆ {Xn →}.
m n≥m

4. Show that the corollary of Theorem 1 remains valid for generalized martingales.
5. Show that every generalized submartingale
n of class C+ is a local submartingale.
6. Let an > 0, n ≥ 1, and let bn = k=1 ak . Show that

∞
an
2
< ∞.
b
n=1 n

6. Absolute Continuity and Singularity of Probability


Distributions on a Measurable Space with Filtration

1. Let (Ω, F ) be a measurable space on which there is defined a family (Fn )n≥1 of
σ-algebras such that F1 ⊆ F2 ⊆ · · · ⊆ F and
&
∞ 
F =σ Fn . (1)
n=1

Let us suppose that two probability measures P and P̃ are given on (Ω, F ). Let us
write
Pn = P | Fn , P̃n = P̃ | Fn
for the restrictions of these measures to Fn , i.e., let Pn and P̃n be measures on
(Ω, Fn ), and for B ∈ Fn let

Pn (B) = P(B), P̃n (B) = P̃(B).


6 Absolute Continuity and Singularity of Probability Distributions. . . 165

Recall that the probability measure P̃ is absolutely continuous with respect to P


(notation, P̃  P) if P̃(A) = 0 whenever P(A) = 0, A ∈ F .
When P̃  P and P  P̃, the measures P̃ and P are equivalent (notation,
P̃ ∼ P).
The measures P̃ and P are singular (or orthogonal) if there is a set A ∈ F such
that P̃(A) = 1 and P(A) = 1 (notation, P̃ ⊥ P).

Definition 1. We say that P̃ is locally absolutely continuous with respect to P (no-


loc
tation, P̃  P) if
P̃n  Pn (2)
for every n ≥ 1.

The fundamental question that we shall consider in this section is the determina-
loc
tion of conditions under which local absolute continuity P̃  P implies one of the
properties P̃  P, P̃ ∼ P, P̃ ⊥ P. It will become clear that martingale theory is the
mathematical apparatus that lets us give definitive answers to these questions.
Recall that the problems of absolute continuity and singularity were considered
in Sect. 9, Chap. 3, Vol. 1, for arbitrary probability measures. It was shown that the
corresponding tests could be stated in terms of the Hellinger integrals (Theorems 2
and 3 therein). The results about absolute continuity and singularity for locally ab-
solutely continuous measures to be stated below could be obtained using those tests.
This approach is revealed in the monographs [34, 43]. Here we prefer another pre-
sentation, which enables us to better illustrate the possibilities of using the results on
the sets of convergence of submartingales obtained in Sect. 5. (Note that throughout
this section we assume the property of local absolute continuity. This is done only
to simplify the presentation. The reader is referred to [34, 43] for the general case.)
loc
Let us then suppose that P̃  P. Denote by

dP̃n
zn =
d Pn

the Radon–Nikodým derivative of P̃n with respect to Pn . It is clear that zn is Fn -


measurable; and if A ∈ Fn , then
 
dP̃n+1
zn+1 d P = d P = P̃n+1 (A) = P̃n (A)
A A d Pn+1
 
dP̃n
= d P = zn d P .
A d Pn A

It follows that, with respect to P, the stochastic sequence z = (zn , Fn )n≥1 is a


martingale.
The following theorem is the key to problems on absolute continuity and singu-
larity.
166 7 Martingales

loc
Theorem 1. Let P̃  P.
(a) Then with 12 (P̃ + P)-probability 1 there exists the limit limn zn , to be denoted
by z∞ , such that
P(z∞ = ∞) = 0.
(b) The Lebesgue decomposition

P̃(A) = z∞ d P +P̃(A ∩ {z∞ = ∞}), A ∈ F, (3)
A

holds, and the measures P̃(A ∩ {z∞ = ∞}) and P(A), A ∈ F , are singular.
PROOF. Let us notice first that, according to the classical Lebesgue decomposition
(see (29) in Sect. 9, Chap. 3, Vol. 1) of an arbitrary probability measure P with re-
spect to a probability measure P̃, the following representation holds:

8 z̃ 8 ∩ {z = 0}), A ∈ F ,
P(A) = d P +P(A (4)
z
A

where
dP 8
dP
z= , z̃ =
dQ dQ
and the measure Q can be taken, for example, to be Q = 12 (P +P). 8 Conclusion (3)
can be thought of as a specialization of decomposition (4) under the assumption that
loc
P̃  P, i.e., P̃n  Pn .
Let
d Pn dP8n 1
zn = , z̃n = , Qn = 8 n ).
(Pn +P
d Qn d Qn 2
The sequences (zn , Fn ) and ( z̃n , Fn ) are martingales with respect to Q such that
0 ≤ zn ≤ 2, 0 ≤ z̃n ≤ 2. Therefore, by Theorem 2, Sect. 4, there exist the limits

z∞ ≡ lim zn , z̃∞ ≡ lim z̃n (5)


n n

both Q-a.s. and in the sense of convergence in L1 (Ω, F , Q).


The convergence in L1 (Ω, F , Q) implies, in particular, that for any A ∈ Fm
  
z̃∞ d Q = lim z̃n d Q = z̃m d Q = P 8 m (A) = P(A).
8
A n↑∞ A A

Then we obtain
by Carathéodory’s theorem (Sect. 3, Chap. 2, Vol. 1) that for any
A ∈ F = σ( n Fn ) 
8
z̃∞ d Q = P(A),
A
6 Absolute Continuity and Singularity of Probability Distributions. . . 167

8 Q = z̃∞ , and, similarly,


i.e., dP/d

z∞ d Q = P(A),
A

i.e., d P/d Q = z∞ .
Thus, we have established the result
that was to be expected: If the measures
P and Q are defined on F = σ( Fn ) and Pn , Qn are the restrictions of these
measures to Fn , then
d Pn dP
lim =
n d Qn dQ
(Q-a.s. and in L1 (Ω, F , Q)). Similarly,

dP8n 8
dP
lim = .
n d Qn dQ
8 n  Pn , n ≥ 1, it is not hard to
In the special case under consideration, where P
show that (Q-a.s.)
z̃n
zn = , (6)
zn
and Q{zn = 0, z̃n = 0} ≤ 12 [P{zn = 0} + P{z̃ 8 n = 0}] = 0, so that (6) Q-a.s. does
0
not involve an indeterminacy of the form 0 .
The expression of the form 20 , as usual, is set at +∞. It is useful to note that, since
(zn , Fn ) is a nonnegative martingale, relation (5) of Sect. 2 implies that if zτ = 0,
then zn = 0 for all n ≥ τ (Q-a.s.). Of course, the same holds also for (z̃n , Fn ).
Therefore the points 0 and +∞ are “absorbing states” for the sequence (zn )n≥1 .
It follows from (5) and (6) that the limit
limn z̃n z̃∞
z∞ ≡ lim zn = = (7)
n limn zn z∞

exists Q-a.s. 
Since P{z∞ = 0} = {z∞ =0} z∞ d Q = 0, we have P{z∞ = ∞} = 0, which
proves conclusion (a).
For the proof of (3) we use the general decomposition (4). In our setup, by what
P 
has been proved, we have z = dd Q = z∞ , z̃ = ddPQ = z̃∞ (Q-a.s.), hence (4) yields

8 z̃∞ 8 ∩ {z∞ = 0}).
P(A) = d P +P(A
A z∞

8 ∞ = 0} = 0, we obtain the required decomposi-


In view of (7) and the fact that P{z̃
tion (3). Note that due to P{z∞ < ∞} = 1, the measures

P(A) ≡ P(A ∩ {z∞ < ∞}) and 8 ∩ {z∞ = ∞}),


P(A A ∈ F,

are singular.


168 7 Martingales

The Lebesgue decomposition (3) implies the following useful tests for absolute
continuity or singularity of locally absolutely continuous probability measures.
loc
Theorem 2. Let P̃  P, i.e., P̃n  Pn , n ≥ 1. Then

P̃  P ⇔ Ez∞ = 1 ⇔ P̃(z∞ < ∞) = 1, (8)


P̃ ⊥ P ⇔ Ez∞ = 0 ⇔ P̃(z∞ = ∞) = 1, (9)

where E denotes averaging with respect to P.

PROOF. Setting A = Ω in (3), we find that

Ez∞ = 1 ⇔ P̃(z∞ = ∞) = 0, (10)


Ez∞ = 0 ⇔ P̃(z∞ = ∞) = 1. (11)

If P̃(z∞ = ∞) = 0, it again follows from (3) that P̃  P.


Conversely, let P̃  P. Then, since P(z∞ = ∞) = 0, we have P̃(z∞ = ∞) = 0.
In addition, if P̃ ⊥ P, there is a set B ∈ F with P̃(B) = 1 and P(B) = 0. Then
P̃(B ∩ (z∞ = ∞)) = 1 by (3), and therefore P̃(z∞ = ∞) = 1. If, on the other hand,
P̃(z∞ = ∞) = 1, the property P̃ ⊥ P is evident, since P(z∞ = ∞) = 0.
This completes the proof of the theorem.

2. It is clear from Theorem 2 that the tests for absolute continuity or singularity can
be expressed in terms of either P (verify the equation Ez∞ = 1 or Ez∞ = 0) or P̃
(verify that P̃(z∞ < ∞) = 1 or that P̃(z∞ = ∞) = 1).
By Theorem 5 in Sect. 6, Chap. 2, Vol. 1, the condition Ez∞ = 1 is equivalent to
the uniform integrability (with respect to P) of the family {zn }n≥1 . This allows us
to give simple sufficient conditions for the absolute continuity P̃  P. For example,
if
sup E[zn log+ zn ] < ∞ (12)
n

or, if
sup Ez1+ε
n < ∞, ε > 0, (13)
n

then, by Lemma 3 in Sect. 6, Chap. 2, Vol. 1, the family of random variables {zn }n≥1
is uniformly integrable, and therefore P̃  P.
In many cases, it is preferable to verify the property of absolute continuity or
of singularity using a test in terms of P̃, since then the question is reduced to the
investigation of the probability of the “tail” event {z∞ < ∞}, where one can use
propositions like the zero–one law.
Let us show, by way of illustration, that the Kakutani dichotomy can be deduced
from Theorem 2.
Let ξ = (ξ1 , ξ2 , . . .) and ξ˜ = (ξ˜1 , ξ˜2 , . . .) be sequences of independent random
variables defined on a probability space (Ω, F , P).
6 Absolute Continuity and Singularity of Probability Distributions. . . 169

Let (R∞ , B∞ ) be the measurable space of sequences x = (x1 , x2 , . . .) of real


numbers with B∞ = B(R∞ ), and let Bn = σ{x1 , . . . , xn }.
Let P and P̃ be the probability distributions on (R∞ , B∞ ) for ξ and ξ,
˜ respec-
tively, i.e.,

P(B) = P{ξ ∈ B}, P̃(B) = P{ξ ∈ B}, B ∈ B∞ .

Also, let
Pn = P | Bn , P̃n = P̃ | Bn
be the restrictions of P and P̃ to Bn , and let

Pξn (A) = P(ξn ∈ A), Pξ̃n (A) = P(ξ˜n ∈ A), A ∈ B(R1 ).

Theorem 3 (Kakutani Dichotomy). Let ξ = (ξ1 , ξ2 , . . .) and ξ˜ = (ξ˜1 , ξ˜2 , . . .) be


sequences of independent random variables for which

Pξ̃n  Pξn , n ≥ 1. (14)

Then either P̃  P or P̃ ⊥ P.
loc
PROOF. Condition (14) is evidently equivalent to P̃n  Pn , n ≥ 1, i.e., P̃  P. It is
clear that
dP̃n
zn = = q1 (x1 ) · · · qn (xn ),
dPn
where
dPξ̃i
qi (xi ) = (xi ). (15)
dPξi
Consequently,
  ∞
{x : z∞ < ∞} = {x : log z∞ < ∞} = x : log qi (xi ) < ∞ .
i=1
∞
The event {x : i=1 log qi (xi ) < ∞} is a tail event. Therefore, by the Kolmogorov
zero–one law (Theorem 1, Sect. 1, Chap. 4) the probability P̃{x : z∞ < ∞} has only
two values (0 or 1), and therefore, by Theorem 2, either P̃ ⊥ P or P̃  P.
This completes the proof of the theorem.


3. The following theorem provides, in “predictable” terms, a test for absolute conti-
nuity or singularity.
loc
Theorem 4. Let P̃  P, and let

α n = zn z⊕
n−1 , n ≥ 1,

with z0 = 1. Then (with F0 = {∅, Ω})


170 7 Martingales



P̃  P ⇔ P̃ [1 − E( αn | Fn−1 )] < ∞ = 1, (16)
n=1




P̃ ⊥ P ⇔ P̃ [1 − E( αn | Fn−1 )] = ∞ = 1. (17)
n=1

PROOF. Since 
P̃n {zn = 0} = zn d P = 0,
{zn =0}

we have (P-a.s.)
4
n 
n
zn = αk = exp log αk . (18)
k=1 k=1

Setting A = {z∞ = 0} in (3), we find that P̃{z∞ = 0} = 0. Therefore, by (18), we


have (P-a.e.)

{z∞ < ∞} = {0 < z∞ < ∞} = {0 < lim zn < ∞}


 n
= − ∞ < lim log αk < ∞ . (19)
k=1

Let us introduce the function



x, |x| ≤ 1,
u(x) =
sign x, |x| > 1.

Then
 
n  
n
− ∞ < lim log αk < ∞ = − ∞ < lim u(log αk ) < ∞ . (20)
k=1 k=1

Let Ẽ denote averaging with respect to P̃, and let η be an Fn -measurable inte-
grable random variable. It follows from the properties of conditional expectations
(Problem 4) that

zn−1 Ẽ(η | Fn−1 ) = E(ηzn | Fn−1 ) (P - and P̃-a.s.), (21)

Ẽ(η | Fn−1 ) = z⊕
n−1 E(ηzn | Fn−1 ) (P̃-a.s.). (22)
Recalling that αn = z⊕
n−1 zn ,
we obtain the following useful formula for “recalcula-
tion of conditional expectations” (see (44) in Sect. 7, Chap.2, Vol. 1):

Ẽ(η | Fn−1 ) = E(αn η | Fn−1 ) (P̃-a.s.). (23)

From this it follows, in particular, that

E(αn | Fn−1 ) = 1 (P̃-a.s.). (24)


6 Absolute Continuity and Singularity of Probability Distributions. . . 171

By (23),

Ẽ[u(log αn ) | Fn−1 ] = E[αn u(log αn ) | Fn−1 ] (P̃-a.s.).

Since xu(log x) ≥ x − 1 for x ≥ 0, we have, by (24),

Ẽ[u(log αn ) | Fn−1 ] ≥ 0 (P̃-a.s.).

It follows that the stochastic sequence X = (Xn , Fn ) with


n
Xn = u(log αk ),
k=1

is a submartingale with respect to P̃ and |ΔXn | = |u(log αn )| ≤ 1.


Then, by Theorem 5 in Sect. 5, we have (P̃-a.e.)
 
n
− ∞ < lim u(log αk ) < ∞
k=1


= Ẽ[u(log αk ) + u (log αk ) | Fk−1 ] < ∞ .
2
(25)
k=1

Hence we find, by combining (19), (20), (22), and (25), that (P-a.s.)


{z∞ < ∞} = Ẽ[u(log αk ) + u2 (log αk ) | Fk−1 ] < ∞
k=1


= E[αk u(log αk ) + αk u2 (log αk ) | Fk−1 ] < ∞
k=1

and consequently, by Theorem 2,




P̃  P ⇔ P̃ E[αk u(log αk ) + αk u2 (log αk ) | Fk−1 ] < ∞ = 1, (26)
k=1
∞
P̃ ⊥ P ⇔ P̃ E[αk u(log αk ) + αk u (log αk ) | Fk−1 ] = ∞
2
= 1. (27)
k=1

We now observe that by (24),


√ √
E[(1 − αn )2 | Fn−1 ] = 2E[1 − αn | Fn−1 ] (P̃-a.s.)

and for x ≥ 0 there are constants A and B (0 < A < B < ∞) such that
√ √
A(1 − x)2 ≤ xu(log x) + xu2 (log x) + 1 − x ≤ B(1 − x)2 . (28)
172 7 Martingales

Hence (16) and (17) follow from (26), (27) and (24), (28).
This completes the proof of the theorem.


Corollary 1. If, for all n ≥ 1, the σ-algebras σ(αn ) and Fn−1 are independent
loc
with respect to P (or P̃), and P̃  P, then we have the following dichotomy: either
P̃  P or P̃ ⊥ P. Correspondingly,

 √
P̃  P ⇔ [1 − E αn ] < ∞,
n=1

 √
P̃ ⊥ P ⇔ [1 − E αn ] = ∞.
n=1

In particular, in the Kakutani situation (see Theorem 3) αn = qn and



 "
P̃  P ⇔ [1 − E qn (xn )] < ∞,
n=1

 "
P̃ ⊥ P ⇔ [1 − E qn (xn )] = ∞.
n=1

loc
Corollary 2. Let P̃  P. Then


P̃ E(αn log αn | Fn−1 ) < ∞ = 1 ⇒ P̃  P .
n=1

For the proof, it is enough to notice that

x log x + 32 (1 − x) ≥ 1 − x1/2 , (29)

for all x ≥ 0, and apply (16) and (24).


∞ √
Corollary 3. Since the series n=1 [1 − E( αn | Fn−1 )], which has nonnegative
 √
(P̃-a.s.) terms, converges or diverges with the series | log E( αn | Fn−1 )|, con-
clusions (16) and (17) of Theorem 4 can be put in the form
∞ 
 √
P̃  P ⇔ P̃ | log E( αn | Fn−1 )| < ∞ = 1, (30)
n=1
∞ 
 √
P̃ ⊥ P ⇔ P̃ | log E( αn | Fn−1 )| = ∞ = 1. (31)
n=1

Corollary 4. Let there exist constants A and B such that 0 ≤ A < 1, B ≥ 0, and

P{1 − A ≤ αn ≤ 1 + B} = 1, n ≥ 1.
6 Absolute Continuity and Singularity of Probability Distributions. . . 173

loc
Then, if P̃  P, we have


P̃  P ⇔ P̃ E[(1 − αn )2 | Fn−1 ] < ∞ = 1,
n=1


P̃ ⊥ P ⇔ P̃ E[(1 − αn )2 | Fn−1 ] = ∞ = 1.
n=1

For the proof it is enough to notice that if x ∈ [1 − A, 1 + B], where 0 ≤ A < 1,


B ≥ 0, there are constants c and C (0 < c < C < ∞) such that

c(1 − x)2 ≤ (1 − x)2 ≤ C(1 − x)2 . (32)

4. Using the notation of Subsection 2, let us suppose that ξ = (ξ1 , ξ2 , . . .) and


ξ˜ = (ξ˜1 , ξ˜2 , . . .) are Gaussian sequences and P̃n ∼ Pn , n ≥ 1. Let us show that,
for such sequences, the “predictable” test given above implies the Hájek–Feldman
dichotomy: either P̃ ∼ P or P̃ ⊥ P.
By the theorem on normal correlation (Theorem 2 of Sect. 13, Chap. 2, Vol. 1) the
conditional expectations E(xn | Bn−1 ) and Ẽ(xn | Bn−1 ), where E and Ẽ are expec-
tations with respect to P and P̃, respectively, are linear functions of x1 , . . . , xn−1 .
We denote these linear functions by an−1 (x) and ãn−1 (x) (where a0 (x) = a0 ,
ã0 (x) = ã0 are constants) and put

bn−1 = (E[xn − an−1 (x)]2 )1/2 ,


b̃n−1 = (Ẽ[xn − ãn−1 (x)]2 )1/2 .

Again by the theorem on normal correlation, there are sequences ε = (ε1 , ε2 , . . .)


and ε̃ = (ε̃1 , ε̃2 , . . .) of independent Gaussian random variables with zero means
and unit variances, such that (P-a.s.)

ξn = an−1 (ξ) + bn−1 εn ,


(33)
ξ˜n = ãn−1 (ξ) + b̃n−1 ε̃n .

Notice that if bn−1 = 0, or b̃n−1 = 0, it is generally necessary to extend the


probability space in order to construct (εn ) or (ε̃n ). However, if bn−1 = 0, the
distribution of the vector (x1 , . . . , xn ) will be concentrated (P-a.s.) on the linear
manifold xn = an−1 (x), and since by hypothesis P̃n ∼ Pn , we have b̃n−1 = 0,
an−1 = ãn−1 (x), and αn (x) = 1 (P- and P̃-a.s.). Hence we may suppose without
∞all n ≥ 1,√
loss of generality that b2n > 0, b̃2n > 0 for since otherwise the contribution
of the corresponding terms of the sum n=1 [1−E( αn | Bn−1 )] (see (16) and (17))
is zero.
Using the Gaussian hypothesis, we find from (33) that, for n ≥ 1,
 
−1 (xn − a n−1 (x)) 2
(xn − ãn−1 (x)) 2
αn = dn−1 exp − + , (34)
2b2n−1 2b̃2n−1
174 7 Martingales

where dn = |bn /b̃n | and

a0 = Eξ1 , ã0 = Eξ˜1 ,


b20 = Varξ1 , b̃20 = Varξ˜1 .

From (34),
2
1 2dn−1 2
dn−1 an−1 (x) − ãn−1 (x)
log E(αn1/2 | Bn−1 ) = log 2 − 2 .
2 1 + dn−1 1 + dn−1 bn−1
2
Since log [2dn−1 /(1 + dn−1 )] ≤ 0, statement (30) can be written in the form
∞ 2
1 1 + dn−1
P̃  P ⇔ P̃ log
n=1
2 2dn−1
2 !
2
dn−1 an−1 (x) − ãn−1 (x)
+ 2 <∞ = 1. (35)
1 + dn−1 bn−1

The series

 2 ∞

1 + dn−1
log and 2
(dn−1 − 1)
n=1
2dn−1 n=1

converge or diverge together; hence it follows from (35) that



∞  2 !
Δ2n (x) b̃2n
P̃  P ⇔ P̃ + − 1 < ∞ = 1, (36)
n=0
b2n b2n

where Δn (x) = an (x) − ãn (x).


Since an (x) and ãn (x) are linear, the sequence of random variables {Δn (x)/bn }n≥0
is a Gaussian system (with respect to both P̃ and P). As follows from the lemma
that will be proved below,

Δn (x) 2   Δn (x) 2
P̃ <∞ =1⇔ Ẽ < ∞. (37)
bn bn

Hence it follows from (36) that



 2 2 !
Δn (x) b̃2n
P̃  P ⇔ Ẽ + − 1 <∞
n=0
bn b2n

and in a similar way


 ∞  2 2 2 !
Δn (x)  b̃n
P̃ ⊥ P ⇔ P̃ + − 1 < ∞ =0
n=0
b2n b2n
∞ 2 2 2 !
Δn (x) b̃n
⇔ Ẽ + − 1 = ∞.
n=0
bn b2n
6 Absolute Continuity and Singularity of Probability Distributions. . . 175

Then it is clear that if P̃ and P are not singular measures, we have P̃  P. But
by hypothesis, P̃n ∼ Pn , n ≥ 1; hence by symmetry, we have P  P̃. Therefore we
have the following theorem.
Theorem 5 (Hájek–Feldman Dichotomy). Let ξ = (ξ1 , ξ2 , . . .) and ξ˜ =
(ξ˜1 , ξ˜2 , . . .) be Gaussian sequences whose finite-dimensional distributions are
equivalent: P̃n ∼ Pn , n ≥ 1. Then either P̃ ∼ P or P̃ ⊥ P. Moreover,
∞ 2 2 2 !
Δn (x) b̃n
P̃ ∼ P ⇔ Ẽ + −1 < ∞,
n=0
bn b2n
∞ 2 2 2 ! (38)
Δn (x) b̃n
P̃ ⊥ P ⇔ Ẽ + −1 = ∞.
n=0
bn b2n

Lemma. Let β = (βn )n≥1 be a Gaussian sequence defined on (Ω, F , P). Then

∞ 
∞ ∞

P βn2 <∞ >0⇔P βn2 <∞ =1⇔ Eβn2 < ∞. (39)
n=1 n=1 n=1

PROOF. The implications (⇐) are obvious. To establish the implications (⇒), we
first suppose that Eβn = 0, n ≥ 1. Here it is enough to show that

 ∞
 !−2
E βn2 ≤ E exp − βn2 , (40)
n=1 n=1

since then the condition P{ βn2 < ∞} = 1 will imply that1the ∞right-hand side
2 of

(40) is finite. Therefore then n=1 Eβn2 < ∞, and hence P n=1 nβ 2
< ∞ =1
by the implication (⇐).
Select an n ≥ 1. Then it follows from Sects. 11 and 13, Chap. 2, Vol. 1, that there
are independent Gaussian random variables βk,n , k = 1, . . ., r ≤ n, with Eβk,n = 0,
such that
n 
r
βk2 = 2
βk,n .
k=1 k=1
2
If we write Eβk,n = λk,n , we easily see that


r 
r
2
E βk,n = λk,n (41)
k=1 k=1

and 
r  4
r
E exp − 2
βk,n = (1 + 2λk,n )−1/2 . (42)
k=1 k=1
Comparing the right-hand sides of (41) and (42), we obtain

n  r  r !−2 n !−2
E 2
βk = E βk,n ≤ E exp −
2 2
βk,n = E exp − 2
βk ,
k=1 k=1 k=1 k=1

from which, by letting n → ∞, we obtain the required inequality (40).


176 7 Martingales

Now suppose that Eβn ≡ 0.


Let us consider another sequence, β̃ = (β̃n )n≥1 , with the same distribution as
β = (βn )n≥1 but independent of it (if necessary, extending the original probability
∞ ∞
space). If P{ n=1 βn2 < ∞} > 0, then P{ n=1 (βn − β̃n )2 < ∞} > 0, and by
what we have proved

 ∞

2 E(βn − Eβn ) =2
E(βn − β̃n )2 < ∞.
n=1 n=1

Since
(Eβn )2 ≤ 2βn2 + 2(βn − Eβn )2 ,
∞
we have n=1 (Eβn )
2
< ∞, and therefore

 ∞
 ∞

Eβn2 = (Eβn )2 + E(βn − E βn )2 < ∞.
n=1 n=1 n=1

This completes the proof of the lemma.



5. We continue the discussion of the example in Subsection 3 of the preceding


section, assuming that ξ0 , ξ1 , . . . are independent Gaussian random variables with
Eξi = 0, Var ξi = Vi > 0.
Again we let
Xn+1 = θXn + ξn+1
for n ≥ 0, where X0 = ξ0 , and the unknown parameter θ that is to be estimated has
values in R. Let θ̂n be the least-squares estimator.

Theorem 6. A necessary and sufficient condition for the estimator θ̂n , n ≥ 1, to be


strongly consistent is that
∞
Vn
= ∞. (43)
V
n=0 n+1

PROOF. Sufficiency. Let Pθ denote the probability distribution on (R∞ , B∞ ) cor-


responding to the sequence (X0 , X1 , . . .) when the true value of the unknown pa-
rameter is θ. Let Eθ denote an average with respect to Pθ .
We have already seen that
Mn
θ̂n = θ + ,
Mn

where

n−1
Xk ξk+1 
n−1
Xk2
Mn = , Mn = .
Vk+1 Vk+1
k=0 k=0
6 Absolute Continuity and Singularity of Probability Distributions. . . 177

According to the lemma from the preceding subsection,

Pθ (M∞ = ∞) = 1 ⇔ Eθ M∞ = ∞,

i.e., M∞ = ∞ (Pθ -a.s.) if and only if



 Eθ X 2 k
= ∞. (44)
Vk+1
k=0

But

k
Eθ Xk2 = θ2i Vk−i
i=0

and

 ∞
 
k 
Eθ X 2 k 1 2i
= θ Vk−i
Vk+1 Vk+1 i=0
k=0 k=0
∞ ∞
 ∞ ∞
 
∞ 
Vi−k Vi Vi−k
= θ2k = + θ2k . (45)
Vi+1 V
i=0 i+1
Vi+1
k=0 i=k k=1 i=k

Hence (44) follows from (43), and therefore, by Theorem 4, the estimator θ̂n , n ≥ 1,
is strongly consistent for every θ.
Necessity. For all θ ∈ R, let Pθ (θ̂n → θ) = 1. Let us show that if θ1 = θ2 ,
the measures Pθ1 and Pθ2 are singular (Pθ1 ⊥ Pθ2 ). In fact, since the sequence
(X0 , X1 , . . .) is Gaussian, by Theorem 5, the measures Pθ1 and Pθ2 are either
singular or equivalent. But they cannot be equivalent, since, if Pθ1 ∼ Pθ2 but
Pθ1 (θ̂n → θ1 ) = 1, then also Pθ2 (θ̂n → θ1 ) = 1. However, by hypothesis,
Pθ2 (θ̂n → θ2 ) = 1 and θ2 = θ1 . Therefore Pθ1 ⊥ Pθ2 for θ1 = θ2 .
According to (38),

 !
Xk2
Pθ1 ⊥ Pθ2 ⇔ (θ1 − θ2 )2 Eθ1 =∞
Vk+1
k=0

for θ1 = θ2 . Taking θ1 = 0 and θ2 = 0, we obtain from (45) that

∞
Vi
P0 ⊥ Pθ2 ⇔ = ∞,
V
i=0 i+1

which establishes the necessity of (43).


This completes the proof of the theorem.


178 7 Martingales

6. PROBLEMS
1. Prove (6).
2. Let P̃n ∼ Pn , n ≥ 1. Show that

P̃ ∼ P ⇔ P̃{z∞ < ∞} = P{z∞ > 0} = 1,


P̃ ⊥ P ⇔ P̃{z∞ = ∞} = 1 or P{z∞ = 0} = 1.

3. Let P̃n  Pn , n ≥ 1, let τ be a stopping time (with respect to (Fn )), and let
P̃τ = P̃ | Fτ and Pτ = P | Fτ be the restrictions of P̃ and P to the σ-algebra
Fτ . Show that P̃τ  Pτ if and only if {τ = ∞} = {z∞ < ∞} (P̃-a.s.). (In
particular, if P̃{τ < ∞} = 1, then P̃τ  Pτ .)
4. Prove the “recalculation formulas” (21) and (22).
5. Verify (28), (29), and (32).
6. Prove (34).
7. In Subsection 2, let the sequences ξ = (ξ1 , ξ2 , . . .) and ξ˜ = (ξ˜1 , ξ˜2 , . . .) consist
of independent identically distributed random variables. Show that if Pξ̃1  Pξ1 ,
then P̃  P if and only if the measures Pξ̃1 and Pξ1 coincide. If, however, Pξ̃1 
Pξ1 and Pξ̃1 = Pξ1 , then P̃ ⊥ P.

7. Asymptotics of the Probability of the Outcome of a Random


Walk with Curvilinear Boundary

1. Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-


ables. Let Sn = ξ1 + · · · + ξn , let g = g(n) be a “boundary,” n ≥ 1, and let

τ = min{n ≥ 1 : Sn < g(n)}

be the first time at which the random walk (Sn ) is found below the boundary g =
g(n). (As usual, τ = ∞ if {·} = ∅.)
It is difficult to discover the exact form of the distribution of the time τ. In the
present section we find the asymptotic form of the probability P(τ > n) as n → ∞,
for a wide class of boundaries g = g(n) and assuming that the ξi are normally
distributed. The method of proof is based on the idea of an absolutely continuous
change of measure together with a number of the properties of martingales and
Markov times that were presented earlier.

Theorem 1. Let ξ1 , ξ2 , . . . be independent identically distributed random variables


with ξi ∼ N (0, 1). Suppose that g = g(n) is such that g(1) < 0 and, for n ≥ 2,

0 ≤ Δg(n + 1) ≤ Δg(n), (1)

where Δg(n) = g(n) − g(n − 1) and


7 Asymptotics of the Probability of the Outcome 179

n 
log n = o 2
[Δg(k)] , n → ∞. (2)
k=2

Then 
1
n
P(τ > n) = exp − 2
[Δg(k)] (1 + o(1)) , n → ∞. (3)
2
k=2

Before starting the proof, let us observe that (1) and (2) are satisfied if, for exam-
ple,
g(n) = anν + b, 12 < ν ≤ 1, a + b < 0, a > 0,
or (for sufficiently large n)

g(n) = nν L(n), 1
2 ≤ ν ≤ 1,

where L(n) is a slowly varying function (e.g., L(n) = C(log n)β , C > 0, with
arbitrary β for 12 < ν < 1 or with β > 0 for ν = 12 ).

2. We shall need the following two auxiliary propositions for the proof of Theo-
rem 1.
Let us suppose that ξ1 , ξ2 , . . . is a sequence of independent identically distributed
random variables, ξi ∼ N (0, 1). Let F0 = {∅, Ω}, Fn = σ{ξ1 , . . . , ξn }, and let
α = (αn , Fn−1 ) be a predictable sequence with P(|αn | ≤ C) = 1, n ≥ 1, where C
is a constant. Form the sequence z = (zn , Fn ) with

1 2
n n
zn = exp α k ξk − αk , n ≥ 1. (4)
2
k=1 k=1

It is easily verified that (with respect to P) the sequence z = (zn , Fn ) is a martingale


with E zn = 1, n ≥ 1.
Choose a value n ≥ 1 and introduce a probability measure P̃n on the measurable
space (Ω, Fn ) by putting

P̃n (A) = E I(A)zn , A ∈ Fn . (5)

Lemma 1 (Discrete version of Girsanov’s theorem). With respect to P̃n , the random
variables ξ˜k = ξk − αk , 1 ≤ k ≤ n, are independent and normally distributed,
ξ˜k ∼ N (0, 1).
PROOF. Let Ẽn denote the expectation with respect to P̃n . Then for λk ∈ R, 1 ≤
k ≤ n,
 n  n
Ẽn exp i λk ξ˜k = E exp i λk ξ˜k zn
k=1 k=1
    !
αn2 
n−1
= E exp i ˜
λk ξk zn−1 · E exp iλn (ξn − αn ) + αn ξn − Fn−1
2
k=1
180 7 Martingales
  ! 
1 2
n−1 n
= E exp i λk ξk zn−1 exp{− 2 λn } = · · · = exp −
1 2
λk .
2
k=1 k=1

Now the desired conclusion follows from Theorem 4, Sect. 12, Chap. 2, Vol. 1.

Lemma 2. Let X = (Xn , Fn )n≥1 be a square-integrable martingale with mean zero
and
σ = min{n ≥ 1 : Xn ≤ −b},
where b is a constant, b > 0. Suppose that

P(X1 < −b) > 0.

Then there is a constant C > 0 such that, for all n ≥ 1,


C
P(σ > n) ≥ . (6)
E Xn2

PROOF. By Corollary 1 to Theorem 1 in Sect. 2, we have E Xσ∧n = 0, whence

− E I(σ ≤ n)Xσ = E I(σ > n)Xn . (7)

On the set {σ ≤ n}
−Xσ ≥ b > 0.
Therefore, for n ≥ 1,

− E I(σ ≤ n) Xσ ≥ b P(σ ≤ n) ≥ b P(σ = 1) = b P(X1 < −b) > 0. (8)

On the other hand, by the Cauchy–Schwarz inequality,

E I(σ > n) Xn ≤ [P(σ > n) · E Xn2 ]1/2 , (9)

which, with (7) and (8), leads to the required inequality with

C = (b P(X1 < −b))2 .




PROOF OF THEOREM 1. It is enough to show that

n
lim inf log P(τ > n) [Δg(k)]2 ≥ − 12 (10)
n→∞
k=2

and

n
lim sup log P(τ > n) [Δg(k)]2 ≤ − 12 . (11)
n→∞
k=2
7 Asymptotics of the Probability of the Outcome 181

For this purpose we consider the (nonrandom) sequence (αn )n≥1 with

α1 = 0, αn = Δg(n), n ≥ 2,

and the probability measures (P̃n )n≥1 defined by (5). Then, by Hölder’s inequality,

P̃n (τ > n) = E I(τ > n)zn ≤ (P(τ > n))1/q (E zpn )1/p , (12)

where p > 1 and q = p/(p − 1).


The last factor is easily calculated explicitly:

p−1
n
(E zpn )1/p = exp 2
[Δg(k)] . (13)
2
k=2

Now let us estimate the probability P̃n (τ > n) that appears on the left-hand side
of (12). We have

P̃n (τ > n) = P̃n (Sk ≥ g(k), 1 ≤ k ≤ n) = P̃n (S̃k ≥ g(1), 1 ≤ k ≤ n),


k
where S̃k = i=1 ξ˜i , ξ˜i = ξi − αi . By Lemma 1, the variables ξ˜1 , . . . , ξ˜n are in-
dependent and normally distributed, ξ˜i ∼ N (0, 1), with respect to the measure P̃n .
Then, by Lemma 2 (applied to b = −g(1), P = P̃n , Xn = S̃n ), we find that

C
P̃(τ > n) ≥ , (14)
n
where C is a constant.
Then it follows from (12)–(14) that, for every p > 1,

p
n
p
P(τ > n) ≥ Cp exp − [Δg(k)]2 − log n , (15)
2 p−1
k=2

where Cp is a constant. Then (15) implies the lower bound (10) by the hypotheses
of the theorem, since p > 1 is arbitrary.
To obtain the upper bound (11), we first observe that since zn > 0 (P- and P̃-a.s.),
we have, by (5),
P(τ > n) = Ẽn I(τ > n)z−1 n , (16)
where Ẽn denotes an average with respect to P̃n .
In the case under consideration α1 = 0, αn = Δg(n), n ≥ 2, and therefore for
n≥2  n 
 1
n
−1
zn = exp − Δg(k) · ξk + 2
[Δg(k)] .
2
k=2 k=2

By the formula for summation by parts (see the proof of Lemma 2 in Sect. 3,
Chap. 4)
182 7 Martingales


n 
n
Δg(k) · ξk = Δg(n) · Sn − Sk−1 Δ(Δg(k)).
k=2 k=2

Hence, if we recall that, by hypothesis, Δg(k) ≥ 0 and Δ(Δg(k)) ≤ 0, we find that,


on the set {τ > n} = {Sk ≥ g(k), 1 ≤ k ≤ n},


n 
n
Δg(k) · ξk ≥ Δg(n) · g(n) − g(k − 1)Δ(Δg(k)) − ξ1 Δg(2)
k=2 k=3

n
= [Δg(k)]2 + g(1)Δg(2) − ξ1 Δg(2).
k=2

Thus, by (16),

1
n
P(τ > n) ≤ exp − [Δg(k)]2 − g(1)Δg(2) Ẽn I(τ > n)e−ξ1 Δg(2)
2
k=2

1
n
= exp{−g(1)Δg(2)} exp − [Δg(k)] Ẽn I(τ > n)e−ξ1 Δg(2) ,
2
2
k=2

where
Ẽn I(τ > n)e−ξ1 Δg(2) ≤ E zn e−ξ1 Δg(2) = E e−ξ1 Δg(2) < ∞.
Therefore 
1
n
P(τ > n) ≤ C exp − [Δg(k)]2 ,
2
k=2

where C is a positive constant; this establishes the upper bound (11).


This completes the proof of the theorem.

3. The idea of an absolutely continuous change of measure can be used to study


similar problems, including the case of a two-sided boundary. We present (without
proof) a result in this direction.

Theorem 2. Let ξ1 , ξ2 , . . . be independent identically distributed random variables


with ξi ∼ N (0, 1). Suppose that f = f (n) is a positive function such that

f (n) → ∞, n → ∞,

and  

n n
[Δf (k)]2 = o f −2 (k) , n → ∞.
k=2 k=1

Then for
σ = min{n ≥ 1 : |Sn | ≥ f (n)}
8 Central Limit Theorem for Sums of Dependent Random Variables 183

we have

π 2  −2
n
P(σ > n) = exp − f (k)(1 + o(1)) , n → ∞. (17)
8
k=1

4. PROBLEMS
1. Show that the sequence defined in (4) is a martingale. Is it still true without the
condition |αn | ≤ c (P-a.s.), n ≥ 1?
2. Establish (13).
3. Prove (17).

8. Central Limit Theorem for Sums of Dependent Random


Variables

1. In Sect. 4, Chap. 3, Vol. 1, the central limit theorem for sums Sn = ξn1 + · · · + ξnn ,
n ≥ 1, of random variables ξn1 , . . . , ξnn was established under the assumptions
of their independence, finiteness of second moments, and asymptotic negligibility
of their terms. In this section, we give up both the assumption of independence
and even that of the finiteness of the absolute first-order moments. However, the
asymptotic negligibility of the terms will be retained.
Thus, we suppose that on the probability space (Ω, F , P) there are given stochas-
tic sequences
ξ n = (ξnk , Fkn ), 0 ≤ k ≤ n, n ≥ 1,
with ξn0 = 0, F0n = {∅, Ω}, Fkn ⊆ Fk+1
n
⊆ F (k + 1 ≤ n). We set


[nt]
Xtn = ξnk , 0 ≤ t ≤ 1.
k=0

Theorem 1. For a given t, 0 < t ≤ 1, let the following conditions be satisfied: for
each ε ∈(0, 1 ), as n → ∞,
[nt] P
k=1 P(|ξnk | > ε | Fk−1 ) → 0,
n
(A)
[nt] P
k=1 E[ξnk I(|ξnk | ≤ ε) | Fk−1 ] → 0,
n
(B)
[nt] P
k=1 Var[ξnk I(|ξnk | ≤ ε) | Fk−1 ] → σt , where σt2 ≥ 0.
n 2
(C)
Then
d
Xtn → N (0, σt2 ).
Remark 1. Hypotheses (A) and (B) guarantee that Xtn can be represented in the form
P [nt]
Xtn = Ytn +Ztn with Ztn → 0 and Ytn = k=0 ηnk , where the sequence η n = (ηnk , Fkn )
is a martingale difference, and E(ηnk | Fk−1
n
) = 0, with |ηnk | ≤ c, uniformly for
1 ≤ k ≤ n and n ≥ 1. Consequently, in the cases under consideration, the proof
reduces to proving the central limit theorem for martingale differences.
184 7 Martingales

In the case where the variables ξn1 , . . . , ξnn are independent, conditions (A),
(B), and (C), with t = 1 and σ 2 = σ12 , become

n
(a) P(|ξnk | > ε) → 0,
k=1
n
(b) E[ξnk I(|ξnk | ≤ ε)] → 0,
k=1
n
(c) Var[ξnk I(|ξnk | ≤ ε)] → σ 2 .
k=1
These are well known; see the book by Gnedenko and Kolmogorov [33]. Hence
we have the following corollary to Theorem 1.
Corollary. If ξn1 , . . . , ξnn are independent random variables, n ≥ 1, then


n
d
(a), (b), (c) ⇒ X1n = ξnk → N (0, σ 2 ).
k=1

Remark 2. In hypothesis (C), the case σt2 = 0 is not excluded. Hence, in particular,
d
Theorem 1 yields a convergence condition to the degenerate distribution (Xtn → 0).
Remark 3. The method used to prove Theorem 1 lets us state and prove the follow-
ing more general proposition.
Let 0 = t0 < t1 < t2 < · · · < tj ≤ 1, 0 = σt20 ≤ σt21 ≤ σt22 ≤ · · · ≤ σt2j ,
σ02 = 0, and let ε1 , . . . , εj be independent Gaussian random variables with zero
means and E ε2k = σt2k − σt2k−1 . Form the Gaussian vector (Wt1 , . . . , Wtj ) with Wtk =
ε1 + · · · + ε k .
Let conditions (A), (B), and (C) be satisfied for t = t1 , . . . , tj . Then the joint
distribution (Pnt1 ,...,tj ) of the random variables (Xtn1 , . . . , Xtnj ) converges weakly to
the Gaussian distribution Pt1 ,...,tj of the variables (Wt1 , . . . , Wtj ):
w
Pnt1 ,...,tj → Pt1 ,...,tj .

Remark 4. Let (σt2 )0≤t≤1 be a continuous nondecreasing function, σ02 = 0. Let


W = (Wt )0≤t≤1 denote the Brownian motion process (the Wiener process) with
E Wt = 0 and E Wt2 = σt2 . This process was defined in Sect. 13, Chap. 2, Vol. 1, for
σt2 = t. In the general case, this process is defined in a similar way as the Gaus-
sian process W = (Wt )0≤t≤1 with independent increments, W0 = 0, and covariance
function r(s, t) = min(σs2 , σt2 ). It is shown in the general theory of stochastic pro-
cesses that there always exists such a process with continuous paths. (In the case
σt2 = t, this process is called standard Brownian motion.)
If we denote by Pn and P the distributions of the processes X n and W in the
functional space (D, B(D)) (Subsection 7, Sect. 2, Chap. 2, Vol. 1), then we can
say that conditions (A), (B), and (C), fulfilled for all 0 < t ≤ 1, ensure not only
w
the convergence of finite-dimensional distributions (Pnt1 ,...,tj → Pt1 ,...,tj , t1 < t2 <
· · · < tj ≤ t, j = 1, 2, . . . ) stated earlier, but also the functional convergence, i.e.,
the weak convergence of the distributions Pn of the processes X n to the distribution
8 Central Limit Theorem for Sums of Dependent Random Variables 185

of the process W. (For details, see [4, 55, 43].) This result is usually called the
functional central limit theorem or the invariance principle (when ξn1 , . . . , ξnn are
independent, the latter is referred to as the Donsker–Prohorov invariance principle).
2. Theorem 2. 1. Condition (A) is equivalent to the uniform asymptotic negligibility
condition
P
(A∗ ) max1≤k≤[nt] |ξnk | → 0.
2. Assuming (A) or (A∗ ), condition (C) is equivalent to
[nt] 2 P
(C∗ ) k=0 [ξnk − E(ξnk I(|ξnk | ≤ 1) | Fk−1 )] → σt .
n 2

(The value of t in (A∗ ) and (C∗ ) is the same as in (A) and (C).)

Theorem 3. For each n ≥ 1 let the sequence

ξ n = (ξnk , Fkn ), 1 ≤ k ≤ n,

be a square-integrable martingale difference:


2
E ξnk < ∞, E(ξnk | Fk−1
n
) = 0.

Suppose that the Lindeberg condition is satisfied: for any ε > 0,


[nt]
P
(L) 2
E[ξnk I(|ξnk | > ε) | Fk−1
n
] → 0.
k=0
Then (C) is equivalent to
P
X n t → σt2 , (1)
where (quadratic characteristic)


[nt]
X t =
n 2
E(ξnk | Fk−1
n
), (2)
k=0

and (C∗ ) is equivalent to


P
[X n ]t → σt2 , (3)
where (quadratic variation)

[nt]
n 2
[X ]t = ξnk . (4)
k=0
The next theorem is a corollary of Theorems 1–3.
Theorem 4. Let the square-integrable martingale differences ξ n = (ξnk , Fkn ), n ≥ 1,
satisfy (for a given t, 0 < t ≤ 1) the Lindeberg condition (L). Then


[nt]
P d
2
E(ξnk | Fk−1
n
) → σt2 ⇒ Xtn → N (0, σt2 ), (5)
k=0
186 7 Martingales


[nt]
P d
2 → 2
ξnk σt ⇒ Xtn → N (0, σt2 ). (6)
k=0

3.

PROOF OF THEOREM 1. Let us represent Xtn in the form


[nt]

[nt]
Xtn = ξnk I(|ξnk | ≤ 1) + ξnk I(|ξnk | > 1)
k=0 k=0


[nt]

[nt]
= E[ξnk I(|ξnk | ≤ 1) | Fk−1
n
]+ ξnk I(|ξnk | > 1)
k=0 k=0


[nt]
+ {ξnk I(|ξnk | ≤ 1) − E[ξnk I(|ξnk | ≤ 1) | Fk−1
n
]}. (7)
k=0

We define

[nt]
Bnt = E[ξnk I(|ξnk | ≤ 1) | Fk−1
n
],
k=0
μnk (Γ) = I(ξnk ∈ Γ), (8)
νkn (Γ) = P(ξnk ∈ Γ | Fk−1
n
),

where Γ is a set from the smallest σ-algebra B0 = σ(A0 ) generated by the system
of sets A0 in R0 = R \ {0}, which consists of finite sums of disjoint intervals
(a, b] not containing the point {0}, and P(ξnk ∈ Γ | Fk−1
n
) is a regular conditional
distribution of ξnk given the σ-algebra Fk−1 .
n

Then (7) can be rewritten in the following form:


[nt]  [nt] 
 
Xtn = Bnt + x dμnk + x d(μnk − νkn ), (9)
k=1 {|x|>1} k=1 {|x|≤1}

which is known as the canonical decomposition of (Xtn , Ftn ). (The integrals are to
be understood as Lebesgue–Stieltjes integrals, defined for every sample point.)
P
According to (B), we have Bnt → 0. Let us show that (A) implies
[nt] 
 P
|x |dμnk → 0. (10)
k=1 {|x|>1}

We have
[nt] 
 
[nt]
|x| dμnk = |ξnk | I(|ξnk | > 1). (11)
k=1 {|x|>1} k=1

For every δ ∈ (0, 1),


8 Central Limit Theorem for Sums of Dependent Random Variables 187


[nt] 
[nt]
|ξnk | I(|ξnk | > 1) > δ = I(|ξnk | > 1) > δ , (12)
k=1 k=1

since each sum is greater than δ if |ξnk | > 1 for at least one k. It is clear that
[nt] 

[nt]

I(|ξnk | > 1) = dμnk n
(≡ U[nt] ).
k=1 k=1 {|x|>1}

By (A),
[nt] 
 P
n
V[nt] ≡ dνkn → 0, (13)
k=1 {|x|>1}

and Vkn is Fk−1


n
-measurable.
Then, by the corollary to Theorem 4 in Sect. 3,

n → Pn → P
V[nt] 0 ⇒ U[nt] 0. (14)
n
Note that by the same corollary and the inequality ΔU[nt] ≤ 1, we also have the
converse implication:
n → P n → P
U[nt] 0 ⇒ V[nt] 0, (15)
which will be needed in the proof of Theorem 2.
The required proposition (10) now follows from (11)–(14).
Thus
Xtn = Ytn + Ztn , (16)
where
[nt] 

Ytn = x d(μnk − νkn ), (17)
k=1 {|x|≤1}

and

[nt] 
 P
Ztn = Bnt + x dμnk → 0. (18)
k=1 {|x|>1}

It then follows by Problem 1 that to establish that


d
Xtn → N (0, σt2 ),

we need only show that


d
Ytn → N (0, σt2 ). (19)
188 7 Martingales

Let us represent Ytn in the form

Ytn = γ[nt]
n
(ε) + Δn[nt] (ε), ε ∈ (0, 1],

where
[nt] 

n
γ[nt] (ε) = x d(μnk − νkn ), (20)
k=1 {ε<|x|≤1}

[nt] 

Δn[nt] (ε) = x d(μnk − νkn ). (21)
k=1 {|x|≤ε}

As in the proof of (10), it is easily verified that, because of (A), we have


P
n
γ[nt] (ε) → 0,
n → ∞.
The sequence Δn (ε) = (Δnk (ε), Fkn ), 1 ≤ k ≤ n, is a square-integrable martin-
gale with quadratic characteristic


k   2 !
Δn (ε)k = x2 dνin − x dνin
i=1 {|x|≤ε} {|x|≤ε}


k
= Var[ξni I(|ξni | ≤ ε) | Fi−1
n
].
i=1

Because of (C),
P
Δn (ε)t] → σt2 .
Hence, for every ε ∈ (0, 1],
P
n
max{γ[nt] (ε), |Δn (ε)t] − σt2 |} → 0.

By Problem 2 there is then a sequence of numbers εn ↓ 0 such that


P P
n
γ[nt] (εn ) → 0, Δn (εn )t] → σt2 .

Therefore, again by Problem 1, it is enough to prove that

n → d
M[nt] N (0, σt2 ), (22)

where
k 

Mkn = Δnk (εn ) = x d(μni − νin ). (23)
i=1 {|x|≤εn }

For Γ ∈ B0 , let

μ̃nk (Γ) = I(ΔMkn ∈ Γ), ṽnk (Γ) = P(ΔMkn ∈ Γ | Fk−1


n
)
8 Central Limit Theorem for Sums of Dependent Random Variables 189

be a regular conditional probability, ΔMkn = Mkn − Mk−1


n
, k ≥ 1, M0n = 0. Then the
square-integrable martingale M = (Mk , Fk ), 1 ≤ k ≤ n, can evidently be written
n n n

in the form
k k 
n n
Mk = ΔMi = x dμ̃nk .
i=1 i=1 {|x|≤2εn }

(Notice that |ΔMin | ≤ 2εn by (23).)


To establish (22), we have, by Theorem 1 (Sect. 3, Chap. 3, Vol. 1), to show that,
for every real λ,
n
E exp{iλM[nt] } → exp(− 12 λ2 σt2 ). (24)
Set
k 

Gnk = (eiλx − 1) dν̃jn
j=1 {|x|≤2εn }

and
4
k
Ekn (Gn ) = (1 + ΔGnj ).
j=1

Observe that

1+ ΔGnk =1+ (eiλx − 1) dν̃kn = E[exp(iλΔMkn ) | Fk−1
n
],
{|x|≤2εn }

and consequently,

4
k
Ekn (Gn ) = E[exp(iλΔMjn ) | Fj−1
n
].
j=1

By the lemma to be proved in Subsection 4, (24) will follow if, for every real λ,
 [nt] 
4 
 
|E[nt]
n
(Gn )| = E[exp(iλΔMj ) | Fj−1 ] ≥ c(λ) > 0
n n
(25)
 
j=1

and
P
E[nt]
n
(Gn ) → exp(− 12 λ2 σt2 ). (26)
To see this, we represent Ekn (Gn ) in the form

4
k
Ekn (Gn ) = exp(Gnk ) · (1 + ΔGnj ) exp(−ΔGnj ).
j=1

(Compare the function Et (A) defined by (76) of Sect. 6, Chap. 2, Vol. 1.)
190 7 Martingales

Since 
x dν̃jn = E(ΔMjn | Fj−1
n
) = 0,
{|x|≤2εn }

we have
k 

Gnk = (eiλx − 1 − iλx) dν̃jn . (27)
j=1 {|x|≤2εn }

Therefore
 
1
|ΔGnk | ≤ |e iλx
−1− iλx| dν̃kn ≤ λ2 x2 dν̃kn
{|x|≤2εn } 2 {|x|≤2εn }
1 2
≤ λ (2εn )2 → 0 (28)
2
and
 
1 
k k
1 2 n
|ΔGnj | ≤ λ2 x2 dṽnj = λ M k . (29)
j=1
2 j=1 {|x|≤2εn } 2

By (C),
P
M n t] → σt2 . (30)
Suppose first that M t] ≤ a (P-a.s.). Then, by (28), (29), and Problem 3,
n

4
[nt]
P
(1 + ΔGnk ) exp(−ΔGnk ) → 1, n → ∞,
k=1

and therefore, to establish (26), we only have to show that

Gn[nt] → − 12 λ2 σt2 , (31)

i.e., after (27), (29), and (30), that


[nt] 
 P
(eiλx − 1 − iλx + 12 λ2 x2 ) dν̃kn → 0. (32)
k=1 {|x|≤2εn }

But
|eiλx − 1 − iλx + 12 λ2 x2 | ≤ 16 |λx|3 ,
and therefore
[nt]  [nt] 
 
|eiλx − 1 − iλx + 12 λ2 x2 | dν̃kn ≤ 16 |λ|3 (2εn ) x2 dν̃kn
k=1 {|x|≤2εn } k=1 {|x|≤2εn }

= 13 εn |λ|3 Mn t] ≤ 13 εn |λ|3 a → 0, n → ∞.

Therefore, if M n t] ≤ a (P-a.s.), (31) is established and, consequently, so is


(26).
8 Central Limit Theorem for Sums of Dependent Random Variables 191

Let us now verify (25). Since |eiλx − 1 − iλx| ≤ 1 2


2 (λx) , we find from (28)
that, for sufficiently large n,
 k 
4  4 k
|Ekn (Gn )|  n 
=  (1 + ΔGi ) ≥ (1 − 12 λ2 ΔM n j )
j=1 j=1

k
= exp log(1 − 2 λ ΔM j )
1 2 n
.
j=1

But
2 λ ΔM j
1 2 n
log(1 − 12 λ2 ΔM n j ) ≥ −
1 − 12 λ2 ΔM n j
and ΔM n j ≤ (2εn )2 ↓ 0, n → ∞. Therefore there is an n0 = n0 (λ) such that for
all n ≥ n0 (λ),
|Ekn (Gn )| ≥ exp{−λ2 M n k },
and therefore
(Gn )| ≥ exp{−λ2 M n t] } ≥ e−λ a .
2
|E[nt]
n

Hence the theorem is proved under the assumption that M n t] ≤ a (P-a.s.). To
remove this assumption, we proceed as follows.
Let
τn = min{k ≤ [nt] : M n k ≥ σt2 + 1},
taking τn = ∞ if M n t] ≤ σt2 + 1.
Then, for M = Mk∧τn , we have
n
M t] = M n t]∧τn ≤ 1 + σt2 + 2ε2n ≤ 1 + σt2 + 2ε21 (= a),

and by what has been proved,


n
E exp{iλM [nt] } → exp(− 12 λ2 σt2 ).

But
n
lim | E{exp(iλM[nt]
n
) − exp(iλM [nt] )}| ≤ 2 lim P(τn < ∞) = 0.
n n

Consequently,
n
n
lim E exp(iλM[nt] n
) = lim E{exp(iλM[nt] ) − exp(iλM [nt] )}
n n
n
+ lim E exp(iλM [nt] ) = exp(− 12 λ2 σt2 ).
n

This completes the proof of Theorem 1.



Remark. To prove the statement made in Remark 2 to Theorem 1, we need to show


(using the Cramér–Wold method [4]) that for all real numbers λ1 , . . . , λj
192 7 Martingales
 
j !
n
E exp i λ1 M[nt1]
+ λ k (M n
[ntk ] − M n
[ntk−1 ] )
k=2

1 2 2
j
1 2 2
→ exp − λ1 σt1 − λk (σtk − σtk−1 ) .
2
2 2
k=2

The proof of this is similar to that of (24), replacing (Mkn , Fkn ) by the square-
integrable martingales (M̂kn , Fkn ),


k
M̂kn = νi ΔMin ,
i=1

where νi = λ1 for i ≤ [nt1 ] and νi = λj for [ntj−1 ] < i ≤ [ntj ].

4. In this subsection we prove a simple lemma that lets us reduce the verification of
(24) to the verification of (25) and (26).
Let η n = (ηnk , Fkn ), 1 ≤ k ≤ n, n ≥ 1, be stochastic sequences, let


n
Yn = ηnk ,
k=1

let
4
n
E n (λ) = E [exp(iληnk ) | Fk−1
n
], λ ∈ R,
k=1

and let Y be a random variable with

E (λ) = E eiλY , λ ∈ R.

Lemma. If (for a given λ) |E n (λ)| ≥ c(λ) > 0, n ≥ 1, a sufficient condition for


the limit relation n
E eiλY → E eiλY (33)
is that
P
E n (λ) → E (λ). (34)

PROOF. Let n
eiλY
mn (λ) = .
E n (λ)
Then |mn (λ)| ≤ c−1 (λ) < ∞, and it is easily verified that

E mn (λ) = 1.
8 Central Limit Theorem for Sums of Dependent Random Variables 193

Hence, by (34) and the Lebesgue dominated convergence theorem,


n n
| E eiλY − E eiλY | = | E(eiλY − E (λ))| ≤ | E(mn (λ)[E n (λ) − E (λ)])|
≤ c−1 (λ) E |E n (λ) − E (λ)| → 0, n → ∞.


Remark 5. It follows from (33) and the hypothesis that E n (λ) ≥ c(λ) > 0 that
E (λ) = 0. In fact, the conclusion of the lemma remains valid without the assump-
P
tion that |E n (λ)| ≥ c(λ) > 0, if restated in the following form: If E n (λ) → E (λ)
and E (λ) = 0, then (33) holds (Problem 5).
5. PROOF OF THEOREM 2. 1. Let 0 < ε < 1, δ ∈ (0, ε), and for simplicity let
t = 1. Since

n
max |ξnk | ≤ ε + |ξnk |I(|ξnk | > ε)
1≤k≤n
k=1

and  
n n
|ξnk |I(|ξnk | > ε) > δ = I(|ξnk | > ε) > δ ,
k=1 k=1

we have
 
n
P max |ξnk | > ε + δ ≤P I(|ξnk | > ε) > δ
1≤k≤n
k=1

n 
=P dμnk >δ .
k=1 {|x|>ε}

If (A) is satisfied, i.e.,



n 
P dνkn >δ → 0,
k=1 {|x|>ε}

then (cf. (10)) we also have



n 
P dμnk >δ → 0.
k=1 {|x|>ε}

Therefore (A) ⇒ (A∗ ).


Conversely, let
σn = min{k ≤ n : |ξnk | ≥ ε/2}
supposing that σn = ∞ if max1≤k≤n |ξnk | < ε/2. By (A∗ ), limn P(σn < ∞) = 0.
Now observe that, for every δ ∈ (0, 1), the sets
 n∧σ
n

I(|ξnk | ≥ ε/2) > δ and max |ξnk | ≥ 12 ε
1≤k≤n∧σn
k=1
194 7 Martingales

coincide, and by (A∗ ),

n
n∧σ n
n∧σ 
P
I(|ξnk | ≥ ε/2) = dμnk → 0.
k=1 k=1 {|x|≥ε/2}

Therefore, by (15),

n
n∧σ  n
n∧σ 
P
dνkn ≤ dνkn → 0,
k=1 {|x|≥ε} k=1 {|x|≥ε/2}

which, together with the property limn P(σn < ∞) = 0, proves that (A∗ ) ⇒ (A).
2. Again suppose that t = 1. Choose an ε ∈ (0, 1] and consider the square-
integrable martingales (see (21))

Δn (δ) = (Δnk (δ), Fkn ) (1 ≤ k ≤ n)

with δ ∈ (0, ε]. For the given ε ∈ (0, 1], we have, according to (C),
P
Δn (ε)n → σ12 .

It is then easily deduced from (A) that for every δ ∈ (0, ε]


P
Δn (δ)n → σ12 . (35)

Let us show that from (C∗ ) and (A) or, equivalently, from (C∗ ) and (A∗ ), it fol-
lows that, for every δ ∈ (0, ε],
P
[Δn (δ)]n → σ12 , (36)

where  !2

n
n
[Δ (δ)]n = ξnk I(|ξnk | ≤ δ) − x dνkn .
k=1 {|x|≤δ}

In fact, it is easily verified that, by (A),


P
[Δn (δ)]n − [Δn (1)]n → 0. (37)

But
 n  !2  !2 
 
n

 ξ − x dνkn − ξnk I(|ξnk | ≤ 1) − x dνkn 
 nk 
k=1 {|x|≤1} k=1 {|x|≤1}
 !

n
 
≤ I(|ξnk | > 1) ξnk + 2|ξnk |
2 n 
x d(μk − νk )
n

k=1 {|x|≤1}
8 Central Limit Theorem for Sums of Dependent Random Variables 195


n
≤5 I(|ξnk | > 1)ξnk
2

k=1
n 

≤ 5 max ξnk
2
· dμnk → 0. (38)
1≤k≤n {|x|>1}
k=1

Hence (36) follows from (37) and (38).


Consequently, to establish the equivalence of (C) and (C∗ ), it is enough to estab-
lish that both (C) (for a given ε ∈ (0, 1]) and (C∗ ) imply that, for every a > 0,

lim lim sup P{|[Δn (σ)]n − Δn (δ)n | > a} = 0. (39)


δ→0 n

Let
mnk (δ) = [Δn (δ)]k − Δn (δ)k , 1 ≤ k ≤ n.
n
The sequence m (δ) = (mnk (δ), Fkn )
is a square-integrable martingale, and (mn (δ))2
is dominated (in the sense of the definition from Sect. 3) by the sequences [mn (δ)]
and mn (δ).
It is clear that

n
[mn (δ)]n = (Δmnk (δ))2 ≤ max |Δmnk (δ)| · {[Δn (δ)]n + Δn (δ)n }
1≤k≤n
k=1
≤ 3δ 2 {[Δn (δ)]n + Δn (δ)n }. (40)

Since [Δn (δ)] and Δn (δ) dominate each other, it follows from (40) that (mn (δ))2
is dominated by the sequences 6δ 2 [Δn (δ)] and 6δ 2 Δn (δ).
Hence, if (C) is satisfied, then for sufficiently small δ (e.g., for δ <
min ε, 16 b(σ12 + 1))

lim P(6δ 2 Δn (δ)n > b) = 0,


n

and hence, by the corollary to Theorem 4 (Sect. 3), we have (39).


On the other hand, if (C∗ ) is satisfied, then for the same values of δ,

lim P(6δ 2 [Δn (δ)]n > b) = 0. (41)


n

Since |Δ[Δn (δ)]k | ≤ (2δ)2 , the validity of (39) follows from (41) and another ap-
peal to the corollary to Theorem 4 (Sect. 3).
This completes the proof of Theorem 8.
6. PROOF OF THEOREM 3. On account of the Lindeberg condition (L), the equiv-
alence of (C) and (1), and of (C∗ ) and (3), can be established by direct calculation
(Problem 6).


196 7 Martingales

7. PROOF OF THEOREM 4. Condition (A) follows from the Lindeberg condition


(L). As for condition (B), it is sufficient to observe that when ξ n is a martingale
difference, the variables Bnt that appear in the canonical decomposition (9) can be
represented in the form
[nt] 

Bt = −
n
x dνnk .
k=0 {|x|>1}

P
Therefore Bnt → 0 by the Lindeberg condition (L).
8. The fundamental theorem of the present section, namely, Theorem 1, was proved
under the hypothesis that the terms that are summed are uniformly asymptotically
infinitesimal. It is natural to ask for conditions of the central limit theorem without
such a hypothesis. For independent random variables, an example of such a theorem
was Theorem 1 in Sect. 5, Chap. 3, Vol. 1 (assuming finite second moments).
We quote (without proof) an analog of this theorem, restricting ourselves to se-
quences ξ n = (ξnk , Fkn ), 1 ≤ k ≤ n, that are square-integrable martingale differ-
2
ences (E ξnk < ∞, E(ξnk | Fk−1 n
) = 0).
Let Fnk (x) = P(ξnk ≤ x | Fk−1 n
) be a regular distribution function of ξnk given
Fk−1 , and let Δnk = E(ξnk | Fk−1 ).
n 2 n

Theorem 5. If a square-integrable martingale difference ξn = (ξnk , Fkn ), 0 ≤ k ≤


n, n ≥ 1, satisfies the conditions


[nt]
P
Δnk → σt2 , 0 ≤ σt2 < ∞, 0 ≤ t ≤ 1,
k=0

and for every ε > 0


[nt] 
  
 x  P
|x|Fnk (x) − Φ √  dx → 0,
k=0 {|x|>ε} Δnk

then
d
Xtn → N (0, σt2 ).
9. PROBLEMS
d d d
1. Let ξn = ηn + ζn , n ≥ 1, where ηn → η and ζn → 0. Prove that ξn → η.
P
2. Let (ξn (ε)), n ≥ 1, ε > 0, be a family of random variables such that ξn (ε) → 0
for each ε > 0 as n → ∞. Using, for example, Problem 11 from Sect. 10,
P
Chap. 2, Vol. 1, prove that there is a sequence εn ↓ 0 such that ξn (εn ) → 0.
3. Let (αk ), 1 ≤ k ≤ n, n ≥ 1, be complex-valued random variables such that
n

(P-a.s.)
n
|αkn | ≤ C, |αkn | ≤ an ↓ 0.
k=1
9 Discrete Version of Itô’s Formula 197

Show that then (P-a.s.)


4
n
lim (1 + αkn ) exp(−αkn ) = 1.
n
k=1

4. Prove the statement made in Remark 2 to Theorem 1.


5. Prove the statement made in Remark 5 to the lemma.
6. Prove Theorem 3.
7. Prove Theorem 5.

9. Discrete Version of Itô’s Formula

1. In the stochastic analysis of Brownian motion and other related processes (e.g.,
martingales, local martingales, semimartingales) Itô’s change-of-variables formula
plays a key role. In this section, we present a discrete (in time) version of this for-
mula and show briefly how Itô’s formula for Brownian motion could be derived
from it using a limiting procedure.

2. Let X = (Xn )0≤n≤N and Y = (Yn )0≤n≤N be two sequences of random variables
on the probability space (Ω, F , P), X0 = Y0 = 0, and

[X, Y] = ([X, Y]n )0≤n≤N ,

where

n
[X, Y]n = ΔXi ΔYi (1)
i=1

is the quadratic covariation of X and Y (Sect. 1).


Also, suppose that F = F(x) is an absolutely continuous function,
 x
F(x) = F(0) + f (y) dy, (2)
0

where f = f (y), y ∈ R, is a Borel function such that



|f (y)| dy < ∞, c > 0.
|y|≤c

The change-of-variables formula in which we are interested concerns the possibility


of representing the sequence

F(X) = (F(Xn ))0≤n≤N (3)

in terms of “natural” functionals of the sequence X = (Xn )0≤n≤N .


198 7 Martingales

Given the function f = f (x) as in (2), consider the quadratic covariation [X, f (X)]
of the sequences X and f (X) = (f (Xn ))0≤n≤N . By (1),


n
[X, f (X)]n = Δf (Xk )ΔXk
k=1
n
= (f (Xk ) − f (Xk−1 ))(Xk − Xk−1 ). (4)
k=1

We introduce two “discrete integrals” (cf. Definition 5 in Sect. 1):


n
In (X, f (X)) = f (Xk−1 )ΔXk , 1 ≤ n ≤ N, (5)
k=1
n
Ĩn (X, f (X)) = f (Xk )ΔXk , 1 ≤ n ≤ N. (6)
k=1

Then
[X, f (X)]n = Ĩn (X, f (X)) − In (X, f (X)). (7)
(For n = 0, we set I0 = Ĩ0 = 0.)
For a fixed N, we introduce a new (reversed) sequence X̃ = (X̃n )0≤n≤N with

X̃n = XN−n . (8)

Then, clearly,
ĨN (X, f (X)) = −IN (X̃, f (X̃))
and, analogously,

Ĩn (X, f (X)) = −{IN (X̃, f (X̃)) − IN−n (X̃, f (X̃))}.

From this and (7) we obtain

[X, f (X)]N = −{IN (X̃, f (X̃)) + IN (X, f (X))}

and for 0 < n < N we have

[X, f (X)]n = −{IN (X̃, f (X̃)) − IN−n (X̃, f (X̃))} − In (X, f (X))
  N n
=− f (X̃k−1 )ΔX̃k + f (Xk−1 )ΔXk . (9)
k=N−n+1 k=1

Remark 1. We note that the structures of the right-hand sides of (7) and (9) are dif-
ferent. Equation (7) contains two different forms of “discrete integral.” The integral
In (X, f (X)) is a “forward integral” in the sense that the value f (Xk−1 ) of f at the left
end of the interval [k − 1, k] is multiplied by the increment ΔXk = Xk − Xk−1 on this
interval, whereas in Ĩn (X, f (X)) the increment ΔXk is multiplied by the value f (Xk )
at the right end of [k − 1, k].
9 Discrete Version of Itô’s Formula 199

Thus, (7) contains both the “forward integral” In (X, f (X)) and the “backward
integral” Ĩn (X, f (X)), while in (9), both integrals are “forward integrals,” over two
different sequences X and X̃.

3. Since for any function g = g(x)

g(Xk−1 ) + 12 [g(Xk ) − g(Xk−1 )] − 12 [g(Xk ) + g(Xk−1 )] = 0,

it is clear that

n
F(Xn ) = F(X0 ) + g(Xk−1 )ΔXk + 12 [X, g(X)]n
k=1
n 

g(Xk−1 ) + g(Xk )
+ (F(Xk ) − F(Xk−1 )) − ΔXk . (10)
2
k=1

In particular, if g(x) = f (x), where f (x) is the function of (2), then

F(Xn ) = F(X0 ) + In (X, f (X)) + 12 [X, f (X)]n + Rn (X, f (X)), (11)

where !
n 
 Xk
f (Xk−1 ) + f (Xk )
Rn (X, f (X)) = f (x) − dx. (12)
Xk−1 2
k=1

From analysis, it is well known that if the function f  (x) is continuous, then the
following formula (“trapezoidal rule”) holds:
 b !  b
f (a) + f (b) f  (ξ(x))
f (x) − dx = (x − a)(x − b) dx
a 2 a 2!

(b − a)3 1
= x(x − 1)f  (ξ(a + x(b − a))) dx
2 0
 1
(b − a)3 
= f (ξ(a + x(b − a))) x(x − 1) dx
2 0
(b − a)3 
=− f (η),
12
where ξ(x), x, and η are “intermediate” points in the interval [a, b].
Thus, in (12),
1  
n
Rn (X, f (X)) = − f (ηk )(ΔXk )3 ,
12
k=1

where Xk−1 ≤ ηk ≤ Xk , whence

1 
n
|Rn (X, f (X))| ≤ sup f  (η) |ΔXk |3 , (13)
12
k=1
200 7 Martingales

where the supremum is taken over all η such that

min(X0 , X1 , . . . , Xn ) ≤ η ≤ max(X0 , X1 , . . . , Xn ).

We shall refer to formula (11) as the discrete analog of Itô’s formula. We note that
the right-hand side of this formula contains the following three “natural” ingredi-
ents: “the discrete integral” In (X, f (X)), the quadratic covariation [X, f (X)]n , and
the “remainder” term Rn (X, f (X)), which is so termed because it goes to zero in the
limit transition to the continuous time (see Subsection 5 for details).
4.
EXAMPLE 1. If f (x) = a + bx, then Rn (X, f (X)) = 0, and formula (11) takes the
following form:

F(Xn ) = F(X0 ) + In (X, f (X)) + 12 [X, f (X)]n . (14)

(Compare with formula (19) below.)

EXAMPLE 2. Let ⎧
⎨ 1, x > 0,
f (x) = sign x = 0, x = 0,

−1, x < 0,
and let F(x) = |x|.
Let Xk = Sk , where
Sk = ξ1 + ξ2 + · · · + ξk
with ξ1 , ξ2 , . . . independent Bernoulli random variables taking values ±1 with prob-
ability 1/2.
If we also set S0 = 0, we obtain from (11) that

n
|Sn | = (sign Sk−1 )ΔSk + Nn , (15)
k=1

where
Nn = #{0 ≤ k < n, Sk = 0}
is the number of zeroes in the sequence S0 , S1 , . . . , S
n−1 .
n
We note that the sequence of discrete integrals ( k=1 (sign Sk−1 )ΔSk )n≥1 in-
volved in (14) forms a martingale, and therefore

E |Sn | = E Nn . (16)

Since (Problem 2) 9
2
E |Sn | ∼ n, n → ∞, (17)
π
(16) yields 9
2
E Nn ∼ n, n → ∞. (18)
π
9 Discrete Version of Itô’s Formula 201

In other words, the √ average number of “draws” in the random walk S0 , S1 , . . . , Sn


has order of growth n rather than n, which could seem more natural at first glance.
Note that the property (18) is closely related to the arcsine law (Sect. 10, Chap. 1,
Vol. 1) since it is actually its consequence.

5. Let B = (Bt )0≤t≤1 be a standard (B0 = 0, E Bt = 0, E B2t = t) Brownian motion


(Sect. 13, Chap. 2, Vol. 1), and let Xk = Bk/n , k = 0, 1, . . . , n. Then application of
formula (11) leads to the following result:


n
1
F(B1 ) = F(B0 ) + f (B(k−1)/n )ΔBk/n + [f (B·/n ), B·/n ]n + Rn (B·/n , f (B·/n )).
2
k=1
(19)
It is known from the stochastic calculus of Brownian motion (e.g., [75, 32]) that


n
P
|Bk/n − B(k−1)/n |3 → 0, n → ∞. (20)
k=1

Therefore, if f = f (x) is twice differentiable and |f  (x)| ≤ C, x ∈ R, for some


P
C > 0, then we obtain from (13) that Rn (B·/n , f (B·/n )) − → 0.
Appealing again to Brownian motion  theory, we obtain that for any Borel func-
tion f = f (x) ∈ Lloc 2
(i.e., such that |x|≤C f 2 (x) dx < ∞ for any C > 0) there
n
exists the limit (in probability) of “discrete integrals” k=1 f (B(k−1)/n )ΔBk/n . This
1
limit is denoted by 0 f (Bs ) dBs and called Itô’s stochastic integral with respect to
Brownian motion.
P
Therefore,
n turning to (19), we see that Rn (B·/n , f (B·/n )) −→ 0, the “discrete inte-
grals” k=1 f (B(k−1)/n )ΔBk/n converge (in probability) to the “stochastic integral”
1
0
f (Bs ) dBs , and hence there exists the limit in probability of quadratic covariations

[B·/n , f (B·/n )] (= [ f (B·/n ), B·/n ]),

which can naturally be denoted by

[B, f (B)]1 .

Thus, if f = f (x) is twice differentiable, |f  (x)| ≤ C, x ∈ R, and f ∈ Lloc


2
, then
 1
1
F(B1 ) = F(0) + f (Bs ) dBs + [B, f (B)]1 . (21)
0 2

We have here  1
[B, f (B)]1 = f  (Bs ) ds, (22)
0
and therefore
202 7 Martingales

 1 1
1
F(B1 ) = F(0) + f (Bs ) dBs + f  (Bs ) ds, (23)
0 2
0

or, in a more standard form,


 1 1
1
F(B1 ) = F(0) + F  (Bs ) dBs + F  (Bs ) ds. (24)
0 2
0

This formula (for F ∈ C2 ) is referred to as Itô’s change-of-variables formula for


Brownian motion.

6. PROBLEMS
1. Prove formula (15).
2. Establish that property (17) is true.
3. Prove formula (22).
4. Try to prove that (24) holds for any F ∈ C2 .

10. Application of Martingale Methods to Calculation of


Probability of Ruin in Insurance

1. The material studied in this section is a good illustration of the fact that the theory
of martingales provides a simple way of estimating the risk faced by an insurance
company.
We shall assume that the evolution of the capital of a certain insurance company
is described by a random process X = (Xt )t≥0 . The initial capital is X0 = u > 0.
Insurance payments arrive continuously at a constant rate c > 0 (in time Δt the
amount arriving is cΔt) and claims are received at random times T1 , T2 , . . . (0 <
T1 < T2 < · · · ), where the amounts to be paid out at these times are described by
nonnegative random variables ξ1 , ξ2 , . . . .
Thus, taking into account receipts and claims, the capital Xt at time t > 0 is
determined by the formula
Xt = u + ct − St , (1)
where 
St = ξi I(Ti ≤ t). (2)
i≥1

We denote by
T = inf{t ≥ 0 : Xt ≤ 0}
the first time at which the insurance company’s capital becomes less than or equal
to zero (“time of ruin”). Of course, if Xt > 0 for all t ≥ 0, then the time T is set
equal to +∞.
10 Application of Martingale Methods to Calculation of Probability of Ruin in Insurance 203

One of the main questions relating to the operation of an insurance company


is the calculation (or estimation) of the probability of ruin, P(T < ∞), and the
probability of ruin before time t, P(T ≤ t) (inclusively).

2. This is a rather complicated problem. However, it can be solved (partially) in the


framework of the classical Cramér–Lundberg model characterized by the following
assumptions.

A. The times T1 , T2 , . . . at which claims are received are such that the variables
(T0 ≡ 0)
σi = Ti − Ti−1 , i ≥ 1,
are independent identically distributed random variables having an exponential dis-
tribution with density λe−λt , t ≥ 0 (see Table 2.3 in Sect. 3, Chap. 2, Vol. 1).

B. The random variables ξ1 , ξ2 , . . . are independent identically distributed with dis-



tribution function F(x) = P(ξ1 ≤ x) such that F(0) = 0, μ = 0 x dF(x) < ∞.

C. The sequences (T1 , T2 , . . .) and (ξ1 , ξ2 , . . .) are independent sequences (in the
sense of Definition 6 of Sect. 5, Chap. 2, Vol. 1).
Denote by 
Nt = I(Ti ≤ t) (3)
i≥1

the process describing the number of claims before time t (inclusively), N0 = 0.


Since
{Tk > t} = {σ1 + · · · + σk > t} = {Nt < k}, k ≥ 1,
under assumption A, we find that, according to Problem 6 in Sect. 8, Chap. 2, Vol. 1,


k−1
(λt)i
P(Nt < k) = P(σ1 + · · · + σk > t) = e−λt ,
i=0
i!

whence
(λt)k
P(Nt = k) = e−λt , k = 0, 1, . . . , (4)
k!
i.e., the random variable Nt has the Poisson distribution (see Table 2.2 in Sect. 3,
Chap. 2, Vol. 1) with parameter λt. Here, E Nt = λt.
The Poisson process N = (Nt )t≥0 constructed in this way is a special case of
a renewal process (Subsection 4, Sect. 9, Chap. 2, Vol. 1). The trajectories of this
process are discontinuous (specifically, piecewise-constant, continuous on the right,
and with unit jumps). Like Brownian motion (Sect. 13, Chap. 2, Vol. 1) having con-
tinuous trajectories, this process plays a fundamental role in the theory of random
processes. From these two processes can be built random processes of rather com-
plicated probabilistic structure. (We mention processes with independent increments
as a typical example of these; see, e.g., [31, 75, 68].)
204 7 Martingales

3. From assumption C we find that


 
E(Xt − X0 ) = ct − E St = ct − E ξi I(Ti ≤ t) = ct − E ξi I(Ti ≤ t)
i i
 
= ct − E ξi E I(Ti ≤ t) = ct − μ P(Ti ≤ t)
i i

= ct − μ P(Nt ≥ i) = ct − μ E Nt = t(c − λμ).
i

Thus, we see that, in the case under consideration, a natural requirement for an
insurance company to operate with a clear profit (i.e., E(Xt − X0 ) > 0, t > 0) is
that
c > λμ. (5)
In the following analysis, an important role is played by the function
 ∞
h(z) = (ezx − 1) dF(x), z ≥ 0, (6)
0

which is equal to F̂(−z) − 1, where


 ∞
F̂(s) = e−sx dF(x)
0

is the Laplace–Stieltjes transform of F (with s a complex number).


Using the notation
g(z) = λh(z) − cz, ξ0 = 0,
we find that for any r > 0 with h(r) < ∞,
Nt
E e−r(Xt −X0 ) = E e−r(Xt −u) = e−rct · E er i=0 ξi


 Nt
= e−rct E er i=0 ξi
P(Nt = n)
n=0

 e−λt (λt)n
= e−rct (1 + h(r))n
n=0
n!

= e−rct · eλth(r) = et[λh(r)−cr] = etg(r) .

Analogously, it can be shown that for any s < t

E e−r(Xt −Xs ) = e(t−s)g(r) . (7)


10 Application of Martingale Methods to Calculation of Probability of Ruin in Insurance 205

Let FtX = σ(Xs , s ≤ t). Since the process X = (Xt )t≥0 is a process with indepen-
dent increments (Problem 2), we have (P-a.s.)

E(e−r(Xt −Xs ) | FsX ) = E e−r(Xt −Xs ) = e(t−s)g(r) ,

hence (P-a.s.)
E(e−rXt −tg(r) | FsX ) = e−rXs −sg(r) . (8)
Using the notation
Zt = e−rXt −tg(r) , t ≥ 0, (9)
we see that property (8) can be rewritten in the form

E(Zt | FsX ) = Zs , s≤t (P -a.s.). (10)

It is natural to say, by analogy with Definition 1 in Sect. 1, that the process Z =


(Zt )t≥0 is a martingale(with respect to the “flow” (FtX )t≥0 of σ-algebras). Notice
that in this case E |Zt | < ∞, t ≥ 0 (cf. (1) in Sect. 1).
By analogy with Definition 3 in Sect. 1, we shall say that the random variable τ =
τ(ω) with values in [0, +∞] is a Markov time, or a random variable independent of
the future (relative to the “flow” of σ-algebras (FtX )t≥0 ) if for each t ≥ 0 the set

{τ(ω) ≤ t} ∈ FtX .

It turns out that for martingales with continuous time, which are considered now,
Theorem 1 from Sect. 2 remains valid (with self-evident changes to the notation). In
particular,
E Zt∧τ = E Z0 (11)
for any Markov time τ.
Let τ = T. Then, by virtue of (9), we find from (11) that for any t > 0

e−ru = E e−rXt∧T −(t∧T)g(r)


≥ E[ e−rXt∧T −(t∧T)g(r) | T ≤ t ] P(T ≤ t)
= E[ e−rXT −Tg(r) | T ≤ t ] P(T ≤ t)
≥ E[ e−Tg(r) | T ≤ t ] P(T ≤ t) ≥ min e−sg(r) P(T ≤ t).
0≤s≤t

Therefore
e−ru
P(T ≤ t) ≤ = e−ru max esg(r) . (12)
min0≤s≤t e−sg(r) 0≤s≤t

Let us consider the function

g(r) = λh(r) − cr

in more detail. Clearly, g(0) = 0, g (0) = λμ−c < 0 (by virtue of (5)) and g (r) =
λh (r) ≥ 0. Thus, there exists a unique positive value r = R with g(R) = 0.
206 7 Martingales

Note that for r > 0


 ∞  ∞ ∞
erx (1 − F(x)) dx = erx dF(y) dx
0
0 ∞ 
x
y 
= erx dx dF(y)
0 0

1 ∞ ry 1
= (e − 1) dF(y) = h(r).
r 0 r

From this and λH(R) − cR = 0 we conclude that R is the (unique) root of the
equation 
λ ∞ rx
e (1 − F(x)) dx = 1. (13)
c 0
Let us set r = R in (12). Then we obtain, for any t > 0,

P(T ≤ t) ≤ e−Ru , (14)

whence
P(T < ∞) ≤ e−Ru . (15)
Hence we have proved the following theorem.

Theorem. Suppose that in the Cramér–Lundberg model assumptions A, B, C and


property (5) are satisfied (i.e., λμ < c). Then the ruin probabilities P(T ≤ t) and
P(T < ∞) satisfy (14) and (15), where R is the positive (and unique) root of Eq.
(13).

4. In the foregoing proof, we used relation (11), which, as we said, follows from
a continuous-time analog of Theorem 1, Sect. 2 (on preservation of the martingale
property under random time change). The proof of this continuous-time result can be
found, for example, in [54, Sect. 3.2]. However, if we assumed that σi , i = 1, 2, . . . ,
had a (discrete) geometric (rather than an exponential) distribution (P{σi = k} =
qk−1 p, k ≥ 1), then Theorem 1 of Sect. 2 would suffice.
The derivations in this section, which appeal to the theory of random processes
with continuous time, demonstrate, in particular, how mathematical models with
continuous time arise in applied problems.

5. PROBLEMS
1. Prove that the process N = (Nt )t≥0 (under assumption A) is a process with
independent increments.
2. Prove that X = (Xt )t≥0 is also a process with independent increments.
3. Consider the Cramér–Lundberg model and obtain an analog of the foregoing
theorem assuming that the variables σi , i = 1, 2, . . ., have a geometric (rather
than exponential) distribution (P(σi = k) = qk−1 p, k = 1, 2, . . .).
11 Fundamental Theorems 207

11. Fundamental Theorems of Stochastic Financial


Mathematics: The Martingale Characterization
of the Absence of Arbitrage

1. In the previous section we applied the martingale theory to the proof of the
Cramér–Lundberg theorem, which is a basic result of the mathematical theory of
insurance. In this section the martingale theory will be applied to the problem of ab-
sence of arbitrage in a financial market in the situation of stochastic indeterminacy.
In what follows, Theorems 1 and 2, which are called the fundamental theorems of
arbitrage theory in stochastic financial mathematics, are of particular interest be-
cause they state conditions for the absence of arbitrage in martingale terms (in a
sense to be explained later) in the markets under consideration as well as conditions
that guarantee the possibility of meeting financial obligations. (For a more detailed
exposition of the financial mathematics, see [71].)
2. Let us give some definitions. It will be assumed throughout that we are given a
filtered probability space (Ω, F , (Fn )n≥0 , P), which describes the stochastic inde-
terminacy of the evolution of prices, financial indexes, and other financial indicators.
The totality of events in Fn will be interpreted as the information available at time n
(inclusive). For example, Fn may comprise information about the particular values
of some financial assets or financial indexes, for example.
The main object of the fundamental theorems will be the concept of a (B, S)-
market, defined as follows.
Let B = (Bn )n≥0 and S = (Sn )n≥0 be positive random sequences. It is assumed
that Bn for every n ≥ 0 is Fn−1 -measurable, whereas Sn is Fn -measurable. For
simplicity, we assume that the initial σ-algebra F0 is trivial, i.e., F0 = {∅, Ω}
(Sect. 2, Chap. 2, Vol. 1). Therefore B0 and S0 are constants. In the terminology of
Sect. 1, B = (Bn )n≥0 and S = (Sn )n≥0 are stochastic sequences, and moreover, the
sequence B = (Bn )n≥0 is predictable (since Bn are Fn−1 -measurable).
The financial meaning of B = (Bn )n≥0 is that it describes the evolution of a bank
account with initial value B0 . The fact that Bn is Fn−1 -measurable means that the
state of the bank account at time n (say, “today”) becomes already known at time
n − 1 (“yesterday”).
If we let
ΔBn
rn = , n ≥ 1, (1)
Bn−1
with ΔBn = Bn − Bn−1 , then we obviously get

Bn = (1 + rn )Bn−1 , n ≥ 1, (2)

where rn are Fn−1 -measurable and satisfy rn > −1 (since Bn > 0 by assumption).
In the financial literature rn are called the (bank) interest rates.
208 7 Martingales

The sequence S = (Sn )n≥0 differs from B = (Bn )n≥0 in that Sn is Fn -measurable,
in contrast to the Fn−1 -measurability of Bn . This reflects the situation with stock
prices,whose actual value at time n becomes known only when it is announced (i.e.,
“today” rather than “yesterday” as for a bank account).
Similarly to the bank interest rate, we can define the market interest rate

ΔSn
ρn = , n ≥ 1, (3)
Sn−1

for stock S = (Sn )n≥0 .


Clearly, then
Sn = (1 + ρn )Sn−1 , (4)
with ρn > −1, since all Sn > 0 (by assumption).
It follows from (2) and (4) that
4
n
Bn = B0 (1 + rk ), (5)
k=1
4n
Sn = S0 (1 + ρk ). (6)
k=1

By definition, the pair of processes B = (Bn )n≥0 and S = (Sn )n≥0 introduced
in the foregoing form a financial (B, S)-market consisting of two assets, the bank
account B and the stock S.

Remark. It is clear that this (B, S)-market is merely a simple model of real financial
markets, which usually consist of many assets of a diverse nature (e.g., [71]). Nev-
ertheless, even this simple example demonstrates that the methods of martingale
theory are very efficient in the treatment of many issues of a financial and eco-
nomic nature (including, for example, the question about the absence of arbitrage in
a (B, S)-market, which will be solved by the first fundamental theorem.)

3. Now we provide a definition of an investment portfolio and its value and define
the important notion of a self-financing portfolio.
Let (Ω, F , (Fn )n≥0 , P) be a basic filtered probability space with F0 = {∅, Ω},
and let π = (β, γ) be a pair of predictable sequences β = (βn )n≥0 , γ = (γn )n≥0 . We
impose no other restrictions on βn and γn , n ≥ 0, except that they are predictable,
i.e., Fn−1 -measurable (F−1 = F0 ). In particular, they can take fractional and
negative values.
The meaning of βn is the amount of “units” in a bank account, and that of γn is
the amount of shares in an investor’s possession at time n.
We will call π = (β, γ) the investment portfolio in the (B, S)-market under con-
sideration.
We associate with each portfolio π = (β, γ) the corresponding value X π =
π
(Xn )n≥0 by setting
Xnπ = βn Bn + γn Sn (7)
11 Fundamental Theorems 209

and interpreting βn Bn as the amount of money in the bank account and γn Sn as the
total price of the stock at time n. The intuitive meaning of the predictability of β and
γ is also clear: the investment portfolio “for tomorrow” must be composed “today.”
The following important notion of a self-financing portfolio expresses the idea of
considering the (B, S)-markets that admit neither outflow nor inflow of capital. The
formal definition is as follows.
Using the formula of discrete differentiation (Δ(an bn ) = an Δbn + bn−1 Δan ), we
find that the increment ΔXnπ (= Xnπ − Xn−1 π
) of the value is representable as

ΔXnπ = [βn ΔBn + γn ΔSn ] + [Bn−1 Δβn + Sn−1 Δγn ]. (8)

The real change of the value may be caused only by market-based changes in
the bank account and the stock price, related to the quantity βn ΔBn + γn ΔSn . The
second expression on the right-hand side of (8), i.e., Bn−1 Δβn + Sn−1 Δγn , is Fn−1 -
π
measurable and cannot affect Xn−1 at time n. Therefore it must be equal to zero.
In general, the value can vary not only because of market-based changes in in-
terest rates (rn and ρn , n ≥ 1) but also due to, say, inflow of capital from outside
or outflow of capital for operating expenditures, and so on. We will not take into
account such possibilities; in addition, we will consider (in accordance with the
foregoing discussion) only portfolios π = (β, γ) satisfying the condition

ΔXnπ = βn ΔBn + γn ΔSn (9)

for all n ≥ 1.
In stochastic financial mathematics such portfolios are called self-financing.
4. It follows from (9) that a self-financing portfolio π = (β, γ) satisfies


n
Xnπ = X0π + (βk ΔBk + γk ΔSk ), (10)
k=1

and since  Xπ  S 
n
Δ n = γn Δ , (11)
Bn Bn
we have
Xnπ Xπ 
n S 
k
= 0 + γk Δ . (12)
Bn B0 Bk
k=1

Let us fix an N ≥ 1 and consider the evolution of the (B, S)-market at times
n = 0, 1, . . . , N.

Definition 1. We say that a self-financing portfolio π = (β, γ) provides an arbitrage


opportunity at time N if X0π = 0, XNπ ≥ 0 (P-a.s.), and XNπ > 0 with a positive P-
probability, i.e., P{XNπ > 0} > 0.

Definition 2. We say that there is no arbitrage on the (B, S)-market (at time N), or
that this market is arbitrage-free if, for any portfolio π = (β, γ) with X0π = 0 and
210 7 Martingales

P{XNπ ≥ 0} = 1, it holds that P{XNπ = 0} = 1, i.e., the event XNπ > 0 may occur
only with zero P-probability.
The financial meaning of these definitions is that it is impossible to obtain any
risk-free income in an arbitrage-free market.
Clearly, the property of a (B, S)-market to be arbitrage-free, and hence to be in
a certain sense “fair” or “rational,” depends on the probabilistic properties of the
sequences B = (Bn )n≤N and S = (Sn )n≤N , as well as on the assumptions regarding
the structure of the filtered probability space (Ω, F , (Fn )n≤N , P).
Remarkably, the theory of martingales enables us to effectively state conditions
that guarantee the absence of arbitrage opportunities.
Theorem 1 (First Fundamental Theorem). Assume that stochastic indeterminacy is
described by a filtered probability space (Ω, F , (Fn )n≤N , P) with F0 = {∅, Ω},
FN = F .
A (B, S)-market defined on (Ω, F , (Fn )n≤N , P) is arbitrage-free if and only if
8 on (Ω, F ) equivalent to P (P
there exists a measure P 8 ∼ P) such that the dis-

S Sn
counted sequence B = Bn is a martingale with respect to this measure, i.e.,
n≤N

 
8  Sn  < ∞,
E n ≤ N,
Bn
and  
8 Sn | Fn−1 = Sn−1 ,
E n ≤ N,
Bn Bn−1
8 is the expectation with respect to P.
where E 8
Remark 1. The statement of the theorem remains valid also for vector processes
S = (S1 , . . . , Sd ) with d < ∞ [71, Chap. V, Sect. 2b].
Remark 2. For obvious reasons, the measure P 8 involved in the theorem is called
the martingale measure.


Denote by M(P) = P 8
8 ∼ P : S is a P-martingale the class of measures P,8
B
 
which are equivalent to P and such that the sequence BS = BSnn is a martingale
n≤N
8
with respect to P.
We will write N A for the absence of arbitrage (no arbitrage). Using this notation
the conclusion of Theorem 1 can be written
N A ⇐⇒ M(P) = ∅. (13)

PROOF OF THEOREM 1. Sufficiency. Let P 8 ∈ M(P) be a martingale measure and


π = (β, γ) a portfolio with X0π = β0 B0 + γ0 S0 = 0. Then (12) implies

Xnπ 
n S 
k
= γk Δ , 1 ≤ n ≤ N. (14)
Bn Bk
k=1
11 Fundamental Theorems 211
 
The sequence S
= Sk 8
is a P-martingale; therefore the sequence G =
B Bk
k≤N
n  
(Gπn )0≤n≤N with Gπ0 = 0 and Gπn
= γk Δ BSkk , 1 ≤ n ≤ N, is a martingale
 π  k=1
X
transform. Hence the sequence Bnn is also a martingale transform.
0≤n≤N
When testing for arbitrage or its absence, we must consider portfolios π such that
8 ∼ P and BN > 0 (P- and P-a.s.),
XNπ ≥ 0 (P-a.s.). Since P
not only X0π = 0,
but also 8
8 N ≥ 0 = 1.
π
X
we obtain that P BN
 π
X
Then, applying Theorem 3 in Sect. 1 to the martingale transform Bnn ,
0≤n≤N
8 8 XN = E
8 X0 = 0, and
π π
we obtain that this sequence is in fact a P-martingale. Thus, E

π
π BN B0
8 XN ≥ 0 = 1, we have P
since P 8 XN = 0 = 1.
BN BN

Hence we see that XNπ 8 and P-a.s.), and therefore X π = 0 (P-a.s.) for any
= 0 (P- N
self-financing portfolio π with X0π = 0 and XNπ ≥ 0 (P-a.s.), which by definition
means the absence of arbitrage opportunities.
Necessity. We will give the proof only for the one-step model of a (B, S)-market,
i.e., for N = 1. But even this simple case will enable us to demonstrate the idea
of the proof, which consists in an explicit construction of a martingale measure
using the absence of arbitrage. We will construct this measure using the Esscher
transform (see subsequent discussion). (For the proof in the general case N ≥ 1 see
[71, Chapter V, Sect. 2d].)
Without loss of generality we can assume that B0 = B1 = 1. In the current setup,
the absence of arbitrage opportunities reduces (Problem 1) to the condition

P{ΔS1 > 0} > 0 and P{ΔS1 < 0} > 0. (15)

(We exclude the trivial case P{ΔS1 = 0} = 1.)


8
We must derive from this that there exists an equivalent martingale measure P,
8 8 8
i.e., such that P ∼ P and E|ΔS1 | < ∞, EΔS1 = 0.
This immediately follows from the following lemma, which is also of interest in
its own right for probability theory.

Lemma 1. Let (Ω, F ) = (R, B(R)), and let X = X(ω) be the coordinate random
variable (X(ω) = ω). Let P be a probability measure on (Ω, F ) such that

P{X > 0} > 0 and P{X < 0} > 0. (16)

8 ∼ P on (Ω, F ) such that


Then for any real a there exists a probability measure P

8 aX < ∞.
Ee (17)
8
In particular, E|X| < ∞ and, moreover,

8 = 0.
EX (18)
212 7 Martingales

PROOF. Define the measure Q = Q(dx) with Q(dx) = ce−x P(dx) and normaliz-
2

ing constant c = (E e−X )−1 .


2

For any real a, set


ϕ(a) = EQ eaX , (19)
where EQ is the expectation related to Q.
Let
eax
Za (x) = . (20)
ϕ(a)
8 a with
Since Za (x) > 0 and EQ Za (X) = 1, the measure P

8 a (dx) = Za (x) Q(dx)


P (21)

8 a ∼ Q ∼ P.
is a probability measure for any real a. Clearly, P


ax
Remark 3. The transformation x  ϕ(a) e
is known as the Esscher transform. As we
8 8
will see later, the measure P = Pa∗ for a certain value a∗ possesses the martingale
property (18). This measure is referred to as the Esscher measure or the martingale
Esscher measure.

Now we return to the proof of Theorem 1. The function ϕ = ϕ(a) defined for
all real a is strictly convex, since ϕ (a) > 0. Let ϕ∗ = inf{ϕ(a) : a ∈ R}. The
following two cases are possible: (i) there exists a∗ such that ϕ(a∗ ) = ϕ∗ , and (ii)
there is no such (finite) a∗ .
In the first case, ϕ (a∗ ) = 0. Therefore

Xea∗ X ϕ (a∗ )
EPa X = EQ = = 0,
∗ ϕ(a∗ ) ϕ(a∗ )

and we can take the measure P 8 a for the required measure P.


8

So far we have not used the no-arbitrage assumption (16). It is not hard to show
(Problem 2) that this assumption excludes possibility (ii). Therefore there remains
only the first possibility, which has already been considered.
Thus, we have proved the necessity part (which consists in the existence of a
martingale measure) for N = 1. For the general case N ≥ 1 the reader is referred,
as stated earlier, to [71, Chap. V, Sect. 2d].


5. Now we give some examples of arbitrage-free (B, S)-markets.

EXAMPLE 1. Suppose that the (B, S)-market is described by (5) and (6) with 1 ≤
k ≤ N, where rk = r (a constant) for all 1 ≤ k ≤ N and ρ = (ρ1 , ρ2 , . . . , ρN ) is a
sequence of independent identically distributed Bernoulli random variables taking
11 Fundamental Theorems 213

values a and b (a < b) with probabilities P{ρ1 = a} = q, P{ρ1 = b} = p,


p + q = 1, 0 < p < 1. Moreover, assume that

− 1 < a < r < b. (22)

This model of a (B, S)-market is known as the CRR model, after the names of its
authors J. C. Cox, R. A. Ross, and M. Rubinstein; for more details see [71].
Since in this model 1 + ρ  S
Sn n n−1
= ,
Bn 1 + r Bn−1
8 must satisfy
it is clear that the martingale measure P

8 1 + ρn = 1,
E
1+r
8 n = r.
i.e., Eρ
8 n = b}, q̃ = P{ρ
Use the notation p̃ = P{ρ 8 n = a}; then for any n ≥ 1

p̃ + q̃ = 1, bp̃ + aq̃ = r.

Hence
r−a b−r
p̃ = , q̃ = . (23)
b−a b−a
In this case the whole “randomness” is determined by the Bernoulli sequence
ρ = (ρ1 , ρ2 , . . . , ρN ). We let Ω = {a, b}N , i.e., we assume that the space of elemen-
tary outcomes consists of sequences (x1 , . . . , xN ) with xi = a or b. (Assuming this
specific “coordinate” structure of Ω does not restrict generality; in this connection,
see the end of the proof of sufficiency in Theorem 2, Subsection 6.)
As an exercise (Problem 2) we suggest showing that the measure P 8 defined by

8 1 , . . . , xN ) = p̃νb (x1 ,...,xN ) q̃N−νb (x1 ,...,xN ) ,


P(x (24)
N
where νb (x1 , . . . , xN ) = i=1 Ib (xi ) (the number of xi ’s equal to b) is a martingale
measure, and this measure is unique. It is clear from (24) that P{ρ 8 n = b} = p̃ and
8
P{ρn = a} = q̃.
Thus, by Theorem 1, the CRR model is an example of an arbitrage-free market.

EXAMPLE 2. Let the (B, S)-market have the structure Bn = 1 for all n = 0, 1, . . . , N
and  
n
Sn = S0 exp ρ̂k , 1 ≤ n ≤ N. (25)
k=1

Let ρ̂k = μk + σk εk , where μk and σk > 0 are Fk−1 -measurable and (ε1 , . . . , εN )
are independent standard Gaussian random variables, εk ∼ N (0, 1).
214 7 Martingales

We will construct the required Esscher measure P 8 (on (Ω, FN )) by means of the
conditional 8
Esscher transform. That is, let P(dω) = ZN (ω) P(dω), where ZN (ω) =
3
1≤k≤N kz (ω) with
eak ρ̂k
zk (ω) = (26)
E(e k k | Fk−1 )
a ρ̂

(F0 = {∅, Ω}) and where the Fk−1 -measurable random variables ak = ak (ω) are
8
to be chosen so that the sequence (Sn )0≤n≤N is a P-martingale.
8 is a martingale measure if and only if
In view of (25), P

E[e(an +1)ρ̂n | Fn−1 ] = E[ean ρ̂n | Fn−1 ], 1≤n≤N (27)

(with respect to the initial measure P).


Since ρ̂n = μn + σn εn , we find from (27) that an must be chosen so that

σn2
μn + = −an σn2 ,
2
i.e.,
μn 1
an = − − .
σn2 2
With this choice of an , 1 ≤ n ≤ N, the density ZN (ω) is given by the formula
  N 
μn σn  1  μn σ n 2 
ZN (ω) = exp − + εn + + . (28)
n=1
σn 2 2 σn 2

8 = P. In other words, in this


2
σ
If μn = − 2n for all 1 ≤ n ≤ N from the outset, then P
case the initial measure P itself is a martingale measure.
Thus the (B, S)-market with B = (Bn )0≤n≤N such that Bn ≡ 1 and S =
(Sn )0≤n≤N as specified by (25), is, as in Example 1, arbitrage-free. In Problem 4,
we propose to examine whether the martingale measure P 8 constructed earlier is
unique.
6. The notion of a complete (B, S)-market to be introduced in what follows is of
great interest to stochastic financial mathematics because (irrespective of whether
or not the market is arbitrage-free) it is related to the natural question of whether,
for a given FN -measurable contingent claim fN , there is a self-financing portfolio π
such that the corresponding capital XNπ “offsets” (or is at least no less than) fN .
Definition 3. A (B, S)-market is said to be complete (relative to time instant N) or
N-complete if any bounded FN -measurable contingent claim fN is replicable, i.e.,
there exists a self-financing portfolio π such that XNπ = fN (P-a.s.).
Theorem 2 (Second Fundamental Theorem). Similarly to Theorem 1, we assume
that (Ω, F , (Fn )0≤n≤N , P) is a filtered probability space, F0 = {∅, Ω}, FN = F ,
11 Fundamental Theorems 215

and the (B, S)-market defined on this space is arbitrage-free (M(P) = ∅). Then this
market is complete if and only if there exists only a unique equivalent martingale
measure (| M(P)| = 1).

PROOF. Necessity. Let the market at hand be complete. This means that for any
FN -measurable contingent claim fN there is a self-financing portfolio π = (β, γ)
such that XNπ = fN (P-a.s.). Without loss of generality we may assume that Bn = 1,
0 ≤ n ≤ N. Hence we see from (10) that


N
fN = XNπ = X0π + γk ΔSk . (29)
k=1

Since the market is arbitrage-free by assumption, the set of martingale measures


is nonempty, M(P) = ∅. We will show that the completeness assumption implies
the uniqueness of the martingale measure (| M(P)| = 1). 
n
Let P1 and P2 be two martingale measures. Then k=1 γk ΔSk 1≤n≤N is a mar-
tingale transform with respect to either of these measures.
Take a set A ∈ FN , and let fN (ω) = IA (ω). Since for some π


N
IA (ω) = XNπ = X0π + γk ΔSk (P -a.s.),
k=1
n 
we conclude from Theorem 3 in Sect. 1 that the sequence k=1 γk ΔSk 1≤n≤N
is a
1 2
martingale with respect to either of the measures P and P . Therefore

EPi IA (ω) = x, i = 1, 2, (30)

where EPi is the expectation with respect to Pi and x = X0π , which is a constant
since F0 = {∅, Ω}. Now (30) implies that P1 (A) = P2 (A) for any set A ∈ FN .
Hence the uniqueness of the martingale measure is established.
The proof of sufficiency is more complicated and will be carried out in several
steps. We consider an arbitrage-free (B, S)-market (M(P) = ∅) such that the mar-
tingale measure is unique (| M(P)| = 1).
It is worth noting that both assumptions of the uniqueness of the martingale mea-
sure and the completeness of the market are strong restrictions. What is more, it
turns out that these assumptions imply that the trajectories S = (Sn )0≤n≤N are
“conditionally two-pointed,” which will be explained subsequently. (This may be
exemplified by the CRR model ΔSn = ρn Sn−1 , where ρn takes only two values, so
that the conditional probabilities P(ΔSn ∈ · | Fn−1 ) “sit” on two points, aSn−1 and
bSn−1 .)
The uniqueness of the martingale measure (| M(P)| = 1) also imposes restric-
tions on the structure of the filtration (Fn )n≤N . Under this condition the σ-algebras
Fn must be generated by the prices S0 , S1 , . . . , Sn (assuming that Bk ≡ 1, k ≤ n).
In this regard, see the diagram on p. 610 of [71] and Chap. V, Sect. 4e, therein.
216 7 Martingales

As an intermediate result for establishing the implication “| M(P)| = 1 ⇒


completeness” we will prove the following useful lemma, which provides an equiv-
alent characterization of completeness of an arbitrage-free market.

Lemma 2. An arbitrage-free (B, S)-market is complete if and only if there exists


a measure P8 in the set M(P) of all martingale measures such that any bounded
8 0≤n≤N admits an “ S -representation”:
martingale m = (mn , Fn , P) B


n S 
γk∗ Δ
k
mn = m0 + (31)
Bk
k=1

with predictable γk∗ , 1 ≤ k ≤ n.

PROOF. We consider an arbitrage-free complete (B, S)-market. (Without loss of


generality, assume that Bn = 1, 0 ≤ n ≤ N.)
Take an arbitrary measure P 8 ∈ M(P), and let m = (mn , Fn , P) 8 0≤n≤N be a
bounded martingale (|mn | ≤ c, 0 ≤ n ≤ N). Set fN = mN . Then, by the defini-
tion of completeness (Definition 3), there is a portfolio π ∗ = (β ∗ , γ ∗ ) such that

XNπ = fN and for all 0 ≤ n ≤ N

∗ 
n
Xnπ = x + γk∗ ΔSk , (32)
k=1


with x = X0π .
∗ ∗ ∗
Since XNπ = fN ≤ c, the sequence X π = (Xnπ , Fn , P) 8 0≤n≤N is a martingale

(Theorem 3, Sect. 1). Thus, we have two martingales, m and X π , with the same ter-

minal value fN (XNπ = mN = fN ). But by the definition of the martingale property,
∗ ∗
mn = E(mN | Fn ) and Xnπ = E(XNπ | Fn ), 0 ≤ n ≤ N. Therefore the Lévy martin-

gales m and X π are the same, and by (32) the martingale m = (mn , Fn , P)8 0≤n≤N
admits the “S-representation”


n
mn = x + γk∗ ΔSk , 1 ≤ n ≤ N, (33)
k=1

with x = m0 .
Let us now prove the reverse statememt (S-representation ⇒ completeness).
By assumption, there exists a measure P 8 ∈ M(P) such that any bounded P- 8
martingale admits an S-representation.
Take for such a martingale X = (Xn , Fn , P) 8 0≤n≤N a martingale with Xn =
8 N | Fn ), where E
E(f 8 is the expectation with respect to P
8 and fN is the contingent
claim involved in Definition 3, for which we must find a self-financing portfolio π
8 and P-a.s.).
such that XNπ = fN (P-
11 Fundamental Theorems 217

For a (bounded) martingale X = (Xn , Fn , P)0≤n≤N consider its S-representation


n
Xn = X0 + γk ΔSk (34)
k=1

with some Fk−1 -measurable variables γk .


Let us show that this implies the existence of a self-financing portfolio π̃ = (β̃, γ̃)
such that Xnπ̃ = Xn for all 0 ≤ n ≤ N and, in particular, fN = XN = XNπ̃ admits the
representation
N
fN = X0π̃ + γ̃k ΔSk , (35)
k=1

as required in Definition 3.
Using representation (34), set γ̃n = γn and define

β̃n = Xn − γn Sn . (36)

Then (34) implies that β̃n are Fn−1 -measurable. Moreover,

Sn−1 Δγ̃n + Δβ̃n = Sn−1 Δγn + ΔXn − Δ(γn Sn )


= Sn−1 Δγn + γn ΔSn − Δ(γn Sn ) = 0.

Thus, according to Subsection 3, the portfolio π̃ = (β̃, γ̃) so constructed is self-


financing and XNπ̃ = fN , i.e., the completeness property is fulfilled.


With this lemma, we see that to complete the proof of the theorem, we must
establish the implication {3} in the following chain of implications:
{3} {2} {1}
| M(P)| = 1 =⇒ S-representation ⇐⇒ completeness =⇒ | M(P)| = 1 .

(Implication {1} was established in the proof of necessity and implication {2} in
the foregoing lemma.)
To make the proof of {3} more transparent, we will consider the particular case
of a (B, S)-market described by the CRR model.
As was pointed out earlier (Example 1), in this model the martingale measure
8 is unique (| M(P)| = 1). So we need to understand why in this case the S-
P
representation (with respect to the martingale measure P) 8 holds. We have already
indicated that the key reason for that is the fact that the ρn in (4) take only two
values, a and b, and therefore the conditional distributions P(ΔSn ∈ · | Fn−1 ) are
two-pointed.
Thus we will consider the CRR model introduced in Example 1 and assume
additionally that Fn = σ(ρ1 , . . . , ρn ) for 1 ≤ n ≤ N and F0 = {∅, Ω}. Let P 8
denote the martingale measure on (Ω, FN ) defined by (24).
218 7 Martingales

Let X = (Xn , Fn , P) 8 0≤n≤N be a bounded martingale. Then there are functions


gn = gn (x1 , . . . , xn ) such that Xn (ω) = gn (ρ1 (ω), . . . , ρn (ω)), so that

ΔXn = gn (ρ1 , . . . , ρn ) − gn−1 (ρ1 , . . . , ρn−1 ).

8
Since E(ΔXn | Fn−1 ) = 0, we have

p̃gn (ρ1 , . . . , ρn−1 , b) + q̃gn (ρ1 , . . . , ρn−1 , a) = gn−1 (ρ1 , . . . , ρn−1 ),

i.e.,

gn (ρ1 , . . . , ρn−1 , b) − gn−1 (ρ1 , . . . , ρn−1 )



gn−1 (ρ1 , . . . , ρn−1 ) − gn (ρ1 , . . . , ρn−1 , a)
= . (37)

r−a b−r
Since p̃ = b−a , q̃ = b−a , we find from (37) that

gn (ρ1 , . . . , ρn−1 , b) − gn−1 (ρ1 , . . . , ρn−1 )


b−r
gn (ρ1 , . . . , ρn−1 , a) − gn−1 (ρ1 , . . . , ρn−1 )
= . (38)
a−r
Let μn ({a}; ω) = I(ρn (ω) = a), μn ({b}; ω) = I(ρn (ω) = b), and let

Wn (ω, x) = gn (ρ1 (ω), . . . , ρn−1 (ω), x) − gn−1 (ρ1 (ω), . . . , ρn−1 (ω)),
Wn (ω, x)
Wn∗ (ω, x) = .
x−r
Using this notation we obtain
 
ΔXn (ω) = Wn (ω, ρn (ω)) = Wn (ω, x) μn (dx; ω) = (x − r)Wn∗ (ω, x) μn (dx; ω).

By (38) the functions Wn∗ (ω, x) do not depend on x. Therefore denoting the expres-
sion in the left-hand side (or, equivalently, in the right-hand side) of (38) by γn∗ (ω)
we find that
ΔXn (ω) = γn∗ (ω) (ρn (ω) − r). (39)
Therefore

n
Xn (ω) = X0 (ω) + γk∗ (ω) (ρk (ω) − r). (40)
k=1

It is easily seen that


S  Sn−1 ρn − r
n
Δ = · .
Bn Bn−1 1 + r
11 Fundamental Theorems 219

Hence
Bn−1  Sn 
ρn − r = (1 + r) Δ ,
Sn−1 Bn
and consequently we see from (40) that


n  S (ω) 
k
Xn (ω) = X0 (ω) + γk (ω)Δ , (41)
Bk
k=1

where
Bk−1
γk (ω) = γk∗ (ω) (1 + r) .
Sk−1
 
The sequence S
= Sn 8 Thus, (41) is
is a martingale with respect to P.
B Bn
0≤n≤N
8
simply the “ BS -representation” for X with respect to the (basic) P-martingale S
B.
The key argument in the proof of {3} for the CRR model (where | M(P)| = 1)
was the fact that the ρn take on only two values. However, it turns out that the
uniqueness assumption of the martingale measure P 8 is so strong that in the general
ΔSn
case it also implies that the variables ρn = Sn−1 are “two-pointed,” i.e., there exist
predictable an = an (ω) and bn = bn (ω) such that

8 n = an | Fn−1 ) + P(ρ
P(ρ 8 n = bn | Fn−1 ) = 1 (P -a.s.). (42)

Taking for granted this property, the foregoing proof of the BS -representation in
the CRR model will “work” also in the general case. Thus, all that remains is to
establish (42). We leave obtaining this result to the reader (Problem 5). Nevertheless,
we give some heuristic arguments showing how the uniqueness of the martingale
measure leads to two-pointed conditional distributions.
Let Q = Q(dx) be a probability distribution on (R, B(R)) and ξ = ξ(x) the
coordinate random variable (ξ(x) = x). Let EQ |ξ| < ∞, EQ ξ = 0 (“martingale
property”), and the measure Q has the property that for any other measure Q 8 such
that EQ |ξ| < ∞ and EQ ξ = 0, it holds that Q 8 = Q (“uniqueness of the martingale
measure”).
We assert that in this case Q is supported on at most two points (a ≤ 0 and b ≥ 0)
that may stick together as one zero point (a = b = 0).
The aforementioned heuristic arguments, which make this assertion very likely,
are as follows.
Suppose that Q is supported on three points, x− ≤ x0 ≤ x+ , with masses
q− , q0 , q+ , respectively. The condition EQ ξ = 0 means that

q− x− + q0 x0 + q+ x+ = 0.

If x0 = 0, then q− x− + q+ x+ = 0.
Let
q− 1 q0 q+
q̃− = , q̃0 = + , q̃+ = , (43)
2 2 2 2
i.e., we move some parts of masses q− and q+ from the points x− and x+ to x0 .
220 7 Martingales

It is seen from (43) that the corresponding measure Q 8 ∼ Q and E  ξ = 0,


Q
although Q 8 = Q. But this contradicts the uniqueness assumption of measure Q
such that EQ ξ = 0.
Therefore the measure Q cannot be supported at three points (x− , x0 , x+ ) with
x0 = 0. In a similar way, utilizing the same idea of “moving masses,” the case
x0 = 0 is treated. (For more details, see [71, Chap. 5, Sect. 4e].)
7. Problems.
1. Show that for N = 1 the no-arbitrage condition is equivalent to inequalities (15).
(It is assumed that P{ΔS1 = 0} < 1.)
2. Show that possibility (ii) in the proof of Lemma 1 (Subsection 4) is excluded
by conditions (16).
3. Prove that the measure P8 in Example 1 (Subsection 5) is a martingale measure,
which is unique in the class M(P).
4. Explore the problem of uniqueness of the martingale measure constructed in
Example 2 (Subsection 5).
5. Prove that the assumption | M(P)| = 1 in the (B, S)-model implies the “condi-
tional two-pointedness” for the distribution of BSnn , 1 ≤ n ≤ N.

12. Hedging in Arbitrage-Free Models

1. Hedging is one of the basic methods of the dynamic control of investment port-
folios. We will set out some basic notions and results related to this method consid-
ering as an example the pricing of so-called option contracts (or simply options).

Options (as instruments of financial engineering), being derivative securities, are


fairly risky. But at the same time they (along with other securities, e.g., forwards)
are successfully used not only for earning profit due to market price fluctuations but
also for protection (hedging) against unexpected changes in stock prices.
An option is a security (contract) issued by a financial institution that gives its
holder the right to buy or sell something valuable (e.g., a share, a bond, currency) at
a certain period or instant of time on specified terms.
Whereas an option gives the right to buy or sell something, the other financial
instrument, a forward contract (or a forward), is a commitment to buy or sell some-
thing of value at a certain time in the future at a price fixed at the moment of signing
the deal.
One of the main questions regarding option pricing concerns the price at which
the options are to be sold. Clearly, the seller wants to charge as much as possible,
while the buyer wants to pay as little as possible. What is the fair, rational price,
acceptable to both buyer and seller?
Naturally, this fair price must be “reasonable.” That is, the buyer must realize that
a lower price for the option may put the seller in a position where he is unable to
meet the obligations fixed by the agreement because of insufficient payment.
12 Hedging in Arbitrage-Free Models 221

At the same time, the amount of this payment should not give the seller arbitrage
possibilities of a “free lunch” type, i.e., the chance to earn a risk-free profit.
Before defining what the fair price of an option should mean, we give the com-
monly accepted classification of options.
2. We will consider a (B, S)-market, B = (Bn )0≤n≤N , S = (Sn )0≤n≤N , operat-
ing at time instants n = 0, 1, . . . , N and defined on a filtered probability space
(Ω, F , (Fn )0≤n≤N , P) with F0 = {∅, Ω} and FN = F .
We will consider options written on stock with prices described by the random
sequence S = (Sn )0≤n≤N .
With regard to the time of their exercise, options are of two types: European and
American.
If an option can be exercised only at the time instant N fixed in the contract, then
N is called the time of its exercise, and this option is said to be of the European type.

Alternatively, if an option can be exercised at any Markov time (or stopping


time; see Definition 3 in Sect. 1) τ = τ(ω), taking values in the range {0, 1, . . . , N}
specified by the contract, then this option is of the American type.
According to the generally adopted terminology, there are two types of options:
1. buyer’s options (call options) and
2. seller’s options (put options).
The difference between these types is that call options grant the right to buy,
while put options grant the right to sell.
For definiteness, we consider examples of standard options of the European type.
These options are characterized by two constants: N, the time of exercise, and K,
the price (fixed by the contract) at which a certain asset (say, a share) can be bought
(buyer’s options) or for which it can be sold (seller’s options).
In the case of a buyer’s option, the buyer buys at time 0 from the seller an option
at price C. This option stipulates that the buyer can buy from the seller at time N the
share for price K. Let S0 and SN be the market prices of the share at times 0 and N.
If SN > K, then the buyer can sell it right away for price SN , thereby earning a profit
SN − K. Otherwise, if SN < K, there is no sense in exercising the right to buy for
price K since the buyer can buy the share at the lower market price SN .
Thus, combining these two cases, we find that for buyer’s options, the buyer at
time N earns the profit
fN = (SN − K)+ , (1)
where a+ = max(a, 0). The buyer’s net return is equal to this quantity minus the
amount C that he has paid to the seller at time 0 (the negative value −C in the case
SN < K indicates a loss C).
In a similar way, the profit of the buyer of a put option is given by the formula

fN = (K − SN )+ . (2)

3. When defining the fair price in an arbitrage-free (B, S)-market we must distin-
guish between two cases, complete and incomplete markets.
222 7 Martingales

Definition 1. Let a (B, S)-market be arbitrage-free and complete. The fair price of
an option of the European type with FN -measurable bounded (nonnegative) contin-
gent claim fN is the price of perfect hedging,

C(fN ; P) = inf{x : ∃π with X0π = x and XNπ = fN (P -a.s.)}. (3)

A portfolio π is called a hedge of the contingent claim fN if XNπ ≥ fN with proba-


bility 1.
It follows from the results of Sect. 11 that in the case of complete arbitrage-free
markets, for any bounded contingent claim there exists a perfect hedging π, i.e.,
such that XNπ = fN (P-a.s.). This is why in definition (3) we consider a (nonempty)
class of portfolios with the property XNπ = fN (P-a.s.).
The following definition is natural for incomplete arbitrage-free markets.
Definition 2. Let a (B, S)-market be arbitrage-free. The fair price of an option of
the European type with FN -measurable bounded (nonnegative) contingent claim fN
is the superhedging price

C(fN ; P) = inf{x : ∃π with X0π = x and XNπ ≥ fN (P -a.s.)}. (4)

Note that this definition is correct, i.e., for any bounded function fN there always
exists a portfolio π with some initial capital x such that XNπ ≥ fN (P-a.s.).
4. Now we give a formula for the price C(fN ; P). We will prove it for complete
markets and refer the reader to specialized literature for incomplete markets (e.g.,
[71, Chap. VI, Sect. 1c]).
Theorem 1. (i) For a complete arbitrage-free (B, S)-market, the fair price of a
European-type option with a contingent claim fN is
fN
C(fN ; P) = B0 EP , (5)
BN

where EP is the expectation with respect to the (unique) martingale measure P.8
(ii) For a general arbitrage-free (B, S)-market, the fair price of a European-type
option with a contingent claim fN is
fN
C(fN ; P) = sup B0 EP , (6)

P∈M(P)
BN

where the sup is taken over the set of all martingale measures M(P).
PROOF. (i) Let π be a perfect hedge with X0π = x and XNπ = fN (P-a.s.). Then (see
(15) in Sect. 11)
fN Xπ x N S 
k
= N = + γk Δ . (7)
BN BN B0 Bk
k=1
12 Hedging in Arbitrage-Free Models 223

Hence, by Theorem 3 from Sect. 1,


fN x
EP = , (8)
BN B0
 n  
since the martingale transform Bx0 + k=1 γk Δ BSkk 1≤n≤N is such that at the ter-
minal time N
x  N S  fN
k
+ γk Δ = ≥ 0. (9)
B0 Bk BN
k=1

Note that the left-hand side of (8) does not depend on the structure of a particular
hedge π with initial value X0π = x. If we take another hedge π  with initial value

X0π , then, according to (8), this initial value is again equal to B0 EP BfNN . Hence it is
clear that the initial value x is the same for all perfect hedges, which proves (5).
(ii) Here we only prove the inequality
fN
sup B0 EP ≤ C(fN ; P). (10)

P∈M(P)
BN

(The proof of the reverse inequality relies on the so-called optional decomposi-
tion, which goes beyond the scope of this book; see, e.g., [71, Chap. VI, Sects. 1c
and 2d].)
Suppose that the hedge π is such that X0π = x and XNπ ≥ fN (P-a.s.).
Then (7) implies that

x 
N S  fN
k
+ γk Δ ≥ ≥ 0.
B0 Bk BN
k=1

8 ∈ M(P),
Therefore, for any measure P

fN
B0 EP ≤x
BN
(cf. (8) and (9)). Hence, taking the supremum on the left-hand side over all measures
8 ∈ M(P), we arrive at the required inequality (10).
P


5. Now we consider some definitions and results related to options of the American
type. For these options we must assume that we are given not a single contingent
claim fN related to time N, but a collection of claims f0 , f1 , . . . , fN whose meaning is
that once the buyer exercises the option at time n, the payoff (by the option seller to
the buyer) is determined by the (Fn -mesurable) function fn = fn (ω).
If the buyer of an option decides to exercise the option at time τ = τ(ω), which
is a Markov time with values in {0, 1, . . . , N}, then the payoff is fτ(ω) (ω). Therefore
the seller of the option when composing his portfolio π must envisage that for any τ
the following hedging condition must hold: Xτπ ≥ fτ (P-a.s.).
224 7 Martingales

This explains the expedience of the following definition.

Definition 3. Let a (B, S)-market be arbitrage-free. The fair price of an option of


the American type with the system f = (fn )0≤n≤N of Fn -measurable nonnegative
payoff functions fn is the upper superhedging price, i.e., the price

C(f ; P) = inf{x : ∃π with X0π = x and Xnπ ≥ fn (P -a.s.), 0 ≤ n ≤ N}. (11)

We state (without proof) an analog of Theorem 1 for American-type options.

Theorem 2. (i) For a complete arbitrage-free (B, S)-market, the fair price of an
American-type option with a system of payoff functions f = (fn )0≤n≤N is given by


C(f ; P) = sup B0 EP , (12)
τ∈MN0 Bτ

where MN0 = {τ : τ ≤ N} is the class of stopping times (with respect to (Fn )0≤n≤N )
and P 8 is the unique martingale measure.
(ii) In the general case of an (incomplete) arbitrage-free (B, S)-market, the fair
price of an American-type option with a system of payoff functions f = (fn )0≤n≤N
is given by

C(f ; P) = sup B0 EP , (13)

τ∈MN , P∈M(P)
B τ
0

8
where M(P) is the set of martingale measures P.

For the proof, see [71, Chap. VI, Sect. 2c].


6. The foregoing theorems enable one to determine the fair price of an option. An-
other important question is how the seller of an option should compose the hedging
portfolio π ∗ having received the premium C(fN ; P) or C(f ; P).
For simplicity, we restrict ourselves to the case of a complete (B, S)-market of
European-type options.

Theorem 3. Consider an arbitrage-free complete (B, S)-market. There exists a self-



financing portfolio π ∗ = (β ∗ , γ ∗ ) with initial capital X0π = C(fN ; P) implementing
the perfect hedging of the terminal payoff fN :

XNπ = fN (P -a.s.).

The dynamics of capital Xnπ = βn∗ Bn + γn∗ Sn , 0 ≤ n ≤ N, is determined by

f 
N
Xnπ = Bn EP | Fn . (14)
BN

The component γ ∗ = (γn∗ )0≤n≤N of the hedge π ∗ = (β ∗ , γ ∗ ) is obtained from


∗ ∗
X π = (Xnπ )0≤n≤N by the formula
12 Hedging in Arbitrage-Free Models 225
 X π∗  S 
= γn∗ Δ
n
Δ n (15)
Bn Bn

and the component β ∗ = (βn∗ )0≤n≤N by the formula



Xnπ = βn∗ Bn + γn∗ Sn . (16)

The proof follows directly from the proof of the implication “complete-
ness” ⇒ “ BS -represenatation” in Lemma
 2, Sect. 11, applied to the martingale
m = (mn )0≤n≤N with mn = EP fN
BN | Fn .
7. As an example of actual option pricing consider a (B, S)-market described by the
CRR model,
Bn = Bn−1 (1 + r),
(17)
Sn = Sn−1 (1 + ρn ),
where ρ1 , . . . , ρN are independent identically distributed random variables taking
two values, a and b, −1 < a < r < b.
This market is arbitrage-free and complete (Problem 3, Sect. 11) with martingale
8 such that P{ρ
measure P 8 n = a} = q̃, where
8 n = b} = p̃, P{ρ

r−a b−r
p̃ = , q̃ = . (18)
b−a b−a
(See Example 1 in Sect. 11, Subsection 5.)
By formula (5) of Theorem 1, the fair price for this (B, S)-market is

fN
C(fN ; P) = EP . (19)
(1 + r)N

And, according to Theorem 3, to compute the perfect hedging portfolio π ∗ =


(β ∗ , γ ∗ ), we must first calculate

 fN 
Xnπ = EP | F n (20)
(1 + r)N

(with Fn = σ(ρ1 , . . . , ρn ), 1 ≤ n ≤ N, and F0 = {∅, Ω}) and then to find γn∗ and
βn∗ by (15) and (16).

Since X0π = C(fN ; P), the problem amounts to finding the conditional expecta-
tions on the right-hand side of (20) for n = 0, 1, . . . , N.
We will assume that the FN -measurable function fN has a “Markov” structure,
i.e., fN = f (SN ), where f = f (x) is a nonnegative function of x ≥ 0.
Use the notation

n
 
Fn (x; p) = f x(1 + b)k (1 + a)n−k Cnk pk (1 − p)n−k . (21)
k=0
226 7 Martingales

Taking into account that


4
(1 + ρk ) = (1 + b)ΔN −Δn (1 + a)(N−n)−(ΔN −Δn ) ,
n<k≤N

where Δn = δ1 + · · · + δn , δk = (ρk − a)/(b − a), we obtain


, -
4
EP f x (1 + ρk ) = FN−n (x; p̃), (22)
n<k≤N

with p̃ = (r − a)/(b −3
a).
Using that SN = Sn n<k≤N (1 + ρk ), (21) and (20) imply that


 fN 
Xnπ = EP | F n = (1 + r)−N FN−n (Sn ; p̃). (23)
(1 + r)N

In particular, ∗
C(fN ; P) = X0π = (1 + r)−N FN (S0 ; p̃). (24)
Finally, taking into account (23), we obtain from (15) that
 X π∗   S 
γn∗
n
=Δ n Δ
Bn Bn
is given by

FN−n (Sn−1 (1 + b); p̃) − FN−n (Sn−1 (1 + a); p̃)


γn∗ = (1 + r)−(N−n) . (25)
Sn−1 (b − a)

To find βn∗ , note that Bn−1 Δβn∗ + Sn−1 Δγn∗ = 0 by the self-financing condition.
Therefore
π∗
Xn−1 = βn∗ Bn−1 + γn∗ Sn−1 , (26)
and consequently,

π
Xn−1 − γn∗ Sn−1
βn∗ = . (27)
Bn−1
Using this formula along with (23) and (25) we obtain

1
1+r
βn∗ = FN−n+1 (Sn−1 ; p̃) − [FN−n (Sn−1 (1 + b); p̃)
BN 1+b

− FN−n (Sn−1 (1 + a); p̃)] . (28)

Let us see, finally, what the fair price C(fN ; P) is in the case of a standard buyer’s
(call) option when fN = (SN − K)+ .
12 Hedging in Arbitrage-Free Models 227

Let K0 = K0 (a, b, N; s0 /K) be the smallest integer for which


 1 + b  K0
S0 (1 + a)N > K, (29)
1+a
i.e., let  
K 1 + b
K0 = 1 + log log , (30)
S0 (1 + a)N 1+a
where [x] is the integral part of x.
Using the notation
1+b
p∗ = p̃,
1+r
where p̃ = (r − a)/(b − a), and


N
B(K0 , N; p) = CNk pk (1 − p)N−k , (31)
k=K0

it is not hard to derive from (24) the following formula (Cox–Ross–Rubinstein) for
the fair price (denoted presently by CN ) of the standard call option:

CN = S0 B(K0 , N; p∗ ) − K(1 + r)−N B(K0 , N; p̃). (32)

If K0 > N, then CN = 0.

Remark. Since
(K − SN )+ = (SN − K)+ − SN + K,
the fair price of a standard seller’s (put) option denoted by PN (= C(fN ; P) with
fN = (K − SN )+ ) is given by

8 + r)−N (K − SN )+ = CN − E(1
PN = E(1 8 + r)−N SN + K(1 + r)−N .

8 + r)−N SN = S0 , we obviously have the following identity (the call–put


Since E(1
parity):
PN = CN − S0 + K(1 + r)−N . (33)

8. Problems
1. Find the price C(fN ; P) of a standard call option with fN = (SN − K)+ for the
model of the (B, S)-market considered in Example 2, Subsection 5, Sect. 11.
2. Try to prove the reverse inequality in (10).
3. Prove (12), and try to prove (13).
4. Give a detailed derivation of (23).
5. Prove (25) and (28).
6. Give a detailed derivation of (32).
228 7 Martingales

13. Optimal Stopping Problems: Martingale Approach

1. We have already encountered an optimal stopping problem when we dealt with


the fair price of an American-type option. That is, formula (12) in Sect. 12 shows
that, to find this price, we must (under the simplified conditions Bn = 1, 0 ≤ n ≤ N,
and P8 = P) determine the quantity (also called a “price”)

V0N = sup E fτ , (1)


τ ∈MN0

where f = (f0 , f1 , . . . , fN ) is a sequence of Fn -measurable nonnegative functions


and τ = τ (ω) are Markov times (or stopping times) of class MN0 consisting of
random variables τ = τ (ω) taking values {0, 1, . . . , N} and such that for any n in
this set
{ω : τ (ω) = n} ∈ Fn . (2)
(In this section we assume given a filtered probability space (Ω, F , (Fn )n≥0 , P)
with F0 = {∅, Ω}.)
Along with problem (1), where the times τ = τ (ω) belong to MN0 , the problem
of finding the price
V0∞ = sup E fτ (3)
τ ∈M∞
0

is also of interest, where M∞ 0 = {τ : τ < ∞} and f = (f0 , f1 , . . . ) is a stochastic


sequence of Fn -measurable random variables fn , n ≥ 0, with E |fτ | < ∞.
In both cases (1) and (3), the problem is not only in finding the prices V0N and V0∞ ,
but also in determining the optimal times (provided they exist) when the supremum
is attained.
In many problems it makes sense to consider also infinite Markov times (taking
also the value +∞). In this case when dealing with E fτ we should agree what we
mean by f∞ . One natural way is to take lim supn fn for f∞ . Another convention when
admitting infinite values for τ is to define the price as

V 0 = sup E fτ I(τ < ∞), (4)

τ ∈M0

∞ ∞
where M0 = {τ : τ ≤ ∞} is the class of all Markov times. Obviously, V 0 =
supτ ∈M∞ E fτ when letting f∞ = 0 (cf. Sect. 1, Subsection 3).
0
In what follows we will only treat problem (1). (Regarding the case N = ∞, see
Sect. 9, Chap. 8.) If the probabilistic structure of the sequence f = (f0 , f1 , . . . , fN )
is not specified, the most efficient method of solving problems (1) and (3) is the
“martingale” method described in what follows. (We will always assume without
mention that E |fn | < ∞ for all n ≤ N.)
2. Thus, let N < ∞. This case may be treated by what is known as backward
induction, which is carried out here as follows.
13 Optimal Stopping Problems 229

Along with V0N , define the “prices”

VnN = sup E fτ , (5)


τ ∈MNn

where MNn = {τ : n ≤ τ ≤ N} is the class of stopping times such that n ≤ τ (ω) ≤


N for all ω ∈ Ω.
Moreover, define inductively the stochastic sequence vN = (vNn )0≤n≤N as fol-
lows:
vNN = fN , vNn = max(fn , E(vNn+1 | Fn )) (6)
for n = N − 1, . . . , 0.
For 0 ≤ n ≤ N, define

τnN = min{n ≤ k ≤ N : fk = vNk }. (7)

Using this notation, the following theorem completely describes the solution of
the optimal stopping problems (1) and (5).
Theorem 1. Let f = (f0 , f1 , . . . , fN ) be such that every fn is Fn -measurable.
(i) For any n, 0 ≤ n ≤ N, the stopping time

τnN = min{n ≤ k ≤ N : vNk = fk } (8)

is optimal within the class MNn :

E fτnN = sup E fτ (= VnN ). (9)


τ ∈MNn

(ii) The stopping times τnN , 0 ≤ n ≤ N, are optimal also in the following “condi-
tional” sense:
E(fτnN | Fn ) = ess sup E(fτ | Fn ) (P -a.s). (10)
τ ∈MNn

The “stochastic prices” ess supτ ∈MNn E(fτ | Fn ) are equal to vNn :

ess sup E(fτ | Fn ) = vNn (P -a.s.) (11)


τ ∈MNn

and
VnN = E vNn . (12)
If n = 0, then
V0N = vN0 . (13)
For n = N,
VNN = E fN . (14)
3. Before we proceed to the proof, let us recall the definition of the essential supre-
mum ess supα∈A ξα (ω) of a family of F -measurable random variables {ξα (ω), α ∈
A} involved in (10).
230 7 Martingales

We need this concept because in the case of an uncountable set A the use of the
ordinary supα∈A ξα (ω) may, in general, give rise to functions (of ω ∈ Ω) that are
not F -measurable.
Indeed, for any c ∈ R


ω : sup ξα (ω) ≤ c = {ω : ξα (ω) ≤ c}.
α∈A
α∈A

Here, the sets Aα = {ω : ξα (ω) ≤ c} belong to F (i.e., they


 are events). However,
since the set A is uncountable, we are not guaranteed that α∈A Aα ∈ F .
Definition. Let {ξα (ω), α ∈ A} be a family of random variables (i.e., of F -
measurable functions taking values in (−∞, +∞)). An extended random variable
ξ(ω) (an F -measurable function with values in (−∞, +∞]) is said to be the essen-
tial supremum of the family {ξα (ω), α ∈ A} (denoted ξ(ω) = ess supα∈A ξα (ω))
if
(a) ξ(ω) ≥ ξα (ω) (P-a.s.) for all α ∈ A;
(b) For any (extended) random variable η(ω) such that η(ω) ≥ ξα (ω) (P-a.s.) for
all α ∈ A we have ξ(ω) ≤ η(ω) (P-a.s.).
In other words, ξ(ω) is the smallest among all (extended) random variables ma-
jorizing ξα (ω) for all α ∈ A.
Of course, we must prove first of all that this definition is meaningful. This is
done by the following statement.
Lemma. For any family {ξα (ω), α ∈ A} of random variables there exists a random
variable (in general, extended) ξ(ω) (denoted by ess supα∈A ξα (ω)) with properties
(a) and (b) as in the definition.
There is a countable subset A0 ⊆ A with the property that this variable can be
taken to be
ξ(ω) = sup ξα (ω).
α∈A0

PROOF. First, assume that all ξα (ω), α ∈ A, are uniformly bounded (|ξα (ω)| ≤ c,
ω ∈ Ω, α ∈ A).  
Let A be a finite set of indices α ∈ A. Set S(A) = E maxα∈A ξα (ω) . Let,
further, S = sup S(A), where the supremum is taken over all finite subsets A ⊆ A.
Denote by An , n ≥ 1, a finite set such that
  1
E max ξα (ω) ≥ S − .
α∈An n

Let A0 = n≥1 An . This set is countable, hence

ξ(ω) = sup ξα (ω)


α∈A0

is F -measurable, i.e., this is a random variable. (Note that |ξ(ω)| ≤ c, hence ξ(ω)
is an ordinary rather than extended random variable.)
13 Optimal Stopping Problems 231

This construction of ξ(ω) implies (Problem 1) that this random variable has prop-
erties (a) and (b) of the foregoing definition.
Therefore we have established the existence of the essential supremum for a uni-
formly bounded family {ξα (ω), α ∈ A}.
In the general case we first go from ξα (ω) to the bounded random variables
ξ˜α (ω) = arctan ξα (ω), for which |ξ˜α (ω)| ≤ π/2, α ∈ A, ω ∈ Ω, and then we let
˜
ξ(ω) = ess supα∈A ξ˜α (ω).
Then the random variable ξ(ω) = tan ξ(ω) ˜ will satisfy requirements (a) and (b)
of the definition of the essential supremum (Problem 3).

4. PROOF OF THEOREM 1. Let us fix an index N. To simplify writing, it will often


be omitted.
If n = N, then vN = fN and τN = N, and properties (9)–(12) and (14) are obvious.
Now we will argue inductively.
Suppose that the theorem is established for n = N, N − 1, . . . , k. Let us show that
it holds then for n = k − 1.
Let τ ∈ Mk−1 (= MNk−1 ) and A ∈ Fk−1 . Define the stopping time τ ∈ Mk by
letting τ = max(τ, k). Since τ ∈ Mk and the event {τ ≥ k} ∈ Fk−1 , we find that

E[IA fτ ] = E[IA∩{τ =k−1} fτ ] + E[IA∩{τ ≥k} fτ ]


= E[IA∩{τ =k−1} fτ ] + E[IA∩{τ ≥k} E( fτ | Fk−1 )]
= E[IA∩{τ =k−1} fτ ] + E[IA∩{τ ≥k} E(E( fτ | Fk ) | Fk−1 )]
≤ E[IA∩{τ =k−1} fk−1 ] + E[IA∩{τ ≥k} E(vk | Fk−1 )] ≤ E[IA vk−1 ]. (15)

In view of the Fk−1 -measurability of the set A, this implies that for any τ ∈
Mk−1
E( fτ | Fk−1 ) ≤ vk−1 (P -a.s.). (16)
We will show now that for the Markov time τk−1

E( fτk−1 | Fk−1 ) = vk−1 , (17)

with P-probability 1. (If this equality is established, we obtain by (16) that (10) and
(11) hold also for n = k − 1.)
For that purpose it suffices to show that (15) holds for τ = τk−1 with equality
rather than inequality signs throughout.
Beginning as in (15) and using then that on the set {τk−1 ≥ k} we have τ = τk
by definition (5) and that (by the induction assumption) E( fτk | Fk ) = vk (P-a.s.),
we obtain
E[IA fτk−1 ] = E[IA∩{τk−1 =k−1} fk−1 ] + E[IA∩{τk−1 ≥k} E( fτk−1 | Fk−1 )]
= E[IA∩{τk−1 =k−1} fk−1 ] + E[IA∩{τk−1 ≥k} E( fτk | Fk−1 )]
= E[IA∩{τk−1 =k−1} fk−1 ] + E[IA∩{τk−1 ≥k} E(vk | Fk−1 )] = E[IA vk−1 ],
232 7 Martingales

where the last equality holds because vk−1 = max( fk−1 , E(vk | Fk−1 )) by defini-
tion, hence vk−1 = fk−1 on the set {τk−1 = k − 1} and vk−1 > fk−1 on the set
{τk−1 > k − 1} = {τk−1 ≥ k} (so that on this set vk−1 = E(vk | Fk−1 )).
Thus (17) is established. As was pointed out earlier, this property, together with
(16), implies that (10) and (11) hold.
It follows from these relations that (P-a.s.)

vn = E( fτn | Fn ) ≥ E( fτ | Fn ) (18)

for any τ ∈ Mn (= MNn ). Therefore, taking into account the convention vNn = vn ,
we find that
E vNn = E fτn ≥ sup E fτ = VnN , (19)
τ ∈MNn

which proves (9) and (12).


Property (13) is a particular case of (12) (for n = 0) due to the fact that vN0 is
a constant by (11) and since the σ-algebra F0 (= {∅, Ω}) is trivial. Finally, (14)
follows from definition (5) (for n = N).
5. To clarify the “martingale” nature of the optimal problem at hand, consider the
recurrence relations (6) for the sequence vN = (vN0 , vN1 , . . . , vNN ) with the “boundary”
condition vNN = fN .
We see from (6) that for every n = 0, 1, . . . , N − 1 (P-a.s.)

vNn ≥ fn , (20)
vNn ≥ E(vNn+1 | Fn ). (21)

The first inequality here means that the sequence vN majorizes the sequence
f = (f0 , f1 , . . . , fN ). The second inequality shows that vN is a supermartingale
with “terminal” value vNN = fN . Thus, we can say that vN = (vN0 , vN1 , . . . , vNN ) with
vNn ’s defined by (6) or by (11), is a supermartingale majorant for the sequence
f = (f0 , f1 , . . . , fN ).
In other words, this means that the sequence vN belongs to the class of sequences
γ = (γ0N , γ1N , . . . , γNN ) with γNN ≥ fN satisfying (P-a.s.) the “variational inequali-
N

ties”
γnN ≥ max(fn , E(γn+1
N
| Fn )) (22)
for all n = 0, 1, . . . , N − 1.
But the sequence vN possesses additionally the property that (22) holds for vN
not only with nonstrict inequality “≥” but with equality “=” (see (6)). This property
singles out the sequence vN among sequences γ N (with γNN ≥ fN ) as follows.
Theorem 2. The sequence vN is the least supermartingale majorant for the se-
quence f = (f0 , f1 , . . . , fN ).
PROOF. Since vNN = fN and γNN ≥ fN , we have γNN ≥ vNN . Together with (22) and (6)
this implies that (P-a.s.)
N
γN−1 ≥ max( fN−1 , E(γNN | FN−1 )) ≥ max( fN−1 , E(vNN | FN−1 )) = vNN−1 .
13 Optimal Stopping Problems 233

In a similar way we find that γnN ≥ vNn (P-a.s.) also for all n < N − 1.

Remark. The result of this theorem can be restated as follows: The solution vN =
(vN0 , vN1 , . . . , vNN ) of the recurrence system

vNn = max(fn , E(vNn+1 | Fn )), n < N,

with vNN = fN , is the smallest among solutions γ N = (γ0N , γ1N , . . . , γNN ) of the recur-
rence system of inequalities

γnN ≥ max(fn , E(γn+1


N
| Fn )), n < N, (23)

with γNN ≥ fN .

6. Theorems 1 and 2 not only describe the method of finding the price V0N =
sup E fτ , where sup is taken over the class of Markov times MN0 , but also enable
us to determine the optimal time τ0N , i.e., the time for which E fτ0N = V0N .
According to (8),

τ0N = min{0 ≤ k ≤ N : vNk = fk }. (24)

When solving specific optimal stopping problems, the following equivalent de-
scription of this stopping time τ0N is usable.
Let
DNn = {ω : vNn (ω) = fn (ω)} (25)
and
CnN = Ω \ DNn = {ω : vNn (ω) = E(vNn+1 | Fn )(ω)}.
Clearly, DNN = Ω, CNN = ∅, and

DN0 ⊆ DN1 ⊆ · · · ⊆ DNN = Ω,


C0N ⊇ C1N ⊇ · · · ⊇ CNN = ∅.

It follows from (24) and (25) that the stopping time τ0N can also be defined as

τ0N = min{0 ≤ k ≤ N : ω ∈ DNk }. (26)

It is natural to call DNk the “stopping sets” and CkN the “continuation of observa-
tion sets.” This terminology can be justified as follows.
Consider the time instant n = 0, and divide the set Ω into sets DN0 and C0N (Ω =
D0 ∪ C0N , DN0 ∩ C0N = ∅). If ω ∈ DN0 , then τ0N (ω) = 0. In other words, “stopping”
N

is done then at time n = 0. But if ω ∈ C0N , then τ0N (ω) ≥ 1. In the case where
ω ∈ DN1 ∩ C0N , we have τ0N (ω) = 1. The subsequent steps are considered in a similar
manner. At time N the observations are certainly terminated.
234 7 Martingales

7. Consider some examples.

EXAMPLE 1. Let f = ( f0 , f1 , . . . , fN ) be a martingale with f0 = 1. Then, according


to Corollary 1 to Theorem 1, Sect. 2, E fτ = 1 for any Markov time τ ∈ MN0 .
Therefore, in this case, V0N = supτ ∈MN0 E fτ = 1.
The functions vNn = fn for all 1 ≤ n ≤ N, and vN0 = 1. Then it is clear that
τ0 = min{0 ≤ k ≤ N : fk = vNk } = 0 and τnN = n for any 1 ≤ n ≤ N.
N

Thus the optimal stopping problem for martingale sequences is solved actually
in a trivial manner: the optimal stopping time is τ0N (ω) = 0, ω ∈ Ω (as well as, by
the way, any other stopping time τnN (ω) = n, ω ∈ Ω, 1 ≤ n ≤ N).

EXAMPLE 2. If f = ( f0 , f1 , . . . , fN ) is a submartingale, then E fτ ≤ E fN for any


τ ∈ MN0 (Theorem 1, Sect. 2). Thus the optimal stopping time here is τ ∗ ≡ N.
Since vNk = E(fN | Fk ) ≥ fk (P-a.s.), it is possible that τ0N (ω) may be less than N for
some ω. But in any case both stopping times τ0N and τ ∗ ≡ N are optimal. Although
τ ∗ ≡ N has a simple structure, τ0N nevertheless has a certain advantage, namely, it
is the smallest among possible stopping times, i.e., if τ̃ is also a stopping time in the
class MN0 , then P{τ0N ≤ τ̃ } = 1.

EXAMPLE 3. Let f = ( f0 , f1 , . . . , fN ) be a supermartingale. Then vNn = fn for all


0 ≤ n ≤ N. Therefore the optimal stopping time is (as in the martingale case)
τ0N = 0.

The preceding examples are fairly simple, and the problem of finding optimal
stopping times is solved in them actually without invoking the theory given by The-
orems 1 and 2. Their solution relies on the results on preservation of martingale,
submartingale, and supermartingale properties under time change by a Markov time
(Sect. 2). But in general finding the price V0N and the optimal stopping time τ0N may
be a very difficult problem.
Of great interest are the cases where the functions fn have the form

fn (ω) = f (Xn (ω)),

where X = (Xn )n≥0 is a Markov chain. As will be shown in Sect. 9 of Chap. 8, the
solution of optimal stopping problems reduces in fact to the solution of variational
inequalities and Wald–Bellman equatons of dynamic programming.
We also provide therein (nontrivial) examples of complete solutions to some op-
timal stopping problems for Markov sequences.
8. Problems
1. Show that the random variable ξ(ω) = supα∈A0 ξα (ω) constructed in the proof
of the lemma (Subsection 3) satisfies requirements (a) and (b) in the definition of
essential supremum. (Hint: In the case α ∈ / A0 , consider E max(ξ(ω), ξα (ω)).)
2. Show that ξ(ω) = tan ξ(ω) ˜ (see the end of the proof of the lemma in Subsec-
tion 3) also satisfies requirements (a) and (b).
13 Optimal Stopping Problems 235

3. Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-


ables with E |ξ1 | < ∞. Consider the optimal stopping problem (in the class
M∞1 = {τ : 1 ≤ τ < ∞}):
 
V ∗ = sup E max ξi − cτ .
τ ∈M∞
1
i≤τ

Let τ ∗ = min{n ≥ 1 : ξn ≥ A∗ }, where A∗ is the unique root of the equation


E(ξ1 − A∗ ) = c. Show that whenever P{τ ∗ < ∞} = 1, the stopping
 time τ ∗ is

optimal in the class of all finite stopping times τ for which E maxi≤τ ξi − cτ
exists. Show also that V ∗ = A∗ .
4. In this and the following problems, let

M∞
n = {τ : n ≤ τ < ∞},
Vn∞ = sup E fτ ,
τ ∈M∞
n

v∞
n = ess sup E( fτ | Fn ),
τ ∈M∞
n

τn∞ = min{k ≥ n : v∞
n = fn }.

Assuming that E sup fn− < ∞, show that the limiting random variables

ṽn = lim vNn


N→∞

have the following properties:


(a) For any τ ∈ M∞
n
ṽn ≥ E( fτ | Fn );
(b) If the stopping time τn∞ ∈ M∞
n , then

ṽn = E( fτn∞ | Fn ),
ṽn = v∞
n (= ess sup E( fτ | Fn )).
τ ∈M∞
n

5. Let τn∞ ∈ M∞ ∞
n . Deduce from (a) and (b) of the previous problem that τn is the
optimal stopping time in the sense that

ess sup E( fτ | Fn ) = E( fτn∞ | Fn ) (P-a.s.)


τ ∈M∞
n

and
sup E fτ = E fτn∞ ,
τ ∈M∞
n

i.e., Vn∞ = E fτn∞ .


Chapter 8
Markov Chains

The modern theory of Markov processes has its origins in the studies of A. A. Markov
(1906–1907) on sequences of experiments “connected in a chain” and in attempts to de-
scribe mathematically the physical phenomenon known as Brownian motion (L. Bachelier
1900, A. Einstein 1905).
E. B. Dynkin “Markov processes,” [21, Vol. 1]

1. Definitions and Basic Properties

1. In Sect. 12 of Chap. 1 we set out, for the case of finite probability spaces, the
fundamental ideas and principles behind the concept of Markov dependence (see
property (7) therein) of random variables, which is designed to describe the evolu-
tion of memoryless systems. In this chapter we extend this treatment to more general
probability spaces.
One of the main problems of the theory of Markov processes is the study of the
asymptotic behavior (as time goes to infinity) of memoryless systems. Remarkably,
under very broad assumptions, such a system evolves as if it “forgot” the initial state,
its behavior “stabilizes,” and the system reaches a “steady-state” regime. We will an-
alyze in detail the asymptotic behavior of systems described as Markov chains with
countable many states. To this end we will provide a classification of the states of
Markov chains according to the algebraic and asymptotic properties of their transi-
tion probabilities.
2. Let (Ω, F , (Fn )n≥0 , P) be a filtered probability space, i.e., a probability space
(Ω, F , P) with a specified filtration (flow) (Fn )n≥0 of σ-algebras Fn , n ≥ 0, such
that F0 ⊆ F1 ⊆ . . . ⊆ F . Intuitively, Fn describes the “information” available by
the time n (inclusive).
Let, further, (E, E ) be a measurable space representing the “state space,” where
systems under consideration take their values. For “technical reasons” (e.g., so that,
for any random element X0 (ω) and x ∈ E, the set {ω : X0 (ω) = x} ∈ F ) it will be

© Springer Science+Business Media, LLC, part of Springer Nature 2019 237


A. N. Shiryaev, Probability-2, Graduate Texts in Mathematics 95,
https://doi.org/10.1007/978-0-387-72208-5 5
238 8 Markov Chains

assumed that the σ-algebra E contains all singletons in E, i.e., sets consisting of one
point. (Regarding this assumption, see Subsection 6 below.)
The measurable spaces (E, E ) subject to this assumption are called the phase
spaces or the state spaces of the systems under consideration.

Definition 1 (Markov chain in wide sense). Let (Ω, F , (Fn )n≥0 , P) be a filtered
probability space and (E, E ) a phase space. A sequence X = (Xn )n≥0 of E-
valued Fn /E -measurable random elements Xn = Xn (ω), n ≥ 0, defined on
(Ω, F , (Fn )n≥0 , P), is a sequence of random variables with Markov dependence(or
a Markov chain) in the wide sense if for any n ≥ 0 and B ∈ E the following wide-
sense Markov property holds:

P(Xn+1 ∈ B | Fn ) (ω) = P(Xn+1 ∈ B | Xn (ω)) (P -a.s.). (1)

Let FnX = σ(X0 , X1 , . . . , Xn ) be the σ-algebra generated by X0 , X1 , . . . , Xn .


Since FnX ⊆ Fn and Xn are FnX -measurable, (1) implies the Markov property in
the strict sense (or simply Markov property):

P(Xn+1 ∈ B | FnX ) (ω) = P(Xn+1 ∈ B | Xn (ω)) (P -a.s.). (2)

For clarity (cf. Sect. 12 in Chap. 1, Vol. 1), this property is often written

P(Xn+1 ∈ B | X0 (ω), . . . , Xn (ω)) = P(Xn+1 ∈ B | Xn (ω)) (P -a.s.). (3)

The strict-sense Markov property (2) deduced from (1) suggests the definition of
the Markov dependence in the case where the flow (Fn )n≥0 is not specified a priori.

Definition 2 (Markov chain). Let (Ω, F , P) be a probability space and (E, E )


a phase space. A sequence X = (Xn )n≥0 of F /E -measurable E-valued random
elements Xn = Xn (ω) is a sequence of random variables with Markov dependence,
or a Markov chain, if for any n ≥ 0 and B ∈ E the strict-sense Markov property (2)
holds.

Remark. The introduction from the outset of a filtered probability space on which
a Markov chain in the wide sense is defined is useful in many problems where the
behavior of systems depends on a “flow of information” (Fn )n≥0 . For example, it
may happen that the first component X = (Xn )n≥0 of a “two-dimensional” process
(X, Y) = (Xn , Yn )n≥0 is not a Markov chain in the sense of (2), but nevertheless it is
a Markov chain in the sense of (1) with Fn = FnX,Y , n ≥ 0.
However, in the elementary exposition of the theory of Markov chains to be set
out in this chapter, the flow (Fn )n≥0 is not usually introduced and the presentation
is based on Definition 2.

3. The Markov property characterizes the “lack of aftereffect” (lack of memory) in


the evolution of a system whose states are described by the sequence X = (Xn )n≥0 .
In the case of a finite space Ω, this was stated in Sect. 12, Chap. 1 (Vol. 1) as the
property
1 Definitions and Basic Properties 239

P(F | PN) = P(F | N), (4)


where F stands for “future”, P for “past”, and N for “present” (“now”). It was pointed
out there that Markov systems also possess the property

P(PF | N) = P(P | N) P(F | N), (5)

interpretable as independence of past and future for a given present.


In the general case, the analogs of (4) and (5) are stated as properties (6) and (7) in
the following theorem, which gives various equivalent formulations of the Markov
property (in the sense of Definition 2). In this theorem the following notation is
used:
F[0,n]
X
= σ(X0 , X1 , . . . , Xn ),
F[n,∞)
X
= σ(Xn , Xn+1 , . . . ),
F(n,∞)
X
= σ(Xn+1 , Xn+2 , . . . ).
Theorem 1. The Markov property (2) is equivalent to either of the following two
properties: for n ≥ 0,

P(F | F[0,n]
X
)(ω) = P(F | Xn (ω)) (P -a.s.) (6)

for any future event F ∈ F(n,∞)


X
, or for n ≥ 1

P(PF | Xn (ω)) = P(P | Xn (ω)) P(F | Xn (ω)) (P -a.s.) (7)

for any future event F ∈ F(n,∞)


X
and past event P ∈ F[0,n−1]
X
.
PROOF. First of all, we prove the equivalence of (6) and (7).
(6) ⇒ (7). We have (P-a.s.)

P(P | Xn (ω)) P(F | Xn (ω)) = E(IP | Xn (ω)) E(IF | Xn (ω))


= E{IP E(IF | Xn (ω)) | Xn (ω)} = E{IP E(IF | F[0,n]
X
)(ω) | Xn (ω)}
= E{E(IP IF | F[0,n]
X
)(ω) | Xn (ω)} = E{IP IF | Xn (ω)} = P(PF | Xn (ω)).

(7) ⇒ (6). We must show that for any set C in F[0,n]


X

E(IC P(F | Xn )) = E(IC P(F | F[0,n]


X
)). (6 )

To this end, consider first a particular case of such a set, namely, a set PN, where
P ∈ F[0,n−1]
X
and N ∈ σ(Xn ), and show that in this case (6 ) follows from (7).
Indeed,

E(IPN P(F | Xn )) = E(IP IN E(F | Xn )) = E(IN E(IP E(IF | Xn ) | Xn ))


(7)
= E(IN E(IP | Xn ) E(IF | Xn )) = E(IN P(P | Xn ) P(F | Xn )) = E(IN P(PF | Xn ))
= P(PNF) = E(IPN P(F | F[0,n]
X
)), (8)
240 8 Markov Chains

i.e., the property (6 ) holds for sets C of the form PN, where P ∈ F[0,n−1] X
and
N ∈ σ(Xn ). By means of monotone classes arguments (Sect. 2, Chap. 2, Vol. 1) we
deduce that property (6 ) is valid for any sets C in F[0,n]
X
. Since the function P(F | Xn )

is F[0,n] -measurable, (6 ) implies that P(F | Xn ) is a version of the conditional prob-
X

ability P(F | F[0,n]


X
), i.e., (6) holds.
Let us turn to the proof of equivalence of (2) and (6), or, in view of the foregoing
proof, that of (2) and (7). The implication (6) ⇒ (2) is obvious. Let us prove the
implication (2) ⇒ (6), invoking again the monotone classes arguments.
The sets F in (6) belong ∞to the σ-algebra F(n,∞)
X
= F[n+1,∞)
X
, the σ-algebra
generated by the algebra k=1 F[n+1,n+k] , where F[n+1,n+k] = σ(Xn+1 , . . . , Xn+k ).
X X

Therefore it is natural to start with the proof of (6) for sets F in the σ-algebras
F[n+1,n+k]
X
.
We will prove this by induction. If k = 1, then F[n+1,n+1]
X
= σ(Xn+1 ), and (6) is
the same as (2), which is assumed to hold.
Now let (6) hold for some k ≥ 1. Let us prove its validity for k + 1.
To this end, let us take a set F ∈ F[n+1,n+k+1]
X
of the form F = F1 ∩ F2 , where
F ∈ F[n+1,n+k] and F ∈ σ(Xn+k+1 ). Then, using the induction assumption, we
1 X 2

find that (P-a.s.)

P( F | F[0,n]
X
) = E( IF | F[0,n]
X
) = E[ IF1 ∩F2 | F[0,n]
X
]
= E[ IF1 E( IF2 | F[0,n+k]
X
) | F[0,n]
X
]
= E[ IF1 E( IF2 | Xn+k ) | F[0,n]
X
] = E[ IF1 E( IF2 | Xn+k ) | Xn ]
= E[ IF1 E( IF2 | F[n,n+k] ) | Xn ] = E[ E( IF1 IF2 | F[n,n+k] ) | Xn ]
= E[ IF1 IF2 | Xn ] = P( F1 ∩ F2 | Xn ) = P( F | Xn ). (9)

The fact that property (9) holds, as we proved, for the sets F ∈ F[n+1,n+k+1]
X

of the form F = F ∩ F with F ∈ F[n+1,n+k] and F ∈ σ(Xn+k+1 ) implies


1 2 1 X 2

(Problem 1a) that this property holds for any sets F ∈ F[n+1,n+k+1]
X
. Hence we

conclude (Problem 1b) that (9) is valid also for F in the algebra k=1 F[n+1,n+k]
X
,
which implies
in turn (Problem
 1c) that this property is satisfied also for the σ-

k=1 F[n+1,n+k] = F(n,∞)
X X
algebra σ .

Remark. The reasoning in this proof is based on the principle of appropriate sets
(starting the proof with sets of a “simple” structure) by applying subsequently the
results on monotone classes (Sect. 2, Chap. 2, Vol. 1). In what follows this method
will be repeatedly used (e.g., proofs of Theorems 2 and 3, which, in particular,
enable one to recover the parts of the foregoing proof of Theorem 1 that were stated
as Problems 1a, 1b, and 1c).

4. As a classical example of the Markov chain, consider the random walk X =


(Xn )n≥0 with
Xn = X0 + Sn , n ≥ 1, (10)
1 Definitions and Basic Properties 241

where Sn = ξ1 +· · ·+ξn and X0 , ξ1 , ξ2 , . . . are independent random variables defined


on a probability space (Ω, F , P).
Theorem 2. Let F0 = σ(X0 ), Fn = σ(X0 , ξ1 , . . . , ξn ), n ≥ 1. The sequence
X = (Xn )n≥0 considered on the filtered probability space (Ω, F , (Fn )n≥0 , P) is
a Markov chain (in the wide as well in the strict sense), i.e.,

P(Xn+1 ∈ B | Fn )(ω) = P(Xn+1 ∈ B | Xn (ω)) (P -a.s.) (11)

for n ≥ 0 and B ∈ B(R), and

P(Xn+1 ∈ B | Xn (ω)) = Pn+1 (B − Xn (ω)) (P -a.s.), (12)

where
Pn+1 (A) = P{ξn+1 ∈ A} (13)
and
B − Xn (ω) = {y : y + Xn (ω) ∈ B}, B ∈ B(R).
PROOF. We will prove (11) and (12) simultaneously.
For discrete probability spaces, similar results were proved in Sect. 12, Chap. 1,
Vol. 1, and it may appear that the proof here should be rather simple, too. But, as
will be seen from the subsequent proof, the present situation is more complicated.
Let A be a set such that A ∈ {X0 ∈ B0 , ξ1 ∈ B1 , . . . , ξn ∈ Bn }, where Bi ∈ B(R),
i = 0, 1, . . . , n. By the definition of conditional probability P(Xn+1 ∈ B | Fn ) (ω)
(Sect. 7, Chap. 2, Vol. 1),
 
P(Xn+1 ∈ B | Fn )(ω) P(dω) = I{Xn+1 ∈B} (ω) P(dω)
A A

= P{X0 ∈ B0 , ξ1 ∈ B1 , . . . , ξn ∈ Bn , Xn+1 ∈ B}

= Pn+1 (B − (x0 + x1 + · · · + xn )) P0 (dx0 ) . . . Pn (dxn )
B0 ×···×Bn

= Pn+1 (B − Xn (ω)) P(dω). (14)
A

Thus we have proved the equality


 
P(Xn+1 ∈ B | Fn )(ω) P(dω) = Pn+1 (B − Xn (ω)) P(dω) (15)
A A

for sets A ∈ Fn of the form A = {X0 ∈ B0 , ξ1 ∈ B1 , . . . , ξn ∈ Bn }.


Obviously, the system An of such sets is a π-system (Ω ∈ An and if A1 ∈ An
and A2 ∈ An , then also A1 ∩ A2 ∈ An ; see Definition 2 in Sect. 2, Chap. 2, Vol. 1).
Further, let L be the class of sets A ∈ Fn that satisfy (15).
242 8 Markov Chains

Let us show that L is a λ-system (Definition 2, Sect. 2, Chap. 2, Vol. 1). It is


clear that Ω ∈ L , i.e., the property (λa ) of this definition is satisfied. The property
(λb ) of the same definition holds because of the additivity of Lebesgue integral.
Finally, the property (λc ) of that definition follows from the theorem on monotone
convergence of Lebesgue integrals (Sect. 6, Chap. 2, Vol. 1).
Thus, L is a λ-system. Applying statement (c) of Theorem 2 in Sect. 2, Chap. 2,
Vol. 1, we obtain that σ(An ) ⊆ L . But σ(An ) = Fn , so property (15) is valid also
for sets A in Fn .
Consequently, taking into account that Pn+1 (B − Xn (ω)) (as a function of ω)
is Fn -measurable (Problem 2) we obtain from (15) (by the definition of condi-
tional probability) that Pn+1 (B − Xn (ω)) is a version of the conditional probabil-
ity P(Xn+1 ∈ B | Fn )(ω). Finally, using the “telescopic property” of conditional
expectations (see property H* in Sect. 7, Chap. 2, Vol. 1) we find that (P-a.s.)

P(Xn+1 ∈ B | Xn )(ω) = E[ I{Xn+1 ∈B} | Xn ](ω) = E[ E(I{Xn+1 ∈B} | Fn ) | Xn ](ω) =


= E[Pn+1 (B − Xn ) | Xn (ω)] = Pn+1 (B − Xn (ω)). (16)

Thus, both properties (11) and (12) are proved.




Remark. Properties (11) and (12) could also be deduced (Problem 3) directly from
Lemma 3 in Sect. 2, Chap. 2, Vol. 1. We carried out a detailed proof of these “al-
most obvious” properties to demonstrate once more the technique of the proof of
such assertions based on the principle of appropriate sets and results on monotone
classes.
5. Consider the Markov property (1). If (E, E ) is a Borel space, then, by Theorem 3
in Sect. 7, Chap. 2, Vol. 1, for any n ≥ 0 there exists a regular conditional distribu-
tion Pn+1 (x; B) such that (P-a.s.)

P(Xn+1 ∈ B | Xn (ω)) = Pn+1 (Xn (ω); B), (17)

where the function Pn+1 (x; B), B ∈ E , x ∈ E, has the following properties (Defini-
tion 7, Sect. 7, Chap. 2, Vol. 1):
(a) For any x the set function Pn+1 (x, · ) is a measure on (E, E );
(b) For any B ∈ E the function Pn+1 ( · ; B) is E -measurable.
The functions Pn = Pn (x; B), n ≥ 1, are called transition functions (or Markov
kernels).
The case of special interest to us will be the one where all these transition func-
tions are the same, P1 = P2 = . . ., or, more precisely, when the conditional proba-
bilities P(Xn+1 ∈ B | Xn (ω)), n ≥ 0, have a common version of regular conditional
distribution P(x; B) such that (P-a.s.)

P(Xn+1 ∈ B | Xn (ω)) = P(Xn (ω); B) (18)

for all n ≥ 0 and B ∈ E .


1 Definitions and Basic Properties 243

If such a version P = P(x; B) exists (in which case we can set all Pn = P,
n ≥ 0), then the Markov chain is called homogeneous (in time) with transition
function P = P(x; B), x ∈ E, B ∈ E .
The intuitive meaning of the homogeneity property of Markov chains is clear:
the corresponding system evolves homogeneously in the sense that the probabilistic
mechanisms governing the transitions of the system remain the same for all time
instants n ≥ 0. (In the theory of dynamical systems, systems with this property are
said to be conservative.)
Besides the transition probabilities P1 , P2 , . . . , or the transition probability P for
homogeneous chains, the important characteristic of Markov chains is the initial
distribution π = π(B), B ∈ E , i.e., the probability distribution π(B) = P{X0 ∈ B},
B ∈ E.
The set of objects (π, P1 , P2 , . . . ) completely determines the probabilistic prop-
erties of the sequence X = (Xn )n≥0 , since all the finite-dimensional distributions of
this sequence are given by the formulas

P{X0 ∈ B} = π(B), B ∈ E,

and

P{(X0 , X1 , . . . , Xn ) ∈ B}

= IB (x0 , x1 , . . . , xn ) π(dx0 )P1 (x0 ; dx1 ) · · · Pn (xn−1 ; dxn ) (19)
E×···×E

for any n ≥ 1 and B ∈ B(En+1 ) (= E n+1 = E ⊗ · · · ⊗ E (n + 1) times).


Indeed, consider first the set B of the form B = B0 × · · · × Bn . Then for n = 1
we have, by the formula for total probability (see (5) in Sect. 7, Chap. 2, Vol. 2),

P{X0 ∈ B0 , X1 ∈ B1 } = I{X0 ∈B0 } (ω) P(X1 ∈ B1 | X0 (ω)) P(dω)
Ω

= I{X0 ∈B0 } (ω) P1 (B1 ; X0 (ω)) P(dω)
Ω
 
= IB0 (x0 ) P1 (B1 ; x0 ) π(dx0 ) = IB0 ×B1 (x0 , x1 ) P1 (dx1 ; x0 ) π(dx0 ).
E E×E

The further proof proceeds by induction:

P{X0 ∈ B0 , X1 ∈ B1 , . . . , Xn ∈ Bn }
244 8 Markov Chains

= I{X0 ∈B0 ,...,Xn−1 ∈Bn−1 } (ω) P(Xn ∈ Bn | X0 (ω), . . . , Xn−1 (ω)) P(dω)
Ω

= I{X0 ∈B0 ,...,Xn−1 ∈Bn−1 } (ω) P(Xn ∈ Bn | Xn−1 (ω)) P(dω)
Ω

= I{X0 ∈B0 ,...,Xn−1 ∈Bn−1 } (ω) Pn (Bn ; Xn−1 (ω)) P(dω)
Ω

= IB0 ×B1 ×···×Bn−1 (x0 , x1 , . . . , xn−1 )
E×···×E
× Pn (Bn ; xn−1 ) P{X0 ∈ dx1 , . . . , Xn−1 ∈ dxn }

= IB0 ×B1 ×···×Bn−1 ×Bn (x0 , x1 , . . . , xn−1 , xn )
E×···×E
× Pn (dxn ; xn−1 ) Pn−1 (dxn−1 ; xn−2 ) . . . P1 (dx1 ; x0 ) π(dx0 ),

which coincides with (19) for sets B of the form B = B0 × B1 × · · · × Bn . The


general case of the sets B ∈ B(En+1 ) is treated in the same way as in the proof of
the similar point in Theorem 2.
Using the results on monotone classes (Sect. 2, Chap. 2, Vol. 1), one can deduce
from (19) (Problem 4) that for any bounded B(En+1 )-measurable function h =
h(x0 , x1 , . . . , xn )
E h(X0 , X1 , . . . , Xn )

= h(x0 , x1 , . . . , xn ) π(dx0 ) P1 (dx1 ; x0 ) · · · Pn (dxn ; xn−1 ). (20)
En+1
6. Thus, if we have a Markov chain (in the wide or strict sense), then, by means of
formula (19), we can recover the distribution Law(X0 , X1 , . . . , Xn ) of any collection
of random variables X0 , X1 , . . . , Xn , n ≥ 1, from its initial distribution π = π(B) =
P{X0 ∈ B}, B ∈ E , and its transition probabilities Pn (x; B), n ≥ 1, x ∈ E, B ∈ E .
Now we take another look at defining Markov chains. We will require that
they must be completely determined by a given collection of distributions
(π, P1 , P2 , . . . ), where the meaning of π is the probability distribution of the
initial state of the system and the functions Pn+1 = Pn+1 (x; B), n ≥ 0, satisfy-
ing (a) and (b) of Subsection 5, play the role of transition probabilities, i.e., the
probabilities that the system in state x at time n will get into the set B ∈ E at
time n + 1. Naturally, when our initial object is the collection (π, P1 , P2 , . . . ), the
question arises as to whether there is any Markov chain with initial distribution π
and transition probabilities P1 , P2 , . . . .
An (affirmative) answer to this question is virtually given by Kolmogorov’s the-
orem (Theorem 1 and Corollary 3 in Sect. 9, Chap. 2, Vol. 1), at least for E = Rd ,
and by Ionescu Tulcea’s theorem (Theorem 2 in Sect. 9, Chap. 2, Vol. 1) for arbitrary
measurable spaces (E, E ).
1 Definitions and Basic Properties 245

Following the proofs of these theorems, define first of all the measurable space
(Ω, F ) by setting (Ω, F ) = (E∞ , B(E∞ )), where E∞ = E × E × · · · , B(E∞ ) =
E ⊗ E ⊗ · · · ; in other words, we take the elementary events to be the “points”
ω = (x0 , x1 , . . . ), where xi ∈ E.
Define the flow (Fn )n≥0 by setting Fn = σ(x0 , x1 , . . . , xn ). The random vari-
ables Xn = Xn (ω) will be defined “canonically” by setting Xn (ω) = xn if ω =
(x0 , x1 , . . . ).
Ionescu Tulcea’s theorem states that for arbitrary measurable spaces (E, E ) (and
in particular for phase spaces under consideration) there exists a probability measure
Pπ on (Ω, F ) such that

Pπ {X0 ∈ B} = π(B), B ∈ E, (21)

and the finite-dimensional distributions for all n ≥ 1 are given by

Pπ {(X0 , X1 , . . . , Xn ) ∈ B}
  
= π(dx0 ) P1 (x0 ; dx1 ) . . . IB (x0 , . . . , xn ) Pn (xn−1 ; dxn ). (22)
E E E

Theorem 3. The canonically defined sequence X = (Xn )n≥0 is a Markov chain


(in the sense of Definition 2) with respect to the measure Pπ specified by Ionescu
Tulcea’s theorem.
PROOF. We must prove that for n ≥ 0 and B ∈ E

Pπ (Xn+1 ∈ B | Fn )(ω) = Pπ (Xn+1 ∈ B | Xn (ω)) (Pπ -a.s.) (23)

and, moreover, for n ≥ 0

Pπ (Xn+1 ∈ B | Xn (ω)) = Pn (Xn (ω); B) (Pπ -a.s.). (24)

We will prove this by using the principle of appropriate sets and the results on
monotone classes (Sect. 2, Chap. 2, Vol. 1).
As before, we take for appropriate sets the sets of a “simple” structure A ∈ Fn
of the form
A = {ω : X0 (ω) ∈ B0 , . . . , Xn (ω) ∈ Bn },
where Bi ∈ E , i = 0, 1, . . . , n, and let B ∈ E .
Then the construction of the measure Pπ (see (22)) implies

I{Xn+1 ∈B} (ω) Pπ (dω) = Pπ {X0 ∈ B0 , . . . , Xn ∈ Bn , Xn+1 ∈ B}
A
   
= π(dx0 ) P1 (x0 ; dx1 ) . . . Pn (xn−1 ; dxn ) Pn+1 (xn ; dxn+1 )
B0 B1 Bn B

= Pn+1 (Xn (ω); B) Pπ (dω). (25)
A
246 8 Markov Chains

By the same arguments as in the proof of Theorem 2 (see the proof of (15) for
sets A ∈ Fn ) we find that (25) is also fulfilled for A ∈ Fn , i.e., for sets of the form
A = {ω : (X0 (ω), . . . , Xn (ω)) ∈ C}, where C ∈ B(En+1 ).
Since, by the definition of conditional probabilities (Sect. 7, Chap. 2, Vol. 2),
 
I{Xn+1 ∈B} (ω) Pπ (dω) = Pπ (Xn+1 ∈ B | Fn ) (ω) Pπ (dω), (26)
A A

and the functions Pn+1 (Xn (ω); B) are Fn -measurable, we obtain the required prop-
erties (23) and (24) using (25) and the “telescopic” property of conditional expecta-
tions (see H* in Subsection 4, Sect. 7, Chap. 2, Vol. 1).


7. Thus, with any given collection of distributions (π, P1 , P2 , . . . ), we can associate
a Markov chain (to be denoted by X π = (Xn , Pπ )n≥0 ) with initial distribution π and
transition probabilities P1 , P2 , . . . (i.e., a chain with properties (21), (23), and (24)).
This chain proceeds as follows.
At the time instant n = 0, the initial state is randomly chosen according to the
distribution π. If the initial value X0 is equal to x, then at the first step the system
moves from this state to a state x1 with distribution P1 ( · ; x), and so on.
Therefore the initial distribution π acts only at time n = 0, while the subsequent
evolution of the system is determined by the transition probabilities P1 , P2 , . . . .
Consequently, if the random choice of two initial distributions π1 and π2 results in
the same state x, then the behavior of the system will be the same (in probabilistic
terms), being determined only by the transition probabilities P1 , P2 , . . . . This can
also be expressed as follows.
Let Px denote the distribution Pπ corresponding to the case where π is supported
at a single point x: π(dy) = δx (dy), i.e., π({x}) = 1, where {x} is a singleton, which
belongs to E by the assumption about the phase space (E, E ) (Subsection 2).
Then (22) implies (Problem 4) that for any A ∈ B(E∞ ) and x ∈ E the probability
Px (A) is (for each π) a version of the conditional probability Pπ (A | X0 = x), i.e.,

Pπ (A | X0 = x) = Px (A) (Pπ -a.s.). (27)

For any x ∈ E the probabilities Px ( · ) are completely determined by the collection


of transition probabilities (P1 , P2 , . . . ).
Therefore, if we are primarily interested in knowing how the behavior of the sys-
tem depends on the transition probabilities (P1 , P2 , . . . ), we can restrict ourselves to
the probabilities Px ( · ), x ∈ E, obtaining, if needed, the probabilities Pπ ( · ) simply
by the integration

Pπ (A) = Px (A) π(dx), A ∈ B(E∞ ). (28)
E

These arguments gave rise to an approach according to which the main object
in the “general theory of Markov processes” (see [21]) (with discrete time in the
present case) is not a particular Markov chain X π = (Xn , Pπ )n≥0 , but rather a family
1 Definitions and Basic Properties 247

of Markov chains X x = (Xn , Px )n≥0 with x ∈ E. (Nevertheless, instead of the words


“a family of Markov chains” one often says simply “a Markov chain” and writes
“X = (Xn , Fn , Px )” instead of “X x = (Xn , Px )n≥0 with x ∈ E.”)
Let us emphasize that all these considerations presume that the chains are defined
“canonically”: the space (Ω, F ) is taken to be (E∞ , E ∞ ), E ∞ = E ⊗ E ⊗ · · · , the
random variables Xn (ω) are defined so that Xn (ω) = xn if ω = (x0 , x1 , . . .). There-
fore in X x = (Xn , Px ), only the probability Px depends on x, whereas no dependence
of Xn on x is assumed. This implies that, according to the measure Px , all the trajec-
tories (Xn )n≥0 “start” at the point x, i.e., Px {X0 = x} = 1.
8. In the case of finite Markov chains (Sect. 12, Chap. 1, Vol. 1), their behavior was
(n)
analyzed by exploring the transition probabilities pij = P(Xn = j | X0 = i), which
were shown to satisfy the Kolmogorov–Chapman equation (see (13) therein) from
which, in turn, the forward and backward Kolmogorov equations ((16) and (15)
therein) were derived.
Now we turn to the Kolmogorov–Chapman equation for Markov chains with
arbitrary phase space (E, E ). We will restrict ourselves to homogeneous chains for
which P1 = P2 = · · · = P.
In this case, in view of (22),

Pπ {(X0 , X1 , . . . , Xn ) ∈ B}
  
= π(dx0 ) P(x0 ; dx1 ) . . . IB (x0 , x1 , . . . , xn )P(xn−1 ; dxn ). (29)
E E E

In particular, for n = 2 we have


 
Pπ {X0 ∈ B0 , X2 ∈ B2 } = P(x1 ; B2 )P(x0 ; dx1 )π(dx0 ). (30)
B0 E

Hence, by the Radon–Nikodym theorem (Sect. 6, Chap. 2, Vol. 1) and the defini-
tion of the conditional probabilities, we find that (π-a.s.)

Pπ (X2 ∈ B2 | X0 = x) = P(x; dx1 )P(x1 ; B2 ). (31)
E

Let us notice now that, by (27), Pπ (X2 ∈ B2 | X0 = x) = Px {X2 ∈ B2 } (π-a.s.),


where the probability Px {X2 ∈ B2 } has a simple meaning: this is the probability of
the transition of the system from state x at time n = 0 into set B2 at time n = 2, i.e.,
this is the transition probability for two steps.
Let P(n) (x; Bn ) = Px {Xn ∈ Bn } denote the transition probability for n steps.
Then, in view of the homogeneity of the chains at hand, P(1) (x; B1 ) = P(x; B1 ),
hence we find from (31) that (π-a.s.)

P (x; B) = P(1) (x; dx1 )P(1) (x1 ; B),
(2)
(32)
E

where B ∈ E .
248 8 Markov Chains

In a similar manner, one can establish (Problem 5) that for any n ≥ 0, m ≥ 0


(π-a.s.) 
P(n+m) (x; B) = P(n) (x; dy)P(m) (y; B). (33)
E

This is the well-known Kolmogorov–Chapman equation, whose intuitive


meaning is quite clear: to compute the probability P(m+n) (x; B) of a transition from
the point x ∈ E to the set B ∈ E for n + m steps we must multiply the probability
P(n) (x; dy) of transition from x into an “infinitesimal” neighborhood dy of y ∈ E
for n steps by the probability of transition from y to B for m steps (with subsequent
integration over all “intermediate” points y).
Regarding the Kolmogorov–Chapman equation, which relates the transition
probabilities for a varying number of steps, we should point out that it is established
only up to “π-almost sure.” In particular, this implies that this is not a relation
that holds for all x ∈ E. This should not come as a surprise because, as on many
previous occasions, where we had to choose versions of conditional probabilities,
we are not guaranteed that these versions are such that the properties of interest are
fulfilled identically in x rather than π-almost sure.
Nevertheless, it is possible to explicitly specify the versions for which the Kol-
mogorov–Chapman equation (33) is fulfilled for all x ∈ E.
This follows from the following assertions (Problem 6). Let the “transition prob-
abilities” P(n) (x; B) be defined as follows:
P(1) (x; B) = P(x; B)

and for n > 1 


P(n) (x; B) = P(x; dy)P(n−1) (y; B).
E
Then
(i) P(n) (x; B), n ≥ 1, are regular conditional probabilities on E for a fixed x;
(ii) P(n) (x; B) is equal to Px {Xn ∈ B}, hence it is a version of Pπ (Xn ∈
B | X0 = x) (π-a.s.);
(iii) For the functions P(n) (x; B), n ≥ 1, the Kolmogorov–Chapman equations
hold identically in x ∈ E.
9. Problems
1. Prove Problems 1a, 1b, and 1c stated in the proof of Theorem 1.
2. Prove that the function Pn+1 (B − Xn (ω)) in Theorem 2 is Fn -measurable in ω.
3. Deduce the properties (11) and (12) from Lemma 3 in Sect. 2, Chap. 2, Vol. 1.
4. Prove (20) and (27).
5. Establish the validity of (33).
6. Prove statements (i), (ii), and (iii) given at the end of Subsection 8.
7. Establish whether the Markov property (3) implies that

P(Xn+1 ∈ B | X0 ∈ B0 , X1 ∈ B1 , . . . , Xn ∈ Bn ) = P(Xn+1 ∈ B | Xn ∈ Bn ),

where B, B0 , B1 , . . . , Bn are subsets of E and P{X0 ∈ B0 , X1 ∈ B1 , . . . , Xn ∈


Bn } > 0.
2 Generalized Markov and Strong Markov Properties 249

2. Generalized Markov and Strong Markov Properties

1. In this section we mostly consider families X x = (Xn , Px )n≥0 , x ∈ E, of homo-


geneous Markov chains defined “canonically” on the coordinate space (Ω, F ) =
(E∞ , E ∞ ) and specified by a transition function P = P(x; B), x ∈ E, B ∈ E .
Let us define the shift operators θn : Ω → Ω on (Ω, F ) (cf. Sect. 1, Chap. 5) by
setting
θn (ω) = (xn , xn+1 , . . .)
for ω = (x0 , x1 , . . .).
If H = H(ω) is a F -measurable function, then H ◦ θn will denote the function
(H ◦ θn ) (ω) defined by
(H ◦ θn ) (ω) = H(θn (ω)). (1)
Thus, if ω = (x0 , x1 , . . . ) and H = H(x0 , x1 , . . . ), then (H ◦ θn ) (x0 , x1 , . . . ) =
H(xn , xn+1 , . . . ).
The following theorem is virtually property (6) in Sect. 1 restated in the context
of the present case of a family of homogeneous Markov chains.
Theorem 1. Let X x = (Xn , Px )n≥0 , x ∈ E, be a family of homogeneous Markov
chains determined by a transition function P = P(x; B), x ∈ E, B ∈ E . Assume
that the probabilities Px {(X0 , X1 , . . . , Xn ) ∈ B} for B ∈ B(En+1 ) and n ≥ 0 are
determined by (22) of Sect. 1 with π(dy) = δ{x} (dy) and P1 = P2 = · · · = P.
Then for any initial distribution π, any n ≥ 0, and any bounded (or nonnega-
tive) F -measurable function H = H(ω) the following generalized Markov property
holds:
Eπ (H ◦ θn | FnX )(ω) = EXn (ω) H (Pπ -a.s.). (2)
Remark. Although the notation in the theorem is self-explanatory,
 let us note nev-
ertheless that Eπ is the expectation with respect to Pπ ( · ) = E Px ( · ) π(dx), and
EXn (ω) H is to be understood as follows. Take the expectation Ex H, i.e., the aver-
aging of H with respect to Px (denote it by ψ(x)), and then plug Xn (ω) for x into
the expression thus obtained, so that EXn (ω) H = ψ(Xn (ω)). (Note that Ex H is an
E -measurable function of x (Problem 1), so EXn (ω) H is a random variable, i.e., an
F /E -measurable function.)
PROOF. The proof of Theorem 1 again uses the principle of appropriate sets and
functions with subsequent application of results on monotone classes.
To prove (2), we must show that for any A ∈ FnX = σ(x0 , x1 , . . . , xn )
 
(H ◦ θn )(ω) Pπ (dω) = (EXn (ω) H) Pπ (dω), (3)
A A

or, in a more concise form,


Eπ (H ◦ θn ; A) = Eπ (EXn H; A), (4)

where Eπ (ξ; A) denotes Eπ (ξIA ) (Subsection 2, Sect. 6, Chap. 2, Vol. 1).


250 8 Markov Chains

According to the principle of appropriate sets, consider sets A of a “simple” struc-


ture, that is, the sets A = {ω : x0 ∈ B0 , . . . , xn ∈ Bn }, Bi ∈ Ei , and a function
H = H(x0 , x1 , . . . , xm ), m ≥ 0 (more precisely, an FmX -measurable function H).
Then (4) becomes

Eπ (H(Xn , Xn+1 , . . . , Xn+m ); A) = Eπ (EXn H(X0 , X1 , . . . , Xm ); A). (5)

Using (22) of Sect. 1 we find that

Eπ (H(Xn , Xn+1 , . . . , Xn+m ); A) = Eπ (IA (X0 , . . . , Xn )H(Xn , . . . , Xn+m ))



= IA (x0 , . . . , xn )H(xn , . . . , xn+m )
En+m+1

× π(dx0 )P(x0 ; dx1 ) · · · P(xn+m−1 ; dxn+m )



= IA (x0 , . . . , xn ) π(dx0 )P(x0 ; dx1 ) · · · P(xn−1 ; dxn )
En+1

× H(xn , . . . , xn+m )P(xn ; dxn+1 ) · · · P(xn+m−1 ; dxn+m )
Em

= IA (x0 , . . . , xn )π(dx0 )P(x0 ; dx1 ) · · · P(xn−1 ; dxn )
En+1

× H(x0 , . . . , xm ) Px (dx1 , . . . , dxm ) = Eπ (EXn H(X0 , . . . , Xm ); A),
Em

where Px (dx1 , . . . , dxm ) = P(x; dx1 )P(x1 ; dx2 ) · · · P(xm−1 ; dxm ).


Thus, (5) for the sets A = {ω : x0 ∈ B0 , x1 ∈ B1 , . . . , xn ∈ Bn } and functons H
of the form H = H(x0 , x1 , . . . , xm ) is established. The case of general A ∈ FnX is
treated (for a fixed m) in the same way as in Theorem 2 of Sect. 1.
It remains to show that the properties just proved remain true also for all F
(= E ∞ )-measurable bounded (or nonnegative) functions H = H(x0 , x1 , . . . ).
For that it suffices to prove that if A ∈ FnX , then (5) holds true for such functions,
i.e., that
Eπ (H(Xn , Xn+1 , . . . ); A) = Eπ (EXn H(X0 , X1 , . . . ); A). (6)
Having in mind an application of the principle of appropriate sets (Sect. 2,
Chap. 2, Vol. 1), denote by H the set of all bounded (or nonnegative) F -measurable
functions H = H(x0 , x1 , . . . ) for which (5) is true.
Denote by J the set of (cylindrical) sets of the form Im = {ω : x0 ∈ B0 , . . . , xm ∈
Bm } with some Bi ∈ E , i = 0, 1, . . . , m, m ≥ 0. Clearly, J is a π-system of sets in
F (= E ∞ ).
To prove that it is also a λ-system, we turn to the conditions of Theorem 3 of
Sect. 2, Chap. 2, Vol. 1.
2 Generalized Markov and Strong Markov Properties 251

Condition (h1 ) is fulfilled because IA ∈ H for A ∈ J by what was proved


earlier (take H(x0 , . . . , xm ) = IA (x0 , . . . , xm ) in (5)). Condition (h2 ) follows from
the additivity of the Lebesgue integral, and (h3 ) from the monotone convergence
theorem for Lebesgue integrals.
According to Theorem 5 mentioned earlier, H contains, then, all functions mea-
surable with respect to σ(J), which by definition is the σ-algebra E ∞ = B(E∞ )
(Subsections 4 and 8, Sect. 2, Chap. 2, Vol. 1).


2. Now we proceed to another generalization of the Markov property, the so-called
strong Markov property related to the change from “time n” to “random time τ .”
(The general setup will be the same as at the start of this section: (Ω, F ) =
(E∞ , E ∞ ) and so on.)
We will denote by τ = τ (ω) finite random variables τ (ω) such that for any n ≥ 0

{ω : τ (ω) = n} ∈ FnX .

According to the terminology used in Sect. 1, Chap. 7 (Definition 3), such a random
variable is called a (finite) Markov or stopping time.
We will associate with the flow (FnX )n≥0 and the stopping time τ the σ-algebra

FτX = {A ∈ F X : A ∩ {τ = n} ∈ FnX for all n ≥ 0},



where the σ-algebra F X = σ( FnX ) is interpreted as the σ-algebra of events ob-
served on the “random interval” [0, τ ].

Theorem 2. Suppose that the conditions of Theorem 1 are fulfilled, and let τ =
τ (ω) be a finite Markov time. Then the following strong Markov property holds:

Eπ (H ◦ θτ | FτX ) = EXτ H (Pπ -a.s.). (7)

Before we proceed to the proof, let us comment on how EXτ H and H ◦ θτ must
be understood.
Let ψ(x) = Ex H. (We pointed out in Subsection 1 that ψ(x) is a E -measurable
function of x.) By EXτ H we mean ψ(Xτ ) = ψ(Xτ (ω) (ω)). As concerns (H ◦ θτ )(ω),
this is the random variable (H ◦ θτ (ω) )(ω) = H(θτ (ω) (ω)).

PROOF. Take a set A ∈ Fτ . As in Theorem 1, for the proof of (7) we must show
that
Eπ (H ◦ θτ ; A) = Eπ (EXτ H; A). (8)
Consider the left-hand side. We have


Eπ (H ◦ θτ ; A) = Eπ (H ◦ θτ ; A ∩ {τ = n})
n=0
∞
= Eπ (H ◦ θn ; A ∩ {τ = n}). (9)
n=0
252 8 Markov Chains

The right-hand side of (8) is




Eπ (EXτ H; A) = Eπ (EXπ H; A ∩ {τ = n}). (10)
n=0

Obviously, A ∩ {τ = n} ∈ FnX . Therefore, in view of (4), the right-hand sides in (9)


and (10) are the same, which proves the strong Markov property (7).


Corollary. If we let H(x0 , x1 , . . . ) = IA (x0 , x1 , . . . ), where A = {ω : (x0 , x1 , . . . ) ∈
B}, B ∈ E ∞ = B(E∞ ), we obtain from (7) the following widely used form of the
strong Markov property:

Pπ ((Xτ , Xτ +1 , . . . ) ∈ B | X0 , X1 , . . . , Xτ )
= PXτ {(X0 , X1 , . . . ) ∈ B} (Pπ -a.s.). (11)

Remark 1. If we analyze the proof of the strong Markov property (7), we can see
that in fact the following property also holds.
Let for any n ≥ 0 the real-valued functions Hn = Hn (ω) defined on Ω = E∞
be F -measurable (F = E ∞ ) and uniformly bounded (i.e., |Hn (ω)| ≤ c, n ≥ 0,
ω ∈ Ω). Then for any finite Markov time τ = τ (ω) (τ (ω) < ∞, ω ∈ Ω) the
following form of the strong Markov property holds (Problem 2):

Eπ [Ψτ | FτX ] = ψ(τ, Xτ ) (Pπ -a.s.), (12)

where Ψ(ω) = Hn (θn (ω)), ψ(n, x) = Ex Hn (see [21]).


Remark 2. We assumed earlier that τ = τ (ω) is a finite Markov time. If this is not
the case, i.e., τ (ω) ≤ ∞, ω ∈ Ω, then (12) must be changed as follows (Problem 3):

Eπ [Ψτ | FτX ] = ψ(τ, Xτ ) ({τ < ∞}; Pπ -a.s.). (13)

In other words, in this case, (12) holds Pπ -a.s. on the set {τ < ∞}.
3. Example (Related to the strong Markov property). When dealing with the law
of iterated logarithm we used an inequality (Lemma 1 in Sect. 4, Chap. 4, see also
(14) in what follows)
1 whose counterpart
2 for the Brownian motion B = (Bt )t≤T is
the equality P max0≤t≤T Bt > a = 2 P{|BT | > a} ([12, Chap. 3]).
Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-
ables with symmetric (about zero) distribution. Let X0 = x ∈ R, Xm = X0 + (ξ1 +
· · · + ξm ), m ≥ 1. As before, we denote by Px the probability distribution of the
sequence X = (Xm )m≥0 with X0 = x. (The space Ω is assumed to be specified
coordinate-wise, ω = (x0 , x1 , . . . ) and Xm (ω) = xm .)
According to (slightly modified) inequality (9) of Sect. 4, Chap. 4,


P0 max Xm > a ≤ 2 P0 {Xn > a} (14)
0≤m≤n

for any a > 0.


2 Generalized Markov and Strong Markov Properties 253

Define the Markov time τ = τ (ω) by

τ (ω) = inf{0 ≤ m ≤ n : Xm (ω) > a}. (15)

(As usual, we set inf ∅ = ∞.) Let us demonstrate an “easy proof” of (14) using this
Markov time, which would be valid if such a (random) time could be treated in the
same manner as if it were nonrandom. We have (cf. proof of Lemma 1 in Sect. 4,
Chap. 4)

P0 {Xn > a} = P0 {(Xn − Xτ ∧n ) + Xτ ∧n > a}


≥ P0 {Xn − Xτ ∧n ≥ 0, Xτ ∧n > a} = P0 {Xn − Xτ ∧n ≥ 0} P0 {Xτ ∧n > a}


≥ 12 P0 {Xτ ∧n > a} = 12 P0 {τ ≤ n} = 12 P0 max Xm > a , (16)
0≤m≤n

where we have used the seemingly “almost obvious” property that Xn − Xτ ∧n and
Xτ ∧n are independent, which is true for a deterministic time τ but is, in general,
false for a random τ (Problem 4). (This means that our “easy proof” is incorrect.)
Now we give a correct proof of (14) based on the strong Markov property (13).
Since {Xn > a} ⊆ {τ ≤ n}, we have

P0 {Xn > a} = E0 (I{Xn >a} ; τ ≤ n). (17)

Define the functions Hm = Hm (x0 , x1 , . . . ) by setting



1, if m ≤ n and xn−m > a,
Hm (x0 , x1 , . . . ) =
0 otherwise.

It follows from this definition that on the set {τ ≤ n}



1, if xn > a,
(Hτ ◦ θτ ) (x0 , x1 , . . . ) = (18)
0 otherwise.

Since {Xn > a} ⊆ {τ ≤ n} and {τ ≤ n} ∈ Fτ , we obtain from (17)

P0 {Xn > a} = E0 (Hτ ◦ θτ ; τ ≤ n) = E0 (E0 (Hτ ◦ θτ | Fτ ); τ ≤ n). (19)

According to the strong Markov property (13), we have

E0 (Hτ ◦ θτ | Fτ ) = ψ(τ, Xτ ) (P0 -a.s.) (20)

on the set {τ ≤ n}. By definition, ψ(m, x) = Ex Hm , and we obtain for x ≥ a

Ex Hm = Px {Xn−m > a} ≥ Px {Xn−m > x} ≥ 1


2

(the last inequality follows from the symmetry of the distributions of ξ1 , ξ2 , . . . ).


254 8 Markov Chains

Hence
1
E0 (Hτ ◦ θτ | Fτ ) ≥
(P -a.s.) (21)
2
on the set {τ ≤ n}. Together with (19) and (20), this implies the required inequality
(14).
4. If we compare the Kolmogorov–Chapman Eq. (13) with Eq. (38), both in Sect. 12,
Chap. 1, Vol. 1, we can observe that they are very similar. Therefore it is of interest
to analyze the common points and the differences in their statements and proofs.
(We restrict ourselves to homogeneous Markov chains with discrete state space E.)
Using (1) and (2), we obtain for n ≥ 1, 1 ≤ k ≤ n, and i, j ∈ E, that
 
Pi {Xn = j} = Pi {Xn = j, Xk = α} = Ei I(Xn = j)I(Xk = α)
α∈E α∈E

= Ei [Ei (I(Xn = j)I(Xk = α) | Fk )]
α∈E

= Ei [I(Xk = α) Ei (I(Xn = j) | Fk )]
α∈E
(1) 
= Ei [I(Xk = α) Ei (I(Xn−k = j) ◦ θk | Fk )]
α∈E
(2) 
= Ei [I(Xk = α) EXk I(Xn−k = j)]
α∈E

= Ei [I(Xk = α) Eα I(Xn−k = j)]
α∈E
 
= Ei I(Xk = α) Eα I(Xn−k = j) = Pi {Xk = α} Pα {Xn−k = j}, (22)
α∈E α∈E

which is exactly the Kolmogorov–Chapman Eq. (13) as in Sect. 12, Chap. 1, Vol. 1,
written there as  (k) (n−k)
(n)
pij = piα pαj .
α∈E

If we replace the time k in (22) with a Markov time τ (taking values 1, 2, . . . , n)


and use the strong Markov property (7) instead of Markov property (2), we obtain
(Problem 5) the following natural (generalized) form of the Kolmogorov–Chapman
equation: 
Pi {Xn = j} = Pi {Xτ = α} Pα {Xn−τ = j}. (23)
α∈E

Both in (22) and (23) the summation is done over the phase variable α ∈ E, whereas
in (38) of Sect. 12, Chap. 1, Vol. 1, the summation is over the time variable.
Having noticed this, assume that τ is a Markov time with values in {1, 2, . . . }.
Starting as in the derivation of (38) given earlier, we find that
2 Generalized Markov and Strong Markov Properties 255


n
Pi {Xn = j} = Pi {Xn = j, τ = k} + Pi {Xn = j, τ ≥ n + 1}
k=1

n
= Ei I(Xn = j)I(τ = k) + Pi {Xn = j, τ ≥ n + 1}
k=1
n
= Ei [Ei (I(Xn = j)I(τ = k) | Fk )] + Pi {Xn = j, τ ≥ n + 1}
k=1
n
= Ei [I(τ = k) Ei (I(Xn = j) | Fk )] + Pi {Xn = j, τ ≥ n + 1}
k=1
n
= Ei [I(τ = k) Ei (I(Xn−k = j) ◦ θk | Fk )] + Pi {Xn = j, τ ≥ n + 1}
k=1
n
= Ei [I(τ = k) EXk I(Xn−k = j)] + Pi {Xn = j, τ ≥ n + 1}. (24)
k=1

In Subsection 7 of Sect. 12, Chap. 1, Vol. 1, the role of τ was

τj = min{1 ≤ k ≤ n : Xk = j}

with the condition that τj = n + 1 if the set { · } = ∅. In this case, (24) simplifies to


n
Pi {Xn = j} = Ei (I(τj = k) EXτj I(Xn−k = j))
k=1

n 
n
= Ei (I(τj = k) Ej I(Xn−k = j)) = Ei I(τj = k) Ej I(Xn−k = j)
k=1 k=1

n
= Pi {τj = k} Pj {Xn−k = j}
k=1

to become Eq. (38) in Sect. 12, Chap. 1, Vol. 1:

(n)

n
(k) (n−k)
pij = fij pjj . (25)
k=1

Equation (24) makes it possible to also derive other useful formulas involv-
ing summation with respect to the time variable (in contrast to the Kolmogorov–
Chapman equation). For example, consider the Markov time

τ (α) = min{1 ≤ k ≤ n : Xk = α(k)},

where the (deterministic) function α = α(k), 1 ≤ k ≤ n, and the Markov chain are
such that Pi {τ (α) ≤ n} = 1 (for fixed i and n). Then (24) implies that
256 8 Markov Chains


n
Pi {Xn = j} = Ei [I(τ (α) = k) EXτ (α) I(Xn−k = j)]
k=1
n
= Ei I(τ (α) = k) Eα(k) I(Xn−k = j),
k=1

i.e.,

n
Pi {Xn = j} = Pi {τ (α) = k} Pα(k) {Xn−k = j}
k=1

(cf. (23)).
5. Problems
1. Prove that the function ψ(x) = Ex H introduced in the remark in Subsection 1
is E -measurable.
2. Prove (12).
3. Prove (13).
4. Is the independence property of Xn − Xτ ∧n and Xτ ∧n in the example in Subsec-
tion 3 true?
5. Prove (23).

3. Limiting, Ergodic, and Stationary Probability Distributions


for Markov Chains

1. As mentioned in Sect. 1, the problem of the asymptotic behavior of memoryless


stochastic systems described by Markov chains is of great importance for the theory
of Markov random processes. One of the reasons for that is the fact that, under very
broad assumptions, the behavior of a Markov system “stabilizes” and the system
reaches a “steady-state” regime.
The limiting behavior of homogeneous Markov chains X = (Xn )n≥0 may be
studied in different aspects. For example,
n−1 one can explore the Pπ -almost sure con-
vergence for functionals of the form 1n m=0 f (Xm ) as n → ∞ for various functions
f = f (x), as was done in the ergodic theorem for strict-sense stationary random
sequences (Theorem 3 in Sect. 3, Chap. 5). It is also of interest to investigate the
conditions for the law of large numbers as in Sect. 12, Chap. 1, Vol. 1.
Instead of these types of questions concerning convergence almost sure or in
probability, in our exposition we will be mostly interested in the asymptotic be-
havior of the transition probabilities P(n) (x; A) for n steps as n → ∞ (see (10) in
Sect. 1) and in the existence of nontrivial stationary (invariant) measures q = q(A),
i.e., measures such that q(E) > 0 and

q(A) = P(x; A) q(dx), (1)

where P(x; A) is the transition function (for one step).


3 Limiting Distributions 257

Let us emphasize that definition (1) does not, in general, presume that q = q(A)
is a probability measure (q(E) = 1).
If this is a probability measure, it is said to be astationary or invariant dis-
tribution. The meaning of this terminology is clear: If we take q for the initial
distribution π, i.e., assume that Pq {X0 ∈ A} = q(A), then (1) will imply that
Pq {Xn ∈ A} = q(A) for any n ≥ 1, i.e., this distribution remains invariant in
time.
It is easy to come up with an example where there is no stationary distribution
q = q(A), but there are stationary measures.

EXAMPLE. Let X = (Xn )n≥0 be the Markov chain generated by Bernoulli trials,
i.e., Xn+1 = Xn + ξn+1 , where ξ1 , ξ2 , . . . are a sequence of independent identically
distributed random variables with P{ξn = +1} = p, P{ξn = −1} = q. Let X0 = x,
x ∈ {0, ±1, . . . }. It is clear that the transition function here is

P(x; {x + 1}) = p, P(x; {x − 1}) = q.

It is not hard to verify that one of the solutions to (1) is the measure q(A) such that
q({x}) = 1 for any x ∈ {0, ±1, . . . }. If p = q > 0, then q(A) with q({x}) = (p/q)x
is another invariant measure. It is obvious that neither of them is a probability mea-
sure, and there is no invariant probability measure here.

This simple example shows that the existence of a stationary (invariant) distribu-
tion requires certain assumptions about Markov chains.
The main interest in the problem of convergence of transition probabilities
P(n) (x; A) as n → ∞ lies in the existence of a limit that is independent of the initial
state x. We must bear in mind that there may exist no limiting distribution at all, for
example, it may happen that lim P(n) (x; A) = 0 for any A ∈ E and any initial state
x ∈ E. For example, take p = 1 in the preceding example, i.e., consider the deter-
ministic motion to the right. (See also Examples 4 and 5 in Sect. 8; cf. Problem 6 in
Sect. 5.)
Establishing the conditions for the existence of stationary (invariant) distributions
and the convergence of transition probabilities (and obtaining their properties) for
arbitrary phase spaces (E, E ) is a very difficult problem (e.g., [9]). However, in the
case of a countable state space (for “countable Markov chains”), interesting results
in this area admit fairly transparent formulations. They will be presented in Sects. 6
and 7. But before that we will give a detailed classification of the states of countable
Markov chains according to the algebraic and asymptotic properties of transition
probabilities.
Let us point out that the questions concerning stationary distributions and
the existence of the limits limn P(n) (x; A) are closely related. Indeed, if the limit
limn P(n) (x; A) (= ν(A)) exists, does not depend on x, and is a measure (in A ∈ E ),
then we find from the Kolmogorov–Chapman equation

(n+1)
P (x; A) = P(n) (x; dy) P(y; A)
258 8 Markov Chains

by (formally) taking the limit as n → ∞ that



ν(A) = P(y; A) ν(dy).

Thus ν = ν(A) is then a stationary (invariant) measure.


2. Throughout the sequel, we assume that the Markov chains X = (Xn )n≥0 under
consideration take values in a countable phase space E = {1, 2, . . . }. For simplicity
of notation, we will denote the transition functions P(i, {j}) by pij (i, j ∈ E). The
transition probabilities (of a randomly moving “particle”) from state i to state j for
(n)
n steps will be denoted by pij .
We will be interested in obtaining conditions under which the following state-
ments hold true:
A. For all j ∈ E there exist the limits
(n)
πj = lim pij ,
n

independent of the initial states i ∈ E;


B. These limiting values Π = (π1 , π2 , . . . ) form a probability distribution, i.e.,
πj ≥ 0 and j∈E πj = 1;
C. The Markov chain is ergodic, i.e., the limiting values Π = (π1 , π2 , . . . ) are
such that all πj > 0 and j∈E πj = 1;
D. There exists a unique stationary  (invariant) probability distribution Q =
(q1 , q2 , . . . ), i.e., such that qj ≥ 0, i∈E qi = 1, and

qj = qi pij
i∈E

for all j ∈ E.

Remark. The term “ergodicity” used here appeared already in Chap. 5 (ergodicity
as a metric transitivity property, the Birkhoff–Khinchin ergodic theorem). Formally,
these terms are related to different objects, but their common feature is that they
reflect the asymptotic behavior of various probabilistic characteristics as the time
parameter goes to infinity.

3. Problems.
(n)
1. Give examples of Markov chains for which the limits πj = limb pij exist and
(a) are independent of the initial state j and (b) depend on j.
2. Give examples of ergodic and nonergodic Markov chains.
3. Give examples where the stationary distribution is not ergodic.
4 Classification of States of Markov Chains 259

4. Classification of States of Markov Chains in Terms of


Algebraic Properties of Matrices of Transition Probabilities

1. We will assume that the Markov chain under consideration has a countable set of
states E = {1, 2, . . . } and transition probabilities pij , i, j ∈ E. The matrix of these
probabilities will be denoted by P = pij  or, in expanded form,
. .
.p11 p12 p13 . . ..
. .
.p21 p22 p23 . . ..
. .
P=. .
.. . . . . . . . . . . . . . . .
. pi1 pi2 pi3 . . ..
. .
.. . . . . . . . . . . . . . .

(Sometimes we will write ( · ) instead of  ·  to denote matrices.)


In what follows we give the classification of the states of a Markov chain in
terms of the algebraic properties of the matrices of transition probabilities P and
P(n) , n ≥ 1.
The matrix of transition probabilities P completely determines transitions for one
step from one state to another, while transitions for n steps are determined (due to
(n)
the Markov property) by the matrices P(n) = pij .

Fig. 36 Inessential and essential states

For example, the matrix 


1/2 1/2
P=
0 1
and the corresponding graph (Sect. 12, Chap. 1, Vol. 1) show that in the random
walk over states 0 and 1 driven by this matrix, the move 0 → 1 for one step is
possible (with probability 1/2), whereas the move 1 → 0 is impossible. Clearly, the
transition 1 → 0 is impossible for any number of steps, which can be seen from the
structure of the matrices . −n .
.2 1 − 2−n .
P =.
(n) . .
0 1 .
(n)
showing that p10 = 0 for every n ≥ 1.
260 8 Markov Chains

State 1 in this example is such that the particle can enter into it (from state 0),
but cannot leave it.
Consider the graph in Fig. 36, from which one can easily recover the transition
matrix P. It is clear from this graph that there are three states (the left-hand part of
the figure) such that leaving any of them, there is no way to return back.
With regard of the future behavior of the “particle” wandering in accordance with
this graph, these three states are inessential because the particle can leave them but
cannot return anymore.
These “inessential” states are of little interest, and we can discard them to focus
our attention on the classification of the remaining “essential” states. (This descrip-
tive definition of “inessential” and “essential” states can be formulated precisely in
(n)
terms of the transition probabilities pij , i, j ∈ E, n ≥ 1, Problem 1.)
2. To classify essential states or groups of such states, we need some definitions.

Definition 1. We say that state j is accessible from point i (notation: i → j) if there


(n) (0)
is n ≥ 0 such that pij > 0 (with pij = 1 if i = j and 0 if i = j).
States i and j communicate (notation: i ↔ j) if i → j and j → i, i.e., if they are
mutually accessible.

Lemma 1. The property of states to communicate (↔) is an equivalence relation


between states of the Markov chain with transition probabilities matrix P.

PROOF. By the definition of the equivalence relation (in this case “↔”), we must
verify that it is reflexive (i ↔ i), symmetric (if i ↔ j, then j ↔ i), and transitive (if
i ↔ j and j ↔ k, then i ↔ k).
The first two properties follow directly from the definition of communicating
(n)
states. Transitivity follows from the Kolmogorov–Chapman equation: If pij > 0
(m)
and pjk > 0, then

(n+m)
 (n) (m) (n) (m)
pik = pil plk ≥ pij pjk > 0,
l∈E

i.e., i → k. In a similar way, k → i, hence i ↔ k.




We will gather the states i, j, k, . . . that communicate with each other (i ↔ j, j ↔
k, k ↔ i, . . . ) into the same class. Then any two such classes of states either co-
incide or are disjoint. Thus, the relation that two states may communicate induces
a partition of the set of (essential) states E into a finite or countable set of disjoint
classes E1 , E2 , . . . (E = E1 + E2 + . . . ).
These classes will be called indecomposable classes (of essential communicat-
ing) states. A Markov chain whose states form a single indecomposable class will
be said to be indecomposable.
As an illustration we consider the chain with state space E = {1, 2, 3, 4, 5} and
the matrix of transition probabilities
4 Classification of States of Markov Chains 261
⎛ ⎞
1/3 2/3 0 0 0
⎜1/4 3/4 0 0 0 ⎟ 
⎜ ⎟ P1 0
P=⎜
⎜ 0 0 0 1 ⎟
0 ⎟= .
⎝ 0 0 P2
0 1/2 0 1/2⎠
0 0 0 1 0

The graph of this chain has the form

It is clear that this chain has two indecomposable classes, E1 = {1, 2}, E2 =
{3, 4, 5}, and the investigation of its properties reduces to the investigation of the
two separate chains with state spaces E1 and E2 and transition matrices P1 and P2 .
Now let us consider any indecomposable class E, for example, the one sketched
in Fig. 37.

Fig. 37 Example of a Markov chain with period d = 2

Observe that in this case a return to each state is possible only after an even
number of steps, and a transition to an adjacent state after an odd number. The
transition matrix has a block structure,
⎛ ⎞
0 0 1/2 1/2
⎜ 0 0 1/2 1/2⎟
P=⎜ ⎝1/2 1/2 0 0 ⎠ .

1/2 1/2 0 0
Therefore it is clear that the class E = {1, 2, 3, 4} separates into two subclasses
C0 = {1, 2} and C1 = {3, 4} with the following cyclic property: After one step
from C0 the particle necessarily enters C1 , and from C1 it returns to C0 .
3. This example suggests that, in general, it is possible to give a classification of
indecomposable classes into cyclic subclasses.
To this end we will need some definitions and a fact from number theory.
262 8 Markov Chains

Definition 2. Let ϕ = (ϕ1 , ϕ2 , . . . ) be a sequence of nonnegative numbers ϕn ≥ 0,


n ≥ 1. The period of the sequence ϕ (notation: d(ϕ)) is the number

d(ϕ) = GCD(Mϕ ) = GCD{ n ≥ 1 : ϕn > 0 },

where GCD(Mϕ ) is the greatest common divisor of the set Mϕ of indices n ≥ 1 for
which ϕn > 0; if ϕn = 0, n ≥ 1, then Mϕ = ∅, and GCD(Mϕ ) is set to be zero.

In other words, the period of a sequence ϕ is d(ϕ) if n is divisible by d(ϕ)


whenever ϕn > 0 (i.e., n equals d(ϕ)k with some k ≥ 1) and d(ϕ) is the greatest
number among d with this property, i.e., such that n = dl for some integer l ≥ 1.
For example, the sequence ϕ = (ϕ1 , ϕ2 , . . . ) such that ϕ4k > 0 for k = 1, 2, . . .
and ϕn = 0 for n = 4k, has period d(ϕ) = 4 rather than 2, although ϕ2l > 0 for
l = 2, 4, 8.

Definition 3. A sequence ϕ = (ϕ1 , ϕ2 , . . . ) is aperiodic if its period d(ϕ) = 1.

The following elementary result of number theory will be useful in the sequel for
the classification of states in terms of the cyclicity property.

Lemma 2. Let M be a set of nonnegative integers (M ⊆ E) closed with respect to


addition and such that GCD(M) = 1.
Then there is an n0 such that M contains all numbers n ≥ n0 .

We will apply this lemma to the set M = Mϕ taking as the sequence ϕ =


(1) (2) (d) (2d)
(ϕ1 , ϕ2 , . . . ) the sequence (pjj , pjj , . . . ) or the sequence (pjj , pjj , . . . ), d ≥ 1,
where j is a state of the Markov chain with a matrix of transition probabilities
(n)
P = pij , and pjj is an element of the matrix P(n) , n ≥ 1, P(1) = P. (We will say
(1) (2)
that a state j has period d(j) if d(j) is the period of the sequence (pjj , pjj , . . . ).)
Then we obtain the following result.

Theorem 1. Let a state j have period d = d(j).


(n)
If d = 1, then there is n0 = n0 (j) such that the transition probabilities pjj > 0
for all n ≥ n0 .
(nd)
If d > 1, then there is n0 = n0 (j, d) such that pjj > 0 for all n ≥ n0 .
(m)
If d ≥ 1 and pij > 0 for some i ∈ E and m ≥ 1, then there is n0 = n0 (j, d, m)
(m+nd)
such that pij > 0 for all n ≥ n0 .

Now we state a theorem showing that the periods of the states of an indecompos-
able class are of the same “type.”

Theorem 2. Let E∗ = {i, j, . . . } be an indecomposable class of (communicating)


states of set E.
Then all the states of this class are “of the same type” in the sense that they have
the same period (denoted by d(E∗ )) called the period of class E∗ .
4 Classification of States of Markov Chains 263

(k) (l)
PROOF. Let i, j ∈ E∗ . Then there are k and l such that pij > 0 and pji > 0. But,
by the Kolmogorov–Chapman equation, we have then
(k+l)
 (k) (l) (k) (l)
pii = pia pai ≥ pij pji > 0,
a∈E

hence k + l must be divisible by d(i), the period of the state i ∈ E∗ .


(n)
Let d(j) be the period of the state j ∈ E∗ , and let n be such that pjj > 0. Then n
must be divisible by d(j), and since
(n+k+l) (k) (n) (l)
pii ≥ pij pjj pji > 0,

we obtain that n + k + l is divisible by d(i). But k + l is divisible by d(i), hence n is


(n)
divisible by d(i), and since d(j) = GCD{ n : pjj > 0 }, we have d(i) ≤ d(j).
By symmetry, d(j) ≤ d(i), hence d(i) = d(j).


4. If a set E∗ ⊆ E forms an indecomposable class of (communicating) states and
d(E∗ ) = 1, then it is said to be an aperiodic class of states.
Now we consider the case d(E∗ ) > 1. The transitions within such a class may be
quite freakish (as in the preceding example of a Markov chain with period d(E∗ ) =
2, see Fig. 37). However, there is a cyclic character of the transitions from one group
of states to another.

Fig. 38 Motion among cyclic subclasses

Theorem 3. Let E∗ be an indecomposable class of states, E∗ ⊆ E, with period


d = d(E∗ ) > 1.
Then there are d groups of states C0 , C1 , . . . , Cd−1 , called cyclic subclasses
(E∗ = C0 + C1 + · · · + Cd−1 ), such that at the time instants n = p + kd, with
p = 0, 1, . . . , d − 1 and k = 0, 1, . . . , the “particle” is in the subclass Cp with a
transition at the next time to Cp+1 , then to Cp+2 , . . . , Cd−1 , then from Cd−1 to C0
and so on.
264 8 Markov Chains

PROOF. Let us fix a state i0 ∈ E∗ and define the following subclasses:


(n)
C0 = {j ∈ E∗ : pi0 j > 0, n = kd, k = 0, 1, . . . },
(n)
C1 = {j ∈ E∗ : pi0 j > 0, n = kd + 1, k = 0, 1, . . . },
.......................................................
(n)
Cd−1 = {j ∈ E∗ : pi0 j > 0, n = kd + (d − 1), k = 0, 1, . . . }.

It is clear that E∗ = C0 + C1 + · · · + Cd−1 . Let us show that the “particle” moves


from one subclass to another following the rule described in the theorem (Fig. 38).
In fact, consider a state i ∈ Cp , and let the state j ∈ E∗ be such that pij > 0. We
will show that then j ∈ C(p+1) (mod d) .
(n)
Let n be such that pi0 j > 0. Then n can be written as n = p + kd with some
p = 0, 1, . . . , d − 1 and k = 0, 1, . . . . Hence n ≡ p (mod d) and n + 1 ≡ (p + 1)
(n+1)
(mod d). This implies that pi0 j > 0 (by the definition of the period d = d(E∗ )),
so that j ∈ C(p+1) (mod d) , which was to be proved.


Let us observe that it now follows that the transition matrix P of an indecompos-
able chain has the following block structure:

Suppose now that the wandering particle whose evolution is driven by matrix P
starts from a state in the subclass C0 . Then at each time n = p + kd this particle will
be (by the definition of subclasses C0 , C1 , . . . , Cd−1 ) in the set Cp .
Therefore with each set Cp of states we can associate a new Markov chain with
(d)
transition matrix pij , where i, j ∈ Cp . This new chain will be indecomposable
and aperiodic.
Thus, taking into account the foregoing classification (into inessential and essen-
tial states, indecomposable classes and cyclic subclasses; see Fig. 39), we can draw
the following conclusion:
(n)
To investigate the limiting behavior of transition probabilites pij ,
n ≥ 1, i, j ∈ E, which determine the motion of the “Markov parti-
cle,” we can restrict our attention to the case where the phase space
E itself is a unique indecomposable and aperiodic class of states.
In this case, the Markov chain X = (Xn )n≥0 itself with such a phase space and
the matrix of transition probabilities P is called indecomposable and aperiodic.
5 Classification of States of Markov Chains 265

Fig. 39 Classification of states of a Markov chain in terms of arithmetic properties of probabilities


(n)
pi j

5. Problems
(n)
1. Give an accurate formulation in terms of transition probabilities pij , i, j ∈ E,
n ≥ 1, to the descriptive definition of inessential and essential states stated at
the end of Subsection 1.
2. Let P be the matrix of transition probabilities of an indecomposable Markov
chain with finitely many states. Let P2 = P. Explore the structure of P.
3. Let P be the matrix of transition probabilities of a finite Markov chain X =
(Xn )n≥0 . Let σ1 , σ2 , . . . be a sequence of independent identically distributed
nonnegative integer-valued random variables independent of X, and let τ0 = 0,
τn = σ1 + · · · + σn , n ≥ 1. Show that the sequence X 8 = (X8n )n≥0 with X
8n = Xτn
8
is a Markov chain. Find the matrix P of transition probabilities for this chain.
Show that if states i and j communicate for the chain X, they do so for X. 8
4. Consider a Markov chain with two states, E = {0, 1}, and the matrix of transi-
tion probabilities

α 1−α
P= , 0 < α < 1, 0 < β < 1.
1−β β

Describe the structure of the matrices P(n) , n ≥ 2.

5. Classification of States of Markov Chains in Terms of


Asymptotic Properties of Transition Probabilities

1. Let X = (Xn )n≥0 be a homogeneous Markov chain with countable state space
E = {1, 2, . . . } and transition probabilities pij = Pi {X1 = j}, i, j ∈ E.
Let
(n)
fii = Pi {Xn = i, Xk = i, 1 ≤ k ≤ n − 1} (1)
266 8 Markov Chains

and (for i = j)
(n)
fij = Pi {Xn = j, Xk = j, 1 ≤ k ≤ n − 1}. (2)
(n) (n)
It is clear that fii is the probability of first return to state i at time n, while fij is
the probability of first arrival at state j at time n, provided that X0 = i.
If we set
σi (ω) = min{ n ≥ 1 : Xn (ω) = i } (3)
(n) (n)
with σi (ω) = ∞ when the set in (3) is empty, then the probabilities fii and fij can
be represented as
(n) (n)
fii = Pi {σi = n}, fij = Pi {σj = n}. (4)

For i, j ∈ E define the quantities



 (n)
fij = fij . (5)
n=1

It is seen from (4) that


fij = Pi {σj < ∞}. (6)
In other words, fij is the probability that the “particle” leaving state i will ultimately
arrive at state j.
In the sequel, of special importance is the probability fii that the “particle” leav-
ing state i will ultimately return to this state. These probabilities are used in the
following definitions.
Definition 1. A state i ∈ E is recurrent if fii = 1.
Definition 2. A state i ∈ E is transient if fii < 1.
There are the following conditions for recurrence and transience.
Theorem 1. (a) The state i ∈ E is recurrent if and only if either of the following two
conditions is satisfied:
 (n)
Pi {Xn = i i. o.} = 1 or pii = ∞.
n

(b) The state i ∈ E is transient if and only if either of the following two conditions
is satisfied:  (n)
Pi {Xn = i i. o.} = 0 or pii < ∞.
n
Therefore, according to this theorem,
 (n)
fii = 1 ⇐⇒ Pi {Xn = i i. o.} = 1 ⇐⇒ pii = ∞, (7)
n
 (n)
fii < 1 ⇐⇒ Pi {Xn = i i. o.} = 0 ⇐⇒ pii < ∞. (8)
n
5 Classification of States of Markov Chains 267

Remark. Recall that, according to Table 2.1 in Sect. 1, Chap. 2, Vol. 1, the event
{Xn = i i. o.} is the set of outcomes ω for which Xn (ω) = i for infinitely many
∞ n. If we use the notation An = {ω : Xn (ω) = i}, then {Xn = i i. o.} =
indices
n=1 k=n Ak ; see the table mentioned earlier.

PROOF. We can observe immediately that the implication


 (n)
pii < ∞ =⇒ Pi {Xn = i i. o.} = 0 (9)
n

(n)
follows from the Borel–Cantelli lemma since pii = Pi {Xn = i} (see statememt (a)
of this lemma, Sect. 10, Chap. 2, Vol. 1).
Let us show that  (n)
fii = 1 ⇐⇒ pii = ∞. (10)
n

The Markov property and homogeneity imply that for any collections (i1 , . . . , ik )
and (j1 , . . . , jn )
1 2
Pi (X1 , . . . , Xk ) = (i1 , . . . , ik ), (Xk+1 , . . . , Xk+n ) = (j1 , . . . , jn )
1 2 1 2
= Pi (X1 , . . . , Xk ) = (i1 , . . . , ik ) Pik (X1 , . . . , Xn ) = (j1 , . . . , jn ) .

This implies at once that (cf. derivation of (38) in Sect. 12, Chap. 1, Vol. 1 and (25)
in Sect. 2 of this chapter)

(n)

n−1
pij = Pi {Xn = j} = Pi {X1 = j, . . . , Xn−k−1 = j, Xn−k = j, Xn = j}
k=0

n−1
= Pi {X1 = j, . . . , Xn−k−1 = j, Xn−k = j} Pj {Xk = j}
k=0

n−1
(n−k) (k)

n
(k) (n−k)
= fij pjj = fij pjj .
k=0 k=1

Thus
(n)

n
(k) (n−k)
pij = fij pjj . (11)
k=1
(0)
Letting j = i we find that (with pii = 1)


 ∞ 
 n ∞
 ∞
 ∞

(n) (k) (n−k) (k) (n−k) (n)
pii = fii pii = fii pii = fii pii
n=1 n=1 k=1 k=1 n=k n=0
, ∞
-
 (n)
= fii 1 + pii . (12)
n=1
268 8 Markov Chains

Hence we see that




(n)

 pii
(n) n=1
pii < ∞ =⇒ fii = ∞ . (13)
(n)
n=1 1 + pii
n=1

∞ (n)
Let now n=1 pii = ∞. Then


N
(n)

N 
n
(k) (n−k)

N
(k)

N
(n−k)

N
(k)

N
(l)
pii = fii pii = fii pii ≤ fii pii ,
n=1 n=1 k=1 k=1 n=k k=1 l=0

hence

N
(n)

 
N pii
(k) (k) n=1
fii = fii ≥ fii ≥ → 1, N → ∞.

N
(l)
k=1 k=1 pii
l=0

Thus,

 (n)
pii = ∞ =⇒ fii = 1. (14)
n=1

The implications (13) and (14) imply the following equivalences:



 (n)
pii < ∞ ⇐⇒ fii < 1, (15)
n=1
∞
(n)
pii = ∞ ⇐⇒ fii = 1. (16)
n=1

To complete the proof, it remains to show that

fii < 1 ⇐⇒ Pi {Xn = i i. o.} = 0, (17)


fii = 1 ⇐⇒ Pi {Xn = i i. o.} = 1. (18)

These properties are easily comprehensible from an intuitive point of view. For
example, if fii = 1, then Pi {σi < ∞} = 1, i.e., the “particle” sooner or later
will return to the same state i from where it started its motion. But then, by the
strong Markov property, the “life of the particle” starts anew from this (random)
time. Continuing this reasoning, we obtain that the events {Xn = i} will occur for
infinitely many indices n, i.e., Pi {Xn = i i.o.} = 1.
Let us give a formal proof of (17) and (18). For a given state i ∈ E, consider the
probability that the number of returns to i is greater than or equal to m. We claim
that this probability is equal to (fii )m .
Indeed, for m = 1 this follows from the definition of fii . Suppose that our claim
is true for m − 1. We will show that the probability of interest is then equal to (fii )m .
5 Classification of States of Markov Chains 269

By the strong Markov property (see (8) in Sect. 2) and since {σi = k} ∈ Fσi , we
find
Pi (the number of returns to i is at least m)
∞
= Pi (σi = k and there are at least m − 1 returns to i after time k)
k=1


= Pi {σi = k} Pi (at least m − 1 of Xσi +1 , Xσi +2 , . . . are equal to i | σi = k)
k=1
∞
= Pi {σi = k} Pi (at least m − 1 of X1 , X2 , . . . are equal to i)
k=1
∞
(k)
= fii (fii )m−1 = fii (fii )m−1 = (fii )m .
k=1

This implies that



1, if fii = 1,
Pi {Xn = i i. o.} = lim (fii ) = m
(19)
m→∞ 0, if fii < 1.

Using the notation A = {An i.o.} (= lim sup An ), where An = {Xn = i}, we see
from (19) that Pi (A) obeys the “0 or 1 law,” i.e., Pi (A) can take only two values 0
or 1. (Note that this property does not follow directly from statements (a) and (b) of
the Borel–Cantelli lemma (Sect. 10, Chap. 2, Vol. 1) since the events An , n ≥ 1, are,
in general, dependent.)
Equation (19) and the property that Pi (A) can take only the values 0 and 1 imply
the required implications in (17) and (18).


2. The theorem just proved implies the following simple, but important, property of
transient states.

Theorem 2. If a state j is transient, then for any i ∈ E



 (n)
pij < ∞, (20)
n=1

hence, for any i ∈ E,


(n)
pij → 0, n → ∞. (21)
(0)
PROOF. We have from (11) (with pjj = 1)

 ∞ 
 n ∞
 ∞
 ∞
 ∞

(n) (k) (n−k) (k) (n) (n) (n)
pij = fij pjj = fij pjj = fij pjj ≤ pjj < ∞,
n=1 n=1 k=1 k=1 n=0 n=0 n=0
270 8 Markov Chains
∞ (k)
where we have used that fij = k=1 fij ≤ 1 (being the probability that the particle
leaving state i will ultimately arrive at state j).
Property (21) obviously follows from (20).


3. Now we proceed to recurrent states.
Every recurrent state i ∈ E can be classified according to whether the average
time of return to this state

 (n)
μi = nfii (= Ei σi ) (22)
n=1

(n)
is finite or infinite. (Recall that, according to (1), fii is the probability of the first
return for exactly n steps.)

Definition 3. Let us say that a recurrent state i ∈ E is positive if



 (n)
μi = nfi i < ∞ (23)
n=1

and null if

 (n)
μi = nfi i = ∞. (24)
n=1

Hence, according to this definition, the first return to a null (recurrent) state re-
quires (in average) infinite time. Alternatively, the average time of first return to a
positive (recurrent) state is finite.
4. The following figure illustrates the classification of states of a Markov chain in
terms of recurrence and transience, and positive and null recurrence (Fig. 40).

Fig. 40 Classification of states of Markov chain in terms of asymptotic properties of probabilities


(n)
pi i

5. Theorem 3. Let a state j ∈ E of a Markov chain be recurrent and aperiodic


(d(j) = 1).
5 Classification of States of Markov Chains 271

Then for any i ∈ E


(n) fij
pij → , n → ∞. (25)
μj
If, moreover, states i and j communicate (i ↔ j), i.e., belong to the same inde-
composable class, then
(n) 1
pij → , n → ∞. (26)
μj
The proof given below will rely on Lemma 1, which is one of the key results of
“discrete renewal theory.” For another proof of Theorem 5, based on the concept of
coupling (Sect. 8, Chap. 3, Vol. 1), see, for example, [9, 35].

Lemma 1. (From “discrete renewal theory.”) Let ϕ = (ϕ1 , ϕ2 , . . . ) be an aperiodic


sequence (d(ϕ) = 1) of nonnegative numbers and u = (u0 , u1 , . . . ) a sequence
constructed by the following recurrence rule: u0 = 1 and for every n ≥ 1

un = ϕ1 un−1 + ϕ2 un−2 + · · · + ϕn u0 . (27)

Then
un → μ−1
∞
as n → ∞, where μ = n=1 nϕn .

For the proof, see, for example, [25, XIII.10].

PROOF OF THEOREM 3. First, let us show for i = j that

(n) 1
pjj → , n → ∞. (28)
μj

To this end, we rewrite (11) (for i = j) as


(n) (1) (n−1) (2) (n−2) (n) (0)
pjj = fjj pjj + fjj pjj + · · · + fjj pjj , (29)

(0) (1) (1)


where we set pjj = 1 and, obviously, fjj = pjj . Letting

(k) (k)
uk = pjj , ϕk = fjj , (30)

(29) becomes
un = ϕ1 un−1 + ϕ2 un−2 + · · · + ϕn u0 ,
which is exactly the recurrence formula (27).
The required result (28) will follow directly from Lemma 1 if we show that the
(1) (2)
period df (j) of the sequence (fjj , fjj , . . . ) is equal to 1, provided that the period
(1) (2)
of (pjj , pjj , . . . ) is 1.

This, in turn, follows from the following general result.


272 8 Markov Chains

Lemma 2. For any j ∈ E


(n) (n)
GCD(n ≥ 1 : pjj > 0) = GCD(n ≥ 1 : fjj > 0), (31)

i.e., the periods df (j) and d(j) are the same.

PROOF. Let
(n) (n)
M = { n : pjj > 0 } and Mf = { n : fjj > 0 }.

Since Mf ⊆ M, we have
GCD(M) ≤ GCD(Mf ),
i.e., d(j) ≤ df (j).
(n)
The reverse inequality follows from the following probabilistic meaning of pjj
(n)
and fjj , n ≥ 1.
(n)
If the “particle” leaving state j arrives in this state in n steps again (pjj > 0), this
(k1 )
means that it returned to j for the first time in k1 steps (fjj > 0), then in k2 steps
(k ) (k )
(fjj 2 > 0), . . . , and finally in kl steps > 0). Therefore n = k1 + k2 + · · · + kl .
(fjj l
The number df (j) is divisible by k1 , k2 , . . . , kl ; hence it is a divisor of n. But d(j) is
(n)
the largest among the divisors of n for which pjj > 0. Hence d(j) ≥ df (j).
Thus d(j) = df (j). This, by the way, means that instead of defining the period
(n)
d(j) of a state j by the formula d(j) = GCD(n ≥ 1 : pjj > 0), we could also define
(n)
it by d(j) = GCD(n ≥ 1 : fjj > 0).
The proof of Lemma 2 is completed.


Now we proceed to the proof of (25) for i = j. Rewrite (11) in the form


(n) (k) (n−k)
pij = fij pjj , (32)
k=1

(l)
where we set pjj = 0, l < 0.
(n) ∞ (k)
Since pjj → μ1j here and k=1 fij ≤ 1, we have, by the dominated convergence
theorem (Theorem 3 in Sect. 6, Chap. 2, Vol. 1),

 ∞
 ∞
(k) (n−k) (k) (n−k) 1  (k) 1
lim fij pjj = fij lim pjj = fij = fij . (33)
n n μj μj
k=1 k=1 k=1

Now (32) and (33) imply that

(n) fij
lim pij = , (34)
n μj

i.e., (25) holds true.


5 Classification of States of Markov Chains 273

Finally, we will show that under the additional assumption i ↔ j (i.e. i, j belong
to the same indecomposable class of communicating states), we have fij = 1. Then
(34) will imply property (26).
State j is recursive by assumption. Therefore Pj {Xn = j i. o.} = 1 by statement
(a) of Theorem 1. Hence for any m
(m)
pji = Pj ({Xm = i} ∩ {Xn = j i. o.})

≤ Pj {Xm = i, Xm+1 = j, . . . , Xn−1 = j, Xn = j}
n>m
 (m) (n−m) (m)
= pji fij = pji fij , (35)
n>m

where the next-to-last equality follows from the generalized Markov property (see
(2) in Sect. 2).
(m)
Since E is a class of communicating states, there is m such that pji > 0. There-
fore (35) implies that fij = 1.
The proof of Theorem 5 is completed.
6. It is natural to state an analog of Theorem 5 for an arbitrary period d of state j of
interest (d = d(j) ≥ 1).
Theorem 4. Let the state j ∈ E of a Markov chain be recurrent with period d =
d(j) ≥ 1, and let i be a state in E (possibly coinciding with j).
(a) Suppose that i and j are in the same indecomposable class C ⊆ E with
(cyclic) subclasses C0 , C1 , . . . , Cd−1 numbered so that j ∈ C0 , i ∈ Ca , where
a ∈ {0, 1, . . . , d − 1}, and the motion over them goes in cyclic order, C0 → C1 →
· · · → Ca → · · · → Cd−1 → C0 . Then

(nd+a) d
pij → as n → ∞. (36)
μj

(b) In the general case, when i and j may belong to different indecomposable
classes, / n 0
(nd+a) d  (kd+a)
pij → fjj as n → ∞ (37)
μj
k=0

for any a = 0, 1, . . . , d − 1.
PROOF. (a) At first, let a = 0, i.e., i and j belong to the same indecomposable
class C and, moreover, to the same cyclic subclass C0 .
(d)
Consider the transition probabilities pij , i, j ∈ C, and arrange from them a new
Markov chain (according to the constructions from Sect. 1).
For this new chain state j will be recurrent and aperiodic, and states i and j remain
communicating (i ↔ j). Therefore, by property (26) of Theorem 5,

(nd) 1 d d
pij → ∞ (kd)
= ∞ (kd)
= ,
μj
k=1 k fjj k=1 (kd) fjj
274 8 Markov Chains

(l)
where the last equality holds because fjj = 0 for all l not divisible by d and μj =
∞ (l)
l=1 lfjj by definition.
Assume now that (36) has been proved for a = 0, 1, . . . , r (≤ d − 2). By the
dominated convergence theorem (Theorem 3 in Sect. 4, Chap. 2, Vol. 1),

 ∞

(nd+r+1) (nd+r) d d
pij = pik pkj → pik = .
μj μj
k=1 k=1

Therefore (36) is true for a = r + 1 (≤ d − 1), hence it is established by induction


for all a = 0, 1, . . . , d − 1.
(b) For all i and j in E we have (see (11))

(nd+a)

nd+a
(k) (nd+a−k)
pij = fij pjj , a = 0, 1, . . . , d − 1.
k=1

(nd+a−k)
By assumption, the period of j is d. Hence pjj = 0, unless k − a has the form
rd. Therefore
(nd+a)
n
(rd+a) ((n−r)d)
pij = fij pjj .
r=0

Using this equality and (36) and applying again the dominated convergence the-
orem, we arrive at the required relation (37).


7. As was pointed out at the end of Sect. 4, in the problem of classifying Markov
chains in terms of asymptotic properties of transition probabilities, we can restrict
ourselves to aperiodic indecomposable chains.
The results of Theorems 1–5 actually contain all that we need for the complete
classification of such chains.
The following lemma is one of the results saying that for an indecomposable
chain all states are of the same (recurrent or transient) type. (Compare with the
property that the states are “of the same type” in Theorem 2, Sect. 4.)

Lemma 3. Let E be an indecomposable class (of communicating states). Then all


its states are either recurrent or transient.

PROOF. Let the chain have at least one transient state, say, state i. By Theorem 1,
 (n)
n pii < ∞.
Now let j be another state. Since E is an indecomposable class of communicating
(k) (l)
states (i ↔ j), there are k and l such that pij > 0 and pji > 0. The obvious
inequality
(n+k+l) (k) (n) (l)
pii ≥ pij pjj pji
implies now that  
(n+k+l) (k) (l) (n)
pii ≥ pij pji pjj .
n n
5 Classification of States of Markov Chains 275
 (n) (k) (l)  (n)
By assumption, n pii < ∞ and k, l satisfy pij pji > 0. Hence n pjj < ∞.
By statement (b) of Theorem 1, this implies that j is also a transient state. In other
words, if at least one state of an indecomposable chain is transient, then so are all
other states.
Now let i be a recurrent state. We will show that all the other states are then re-
current. Suppose that (along with the recurrent state i) there is at least one transient
state. Then, by what has been proved, all other states must be transient, which con-
tradicts the assumption that i is a recurrent state. Thus the presence of at least one
recurrent state implies that all other states (of an indecomposable chain) are also
recurrent.


This lemma justifies the commonly used terminology of saying about an inde-
composable chain (rather than about a single state) that it is recurrent or transient.

Theorem 5. Let a Markov chain consist of a single indecomposable class E of ape-


riodic states. Then only one of three possibilities may occur.
(i) The chain is transient. In this case
(n)
lim pij = 0
n

for any i, j ∈ E with convergence to zero rather “fast” in the sense that
 (n)
pij < ∞.
n

(ii) The chain is recurrent and null. In this case, again


(n)
lim pij = 0
n

for any i, j ∈ E, but the convergence is “slow” in the sense that


 (n)
pij = ∞
n

and the average time μj of first return from j to j is infinite.


(iii) The chain is recurrent and positive. In this case

(n) 1
lim pij = >0
n μj

for all i, j ∈ E, where μj , the average time of return from j to j, is finite.

PROOF. Statement (i) has been proved in Theorems 1 (b) and 2. Statements (ii) and
(iii) follow directly from Theorems 1 (a) and 3.


Consider the case of finite Markov chains, i.e., the case where the state set E
consists of finitely many elements.
276 8 Markov Chains

It turns out that in this case only the third of the three options (i), (ii), (iii) in
Theorem 5 is possible.

Theorem 6. Let a finite Markov chain be indecomposable and aperiodic. Then this
(n)
chain is recurrent and positive, and limn pij = μ1j > 0.

PROOF. Suppose that the chain is transient. If the state space consists of r states
(E = {1, 2, . . . , r}), then


r
(n)

r
(n)
lim pij = lim pij . (38)
n n
j=1 j=1

Obviously, the left-hand side is equal to 1. But the assumption that the chain is
transient implies (by Theorem 1 (i)) that the right-hand side is zero.
Suppose now that the states of the chain are recurrent.
Since by Theorem 5 there remain only two options, (ii) and (iii), we must exclude
(n)
(ii). But since limn pij = 0 for all i, j ∈ E in this case, we arrive at a contradiction
using (38) in the same way as in the case of transient states.
Thus only (iii) is possible.


9. Problems
1. Consider an indecomposable chain with state space0, 1, 2, . . . . This chain is
transient if and only if the system of equations uj = i ui pi j , j = 0, 1, . . ., has
a bounded solution such that ui ≡ c, i = 0, 1, . . . .
2. A sufficient condition for an indecomposable chain with states 0, 1, . . . to be
recurrent  is that there is a sequence (u0 , u1 , . . .) with ui → ∞, i → ∞, such
that uj ≥ i ui pi j for all j = 0.
3. A necessary and sufficient condition for an indecomposable chain with states
0,
 1, . . . to be recurrent and positive is that the system of equations  uj =
i ui pi j , j = 0, 1, . . ., has a solution, not identically zero, such that i |ui | <
∞.
4. Consider a Markov chain with states 0, 1, 2, . . . and transition probabilities

p00 = r0 , p01 = p0 > 0,




⎪ pi > 0, j = i + 1,

⎨r ≥ 0, j = i,
i
pij =

⎪ q i > 0, j = i − 1,


0 otherwise.

Let ρ0 = 1, ρm = (q1 . . . qm )/(p1 . . . pm ). Prove the following propositions.


6 Stationary and Ergodic Distributons 277

A chain is recursive ⇐⇒ ρm = ∞,

A chain is transient ⇐⇒ ρm < ∞,
  1
A chain is positive ⇐⇒ ρm = ∞, < ∞,
pm ρm
  1
A chain is null ⇐⇒ ρm = ∞, = ∞.
pm ρm
5. Show that ∞

(n) (n)
fik ≥ fij fjk , sup pij ≤ fij ≤ pij .
n
n=1
(n)
6. Show that for any Markov chain with countable state space the limits of pij
always exist in the Cesàro sense:
1  (k)
n
fij
lim pij = .
n n μj
k=1

7. Consider a Markov chain ξ0 , ξ1 , . . . with ξk+1 = (ξk )+ + ηk+1 , k ≥ 0, where


η1 , η2 , . . . is a sequence of independent identically distributed random variables
with P(ηk = j) = pj , j = 0, 1, . . . . Write the transition matrix
 and show that if
p0 > 0, p0 + p1 < 1, the chain is recurrent if and only if k kpk ≤ 1.

6. Limiting, Stationary, and Ergodic Distributions for


Countable Markov Chains

1. We begin with a general result clarifying the relationship between the limits Π =
(n)
(π1 , π2 , . . . ), where πj = limn pij , j = 1, 2, . . . , and stationary distributions Q =
(q1 , q2 , . . . ).
Theorem 1. Consider a Markov chain with a countable state space E = {1, 2, . . . }
and transition probabilities pij , i, j ∈ E, such that the limits
(n)
πj = lim pij , j ∈ E,
n

exist and
∞ ∞ of the initial states i ∈ E. Then
are independent
(a) j=1 πj ≤ 1, i=1 πi pij = πj , j ∈ E;
∞ ∞
(b) Either j=1 πj = 0 (hence all πj = 0, j ∈ E) or j=1 πj = 1;
∞
(c) If j=1 πj = 0, then the Markov chain has no stationary distributions, and
∞
if j=1 πj = 1, then the vector of limiting values Π = (π1 , π2 , . . . ) is a stationary
distribution for this chain, and the chain has no other stationary distribution.
PROOF. We have

 ∞
 ∞

(n) (n)
πj = lim pij ≤ lim inf pij = 1, (1)
n n
j=1 j=1 j=1
278 8 Markov Chains

and, for any j ∈ E, k ∈ E,



 ∞
 ∞

(n) (n) (n+1)
πi pij = lim pki pij ≤ lim inf pki pij = lim inf pkj = πj . (2)
n n n
i=1 i=1 i=1

Remark. Note that the inequalities and lower limits appear here, of course, due to
Fatou’s lemma, which is applied to Lebesgue’s integral over a σ-finite (nonnegative)
measure rather than a probability measure as in Sect. 6, Chap. 2, Vol. 1.

Thus the vector Π = (π1 , π2 , . . . ) satisfies



 ∞

πj ≤ 1 and πi pij ≤ πj , j ∈ E. (3)
j=1 i=1

Let us show that the latter inequality is in fact the equality.


Let for some j0 ∈ E
∞
πi pij0 < πj0 . (4)
i=1

Then

 ∞ 
 ∞  ∞
 ∞
 ∞

πj > πi pij = πi pij = πi .
j=1 j=1 i=1 i=1 j=1 i=1
∞
Thecontradiction thus obtained shows that i=1 πi pij = πj . Together with inequal-

ity j=1 πj ≤ 1 this proves conclusion (a).
∞
For the proof of (b), we iterate the equality i=1 πi pij = πj to obtain

 (n)
πi pij = πj
i=1

for any n ≥ 1 and any j ∈ E. Hence, by the dominated convergence theorem (Theo-
rem 3, Sect. 6, Chap. 2, Vol. 1),

 ∞
 
∞ 
(n) (n)
πj = lim πi pij = πi lim pij = πi π j ,
n n
i=1 i=1 i=1

i.e., 


πj 1 − πi = 0, j ∈ E,
i=1
∞  ∞  ∞
so that j=1 πj 1 − i=1 πi = 0. Thus a(1 − a) = 0 with a = i=1 πi ,
implying that either a = 1 or a = 0, which proves conclusion (b).
For the proof of (c), assume that Q = (q1 , q2 , . . . ) is a stationary distribution.
∞ (n)  ∞ 
Then i=1 qi pij = qj and we obtain i=1 qi πj = qj , j ∈ E, by the dominated
convergence theorem.
6 Stationary and Ergodic Distributons 279
∞
Therefore, if Q is a stationary distribution, then i=1 qi = 1, and hence this
stationary distribution must satisfy qj =
 π j for all j ∈ E. Thus in the case where
∞ ∞
j=1 π j = 0, it is impossible to have i=1 qi = 1, so that there is no stationary
distribution in this case. ∞
According to (b), there remains the possibility that j=1 πj = 1. In this case,
by (a), Π = (π1 , π2 , . . . ) is itself a stationary distribution, and the foregoing proof
implies that if Q is also a stationary distribution, then it must coincide with Π, which

proves the uniqueness of the stationary distribution when j=1 πj = 1.


2. Theorem 1 provides a sufficient condition for the existence of a unique stationary
distribution. This condition requires that for all j ∈ E there exist the limiting values
(n)
πj = limn pij independent of i ∈ E such that πj > 0 for at least one j ∈ E.
(n)
At the same time, the more general problem of the existence of the limits limn pij
was thoroughly explored in Sect. 5 in terms of the “intrinsic” properties of the chains
such as indecomposability, periodicity, recurrence and transience, and positive and
null recurrence. Therefore it would be natural to formulate the conditions for the
existence of the stationary distribution in terms of these intrinsic properties deter-
mined by the structure of the matrix of transition probabilities pij , i, j ∈ E. It is seen
also that if conditions stated in these terms imply that all the limiting values are
positive, πj > 0, j ∈ E, then by definition (see property C in Sect. 3) the vector
Π = (π1 , π2 , . . . ) will be an ergodic limit distribution.
The answers to these questions are given in the following two theorems.

Theorem 2 (“Basic theorem on stationary distributions”). Consider a Markov chain


with a countable state space E. A necessary and sufficient condition for the existence
of a unique stationary distribution is that
(a) There exists a unique indecomposable subclass and
(b) All the states are positive recurrent.

Theorem 3 (“Basic theorem on ergodic distributions”). Consider a Markov chain


with a countable state space E. A necessary and sufficient condition for the existence
of an ergodic distribution is that the chain is
(a) Indecomposable,
(b) Positive recurrent, and
(c) Aperiodic.

3. PROOF OF THEOREM 2. Necessity. Let the chain at hand have a unique stationary
distribution, to be denoted by Q. 8 We will show that in this case there is a unique
positive recurrent subclass in state space E.
Let N denote the conceivable number of such subclasses (0 ≤ N ≤ ∞).
Suppose N = 0, and let j be a state in E. Since there are no positive recurrent
classes, state j may be either transitive or null recurrent.
(n)
In the former case, the limits limn pij exist and are equal to zero for all i ∈ E by
Theorem 2 in Sect. 5.
280 8 Markov Chains

In the latter case these limits also exist and are equal to zero, which follows from
(37) in Sect. 5 and the fact that μj = ∞, since state j is null recurrent.
(n)
Thus, if N = 0, then the limits πj = limn pij exist and are equal to zero for all
i, j ∈ E. Therefore, by Theorem 1 (c), in this case there is no stationary distribution,
so the case N = 0 is excluded by the assumption of the existence of a stationary
distribution Q. 8
Suppose now that N = 1. Denote the only positive recurrent class by C. If the
period of this class d(C) = 1, then by (26) of Theorem 5, Sect. 5,

pij → μ−1
(n)
j , n → ∞,

for all i, j ∈ C. If j ∈
/ C, then this state is transient and by property (21) of Theorem 2,
Sect. 5,
(n)
pij → 0, n → ∞,
for all i ∈ E.
Let 
μ−1
j (> 0), if j ∈ C,
qj = (5)
0, if j ∈
/ C.
Then, since C = ∅, the collection Q = (q1 , q2 , . . . ) is (by Theorem 1 (a)) a unique
stationary distribution, therefore Q = Q. 8
Suppose now that the period d(C) > 1. Let C0 , C1 , . . . , Cd−1 be the cyclic sub-
classes of the (positive recurrent) class C.
Every Ck , k = 0, 1, . . . , d − 1, is a recurrent and aperiodic subclass of the matrix
(d)
of transition probabilities pij , i, j ∈ C. Hence, for i, j ∈ Ck ,

(nd) d
pij → >0
μj

by (36) from Sect. 5. Therefore, for each set Ck , the collection {d/μj , j ∈ Ck } is
(d)
(by Theorem 1 (b)) a unique stationary distribution (with regard to the matrix pij ,
 
i, j ∈ C). This implies, in particular, that j∈Ck μdj = 1, i.e., j∈Ck μ1j = 1d .
Let us set 
μ−1
j , j ∈ C = C0 + · · · + Cd−1 ,
qj = (6)
0, j∈/ C,
and show that the collection Q = (q1 , q2 , . . . ) is a unique stationary distribution.
Indeed, if i ∈ C, then  (nd−1)
(nd)
pii = pij pji .
j∈C

Then we find in the same way as in (1) that


d (nd)
 (nd−1)
 d
= lim pii ≥ lim inf pij pji = pji ,
μi n
j∈C
n μ
j∈C j
6 Stationary and Ergodic Distributons 281

hence  1
1
≥ pji . (7)
μi μ
j∈C j

But
 1 d−1 
  d−1
1 1
= = = 1. (8)
i∈C
μ i i∈C
μ i d
k=0 k k=0

As in the proof of Theorem 1 (see (3) and (4)), we obtain from (7) and (8) that (7)
holds in fact with an equality sign:
1  1
= pji . (9)
μi μ
j∈C j

Since qi = μ−1 i > 0, we see from (9) that the collection Q = (q1 , q2 , . . . ) is a
stationary distribution, which is unique by Theorem 1. Therefore Q = Q. 8
Finally, let 2 ≤ N < ∞ or N = ∞. Denote the positive recurrent subclasses by
C1 , . . . , CN if N < ∞, and by C1 , C2 , . . . if N = ∞.
Let Qk = (qk1 , qk2 , . . . ) be a stationary distribution for a class Ck , given by the
formula (compare with (5), (6))

k μ−1
j > 0, j ∈ Ck ,
qj =
0, j∈/ Ck .
∞
Then for any nonnegative numbers a1 , a2 , . . . with k=1 ak = 1 (aN+1 = · · · = 0
if N < ∞), the collection a1 Q 1 +· · ·+aN Q N +. . . is, obviously, a stationary distri-
bution. Hence the assumption 2 ≤ N ≤ ∞ leads us to the existence of a continuum
of stationary distributions, which contradicts the assumption of its uniqueness.
Thus, the foregoing proof shows that only the case N = 1 is possible. In other
words, the existence of a unique stationary distribution implies that the chain has
only one indecomposable class, which consists of positive recurrent states.
Sufficiency. If the chain has an indecomposable subclass of positive recurrent
states, i.e., the case N = 1 takes place, then the preceding arguments imply (by
Theorem 1 (c)) the existence and uniquence of the stationary distribution.
This completes the proof of Theorem 2.


4. PROOF OF THEOREM 3. Actually, all we need is contained in Theorem 2 and its
proof.
Sufficiency. Using the notation in the proof of Theorem 2, we have by the con-
ditions of the present theorem that N = 1, C = E, and d(E) = 1 (aperiodic-
ity). Then the reasoning in the case N = 1 of the proof of Theorem 2 implies that
Q = (q1 , q2 , . . . ) with qj = μ−1
j , j ∈ E, is a stationary and ergodic distribution,
since all μ−1
j < ∞, j ∈ E.
Thus, the existence of an ergodic distribution Π = (π1 , π2 , . . . ) is established
(Π = Q).
282 8 Markov Chains

Necessity. If there exists an ergodic distribution Π = (π1 , π2 , . . . ), then by The-


orem 1 there exists a unique stationary distibution Q coinciding with Π.
It follows from Theorem 2 (and its proof) that the cases N = 0 and 2 ≤ N ≤ ∞
cannot occur, so that N = 1, and there is only one indecomposable class C consisting
of positive recurrent states. It remains to show that C = E and d(E) = 1.
Assume that C = E and d(C) = 1. Then the same reasoning as for N = 1 in the
(n)
proof of Theorem 2 shows that there is a state j ∈/ C such that pij → 0 for all i ∈ E.
(n)
This, however, contradicts the property that πj = limn pij > 0 for all i ∈ E.
Therefore, if d(C) = 1, then C = E and d(E) = 1 (aperiodicity).
Finally, if C = E and d(C) > 1, the arguments in the proof of Theorem 2 (case
N = 1) again imply that there is a stationary distribution Q = (q1 , q2 , . . . ) with
some qj = 0, which is in contradiction with Q = Π, where Π = (π1 , π2 , . . . ) is an
ergodic distribution whose probabilities are positive by definition, πj > 0, j ∈ E.


5. By the definition of the stationary (invariant) distribution Q = (q1 , q2 , . . . ), these
probabilities are subject to the conditions


qj ≥ 0, j ∈ E = {1, 2, . . . }, qj = 1 (10)
j=1

and satisfy the equations




qj = qi pij , j ∈ E. (11)
i=1

In other words, the stationary distribution Q = (q1 , q2 , . . . ) is one of the solu-


tions to the system of equations


xj = xi pij , j ∈ E, (12)
i=1

with components
∞  these solutions being nonnegative (xj ≥ 0, j ∈ E) and normal-
of
ized x
j=1 j = 1 .
Under the conditions of Theorem 5 there exists a stationary solution that at the
same time is ergodic. Hence, by Theorem 1 (c), there is a unique solution to sys-
∞(12) within the class of sequences x = (x1 , x2 , . . . ) with xj ≥ 0, j ∈ E, and
tem
j=1 xj = 1.
But in fact we can make a stronger assertion. Since we assume that the conditions
of Theorem 5 are fulfilled, there exists an ergodic distribution Π = (π1 , π2 , . . . ).
Consider under this assumption the problem of existence of a solution ∞ to (12) in
a wider class of sequences x = (x1 , x2 , . . . ) such that xj ∈ R, j ∈ E, j=1 |xj | < ∞
∞
and j=1 xj = 1. We will show that there is a unique solution in this class given by
the ergodic distribution Π. ∞
Indeed, if x = (x1 , x2 , . . . ) is a solution, then, using that j=1 |xj | < ∞, we
obtain the following chain of inequalities:
7 Stationary and Ergodic Distributions 283

 ∞ 
 ∞ 
xj = xi pij = xk pki pij
i=1 i=1 k=1

 
∞  ∞ ∞

(2) (n)
= xk pki pij = xk pkj = · · · = xk pkj
k=1 i=1 k=1 k=1

for any n ≥ 1. Taking the limit as n → ∞, we obtain (by the dominated conver-
∞  (n)
gence theorem) that xj = xk πj , where πj = limn pkj for any k ∈ E. By
∞ k=1
assumption, k=1 xk = 1. Hence xj = πj , j ∈ E, which was to be proved.
6. Problems
1. Investigate the problem of stationary, limiting, and ergodic distributions for a
Markov chain with the transition probabilities matrix
⎛ ⎞
1/2 0 1/2 0
⎜ 0 0 0 1⎟
P=⎜ ⎝1/4 1/2 1/4 0⎠ .

0 1/2 1/2 0
m
2. Let P = pij  be a finite doubly stochastic matrix (i.e., j=1 pij = 1 for i =
m
1, . . . , m and i=1 pij = 1 for j = 1, . . . , m). Show that Q = (1/m, . . . , 1/m)
is a stationary distribution of the corresponding Markov chain.
3. Let X be a Markov chain with two states, E = {0, 1}, and the transition proba-
bilities matrix

α 1−α
P= , 0 < α < 1, 0 < β < 1.
1−β β

Explore the limiting, ergodic, and stationary distributions for this chain.

7. Limiting, Stationary, and Ergodic Distributions for Finite


Markov Chains

1. According to Theorem 6 in Sect. 5, every indecomposable aperiodic Markov


chain with a finite state space is positive recurrent. This conclusion allows us to
state Theorem 3 from Sect. 6 in the following form. (Compare with questions A, B,
C, and D in Sect. 3.)
Theorem 1. Consider an indecomposable aperiodic Markov chain X = (Xn )n≥0
with finite state space E = {1, 2, . . . , r}. Then
(n)
(a) For all j ∈ E there exist limits πj = limn pij independent of the initial state
i ∈ E.
r limits Π = (π1 , π2 , . . . , πr ) form a probability distribution, i.e., πj ≥ 0 and
(b) The
i=1 πi = 1, j ∈ E.
284 8 Markov Chains

(c) Moreover, these limits πj are equal to μ−1 j > 0 for all j ∈ E, where μj =
∞ (n)
n=1 nfjj is the mean time of return to state j (i.e., μj = Ej τ (j) with τ (j) =
min{ n ≥ 1 : Xn = j}), so that Π = (π1 , π2 , . . . , πr ) is an ergodic distribution.
(d) The stationary distribution Q = (q1 , q2 , . . . , qr ) exists, is unique, and is equal
to Π = (π1 , π2 , . . . , πr ).
2. In addition to Theorem 1, we state the following result clarifying the role of the
properties of a chain being indecomposable and aperiodic.
Theorem 2. Consider a Markov chain with a finite state space E = {1, 2, . . . , r}.
The following statements are equivalent:
(a) The chain is indecomposable and aperiodic (d = 1).
(b) The chain is indecomposable, aperiodic (d = 1), and positive recurrent.
(c) The chain is ergodic.
(d) There is an n0 such that for all n ≥ n0
(n)
min pij > 0.
i,j∈E

PROOF. The implication (d) ⇒ (c) was proved in Theorem 1 of Sect. 12, Chap. 1
(Vol. 1). The converse implication (c) ⇒ (d) is obvious. The implication (a) ⇒ (b)
follows from Theorem 6, Sect. 5, while (b) ⇒ (a) is obvious. Finally, the equivalence
of (b) and (c) is contained in Theorem 5 from Sect. 6.

8. Simple Random Walk as a Markov Chain

1. A simple d-dimensional random walk is a homogeneous Markov chain X =


(Xn )n≥0 describing the motion of a random “particle” over the nodes of the lat-
tice Zd = {0, ±1, ±2, . . . }d when this particle at every step either stays in the same
state or passes to one of the adjacent states with some probabilities.
EXAMPLE 1. Let d = 1 and the state space of the chain be E = Z =
{0, ±1, ±2, . . . }. Let the transition probabilities matrix be


⎨p, j = i + 1,
pij = q, j = i − 1,


0 otherwise,

where p + q = 1.
The following graph demonstrates the possible transitions of this chain.
8 Simple Random Walk as a Markov Chain 285

If p = 0 or 1, the motion is deterministic, and the particle moves to the left or the
right, respectively.
These deterministic cases are of little interest; all the states here are inessential.
Hence we will assume that 0 < p < 1.
Under this assumption the states form a single class of essential communicating
states. In other words, when 0 < p < 1, the chain is indecomposable (Sect. 4).
By the formula for the binomial distribution (Sect. 2, Chap. 1, Vol. 1),

(2n) n (2n)!
pjj = C2n (pq)n = (pq)n (1)
(n!)2

for any j ∈ E. By Stirling’s formula (see (6) in Sect. 2, Chap. 1, Vol. 1); see also
Problem 1), √
n! ∼ 2πn nn e−n .
Therefore we find from (1) that

(2n) (4pq)n
pjj ∼ √ , (2)
πn

hence

 (2n)
pjj = ∞, if p = q, (3)
n=1
∞
(2n)
pjj < ∞, if p = q. (4)
n=1

These formulas, together with Theorem 1 in Sect. 5, yield the following result.
The simple one-dimensional random walk over the set E = Z = {0, ±1, ±2, . . . }
is recurrent in the symmetric case when p = q = 12 and transient when p = q.
If p = q = 1/2, then, as was shown in Sect. 10, Chap. 1, Vol. 1, for large n

(2n) 1
fjj ∼ √ 3/2 . (5)
2 πn

Therefore ∞
 (2n)
μj = (2n)fjj = ∞, j ∈ E. (6)
n=1

Therefore all the states in this case are null recurrent. Hence, by Theorem 5, Sect. 5,
(n)
we obtain that for any i and j, pij → 0 as n → ∞ for all 0 < p < 1. This implies
(Theorem 1, Sect. 6) that there are no limit distributions and no stationary or ergodic
distributions.
EXAMPLE 2. Let d = 2. Consider the symmetric motion in the plain (corresponding
to the case p = q = 1/2 in the previous example), when the particle can move
one step in either direction (to the left or right, up or down) with probability 1/4
(Fig. 41).
286 8 Markov Chains

Fig. 41 A walk in the plane

Assuming for definiteness that the particle was at the origin 0 = (0, 0) at the
initial time instant, we will investigate the problem of its return or nonreturn to this
zero state.
To this end, consider the paths in which a particle makes i steps to the right
and i steps to the left and j steps up and j steps down. If 2i + 2j = 2n, this means
that the particle starting from the origin returns to this state in 2n steps. It is also
clear that the particle cannot return to the origin after an odd number of steps.
This implies that the probabilities of transition from state 0 to the same state 0
are given by the following formulas:
(2n+1)
p00 = 0, n = 0, 1, 2, . . . ,

and (by the formula for total probability)

(2n)
 (2n)!  1 2n
p00 = , n = 1, 2, . . . (7)
(i!)2 (j!)2 4
(i,j) : i+j=n

(see also Subsection 2, “Multinomial Distribution,” in Sect. 2, Chap. 1, Vol. 1.)


Multiplying the numerator and denominator in (7) by (n!)2 yields
 1 2n 
n  1 2n
(2n) n n 2
p00 = C2n Cni Cnn−i = (C2n ) , (8)
4 i=0
4

where we have used the formula (Problem 4 in Sect. 2, Chap. 1, Vol. 1).


n
Cni Cnn−i = C2n
n
.
i=0
8 Simple Random Walk as a Markov Chain 287

(2n)
By Stirling’s formula we obtain from (8) that p00 ∼ 1
πn , hence

 (2n)
p00 = ∞. (9)
n=0

Of course, a similar assertion, by symmetry, holds not only for (0, 0) but also for
any state (i, j).
As in the case d = 1, we obtain from (9) and Theorem 1 of Sect. 5 the following
statement.
The simple two-dimensional symmetric random walk over the set E = Z2 =
{0, ±1, ±2, . . . }2 is recurrent.

EXAMPLE 3. It turns out that in the case d ≥ 3, the behavior of the symmetric
random walk over the states E = Zd = {0, ±1, ±2, . . . }d is quite different from the
cases d = 1 and d = 2 considered above.
That is:
The simple d-dimensional symmetric random walk over the set E = Zd =
{0, ±1, ±2, . . . }d for every d ≥ 3 is transitive.
(2n)
The proof relies on the fact that the asymptotic behavior of the probabilities pjj
as n → ∞ is
(2n) c(d)
pjj ∼ d/2 (10)
n
with a positive constant c(d) depending only on dimension d.
We will give the proof for d = 3 leaving the case d > 3 as a problem.
The symmetry of the random walk means that at every step the particle moves by
one unit in one of the six coordinate directions with probabilities 1/6.
288 8 Markov Chains

Let the particle start from the state 0 = (0, 0, 0). Then, as for d = 2, we find
from the formulas for the multinomial distribution (Sect. 2, Chap. 1, Vol. 1) that
 (2n)!  1 2n
(2n)
p00 =
(i!)2 (j!)2 ((n − i − j)!)2 6
(i,j): 0≤i+j≤n
  n! 2  1 2n
= 2−2n C2n n
i! j! (n − i − j)! 3
(i,j): 0≤i+j≤n
 n!  1 2n
≤ Cn 2−2n C2n n
3−n
i! j! (n − i − j)! 3
(i,j): 0≤i+j≤n
−2n n −n
= Cn 2 C2n 3 , (11)

where  n! 
Cn = max (12)
(i,j): 0≤i+j≤n i! j! (n − i − j)!
and where we have used the obvious fact that
 n!  1 2n
= 1.
i! j! (n − i − j)! 3
(i,j): 0≤i+j≤n

It will be established subsequently that


n!
Cn ∼ . (13)
[(n/3)!]3

Then, by Stirling’s formula, (13) implies that



−2n n −n 3 3
Cn 2 C2n 3 ∼ . (14)
2π 3/2 n3/2
Hence (11) yields ∞
 (2n)
p00 < ∞, (15)
n=1

and therefore, by Theorem 1, Sect. 5, the state 0 = (0, 0, 0) is transient. By symme-


try, the same holds for any state in E = Z3 .
It remains to establish (13). Let
n!
mn (i, j) = ,
i! j! (n − i − j)!

and let i0 = i0 (n), j0 = j0 (n) be the values of i, j for which

max mn (i, j) = mn (i0 , j0 ).


(i,j): 0≤i+j≤n

Taking four points (i0 −1, j0 ), (i0 +1, j0 ), (i0 , j0 −1), (i0 , j0 +1) and using that the
corresponding values mn (i0 − 1, j0 ), mn (i0 + 1, j0 ), mn (i0 , j0 − 1), and mn (i0 , j0 + 1)
are less than or equal to mn (i0 , j0 ), we obtain the inequalities
8 Simple Random Walk as a Markov Chain 289

n − i0 − 1 ≤ 2j0 ≤ n − i0 + 1,
n − j0 − 1 ≤ 2i0 ≤ n − j0 + 1.

One can easily deduce from these inequalities that


n n
i0 (n) ∼ , j0 (n) ∼ ,
3 3
which implies the required formula (13).

Summarizing these cases, we can state the following theorem due to G. Pólya.

Theorem. The simple symmetric random walk over the set

E = Zd = {0, ±1, ±2, . . . }d

is recurrent when d = 1 or d = 2 and transitive when d ≥ 3.

2. The previous examples dealt with a simple random walk in the entire space Zd . In
this subsection we will consider examples of simple random walks with state space
E strictly less than Zd . We will restrict ourselves to the case d = 1.

EXAMPLE 4. Consider a simple random walk with state space E = {0, 1, 2, . . . }


and absorbing zero state 0. Its graph of transitions is as follows.

State 0 is here the only positive recurrent state that forms a unique indecom-
posable subclass. (All the other states are transient.) By Theorem 2 in Sect. 6 there
exists a unique stationary distribution Q = (q0 , q1 , . . . ) with q0 = 1 and qi = 0,
i = 1, 2, . . . .
(n)
This walk provides an example, where (for some i and j) the limits limn pij exist
but depend on the initial state, which means, in particular, that this random walk
possesses no ergodic distribution.
(n) (n)
It is clear that p00 = 1 and p0j = 0 for j = 1, 2, . . . , and an easy calculation
(n)
shows that pij → 0 for all i, j = 1, 2, . . . .
(n)
Let us show that the limits α(i) = limn pi0 exist for all i = 1, 2, . . . and are
given by the formula 
(q/p)i , p > q,
α(i) = (16)
1, p ≤ q.
This formula demonstrates that when p > q (trend to the right), the limiting
(n)
probability limn pi0 of transition from state i (i = 1, 2, . . . ) to state 0 depends on i
decreasing geometrically as i grows.
290 8 Markov Chains

(n)  (k)
For the proof of (16) notice that pi0 = k≤n fi0 since the null state is absorbing,
(n)
hence the limit limn pi0 (= α(i)) exists and equals fi0 , i.e., the probability of interest
is the probability that the particle leaving state i eventually reaches the null state. By
the same method as in Sect. 12, Chap. 1, Vol. 1 (see also Sect. 2, Chap. 7), we obtain
recursive relations for these probabilities:

α(i) = pα(i + 1) + qα(i − 1), (17)

with α(0) = 1. A general solution to this equation is

α(i) = a + b(q/p)i , (18)

and condition α(0) = 1 provides a condition a + b = 1 on a and b.


When q > p, we immediately obtain that b = 0, hence α(i) = 1, because the
α(i) are bounded. This result is easily understandable since in the case q > p a
particle has a tendency to move toward the null state.
In contrast, if p > q, we have the reverse situation: a particle has a tendency to
move to the right, and it is natural to expect that

α(i) → 0, i → ∞, (19)

so that a = 0 and
α(i) = (q/p)i . (20)
Instead of establishing (19) first, we will prove this equality in another way.
Along with the absorbing barrier at point 0, consider one more absorbing barrier
at point N. Denote by αN (i) the probability that a particle leaving point i reaches the
null state before getting to state N. The probabilities αN (i) satisfy equations (17)
with boundary conditions

αN (0) = 1, αN (N) = 0,

and, as was shown in Sect. 9, Chap. 1, Vol. 1,

(q/p)i − (q/p)N
αN (i) = , 0 ≤ i ≤ N. (21)
1 − (q/p)N

Therefore limN αN (i) = (q/p)i , so that for the proof of (20) we must show that

α(i) = lim αN (i). (22)


N

This is easily seen intuitively. The proof can be carried out as follows.
Assume that the particle starts from a given state i. Then

α(i) = Pi (A), (23)


8 Simple Random Walk as a Markov Chain 291

where A is the event that there is an N such that the particle leaving state i reaches
the null state before state N. If

AN = {the particle reaches 0 before N},



then A = N=i+1 AN . Clearly, AN ⊆ AN+1 and
&
∞ 
Pi AN = lim Pi (AN ). (24)
N→∞
N=i+1

But αN (i) = Pi (AN ), so that (22) follows directly from (23) and (24).
(n)
Thus, when p > q, the limits limn pi0 depend on i. If p ≤ q, then limn pi0 = 1
(n)
for any i and limn pij = 0, j ≥ 1. Hence in this case there exists a limit distribution
(n)
Π = (π0 , π1 , . . . ) with πj = limn pij independent of i, and Π = (1, 0, 0, . . . ).

EXAMPLE 5. Consider a simple random walk with state space E = {0, 1, . . . , N}


and absorbing boundary states 0 and N:

In this case there are two indecomposable positive recurrent classes {0} and {N}.
All the other states 1, 2, . . . , N − 1 are transient. We can see from the proof of
Theorem 2, Sect. 6, that there exists a continuum of stationary distributions Q =
(q0 , q1 , . . . , qN ), all of which have the form q1 = · · · = qN−1 = 0 and q0 = a,
qN = b with a ≥ 0, b ≥ 0, and a + b = 1.
According to the results of Subsection 2 in Sect. 9, Chap. 1, Vol. 1,

⎨ (q/p) − (q/p) , p = q,
i N
(n)
lim pi0 = 1 − (q/p) N (25)
n ⎩
1 − (i/N), p = q = 1/2,

(n) (n) (n)


limn piN = 1 − limn pi0 and limn pij = 0, 1 ≤ j ≤ N − 1.
(n)
Let us emphasize that, as in the previous example, the limiting values limn pij
of the transition probabilities depend on the initial state.

EXAMPLE 6. Consider a simple random walk with state space E = {0, 1, . . .} and a
reflecting barrier at 0:

The behavior of this chain essentially depends on p and q.


292 8 Markov Chains

If p > q, then a wandering particle has a trend to the right and the reflecting
barrier enhances this trend, unlike the chain in Example 4, where a particle may
(n)
become “stuck” in the zero state. All the states are transitive: pij → 0, n → ∞, for
all i, j ∈ E; there are no stationary or ergodic distributions.
If p < q, there is a leftward trend, and the chain is recurrent, and so is the chain
for p = q.
Let us write down the system of equations (cf. (12) in Sect. 6) for the stationary
distribution Q = (q0 , q1 , . . . ):

q0 = q1 q,
q1 = q0 + q2 q,
q2 = q1 p + q3 q,
...............

Hence
q1 = q(q1 + q2 ),
q2 = q(q2 + q3 ),
...............
Therefore p
qj = qj−1 , j = 2, 3, . . . .
q
If p = q, then q1 = q2 = . . . , and
∞hence there is no nonnegative solution to
this system satisfying the conditions j=0 qj = 1 and q0 = q1 q. Therefore, when
p = q = 1/2, there is no stationary distribution. All the states in this case are
recurrent. ∞
Finally, let p < q. The condition j=0 qj = 1 yields
 p  p 2 
q1 q + 1 + + + . . . = 1.
q q
Hence
q−p q−p
q1 = , q0 = q1 q = ,
2q 2
and
q − p  p j−1
qj = , j ≥ 2.
2q q
EXAMPLE 7. The state space of the simple random walk in this example is E =
{0, 1, . . . , N}, with reflecting barriers 0 and N:
1 p p p p
1 2
0 N 0< p <1
q q q q N–1 1

The states of this chain constitute one indecomposable class. They are positive re-
current with period d = 2. By Theorem 2 in Sect. 6, there is a unique stationary dis-
8 Simple Random Walk as a Markov Chain 293
N
tribution Q = (q0 , q1 , . . . , qN ). On solving the system of equations qj = qi pij
N i=0
subject to the conditions i=0 qi = 1, qj ≥ 0, j ∈ E, we find

(p/q)j−1
qj = N−1 , 1 ≤ j ≤ N − 1, (26)
1 + i=1 (p/q)i−1

and q0 = q1 q, qN = qN−1 q.
There is no ergodic distribution, which follows from Theorem 3, Sect. 6, and the
fact that this chain has period d = 2. The lack of an ergodic distribution can also be
seen directly. For example, let N = 2:

1 p
1
0 2 0< p <1
q 1

(2n) (2n+1) (n)


Then we see that p11 = 1, but p11 = 0, so the limit limn p11 does not exist.
At the same time, the stationary distribution Q = (q0 , q1 , q2 ) exists and, by (26),
has the form
1 1 1
q0 = q, q1 = , q2 = p.
2 2 2
3. The material set out in the book shows that the simple random walk is a classical
model that makes it possible to develop probabilistic ideology, elaborate probabilis-
tic techniques, and discover many probabilistic laws. In a similar way, the study
of the sums Xn = ξ1 + · · · + ξn , n ≥ 1, of independent Bernoulli random vari-
ables ξ1 , ξ2 , . . . taking only two values, and hence giving rise to a simple random
walk X = (Xn )n≥1 (which is a Markov chain), led to the discovery of the law of
large numbers (Sect. 5, Chap. 1, Vol. 1), the de Moivre–Laplace theorem (Sect. 6,
Chap. 1, Vol. 1), the arcsine law (Sect. 10, Chap. 1, Vol. 1), and many other proba-
bilistic regularities.
In this subsection we will consider two discrete diffusion models that provide a
good illustration of how a simple random walk can describe real physical processes.
A. Ehrenfest Model. As in Example 7, consider the simple random walk with
phase space E = {0, 1, . . . , N} and reflecting barriers at 0 and N.
The transition probabilities from these states are p01 = 1, pN,N−1 = 1. At other
states i = 1, . . . , N − 1 only transitions by one step to the right or to the left are
possible with probabilities

1 − i , j = i + 1,
pij = i N (27)
N, j = i − 1.

In 1907, Paul and Tatiana Ehrenfest [23] proposed this Markov chain as the
model of statistical mechanics describing the motion of gas molecules from one
container (A or B) to the other (B or A) through the membrane between them.
It is assumed that the total number of molecules in the two containers is N, and
at each step one of them is randomly chosen (with probability 1/N) and placed into
the other container. The choice of the molecule at each step is made independently
of its prehistory.
294 8 Markov Chains

Let Xn be the number of molecules in container A at time n. The random mecha-


nism of the motion of molecules fulfills the Markov property (Problem 2):

P(Xn+1 = j | X0 = i0 , X1 = i1 , . . . , Xn−1 = in−1 , Xn = i)


= P(Xn+1 = j | Xn = i) (28)

and
P(Xn+1 = j | Xn = i) = pij , (29)
with pij defined by (27).
For this model there exists a stationary distribution Q = (q0 , q1 , . . . , qN ) given
by the following binomial formula (Problem 3):
 1 N
qj = CNj , j = 0, 1, . . . , N. (30)
2
All the states of this Markov chain are recurrent (Problem 4).
It is of interest to note that the maximum of qj , j = 0, 1, . . . , N, is attained for,
say, even N, at the central value j = N/2, which corresponds to the most probable
“equilibrium” state, when the number of molecules in both containers is the same.
Clearly, this equilibrium, which is established in the course of time, is of a prob-
abilistic nature (specified by the stationary distribution Q).
The possibility of “stabilization” of the number of molecules in the containers
is quite understandable intuitively: the farther state i is from the central value, the
larger the probability (by (27)) that the molecule will move toward this value.
B. D. Bernoulli–Laplace Model. This model, which is akin to the Ehrenfest
model, was proposed by Daniel Bernoulli in 1769 and analyzed by Laplace in 1812
in the context of describing the exchange of particles between two ideal liquids.
Specifically, there are two containers, A and B, containing 2N particles, of which
N particles are white and N particles are black.
The system is said to be in state i, where i ∈ E = {0, 1, . . . , N}, if there are
i white particles and N − i black particles in container A. The assumption of ideal
liquids means that in state i there are N − i white particles and i black particles in
container B, i.e., the number of particles in each container remains equal to N.
At each step n one particle in each container is randomly chosen (with proba-
bility 1/N), and these particles interchange their containers. The two choices are
independent, and each choice is independent of the choices in the previous steps.
Let Xn be the number of white particles in container A. Then the aforementioned
mechanism of particle interchange obeys the Markov property (28) with transition
probabilities pij in (29) given by the formula (Problem 5)


⎨(i/N) ,
2
j = i − 1,
pij = (1 − (i/N))2 , j = i + 1, (31)


2 (i/N)(1 − (i/N)), j = i,

and pij = 0 if |i − j| > 1, i = 0, 1, . . . , N.


8 Simple Random Walk as a Markov Chain 295

As in the Ehrenfest model, all the states are recurrent. There exists a unique
stationary distribution Q = (q0 , q1 , . . . , qN ) determined by (Problem 5)

(CNj )2
qj = N 2, j = 0, 1, . . . , N. (32)
(C2N )

4. At the beginning of this chapter we wrote that the issue of major interest here
is the asymptotic behavior (as n → ∞) of memoryless systems. In the previous
sections we considered Markov chains with countable state space E = {i, j, . . . } as
a specific class of these systems, and we studied in them the behavior of transition
(n)
probabilities pij as n → ∞. In particular, we investigated the asymptotic behavior
of a simple random walk in which transitions are possible only to adjacent states.
Of great interest is the study of similar problems for Markov chains with more
complicated state spaces. In this regard, see, for example, [14, 65].
5. The two models considered above (Ehrenfest and Bernoulli–Laplace) are said to
be discrete diffusion models.
We will give an explanation to this expression in terms of asymptotic behavior of
a simple random walk in R. Let Sn = ξ1 + · · · + ξn , n ≥ 1, S0 = 0, where ξ1 , ξ2 , . . .
is a sequence of independent identically distributed random variables with E ξi = 0,
Var ξi = 1. Let X0n = 0 and
, -
1 
[nt]
S[nt]
Xtn = √ =√ ξk , 0 < t ≤ 1.
n n
k=1

N N
Clearly, the sequence (0, X1/n , X2/n , . . . , X1N ) may be regarded as a simple random

walk in times Δ, 2Δ, √. . . , 1 with Δ = 1/n and jumps of order Δ (ΔXkΔ n

XkΔ − X(k−1)Δ = ξk Δ).
n n

As was pointed out in Remark 4, Sect. 8, Chap. 7, the finite-dimensional distribu-


tions of the random walk X n = (Xtn )0≤t≤1 weakly converge to those of the Wiener
process (Brownian motion) W = (Wt )0≤t≤1 . Moreover, we stated there that a func-
tional convergence also holds, i.e., the weak convergence of the distributions of X n
to the distribution of W (in the same sense as the convergence of empirical processes
to the Brownian bridge, see Sect. 13, Chap. 3, Vol. 1). The Wiener process is a typ-
ical (and the most important) example of a diffusion process; see [26, 21, 12]. This
explains why the processes like X n and those arising in the Ehrenfest and Bernoulli–
Laplace models are naturally called discrete diffusion models.
6. Problems.

1. Prove Stirling’s formula (n! ∼ 2π nn+1/2 e−n ) using the following probabilis-
tic arguments [5, Problem 27.18]. Let Sn = X1 + · · · + Xn , n ≥ 1, where
X1 , X2 , . . . are independent identically distributed random variables distributed
according to Poisson’s law with parameter λ = 1. Prove successively that
 S − n −  n 
n − k  nk nn+1/2 e−n
n −n
(a) E √ =e √ = ;
n n k! n!
k=0
296 8 Markov Chains
 S − n − 
→ Law[N − ],
n
(b) Law √
n
where 
N is a standard normal random variable;
Sn − n −  1
(c) E √ → E N− = √ ;
√n 2π
(d) n! ∼ 2π nn+1/2 e−n .
2. Establish the Markov property (28).
3. Prove (30).
4. Prove that all the states in the Ehrenfest model are recurrent.
5. Verify that (31) and (32) hold true.

9. Optimal Stopping Problems for Markov Chains

1. The subject of this section is closely related to Sect. 13, Chap. 7, which dealt
with the martingale approach to optimal stopping problems for arbitrary stochas-
tic sequences. In this section we focus on the case where stochastic sequences are
generated by functions of states of Markov chains, which enables us to present and
interpret the general results of Sect. 13, Chap. 7, in a simple and conceivable way.
2. Let X = (Xn , Fn , Px ) be a homogeneous Markov chain with discrete time and
phase space (E, E ).
We will assume that the space (Ω, F ) on which the variables Xn = Xn (ω), n ≥ 0,
are defined is a coordinate space (as in Subsection 6, Sect. 1) and that the Xn (ω) are
specified coordinate-wise,
i.e., Xn (ω) = xn if ω = (x0 , x1 , . . . ) ∈ Ω. The σ-algebra
F is defined as σ( Fn ), where Fn = σ(x0 , . . . , xn ), n ≥ 0.

Remark. In the “general theory of optimal stopping rules” there is no need to re-
quire that Ω be a coordinate space. Nevertheless in the “general theory” one also
must assume that the space is sufficiently “rich.” (For details, see [69].)
In the present exposition, the assumption of coordinate space simplifies the pre-
sentation, in particular, regarding the generalized Markov property (Theorem 1 in
Sect. 2), which was defined in this very framework.

As before, P(x; B) will denote the transition function of our chain (P(x; B) =
Px {X1 ∈ B}), x ∈ E, B ∈ E ).
Let T be the one-step transition operator that acts on E -measurable functions
f = f (x) satisfying Ex |f (X1 )| < ∞, x ∈ E, in the following way:
  
(Tf )(x) = Ex f (X1 ) = f (y) P(x; dy) . (1)
E

(For notational simplicity, we will write Tf (x) instead of (Tf )(x). A similar conven-
tion will also be used in other cases.)
9 Optimal Stopping Problems for Markov Chains 297

3. To state the optimal stopping problem for the Markov chain X, let g = g(x) be a
given E -measurable real-valued function such that Ex |g(Xn )| < ∞, x ∈ E, for all
n ≥ 0 (or for 0 ≤ n ≤ N if we are to take the “optimal decision” before a time N
specified a priori).
Let Mn0 be the class of Markov times τ = τ (ω) (with respect to the filtration
(Fk )0≤k≤N ) taking values in the set {0, 1, . . . , n}.
The following theorem is a “Markov” version of Theorems 1 and 2 in Sect. 13,
Chap. 7.

Theorem 1. For any 0 ≤ n ≤ N and x ∈ E, define the “price”

sn (x) = sup Ex g(Xτ ), (2)


τ ∈Mn0

where Ex is the expectation with respect to Px .


Let
τ0n = min{0 ≤ k ≤ n : sn−k (Xk ) = g(Xk )} (3)
and
Qg(x) = max(g(x), Tg(x)). (4)
Then the following statements hold true:
(1) The Markov time τ0n is an optimal stopping time in the class Mn0 :

Ex g(Xτ0n ) = sn (x) (5)

for all x ∈ E.
(2) The functions sn (x) are determined by the formula

sn (x) = Qn g(x), x ∈ E, (6)

where Q0 g(x) = g(x) for n = 0.


(3) The functions sn (x), n ≤ N, satisfy the recurrence relations

sn (x) = max(g(x), Tsn−1 (x)), x ∈ E, 1 ≤ n ≤ N (7)

(with s0 (x) = g(x)).

PROOF. Let us apply Theorems 1 and 2 of Sect. 13, Chap. 7, to the functions fn =
g(Xn ), 0 ≤ n ≤ N. To this end, fix an initial state x ∈ E and consider the functions
VnN and vNn introduced therein. To highlight the dependence on the initial state, we
will write VnN = VnN (x). Thus,

VnN (x) = sup Ex g(Xτ ), (8)


τ ∈MNn

where MNn is the class of Markov times (with respect to the filtration (Fk )k≤N )
taking values in the set {n, n + 1, . . . , N}.
298 8 Markov Chains

In accordance with (6) of Sect. 13, Chap. 7, the functions vNn are defined recur-
sively:
vNN = g(XN ), vNn = max(g(Xn ), Ex (vNn+1 | Fn )). (9)
By the generalized Markov property (Theorem 1 in Sect. 2),

Ex (vNN | FN−1 ) = Ex (g(XN ) | FN−1 ) = EXN−1 g(X1 ) (Px -a.s.), (10)

where EXN−1 g(X1 ) is to be understood as follows (Sect. 2): for the function ψ(x) =
Ex g(X1 ), i.e., ψ(x) = (Tg)(x), we define EXN−1 g(X1 ) ≡ ψ(XN−1 ) = (Tg)(XN−1 ).
Hence vNN = g(XN ) and

vNN−1 = max(g(XN−1 ), (Tg)(XN−1 )) = (Qg)(XN−1 ). (11)

Proceeding in a similar manner, we find that

vNn = (QN−n g)(Xn ) (12)

for all 0 ≤ n ≤ N − 1 and, in particular,

vN0 = (QN g)(X0 ) = (QN g)(x) (Px -a.s.).

By (13) of Sect. 13, Chap. 7, we have vN0 = V0N . Since V0N = V0N (x) = sN (x), we
have sN (x) = (QN g)(x), which proves (6) for n = N (and similarly for any n < N).
The recurrence formulas (7) follow from (6) and the definition of Q.
We show now that the stopping time defined by (3) is optimal (for n = N) in the
class MN0 (and similarly in the classes Mn0 for n < N).
By Theorem 1 of Sect. 13, Chap. 7, the optimal stopping time is

τ0N = min{0 ≤ k ≤ N : vNk = g(Xk )}.

Now (12) and the fact established above that sn (x) = (Qn g)(x) for any n ≥ 0 imply
that
vNk = (QN−k g)(Xk ) = sN−k (Xk ). (13)
Therefore
τ0N = min{0 ≤ k ≤ N : sN−k (Xk ) = g(Xk )}, (14)
which proves the optimality of this stopping time in the class MN0 .


4. Use the notation

DkN = {x ∈ E : sN−k (x) = g(x)}, (15)


CNk =E\ DNk = {x ∈ E : sN−k (x) > g(x)}. (16)

Then we see from (14) that

τ0N (ω) = min{0 ≤ k ≤ N : Xk (ω) ∈ DNk }, (17)


9 Optimal Stopping Problems for Markov Chains 299

and, by analogy with the sets DNk and CkN (in Ω) introduced in Subsection 6, Sect. 13,
Chap. 7, the sets

DN0 ⊆ DN1 ⊆ · · · ⊆ DNN = E, (18)


CN0 ⊇ CN1 ⊇ ··· ⊇ CNN =∅ (19)

can be called the stopping sets and continuation of observation sets (in E), respec-
tively.
Let us point out the specific features of the stopping problems in the case of
Markov chains. Unlike the general case, the answer to the question of whether ob-
servations are to be stopped or continued is given in the Markov case in terms of
the states of the Markov chain itself (τ0N = min{0 ≤ k ≤ N : Xk ∈ DNk }), in
other words, depending on the position of the wandering particle. And the com-
plete solution of the optimal stopping problems (i.e., the description of the “price”
sN (x) and the optimal stopping time τ0N ) is obtained from the recurrence “dynamic
programming equations” (7) by finding successively the functions s0 (x) = g(x),
s1 (x), . . . , sN (x).
5. Consider now the optimal stopping problem assuming that τ ∈ M∞ 0 , where M0

is the class of all finite Markov times. (The assumption τ ∈ M0 means that τ ≤ N,
N

while the assumption τ ∈ M∞ 0 means only that τ = τ (ω) < ∞ for all ω ∈ Ω.)
Thus we consider the price

s(x) = sup Ex g(Xτ ). (20)


τ ∈M∞
0

To avoid any questions about the existence of expectations Ex g(Xτ ), we can assume,
for example, that  
Ex sup g− (Xn ) < ∞, x ∈ E. (21)
n

Clearly, this assumption is satisfied whenever g = g(x) is bounded (|g(x)| ≤ C,


x ∈ E). In particular, (21) holds if the state space E is finite.
The definitions of the prices sN (x) and s(x) imply that

sN (x) ≤ sN+1 (x) ≤ · · · ≤ s(x) (22)

for all x ∈ E. Of course, it is natural to expect that limN→∞ sN (x) equals s(x). If this
is the case, then, passing to the limit in (7), we find that s(x) satisfies the equation

s(x) = max(g(x), Ts(x)), x ∈ E. (23)

This equation implies that s(x), x ∈ E, fulfills the following “variational inequal-
ities”:

s(x) ≥ g(x), (24)


s(x) ≥ Ts(x). (25)
300 8 Markov Chains

Inequality (24) says that s(x) is a majorant of g(x). Inequality (25), according to
the terminology of the general theory of Markov processes, means that s(x) is an
excessive or a superharmonic function.
Therefore, if we could establish that s(x) satisfies (23), then the price s(x) would
be an excessive majorant for g(x).
Note now that if a function v(x) is an excessive majorant for g(x), then, obviously,
the following variational inequalities hold:

v(x) ≥ max(g(x), Tv(x)), x ∈ E. (26)

It turns out, however, that if we assume additionally that v(x) is the least excessive
majorant then (26) becomes an equality, i.e., v(x) satisfies the equation

v(x) = max(g(x), Tv(x)), x ∈ E. (27)

Lemma 1. Any least excessive majorant v(x) of g(x) satisfies equation (27).

PROOF. The proof is fairly simple. Clearly, v(x) satisfies inequality (26). Let
v1 (x) = max(g(x), Tv(x)). Since v1 (x) ≥ g(x) and v1 (x) ≤ v(x), x ∈ E, we have

Tv1 (x) ≤ Tv(x) ≤ max(g(x), Tv(x)) = v1 (x).

Therefore v1 (x) is an excessive majorant for g(x). But v(x) is the least excessive
majorant. Hence v(x) ≤ v1 (x), i.e., v(x) ≤ max(g(x), Tv(x)). Together with (26)
this implies (27).


The preliminary discussions presented earlier, which were based on the assump-
tion s(x) = limN→∞ sN (x) and led to (23), as well as the statement of Lemma 1
suggest a characterization of the the price s(x), namely, that it is likely to be the
least excessive majorant of g(x).
Indeed, the following theorem is true.
 
Theorem 2. Suppose a function g = g(x) satisfies Ex supn g− (Xn ) < ∞, x ∈ E.
Then the following statements are valid.
(a) The price s = s(x) is the least excessive majorant of g = g(x).
(b) The price s(x) is equal to limN→∞ sN (x) = limN→∞ QN g(x) and satisfies the
Wald–Bellman dynamic programming equation

s(x) = max(g(x), Ts(x)), x ∈ E.


 
(c) If Ex supn |g(Xn )| < ∞, x ∈ E, then for any ε > 0 the stopping time

τε∗ = min{n ≥ 0 : s(Xn ) ≤ g(Xn ) + ε}

is ε-optimal in the class M∞


0 , i.e.,

s(x) − ε ≤ Ex g(Xτε∗ ), x ∈ E.
9 Optimal Stopping Problems for Markov Chains 301

If Px {τ0∗ < ∞} = 1, x ∈ E, then the stopping time τ0∗ is optimal (0-optimal),


i.e.,
s(x) = Ex g(Xτ0∗ ), x ∈ E. (28)
(d) If the set E is finite, then τ0∗ belongs to M∞
0 and is optimal.
Remark. The stopping time τ0∗ = min{n ≥ 0 : s(Xn ) = g(Xn )} may happen to be
infinite for some x ∈ E with positive probability, Px {τ0∗ = ∞} > 0. (This can occur
even in the case of a countable set of states, Problem 1.) In view of this we should
agree about the meaning of Ex g(Xτ ) if τ can equal +∞ since the value X∞ has not
been defined.
The value of g(X∞ ) is often defined to be lim supn g(Xn ) (Subsection 1, Sect. 13,
Chap. 7, and [69]). Another option is to consider g(Xτ ) I(τ < ∞) instead of g(Xτ ).
Then denoting by M∞ 0 the class of all Markov times, possibly equal to +∞, the
price
s̄(x) = sup Ex g(Xτ ) I(τ < ∞) (29)
τ ∈M∞
0

is well defined, so that we can consider the optimal stopping problem in the class
M∞ 0 also.
PROOF OF THEOREM 2. We will give the proof only for the case of a finite set E.
In this case the proof is rather simple and clarifies quite well the appearance of
excessive functions in optimal stopping problems. For the proof in the general case,
see [69, 22].
Proof of (a). Let us show that s(x) is excessive, i.e., s(x) ≥ Ts(x), x ∈ E.
It is obvious that for any state y ∈ E and any ε > 0 there is a finite (Py -a.s.)
Markov time τy ∈ M∞ 0 (depending in general on ε > 0) such that

Ey g(Xτy ) ≥ s(y) − ε. (30)

Using these times τy , y ∈ E, we will construct one more time τ̂ , which will deter-
mine the following strategy of the choice of the stopping time.
Let the particle be in the state x ∈ E at the initial time instant. The observation
process is surely not stopped at this time, and one observation is produced. Let at
n = 1 the particle occur in the state y ∈ E. Then the strategy determined by τ̂
consists, informally speaking, in treating the “life” of the particle as if it started
anew at this time subject further to the stopping rule governed by τy .
The formal definition of τ̂ is as follows.
Let y ∈ E. Consider the event {ω : τy (ω) = n}, n ≥ 0. Since τy is a Markov time,
this event belongs to Fn . We assume that Ω is a coordinate space generated by
sequences ω = (x0 , x1 , . . . ) with xi ∈ E, and Fn = σ(x0 , . . . , xn ). This implies that
the set {ω : τy (ω) = n} can be written {ω : (X0 (ω), . . . , Xn (ω)) ∈ By (n)}, where
By (n) is a set in E n+1 = E ⊗ · · · ⊗ E (n + 1 times). (See also Theorem 4 in Sect. 2,
Chap. 2, Vol. 1).
By definition, the Markov time τ̂ = τ̂ (ω) equals n + 1 with n ≥ 0 on the set

Ân = {ω : X1 (ω) = y, (X1 (ω), . . . , Xn+1 (ω)) ∈ By (n)}.
y∈E
302 8 Markov Chains

(The time τ̂ can be described heuristically as a rule for making an observation at


time n = 0, whatever the state x, and using subsequently the Markov time τy if
X1 = y.) 
Since n≥0 Ân = Ω, τ̂ = τ̂ (ω) is well defined for all ω ∈ Ω and is a Markov
time (Problem 2).
Using this construction, the generalized Markov property, and (30), we find that,
for any x ∈ E,

Ex g(Xτ̂ ) = Px {X1 = y, (X1 , . . . , Xn+1 ) ∈ By (n), Xn+1 = z} g(z)
n≥0 y∈E z∈E

= pxy Py {X0 = y, (X0 , . . . , Xn ) ∈ By (n), Xn = z} g(z)
n≥0 y∈E z∈E

= pxy Py {(X0 , . . . , Xn ) ∈ By (n), Xn = z} g(z)
n≥0 y∈E z∈E
 
= pxy Ey g(Xτy ) ≥ pxy (s(y) − ε) = Ts(x) − ε.
y∈E y∈E

Thus,
s(x) = Ex g(Xτ̂ ) ≥ Ts(x) − ε, x ∈ E,
and, since ε > 0 is arbitrary,

s(x) ≥ Ts(x), x ∈ E,

which proves that s = s(x), x ∈ E, is an excessive function.


The property just obtained, that s(x) is excessive (superharmonic), immediately
provides the following important result.

Corollary 1. For any x ∈ E the process (sequence)

s = (s(Xn ))n≥0 (31)

is a supermartingale (with respect to the Px -probability).

Theorem 1 of Sect. 2, Chap. 7, applied to this supermartingale implies that for


any stopping time τ ∈ M∞0 we have

s(x) ≥ Ex s(Xτ ), x ∈ E, (32)

and if σ and τ are Markov times in M∞


0 such that σ ≤ τ (Px -a.s., x ∈ E), then

Ex s(Xσ ) ≥ Ex s(Xτ ), x ∈ E. (33)

(Note that the conditions of Theorem 1, Sect. 2, Chap. 7, mentioned earlier are ful-
filled in this case since space E is finite.)
Now we deduce from (32) the following corollary.
9 Optimal Stopping Problems for Markov Chains 303

Corollary 2. Let the function g = g(x), x ∈ E, in the optimal stopping problem (20)
be excessive (superharmonic). Then τ0∗ ≡ 0 is an optimal stopping time.

Proof of (b). Let us show that s(x) = limN sN (x), x ∈ E.


Since sN (x) ≤ sN+1 (x), the limit limN sN (x) exists. Denote it by s̄(x). Since E is
finite and the sN (x), N ≥ 0, satisfy the recurrence relations

sN (x) = max(g(x), TsN−1 (x)),

we can pass to the limit as N → ∞ in them to obtain that

s̄(x) = max(g(x), Ts̄(x)).

This implies that s̄(x) is an excessive majorant for g(x). But s(x) is the least exces-
sive majorant. Hence s(x) ≤ s̄(x). On the other hand, since sN (x) ≤ s(x) for any
N ≥ 0, we have s̄(x) ≤ s(x).
Therefore s̄(x) = s(x), which proves statement (b).
Proof of (c, d). Finally, we will show that the stopping time

τ0∗ = min{n ≥ 0 : s(Xn ) = g(Xn )}, (34)

i.e., the time


τ0∗ = min{n ≥ 0 : Xn ∈ D∗ } (35)
of the first entering the (stopping) set

D∗ = {x ∈ E : s(x) = g(x)} (36)

is (for finite E) optimal in the class M∞ 0 .


To this end, note that the set D∗ is not empty because it certainly contains those x̃
for which g(x̃) = maxx∈E g(x). In these states s(x̃) = g(x̃), and it is obvious that
the optimal strategy with regard to these states is to stop when getting into x̃. This is
exactly what the stopping time τ0∗ does.
To discuss τ0∗ from the point of view of optimality in the class M∞ 0 , we must first
of all establish that this stopping time belongs to this class, i.e., that

Px {τ0∗ < ∞} = 1, x ∈ E. (37)

This is true indeed under our assumption that the state space E is finite. (For an
infinite E this is, in general, not the case; see Problem 1).
 For the proof of this, note that the event {τ0∗ = ∞} is the same as A =

n≥0 {Xn ∈ / D }. Thus, we are to show that Px (A) = 0 for all x ∈ E.
Obviously, this is the case if D∗ = E.
Let D∗ = E. Since E is finite, there is α > 0 such that g(y) ≤ s(y) − α for all
y ∈ E \ D∗ . Then, for any τ ∈ M∞ 0 ,
304 8 Markov Chains
∞ 

Ex g(Xτ ) = Px {τ = n, Xn = y} g(y)
n=0 y∈E
∞ 
 ∞ 

= Px {τ = n,Xn = y}g(y) + Px {τ = n,Xn = y}g(y)
n=0 y∈D∗ n=0 y∈E\D∗
∞ 
 ∞ 

≤ Px {τ = n,Xn = y}s(y) + Px {τ = n,Xn = y}(s(y) − α)
n=0 y∈D∗ n=0 y∈E\D∗

≤ Ex s(Xτ ) − α Px (A) ≤ s(x) − α Px (A), (38)

where the last inequality follows because s(x) is excessive (superharmonic) and
satisfies Eq. (32).
Taking the supremum over all τ ∈ M∞ 0 on the left-hand side of (38), we obtain

s(x) ≤ s(x) − α Px (A), x ∈ E.

But |s(x)| < ∞ and α > 0. Therefore Px (A) = 0, x ∈ E, which proves the finiteness
of the stopping time τ0∗ .
We will show now that this stopping time is optimal in the class M∞ 0 . By the
definition of τ0∗ ,
s(Xτ0∗ ) = g(Xτ0∗ ). (39)
With this property in mind, consider the function γ(x) = Ex g(Xτ0∗ ) = Ex s(Xτ0∗ ).
In what follows, we show that γ(x) has the properties:
(i) γ(x) is excessive;
(ii) γ(x) majorizes g(x), i.e., γ(x) ≥ g(x), x ∈ E;
(iii) Obviously, γ(x) ≤ s(x).
Properties (i) and (ii) imply that γ(x) is an excessive majorant for s(x), which, in
turn, is the least excessive majorant for g(x). Hence, by (iii), γ(x) = s(x), x ∈ E,
which yields
s(x) = Ex g(Xτ0∗ ), x ∈ E,
thereby proving the required optimality of τ0∗ within M∞ 0 .
Let us prove (i). Use the notation τ̄ = min{n ≥ 1 : Xn ∈ D∗ }. This is a Markov
time, τ0∗ ≤ τ̄ , τ̄ ∈ M∞
1 , and since s(x) is excessive, we have, by (33),

Ex s(Xτ̄ ) ≤ Ex s(Xτ0∗ ), x ∈ E. (40)

Next, using the generalized Markov property (see (2) in Theorem 1, Sect. 2) we
obtain
∞ 

Ex s(Xτ̄ ) = / D∗ , . . . , Xn−1 ∈
Px {X1 ∈ / D∗ , Xn = y} s(y)
n=1 y∈D∗
9 Optimal Stopping Problems for Markov Chains 305
∞  

= / D∗ , . . . , Xn−2 ∈
pxz Pz {X0 ∈ / D∗ , Xn−1 = y} s(y)
n=1 y∈D∗ z∈E

= pxz Ez s(Xτ0∗ ). (41)
z∈E

Hence, by (40), we find that



Ex s(Xτ0∗ ) ≥ pxz Ez s(Xτ0∗ ),
z∈E

i.e., 
γ(x) ≥ pxz γ(z), x ∈ E,
z∈E

which shows that the function γ(x) is excessive.


It remains to show that γ(x) majorizes g(x). If x ∈ D∗ , then τ0∗ = 0 and, obvi-
ously, γ(x) = Ex g(Xτ0∗ ) = g(x).
Consider the set E \ D∗ , and let E0∗ = {x ∈ E \ D∗ : γ(x) < g(x)}. Let x0∗ be the
point in the finite set E0∗ , where the maximum of g(x) − γ(x) is attained:

g(x0∗ ) − γ(x0∗ ) = max∗ (g(x) − γ(x)).


x∈E0

Define the function

γ̃(x) = γ(x) + [g(x0∗ ) − γ(x0∗ )], x ∈ E. (42)

Clearly, this function is excessive (being the sum of an excessive function and a
constant) and

γ̃(x) − g(x) = [g(x0∗ ) − γ(x0∗ )] − [g(x) − γ(x)] ≥ 0

for all x ∈ E. Thus, γ̃(x) is an excessive majorant for g(x), and hence γ̃(x) ≥ s(x),
since s(x) is the least excessive majorant for g(x).
This implies that
γ̃(x0∗ ) ≥ s(x0∗ ).
But γ̃(x0∗ ) = g(x0∗ ) by (42); therefore g(x0∗ ) ≥ s(x0∗ ). Since s(x) ≥ g(x) for
all x ∈ E, we obtain g(x0∗ ) = s(x0∗ ), i.e., x0∗ is in D∗ , while x∗ ∈ E \ D∗ by
assumption.
This contradiction shows that E \ D∗ = ∅, so γ(x) ≥ g(x) for all x ∈ E.


6. Let us give some examples.

EXAMPLE 1. Consider the simple random walk with two absorbing barriers de-
scribed in Example 5, Sect. 8, assuming that p = q = 1/2 (symmetric random
walk). If a function γ(x), x ∈ E = {0, 1, . . . , N}, is excessive for this random walk,
then
306 8 Markov Chains

1 1
γ(x) ≥ γ(x − 1) + γ(x + 1) (43)
2 2
for all x = 1, . . . , N − 1.
Suppose we are given a function g = g(x), x ∈ {0, 1, . . . , N}. Since states 0
and N are absorbing, the function s(x) must be sought among the functions γ(x)
satisfying condition (43) and boundary conditons γ(0) = g(0), γ(N) = g(N).
Condition (43) means that γ(x) is convex on the set {1, 2, . . . , N − 1}. Hence we
can conclude that the price s(x) in the problem s(x) = supτ ∈M∞ 0
Ex g(Xτ ) is the
least convex function subject to the boundary conditions s(0) = g(0), s(N) = g(N).
A visual description of the rule for determinimg s(x) is as follows. Let us cover the
values of g(x) by a stretched thread. In Fig. 42 this thread passes through the points
(0, a), (1, b), (4, c), (6, d), where the points 0, 1, 4, 6 form the set D∗ of stopping
states. In these points we have s(x) = g(x). The values of s(x) at other points x =
2, 3, 5 are determined by linear interpolation. In the general case, the convex hull
s(x) for all x ∈ E is constructed in a similar manner.

b s (x)
c
d
g (x)
a

0 1 2 3 4 5 6

Fig. 42 The function g(x) (dashed line) and its convex hull s(x), x = 0, 1, . . . , 6

EXAMPLE 2. Consider, as in Example 7, Sect. 8, a simple symmetric (p = q = 1/2)


random walk over the set E = {0, 1, . . . , N} with reflecting barriers at 0 and N.
This random walk is positive recurrent, which implies that the optimal rule in the
optimal stopping problem s(x) = supτ ∈M∞ 0
Ex g(Xτ ) has a very simple and natural
structure: wait until the particle reaches a point where g(x) attains its maximum and
then stop the observations.
EXAMPLE 3. Suppose that in a simple symmetric random walk over the set E =
{0, 1, . . . , N} state 0 is absorbing and N is reflecting. Let x0 be the state where g(x)
attains its maximum and that is closest to N if there are several maxima. Then the
optimal stopping rule is as follows: If x0 ≤ x ≤ N, then the walk stops when (with
Px -probability 1) the state x0 is achieved, while for x between 0 and x0 the decision
rule is the same as in Example 1 taking E = {0, 1, . . . , x0 } with absorbing barriers
0 and x0 .
7. Finally, consider the widely known “best choice problem” also known as “the
fiancée problem,” “the secretary problem,” and so on (see [66], [69, 22, 5]). We will
interpret it as “the fiancée problem.”
9 Optimal Stopping Problems for Markov Chains 307

Suppose that a fiancée is going to choose the best of N candidates. It is assumed


that N is known and the candidates are ranked. Let, for definiteness, the best candi-
date be ranked number N, the second N − 1, and so on to 1.
The candidates are presented to the fiancée in random order, which is formalized
as follows. Let (a1 , a2 , . . . , aN ) be a random permutation taking on any of N! possi-
ble permutations of (1, 2, . . . , N) with probability 1/N!. Then the fiancée meets first
the candidate of rank a1 , then a2 , and so on till the last aN th.
The fiancée can choose one of the candidates according to the following rules.
She does not know the ranks of the candidates and can only compare the current
candidate with those she has seen before. Once rejected, a candidate cannot return
anymore (even though he might have been the best one).
Based on a consecutive assessment of candidates (keeping in mind the outcomes
of their pairwise comparison and the “quality” of rejected candidates), the fiancée
must choose a stopping time τ ∗ such that

P{aτ ∗ = N} = sup P{aτ = N}, (44)


τ

where τ runs over a class of stopping times MN1 determined by information acces-
sible to the fiancée.
To describe the class MN1 more precisely, let us construct the rank sequence X =
(X1 , X2 , . . . ) depending on ω = (a1 , a2 , . . . , aN ), which will determine the action
of the fiancée.
That is, let X1 = 1, and let X2 be the order number of the candidate that
dominates all the preceding ones. If, for example, X2 = 3, this means that in
ω = (a1 , a2 , . . . , aN ) we have a1 > a2 , but a3 > a1 (> a2 ). If, say, X3 = 5,
then a3 > a4 , but a5 > a3 (> a4 ).
There might be at most N dominants (when (a1 , a2 , . . . , aN ) = (1, 2, . . . , N)). If
ω = (a1 , a2 , . . . , aN ) contains m dominants, we set Xm+1 = Xm+2 = · · · = N + 1.
The class MN1 of admissible stopping times will consist of those τ = τ (ω) for
which
{ω : τ (ω) = n} ∈ FnX ,
where FnX = σ(X1 , . . . , Xn ), 1 ≤ n ≤ N.
Consider the structure of the rank sequence X = (X1 , X2 , . . . ) in more detail. It
is not hard to see (Problem 3) that this sequence is a homogeneous Markov chain
(with phase space E = {1, 2, . . . , N + 1}). The transition probabilities of this chain
are given by the following formulas:
i
pij = , 1 ≤ i < j ≤ N, (45)
j(j − 1)
i
pi,N+1 = , 1 ≤ i ≤ N, (46)
N
pN+1,N+1 = 1. (47)

It is seen from these formulas that the state N + 1 is absorbing and all the transitions
on set E are upward, i.e., the only possible transitions are i → j with j > i.
308 8 Markov Chains

Remark. Formula (45) follows from the following simple arguments taking into
account that the probability of every sequence ω = (a1 , . . . , aN ) is 1/N!.
For 1 ≤ i < j ≤ N the transition probability is equal to

P{Xn = i, Xn+1 = j}
pij = P(Xn+1 = j | Xn = i) = . (48)
P{Xn = i}

The event {Xn = i, Xn+1 = j} means that aj dominates among a1 , . . . , aj and


aj > ai . The probability of this event is (j−2)!
j!
1
= j(j−1) . In the same way, the event
{Xn = i} means that ai dominates among a1 , . . . , ai and the probability of this event
is (i−1)!
i! = 1i . These considerations and (48) imply (45).
For the proof of (46) it suffices to note that if Xn = i, then Xn+1 = N + 1 implies
that ai dominates both ai+1 , . . . , aN and a1 , . . . , ai−1 . Formula (47) is obvious.

Suppose now that the fiancée adopted a stopping time τ (with respect to the
system of σ-algebras (FnX )) and Xτ = i. Then the conditional probability that  this

stopping time is successful (i.e., aτ = N) is, according to (46), equal to XNτ = Ni .
Therefore

P{aτ = N} = E ,
N
and hence seeking the optimal stopping time τ ∗ (i.e., the stopping time for which
P{aτ ∗ = N} = supτ P{aτ = N}) reduces to the optimal stopping problem


V ∗ = sup E , (49)
τ N

where τ is a Markov time with respect to (FnX ).


It is assumed in (49) that X1 = 1. In accordance with the general method of
solving optimal stopping problems for Markov chains, use the notation

v(i) = sup Ei g(Xτ ),


τ

where Ei is the expectation given that X1 = i, and


i
g(i) = , i ≤ N, g(N + 1) = 0.
N
As we know (Theorem 2), the function v(i), 1 ≤ i ≤ N + 1, is an excessive majorant
for g(i), 1 ≤ i ≤ N + 1:


N
i
v(i) ≥ Tv(i) = v(j), (50)
j=i+1
j(j − 1)
v(i) ≥ g(i), (51)

and moreover it is the least excessive majorant. The same Theorem 2 implies that
v(i), 1 ≤ i ≤ N + 1, satisfies the equation
9 Optimal Stopping Problems for Markov Chains 309

v(i) = max(g(i), Tv(i)), 1 ≤ i ≤ N + 1, (52)

and, as is easy to see, v(i) must fulfill

v(N + 1) = 0, v(N) = g(N) = 1.

Denote by D∗ the set of the states i ∈ E, where observations are stopped. By


Theorem 1, this set has the form

D∗ = {i ∈ E : v(i) = g(i)}.

Accordingly, the set where observations are continued is

C∗ = {i ∈ E : v(i) > g(i)}.

Therefore, if i ∈ D∗ , then

N
i 1 N
i 1
g(i) = v(i) ≥ Tv(i) = · v(j) ≥ · g(j)
j=i+1
j j−1 j=i+1
j j−1
N
i 1 j N
1
= · · = g(i) .
j=i+1
j j−1 N j=i+1
j−1

Hence, for i ∈ D∗ , we must have


N
1
≤ 1.
j=i+1
j − 1

Further, if this inequality is fulfilled and the values i + 1, . . . , N belong to D∗ ,


then
N
i 1 N
1
Tv(i) = · g(j) = g(i) ≤ g(i),
j=i+1
j j − 1 j=i+1
j − 1

so that i also belongs to D∗ .


The preceding arguments (together with N ∈ D∗ , since v(N) = g(N)) show that
the set D∗ has the form

D∗ = {i∗ , i∗ + 1, . . . , N, N + 1},

where i∗ = i∗ (N) is determined by the inequalities


1 1 1 1 1 1
+ ∗ + ··· + ≤1< ∗ + + ··· + , (53)
i∗ i +1 N−1 i − 1 i∗ N−1
which imply that for large N
N
i∗ (N) ∼ . (54)
e
310 8 Markov Chains

Indeed, for any n ≥ 2 we have


1
log(n + 1) − log n < < log n − log(n − 1).
n
Hence
N 1 1 N−1
log < + ··· + < log .
n n N−1 n−1
Together with (53), this yields
N N−1
log < 1 < log ,
i∗ (N) i∗ (N)
−2

which implies (54).


Let us find now v = v(i) for i ∈ E = {1, 2, . . . , N + 1}.
If i ∈ D∗ = {i∗ , i∗ + 1, . . . , N, N + 1}, then v(i) = g(i) = i
N. Let i = i∗ − 1.
Then
 i∗ − 1  1 1 
N
i∗ − 1
v(i∗ − 1) = Tv(i∗ − 1) = g(j) = + · · · + .
j=i∗
j(j − 1) N i∗ − 1 N−1

Now let i = i∗ − 2. Then

i∗ − 2 N
i∗ − 2
∗ ∗ ∗
v(i − 2) = Tv(i − 2) = ∗ v(i − 1) + g(j)
(i − 1)(i∗ − 2) j=i∗
j(j − 1)

1 1 1  i∗ − 2  1
N
= + ··· + +
N i∗ − 1 N−1 N j=i∗ j − 1
i∗ − 1  1 1 
= + · · · + .
N i∗ − 1 N−1
By induction we establish that
i∗ − 1  1 1 
v(i) = v∗ (N) = + · · · + (55)
N i∗ − 1 N−1
for 1 ≤ i < i∗ . Therefore

v∗ (N), 1 ≤ i < i∗ (N),
v(i) = (56)
g(i) = Ni , i∗ (N) ≤ i ≤ N,

for i ∈ {1, 2, . . . , N}.


By (53) we have
 1 1 
lim + · · · + = 1, (57)
N→∞ i∗ (N) − 1 N−1
9 Optimal Stopping Problems for Markov Chains 311

and hence (55) implies that

i∗ (N) − 1 1
lim v∗ (N) = lim = ≈ 0.368. (58)
N→∞ N→∞ N e
This result may appear somewhat surprising since it implies that for a large num-
ber N of candidates the fiancée can choose the best one with a fairly high probability,
V ∗ = supτ P{aτ = N} = v∗ (N) ≈ 0.368. The optimal stopping time for that is

τ ∗ = min{n : Xn ∈ D∗ },

where D∗ = {i∗ , i∗ + 1, . . . , N, N + 1}.


Thus the optimal strategy of the fiancée is to see i∗ −1 candidates without making
any choice (where i∗ = i∗ (N) ∼ N/e, n → ∞) and then to choose the first one who
is better than those she has already seen.
When N = 10, a more detailed analysis shows (e.g., [22, Section 1, Chap. III])
that i∗ (10) = 4. In other words, in this case the fiancée should see the first three
candidates and then choose the first one who dominates over all the preceding ones.
The probability of choosing the best fiancé in this case (i.e., v∗ (10)) is approximately
0.399.
8. Problems
1. Give an example showing that the optimal stopping time (in the class M∞ 0 ) may
not exist for Markov chains with a countable state space.
2. Check that the time τy introduced in the proof of Theorem 2 is a Markov time.
3. Show that the sequence X = (X1 , X2 , . . . ) in the fiancée problem forms a ho-
mogeneous Markov chain.
4. Let X = (Xn )n≥0 be a real-valued homogeneous Markov chain with transition
function P = P(x; B), x ∈ R, B ∈ B(R). An R-valued function f = f (x), x ∈ R,
is P-harmonic (or harmonic with respect to P) if

Ex |f (X1 )| = |f (y)| P(x; dy) < ∞, x ∈ R,
R

and 
f (x) = f (y) P(x; dy), x ∈ R. (59)
R

(If the equality sign in (59) is replaced by the greater than or equal to symbol,
then f is superharmonic.) Prove that if f is superharmonic, then for any x ∈ R
the sequence (f (Xn ))n≥0 with X0 = x is a supermartingale (with respect to Px ).
5. Show that the stopping time τ̄ involved in (38) belongs to the class M∞ 1 .
6. Similarly to Example 1 in Subsection 6, consider the optimal stopping problems

sN (x) = sup Ex g(Xτ )


τ ∈MN0
312 8 Markov Chains

and
s(x) = sup Ex g(Xτ )
τ ∈M∞
0

for simple random walks treated in the examples in Sect. 8.


Development of Mathematical Theory
of Probability: Historical Review

In the history of probability theory, we can distinguish the following periods of its
development (cf. [34, 46]∗ ):
1. Prehistory,
2. First period (the seventeenth century–early eighteenth century),
3. Second period (the eighteenth century–early nineteenth century),
4. Third period (second half of the nineteenth century),
5. Fourth period (the twentieth century).
Prehistory. Intuitive notions of randomness and the beginning of reflection about
chances (in ritual practice, deciding controversies, fortune telling, and so on) ap-
peared far back in the past. In the prescientific ages, these phenomena were regarded
as incomprehensible for human intelligence and rational investigation. It was only
several centuries ago that their understanding and logically formalized study began.
Archeological findings tell us about ancient “randomization instruments”, which
were used for gambling games. They were made from the ankle bone (latin: astra-
galus) of hoofed animals and had four faces on which they could fall. Such dice were
definitely used for gambling during the First Dynasty in Egypt (around 3500 BC),
and then in ancient Greece and ancient Rome. It is known ([14]) that the Roman
Emperors August (63 BC–14 AC) and Claudius (10 BC–54 AC) were passionate
dicers.
In addition to gambling, which even at that time raised issues of favorable and
unfavorable outcomes, similar questions appeared in insurance and commerce. The
oldest forms of insurance were contracts for maritime transportation, which were
found in Babylonian records of the 4th to 3rd Millennium BC. Afterwards the prac-
tice of similar contracts was taken over by the Phoenicians and then came to Greece,
Rome, and India. Its traces can be found in early Roman legal codes and in legis-
lation of the Byzantine Empire. In connection with life insurance, the Roman jurist
Ulpian compiled (220 BC) the first mortality tables.

∗ The citations here refer to the list of References following this section.

© Springer Science+Business Media, LLC, part of Springer Nature 2019 313


A. N. Shiryaev, Probability-2, Graduate Texts in Mathematics 95,
https://doi.org/10.1007/978-0-387-72208-5
314 Development of Mathematical Theory of Probability: Historical Review

In the time of flourishing Italian city-states (Rome, Venice, Genoa, Pisa, Flo-
rence), the practice of insurance caused the necessity of statistics and actuarial cal-
culations. It is known that the first dated life insurance contract was concluded in
Genoa in 1347.
The city-states gave rise to the Renaissance (fourteenth to early seventeenth cen-
turies), the period of social and cultural upheaval in Western Europe. In the Italian
Renaissance, there appeared the first discussions, mostly of a philosophical nature,
regarding the “probabilistic” arguments, attributed to Luca Pacioli (1447–1517),
Celio Calcagnini (1479–1541), and Niccòlo Fontana Tartaglia (1500–1557) (see
[46, 14]).
Apparently, one of the first people to mathematically analyse gambling chances
was Gerolamo Cardano (1501–1576), who was widely known for inventing the Car-
dan gear and solving the cubic equation (although this was apparently solved by
Tartaglia, whose solution Cardano published). His manuscript (written around 1525
but not published until 1663) “Liber de ludo aleae” (“Book on Games of Chance”)
was more than a kind of practical manual for gamblers. Cardano was first to state the
idea of combinations by which one could describe the set of all possible outcomes
(in throwing dice of various kinds and numbers). He observed also that for true
dice “the ratio of the number of favorable outcomes to the total number of possible
outcomes is in good agreement with gambling practice” ([14]).
1. First period (the seventeenth century–early eighteenth century). Many math-
ematicians and historians, such as Laplace [44] (see also [64]), related the beginning
of the “calculus of probabilities” with correspondence between Blaise Pascal (1623–
1662) and Pierre de Fermat (1601–1665). This correspondence arose from certain
questions that Antoine Gombaud (alias Chevalier de Méré, a writer and moralist,
1607–1684) asked Fermat.
One of the questions was how to divide the stake in an interrupted game. Namely,
suppose two gamblers, A and B, agreed to play a certain number of games, say a
best-of-five series, but were interrupted by an external cause when A has won 4
games and B has won 3 games. A seemingly natural answer is to divide it in the
proportion 2 : 1. Indeed, the game certainly finishes in two steps, of which A may
win 1 time, while B has to win both. This apparently implies the proportion 2 : 1.
However, A has won 4 games against 3 won by B, so that the proportion 4 : 3 also
looks natural. In fact, the correct answer found by Pascal and Fermat was 3 : 1.
Another question was: what is more likely, to have at least one 6 in 4 throwings
of a dice or to have at least one pair (6, 6) in 24 simultaneous throwings of two
dice? In this problem, Pascal and Fermat also gave a correct answer: the former
combination is slightly more probable than the latter (1 − (5/6)2 = 0.516 against
1 − (35/36)2 = 0.491).
In solving these problems, Pascal and Fermat (as well as Cardano) applied com-
binatorial arguments that became one of the basic tools in “calculus of probabil-
ities” for the calculation of various chances. Among these tools, Pascal’s triangle
also found its place (although it was known before).
In 1657, the book by Christianus Huygens (1629–1695) “De Ratiociniis in Ludo
Aleæ” (“On Reasoning in Games of Chance”) appeared, which is regarded as the
Development of Mathematical Theory of Probability: Historical Review 315

first systematic presentation of the “calculus of probabilities”. In this book, Huy-


gens formulates many basic notions and principles, states the rules of addition and
multiplication of probabilities, and discusses the concept of expectation. This book
became for long the main textbook in elementary probability theory.
A prominent figure of this formative stage of probability theory was Jacob
(James, Jacques) Bernoulli (1654–1705), who is credited with introducing the clas-
sical concept of the “probability of an event” as the ratio of the number of outcomes
favorable for this event to the total number of possible outcomes.
The main result of J. Bernoulli, with which his name is associated, is, of course,
the law of large numbers, which is fundamental for all applications of probability
theory.
This law, stated as a limit theorem, is dated from 1713 when Bernoulli’s trea-
tise “Ars Conjectandi” (“The Art of Guessing”) was published (posthumously)
with involvement of his nephew Nikolaus Bernoulli (see [3]). As was indicated by
A. A. Markov in his speech on the occasion of the 200th anniversary of the law of
large numbers (see [56]), J. Bernoulli wrote in his letters (of October 3, 1703, and
April 20, 1704) that this theorem was known to him “already twelve years ago”.
(The very term “law of large numbers” was proposed by Poisson in 1835.)
Another member of the Bernoulli family, Daniel Bernoulli (1667–1748), is
known for probability theory in connection with discussion regarding the so-called
“St. Petersburg paradox”, where he proposed to use the notion of “moral expecta-
tion”.
The first period of development of probability theory coincided in time with the
formation of mathematical natural science. This was the time when the concepts of
continuity, infinity, and infinitesimally small quantities prevailed. This was when Is-
saac Newton (1642–1727) and Gottfried Wilhelm Leibniz (1646–1716) developed
differential and integral calculus. As A. N. Kolmogorov [34] wrote, the problem of
that epoch was to “comprehend the extraordinary breadth and flexibility (and om-
nipotence, as appeared then) of the mathematical method of study of causality. The
idea of a differential equation as a law which determines uniquely the evolution
of a system from its present state on took then even more prominent position in
the mathematical natural science then now. The probability theory is needed in the
mathematical natural science when this deterministic approach based on differen-
tial equations fails. But at that time there was no concrete numerical material for
application of probability theory.”
Nevertheless, it became clear that the description of real data by deterministic
models like differential equations was inevitably only a rough approximation. It
was also understood that, in the chaos of large masses of unrelated events, there
may appear in average certain regularities. This envisaged the fundamental natural-
philosophic role of the probability theory, which was revealed by J. Bernoulli’s law
of large numbers.
It should be noted that J. Bernoulli realized the importance of dealing with in-
finite sequences of repeated trials, which was a radically new idea in probabilistic
considerations restricted at that time to elementary arithmetic and combinatorial
tools. The statement of the question that led to the law of large numbers revealed
316 Development of Mathematical Theory of Probability: Historical Review

the difference between the notions of the probability of an event and the frequency
of its appearance in a finite number of trials, as well as the possibility of determina-
tion of this probability (with certain accuracy) from its frequency in large numbers
of trials.
2. Second period (the eighteenth century–early nineteenth century). This period
is associated, essentially, with the names of Pierre-Rémond de Montmort (1678–
1719), Abraham de Moivre (1667–1754), Thomas Bayes (1702–1761), Pierre Si-
mon de Laplace (1749–1827), Carl Friedrich Gauss (1777–1855), and Siméon De-
nis Poisson (1781–1840).
While the first period was largely of a philosophical nature, in the second one the
analytic methods were developed and perfected, computations became necessary in
various applications, and probabilistic and statistical approaches were introduced in
the theory of observation errors and shooting theory.
Both Montmort and de Moivre were greatly influenced by Bernoulli’s work in the
calculus of probability. In his book “Essai d’Analyse sur les Jeux de Hasard” (“Es-
say on the analysis of gambling”), Montmort pays major attention to the methods of
computations in diverse gambling games.
In the books “Doctrine of Chances” (1718) and “Miscellanea Analytica” (“An-
alytical Miscellany”, 1730), de Moivre carefully defines such concepts as indepen-
dence, expectation, and conditional probability.
De Moivre’s name is best known in connection with the normal approximation
for the binomial distribution. While J. Bernoulli’s law of large numbers showed that
the relative frequencies obey a certain regularity, namely, they converge to the corre-
sponding probabilities, the normal approximation discovered by de Moivre revealed
another universal regularity in the behavior of deviations from the mean value. This
de Moivre’s result and its subsequent generalizations played such a significant role
in the probability theory that the corresponding “integral limit theorem” became
known as the Central Limit Theorem. (This term was introduced by G. Pólya (1887–
1985) in 1920, see [60].)
The main figure of this period was, of course, Laplace (Pierre-Simon de Laplace,
1749–1827). His treatise “Théorie analytique des probabilités” (“Analytic Theory
of Probability”) published in 1812 was the main manual on the probability theory in
the nineteenth century. He also wrote several memoirs on foundations, philosophical
issues, and particular problems of the probability theory, in addition to his works on
astronomy and calculus. He made a significant contribution to the theory of errors.
The idea that the measurement errors are normally distributed as a result of the
summation of many independent elementary random errors is due to Laplace and
Gauss. Laplace not only restated Moivre’s integral limit theorem in a more general
form (the “de Moivre–Laplace theorem”), but also gave it a new analytic proof.
Following Bernoulli, Laplace maintained the equiprobability principle implying
the classical definition of probability (in the case of finitely many possible out-
comes).
However, already at that time there appeared “nonclassical” probability distri-
butions that did not conform to the classical concepts. So were, for example, the
Development of Mathematical Theory of Probability: Historical Review 317

normal and Poisson laws, which for long time were considered merely as certain
approximations rather than probability distributions per se (in the modern sense of
the term).
Other problems, where “nonclassical” probabilities arose, were the ones related
to “geometric probabilities” (treated, e.g., by Newton 1665, see [55, p. 60]). An
example of such a problem is the “Buffon needle”. Moreover, unequal probabilities
arose from the Bayes formula (presented in “An Essay towards Solving a Problem in
the Doctrine of Chances” which was read to the Royal Society in 1763 after Bayes’
death). This formula gives the rule for recalculation of prior probabilities (assumed
equal by Bayes) into posterior ones given the occurrence of a certain event. This
formula gave rise to the statistical approach called nowadays “the Bayes approach”.
It can be seen from all that has been said that the framework of the “classical”
(finite) probability theory limited the possibilities of its development and applica-
tion, and the interpretation of the normal, Poisson, and other distributions merely
as limiting objects was giving rise to the feeling of incompleteness. During this pe-
riod, there were no abstract mathematical concepts in the probability theory and it
was regarded as nothing but a branch of applied mathematics. Moreover, its meth-
ods were confined to the needs of specific applications (such as gambling, theory of
observation errors, theory of shooting, insurance, demography, and so on).
3. Third period (second half of the nineteenth century). During the third pe-
riod, the general problems of probability theory developed primarily in St. Peters-
burg. The Russian mathematicians P. L. Chebyshev (1821–1894), A. A. Markov
(1856–1922), and A. M. Lyapunov (1857–1918) made an essential contribution to
the broadening and in-depth study of probability theory. It was due to them that
the limitation to “classical” probability was abandoned. Chebyshev clearly realized
the role of the notions of a random variable and expectation and demonstrated their
usability, which have now become a matter of course.
Bernoulli’s law of large numbers and the de Moivre–Laplace theorem dealt with
random variables taking only two values. Chebyshev extended the scope of these
theorems to much more general random variables. Already his first result estab-
lished the law of large numbers for sums of arbitrary independent random variables
bounded by a constant. (The next step was done by Markov who used in the proof
the “Chebyshev–Markov inequality”.)
After the law of large numbers, Chebyshev turned to establishing the de Moivre–
Laplace theorem for sums of independent random variables, for which he worked
out a new tool, the method of moments, which was later elaborated by Markov.
The next unexpected step in finding general conditions for the validity of the de
Moivre–Laplace theorem was done by Lyapunov, who used the method of charac-
teristic functions taking its origin from Laplace. He proved this theorem assuming
only that the random variables involved in the sum have moments of order 2 + δ,
δ > 0 (rather then the moments of all orders required by the method of moments)
and satisfy Lyapunov’s condition.
Moreover, Markov introduced a principally new concept, namely, that of a se-
quence of dependent random variables possessing the memoryless property known
nowadays as a Markov chain, for which he rigorously proved the first “ergodic the-
orem”.
318 Development of Mathematical Theory of Probability: Historical Review

Thus, we can definitely state that the works of Chebyshev, Markov, and Lyapunov
(“Petersbourg school”) laid the foundation for all subsequent development of the
probability theory.
In Western Europe, the interest to probability theory in the late nineteenth century
was rapidly increasing due to its deep connections discovered at that time with pure
mathematics, statistical physics, and flourished mathematical statistics.
It became clear that the development of probability theory was restrained by its
classical framework (finitely many equiprobable outcomes) and its extension had to
be sought in the models of pure mathematics. (Recall that at that time the set theory
only began to be developed and measure theory was on the threshold of its creation.)
At the same time, pure mathematics, particularly number theory, which is an area
apparently very remote from probability theory, began to use concepts and obtain
results of a probabilistic nature with the help of probabilistic intuition.
For example, Jules Henri Poincaré (1854–1912), in his paper [58] of 1890 deal-
ing with the three-body problem, stated a result on return of the motion of a dynam-
ical system described by a transformation T preserving the “volume”. This result
asserted that if A is the set of initial states ω, then for “typical” states ω ∈ A the
trajectories T n ω would return into the set A infinitely often (in the modern language,
the system returns for almost all [rather then for all] initial states of the system).
In considerations of that time, expressions like “random choice”, “typical case”,
“special case” are often used. In the handbook Calcul des Probabilités [59], 1896,
H. Poincaré asks the question about the probability that a randomly chosen point of
[0, 1] happens to be a rational number.
In 1888, the astronomer Johan August Hugo Guldén (1841–1896) published a
paper [24] that dealt (like Poincaré’s [58]) with planetary stability and which nowa-
days would fall within the domain of number theory.
Let ω ∈ [0, 1] be a number chosen “at random” and let ω = (a1 , a2 , . . . )
be its continued fraction representation, where an = an (ω) are integers. (For a
rational number ω there are only finitely many nonvanishing an ’s; the numbers
ω k = (a1 , a2 , . . . , ak , 0, 0, . . . ) formed from the representation ω = (a1 , a2 , . . . )
are in a sense “best possible” rational approximations of ω.) The question is how
the numbers an (ω) behave for large n in “typical” cases.
Guldén established (though nonrigorously) that the “probability” to have an = k
in the representation ω = (a1 , a2 , . . . ) is “more or less” inversely proportional to k2
for large h. Somewhat later, T. Brodén [9] and A. Wiman [69] showed by dealing
with geometric probabilities that if the “random” choice of ω ∈ [0, 1] is determined
by the uniform distribution of ω on [0, 1], then the probability that an (ω) = k tends
to  1 A  1 
(log 2)−1 log 1 + 1+
k k+1
as n → ∞. This expression is inversely proportional to k2 for large k, which Guldén
essentially meant.
In the second half of the nineteenth century, the probabilistic concepts and ar-
guments found their way to the classical physics and statistical mechanics. Let us
mention, for example, the Maxwell distribution (James Clerk Maxwell, 1831–1879)
Development of Mathematical Theory of Probability: Historical Review 319

for molecular velocities, see [51], and Boltzmann’s temporal averages and ergodic
hypothesis (Ludwig Boltzmann, 1844–1906), see [6, 7].
With their names, the concept of a statistical ensemble is connected, which was
further elaborated by Josiah Willard Gibbs (1839–1903), see [23].
An important role in development of probability theory and understanding its
concepts and approaches was played by the discovery in 1827 by Robert Brown
(1773–1858) of the phenomenon now known as Brownian motion, which he de-
scribed in the paper “A Brief Account of Microscopical Observations . . .” published
in 1828 (see [11]). Another phenomenon of this kind was the radioactive decay dis-
covered in 1896 by Antoine Henri Becquerel (1852–1908), who studied the proper-
ties of uranium. In 1900, Louis Bachelier (1870–1946) used the Brownian motion
for mathematical description of stock value, see [2].
A qualitative explanation and quantitative description of the Brownian motion
was given later by Albert Einstein (1879–1955) [15] and Marian Smoluchowski
(1872–1917) [63]. The phenomenon of radioactivity was explained in the frame-
work of quantum mechanics, which was created in 1920s.
From all that has been said, it becomes apparent that the appearance of new
probabilistic models and the use of probabilistic methodology were far beyond the
scope of the “classical probability” and required new concepts that would enable one
to give a precise mathematical meaning to expressions such as “randomly chosen
point from the interval [0, 1]”, let alone the probabilistic description of Brownian
motion. From this perspective, very well-timed were measure theory and the notion
of the “Borel measure” introduced by Émile Borel (1871–1956) in 1898 [8], and
the theory of integration by Henri Lebesgue (1875–1941) exposed in his book [45]
of 1904. (Borel introduced the measure on the Euclidean space as a generalization
of the notion of length. The modern presentation of measure theory on abstract
measurable spaces follows Maurice Fréchet (1878–1973), see [22] of 1915. The
history of measure theory and integration can be found, e.g., in [25].)
It was immediately recognized that Borel’s measure theory along with
Lebesgue’s theory of integration form the conceptual basis that may justify many
probabilistic considerations and give a precise meaning to intuitive formulations
like the “random choice of a point from [0, 1]”. Soon afterwards (1905), Borel
himself produced an application of the measure-theoretic approach to the proba-
bility theory by proving the first limit theorem, viz. strong law of large numbers,
regarding certain properties of real numbers that hold “with probability one”.
This theorem, giving a certain idea of “how many” real numbers with exceptional
(in the sense to be specified) properties are there, consists of the following.
Let ω = 0. α1 , α2 , . . . be the binary representation with αn = 0 or 1 of a
real number ω ∈ [0, 1] (compare with continued fraction representation ω =
(a1 , a2 , . . . ) considered above). Let νn (ω) be the (relative) frequency of ones among
the first n digits α1 , . . . , αn , then the set of those numbers ω for which νn (ω) → 1/2
as n → ∞ (“normal” numbers according to Borel) has Borel measure 1, while the
(“exceptional”) numbers for which this convergence fails form a set of zero mea-
sure.
320 Development of Mathematical Theory of Probability: Historical Review

This result (Borel’s law of large numbers) bears a superficial resemblance to


Bernoulli’s law of large numbers. However, there is a great formally mathematical
and conceptually philosophical difference between them. In fact, the law of large
numbers says that for any ε > 0 the probability of the event {ω : |νn (ω) − 12 | ≥ }
tends to 0 as n → ∞. But the strong law 1 of large numbers says more, 2 namely,
it states that the probability of the event ω : supm≥n |νm (ω) − 12 | ≥ ε tends to
0. Further, in the former case, the assertion concerns the probabilities related to fi-
nite sequences (α1 , α2 , . . . , αn ), n ≥ 1, and the limits of these probabilities. The
latter case, however, deals with infinite sequences (α1 , α2 , . . . αn , . . .) and probabil-
ities related to them.† (A detailed presentation of a wide variety of mathematical
and philosophical issues connected with application of probabilistic methods in the
number theory, as well as a comprehensive information about the development of
the modern probability theory, can be found in Jan von Plato’s “Creating Modern
Probability” [57].)
4. Fourth period (the twentieth century). The interplay between the probability
theory and pure mathematics, which became apparent by the end of the nineteenth
century, made David Hilbert (1862–1943) pose the problem of mathematization of
the probability theory in his lecture on the 2nd Mathematical Congress in Paris on
August 8, 1900. Among his renowned problems (the first of which pertained to the
continuum-hypothesis), the sixth was formulated as the one of axiomatization of
physical disciplines where mathematics plays a dominant role. Hilbert associated
with this disciplines the probability theory and mechanics, having pointed out the
necessity of the rigorous development of the method of mean values in physics and,
in particular, in kinetic theory of gases. (Hilbert pointed out that the axiomatization
of the probability theory was initiated by Georg Bohlmann (1869–1928), a privat-
dozent in Göttingen, who spoke on this matter on the Actuarial Congress in Paris,
1900, see [5, 62]. The probability introduced by Bohlmann was defined as a (finite-
additive) function of events, but without clear definition of the system of events,
which he well recognized himself.)
The fourth period in the development of probability theory is the time of its
logical justification and becoming a mathematical discipline.
Soon after Gilbert’s lecture, several attempts of building a mathematical theory
of probability involving elements of set and measure theory were made. In 1904,
R. Lämmel [43], see also [62], used set theory to describe possible outcomes. How-
ever, the notion of probability (termed “content” and associated with volume, area,
length, etc.) remained on the intuitive level of the previous period.
In the thesis [10] (see also [62]), produced in 1907 under the guidance of Hilbert,
Ugo Broggi (1880–1965) exploited Borel’s and Lebesgue’s measure (using its pre-
sentation in Lebesgue’s book [45] of 1904), but the notion of the (finite-additive)

† Bernoulli’s LLN dealt with a problem quite different from the one solved by the SLLN, namely,
obtaining an approximation for the distribution of the sum of n independent variables. Of course,
it was only the first step in this direction; the next one was the de Moivre–Laplace theorem. This
problem does not concern infinite sequences of random variables and is of current interest for
probability theory and mathematical statistics. For the modern form of the LLN, see, e.g., the
degenerate convergence criterion in Loéve’s “Probability Theory” (Translator).
Development of Mathematical Theory of Probability: Historical Review 321

probability required (in the simplest cases) the concepts of “relative measures”, “rel-
ative frequences” and (in the general cases) some artificial limiting procedures.
Among the authors of subsequent work on logical justification of probability
theory, we mention first of all S. N. Bernstein (1880–1968) and Richard von Mises
(1883–1953).
Bernstein’s system of axioms ([4], 1917) was based on the notion of qualita-
tive comparison of events according to their greater or smaller likelihood. But the
numerical value of probability was defined as a subordinate notion.
Afterwards, a very similar approach based on subjective qualitative statements
(“system of knowledge of the subject”) was extensively developed by Bruno de
Finetti (1906–1985) in the late 1920s–early 1930s (see, e.g., [16–21]).
The ideas of de Finetti found support from some statisticians following the Bayes
approach, e.g., Leonard Savage (1917–1971), see [61], and were adopted in game
and decision theory, where subjectivity plays a significant role.
In 1919, Mises proposed ([52, 53]) the frequentist (or in other terms, statistical
or empirical) approach to foundation of probability theory. His basic idea was that
the probabilistic concepts are applicable only to so called “collectives”, i.e., indi-
vidual infinite ordered sequences possessing a certain property of their “random”
formation. The general Mises’ scheme may be outlined as follows.
We have a space of outcomes of an “experiment” and assume that we can produce
infinitely many trials resulting in a sequence x = (x1 , x2 , . . .) of outcomes. Let
A be a subset in the set of outcomes and νn (A; x) = n−1 i=1 IA (xi ) the relative
n

frequency of occurrence of the “event” A in the first n trials.


The sequence x = (x1 , x2 , . . .) is said to be a “collective” if it satisfies the fol-
lowing two postulates (which Mises calls alternative conditions, see ([52–54]):
(i) (The existence of the limiting frequencies for the sequence). For all “admissi-
ble” sets A, there exists the limit

lim νn (A; x) (= P(A; x)).


n

(ii) (The existence of the limiting frequencies for subsequences). For all subse-
quences x = (x1 , x2 , . . .) obtained from the sequence x = (x1 , x2 , . . .) by means of
a certain preconditioned system of (“admissible”) rules (termed by Mises as “place-
selection functions”), the limits of frequencies limn νn (A; x ) must be the same as
for the initial sequence x = (x1 , x2 , . . .), i.e. must be equal to limn νn (A; x).
According to Mises, one can speak of the “probability of A” only in connection
with a certain “collective”, and this probability P(A; x) is defined (by (i)) as the limit
limn νn (A; x). It should be emphasized that if this limit does not exist (so that x by
definition is not a “collective”), this probability is not defined. The second postulate
was intended to set forth the concept of “randomness” in the formation of the “col-
lective” x = (x1 , x2 , . . .) (which is the cornerstone of the probabilistic reasoning
and must be in accordance with intuition). It had to express the idea of “irregu-
larity” of this sequence and “unpredictability” of its “future values” (xn , xn+1 , . . .)
from the “past” (x1 , x2 , . . . , xn−1 ) for any n ≥ 1. (In probability theory based on
322 Development of Mathematical Theory of Probability: Historical Review

Kolmogorov’s axioms exposed in Sect. 1, Chap. 2, Vol. 1, such sequences are “typi-
cal” sequences of independent identically distributed random variables, see Subsec-
tion 4 of Sect. 5, Chap. 1).
The postulates used by Mises in construction of “a mathematical theory of repet-
itive events” (as he wrote in [54]) caused much discussion and criticism, especially
in the 1930s. The objections concerned mainly the fact that in practice we deal only
with finite rather than infinite sequences. Therefore, in reality, it is impossible to de-
termine whether the limit limn νn (A; x) does exist and how sensitive this limit is to
taking it along a subsequence x instead of the sequence x. The issues, which were
also severely criticised, were the manner of defining by Mises “admissible” rules
of selecting subsequences as well as vagueness in defining the set of those (“test”)
rules that can be considered in the alternative condition (ii).
If we consider a sequence x = (x1 , x2 , . . .) of zeroes and ones such that the
limit limn νn ({1}; x) lies in (0, 1), then this sequence must contain infinitely many
both zeroes and ones. Therefore, if any rules of forming subsequences are admitted,
then we always can take a subsequence of x consisting, e.g., only of ones for which
limn νn ({1}; x ) = 1. Hence nontrivial collectives invariant with respect to all rules
of taking subsequences do not exist.
The first step towards the proof that the class of collectives is not empty was taken
in 1937 by Abraham Wald (1902–1950), see [68]. In his construction, the rules of se-
lecting subsequences x = (x1 , x2 , . . .) from x = (x1 , x2 , . . .) were determined by a
countable collection of functions fi = fi (x1 , . . . , xi ), i = 1, 2, . . ., taking two values 0
and 1 so that xi+1 is included into x if fi (x1 , . . . , xi ) = 1 and not included otherwise.
In 1940, Alonzo Church (1903–1995) proposed [12] another approach to forming
subsequences based on the idea that every rule must be “effectively computable” in
practice. This idea led Church to the concept of algorithmically computable func-
tions (i.e., computable by means of, say, Turing machine). (Let, for example, xj take
two values, ω1 = 0 and ω2 = 1. Let us make correspond to (x1 , . . . , xn ) the integer


n
λ= ik 2k−1 ,
k=1

where ik is defined by xk = ωik . Let ϕ = ϕ(λ) be a {0, 1}-valued function defined


on the set of nonnegative integers. Then xn+1 is included into x if ϕ(λn ) = 1 and
not included otherwise.)
For explanation and justification of his concept of a “collective” as a sequence
with the “randomness” property, Mises brought forward a heuristic argument that it
is impossible for such sequences to construct a “winning system of a game”.
These arguments were critically analyzed in a 1939 monograph [65] by Jean
Ville (1910–1988), where he put Mises’ reasoning into a rigorous mathematical
form. It is interesting to note that this is the paper where the term “martingale” (in
the mathematical sense) was first used.
The above description of various approaches to the axiomatics of probability
theory (e.g., Bernstein, de Finetti, Mises) shows that they were complicated and
overburdened with concepts stemming from the intention of their authors to make
Development of Mathematical Theory of Probability: Historical Review 323

probability theory closer to applications. As Kolmogorov pointed out in his Foun-


dations of the Theory of Probability [33], this could not lead to a simple system of
axioms.
The first publication by Kolmogorov demonstrating his interest in the logical jus-
tification of probability theory was his paper (regretfully, not widely known) Gen-
eral measure theory and calculus of probability [27], see also [37]. Both the title
of the paper and its content show that Kolmogorov envisaged the way of the logi-
cal justification of probability theory in the framework of measure and set theory.
As follows from the above exposition, this was not a novelty and was quite natu-
ral for the Moscow mathematical school, where set theory and the metric theory of
functions were prevailing directions of research.
In the time between this paper (1929) and appearance of Foundations ([31], 1933)
Kolmogorov published one of his most renowned probabilistic papers, “On Analytic
Methods in Probability Theory” [28]. P. S. Aleksandrov and A. Ya. Khinchin wrote
[1] about this paper: “In the entire probability theory of twentieth century it is diffi-
cult to find a research so essential to development of science.”
The fundamentality of this paper consisted not only in that it laid the basis for the
theory of Markov random processes, but also in that it demonstrated close relations
of this theory, and probability theory in whole, with calculus (in particular, with
the theory of ordinary and partial differential equations) as well as with classical
mechanics and physics.
In connection with the problem of justification of mathematical probability the-
ory, note that Kolmogorov’s paper “On Analytic Methods” provided in a sense a
physical motivation to the necessity of the logical construction of the fundamen-
tals of random processes, which, apart from axiomatics, was one of the aims of his
Foundations.
Kolmogorov’s axiomatization of probability theory is based on the concept of the
probability space
(Ω, F, P),
where (Ω, F) is an (abstract) measurable space (of “elementary events” or out-
comes) and P is a nonnegative countably additive set function on F normalized
by P(Ω) = 1 (to be a “probability”, see Sect. 1, Chap. 2, Vol. 1). The random vari-
ables are defined as F-measurable functions ξ = ξ(ω); the expectation of ξ is the
Lebesgue integral of ξ(ω) with respect to P.
A novel concept was that of the conditional expectation E(ξ | G) with respect to
a σ-algebra G ∈ F (see in this connection Kolmogorov’s preface to the 2nd edition
[32] of Foundations).
There is a theorem (on the existence of a process with specified finite-
dimensional distributions) in Foundations that Kolmogorov called basic, thereby
emphasizing its particular importance. The matter is as follows.
In Analytic Methods, the Markov processes were designed to describe the evolu-
tion of “stochastically determined systems”, and this description was given in terms
of differential properties of functions P(s, x, t, A) satisfying the “Kolmogorov–
324 Development of Mathematical Theory of Probability: Historical Review

Chapman equation”. These functions were called “transition probabilities” due to


their interpretation as the probabilities that the system being in the state x at time s
will occur in the set A of states at time t.
In a similar way, in the papers [29, 30, 17, 18] of that time, which dealt with
“homogeneous stochastic processes with independent increments”, these processes
were treated in terms of functions Pt (x) satisfying the equation

Ps+t (x) = Ps (x − y) dPt (y),

which is naturally derived from the interpretation of Pt (x) as the probability that the
increment of the process for the time t is no greater than x.
However, from the formally-logical point of view the existence of an object,
which could be called a “process” with transition probabilities P(s, x, t, A) or with
increments distributed according to Pt (x), remained an open question.
This was the question solved by the basic theorem stating that for any system of
consistent finite-dimensional probability distributions

Ft1 ,t2 ,...,tn (x1 , x2 , . . . , xn ), 0 ≤ t1 < t 2 < · · · < tn , xi ∈ R,

one can construct a probability space (Ω, F, P) and a system of random variables
X = (Xt )t≥0 , Xt = Xt (ω), such that

P{Xt1 ≤ x1 , Xt2 ≤ X2 , . . . , Xtn ≤ xn } = Ft1 ,t2 ,...,tn (x1 , x2 , . . . , xn ).

Here, Ω is taken to be the space R[0,∞) of real-valued functions ω = {ωt }t≥0 ,


F is the σ-algebra generated by cylinder sets, and the measure P is defined as the
extension of the measure from the algebra of cylinder sets (on which this measure is
naturally defined by the finite-dimensional distributions) to the smallest σ-algebra
generated by this algebra. The random variables Xi (ω) are defined coordinate-wise:
if ω = {ωt }t≥0 , then Xt (ω) = ωt . (This construction explains why the notion of a
“random process” is often identified with the corresponding measure on R[0,∞) ).
There is a little section in the Foundations dealing with applicability of probabil-
ity theory.
Describing the scheme of conditions according to which this theory is applied
to the “real world of experiment”, Kolmogorov largely follows Mises, demonstrat-
ing thereby that Mises’ frequentist approach to interpretation and applicability of
probability theory was not alien to him.
This scheme of conditions is essentially as follows.
It is assumed that there is a complex of conditions which presumes the possibility
to run infinitely many repeated experiments. Let (x1 , . . . , xn ) be the outcomes of n
experiments taking their values in a set X and let A be a subset of X in which we are
interested. If xi ∈ A, we say that the event A occurred in the ith experiment. (Note
that no assumptions of probabilistic nature are made a priori, e.g., that the experi-
ments are carried out randomly and independently or anything about the “chances”
of A to occur, etc.)
Development of Mathematical Theory of Probability: Historical Review 325

Further, it is assumed that to the event A a certain number (to be denoted by


P(A)) is assigned such that we may be practically certain that the relative frequency
νn (A) of occurrence of A in n trials will differ very slightly from P(A) for large n.
Moreover, if P(A) is very small, we may be certain that A cannot occur in a single
experiment.
In his Foundations, Kolmogorov does not discuss in detail the conditions for
applicability of probability theory to the “real world”, saying that we “disregard the
deep philosophical dissertations on the concept of probability in the experimental
world”. However, he points out in the Introduction to Chap. 1 that there are domains
of applicability of probability theory “which have no relation to the concepts of
random event and of probability in the precise meaning of these words”.
Thirty years after, Kolmogorov turned again (see [35, 36, 38, 39, 40, 41]) to the
issue of applicability of probability theory and proposed two approaches to resolve
it, which are based on the concepts of “approximative randomness” and “algorith-
mic complexity”. In this regard, he emphasized [39] that in contrast to Mises and
Church who operated with infinite sequences x1 , x2 , . . . his approaches to defin-
ing randomness are of strictly finite nature, i.e. they are related to finite sequences
x1 , x2 , . . . , xN (called subsequently chains according to [42]), which are so in real
problems.
The concept of “approximative randomness” is introduced as follows. Let
x1 , x2 , . . . , xN be a binary (xi = 0, 1) sequence of length N and let n ≤ N.
This chain is said to be (n, ε)-random with respect to a finite collection Φ of ad-
missible algorithms [13] if there exists a number p (= P({1})) such that for any
chain (x1 , x2 , . . . , xm

) with n ≤ m ≤ N obtained from x1 , x2 , . . . , xN by means of an
algorithm A ∈ Φ the relative frequency νm ({1}, x ) differs from p no more than by
ε. (The algorithms in Φ producing chains of length m < n are neglected.)
Kolmogorov shows in [35] that if for a given n and 0 < ε < 1 the number of
admissible algorithms is no greater than
1
exp{2nε2 (1 − ε)},
2
then for any 0 < p < 1 and any N ≥ n there is a chain (x1 , x2 , . . . , xN ), which is
(n, ε)-random (having the property of “approximative randomness”).
This approach to the identification of “random” chains involves (as well as Mises’
one) a certain arbitrariness connected with indeterminacy in description and selec-
tion of admissible algorithms. Obviously, this class of algorithms cannot be too large
because otherwise the set of “approximatively random” chains would be empty. On
the other hand, it is desirable that the admissible algorithms would be sufficiently
simple (e.g., were presented in a tabular form).
In probability theory, the idea that typical random realizations have a very com-
plicated, irregular form has been established on the basis of various probabilistic
statements.
Therefore, if we want the algorithmic definition of randomness of a chain to be
as close as possible to the probabilistic conception of the structure of a random real-
326 Development of Mathematical Theory of Probability: Historical Review

ization, the algorithms in Φ must reject atypical chains of simple structure, selecting
as random those sufficiently complicated.
This consideration led Kolmogorov to the “second” approach to the concept of
randomness. The emphasis in this approach is made on the “complexity” of the
chains rather then on “simplicity” of the related algorithms. Kolmogorov introduces
a certain numerical characteristic of complexity, which is designed to show the de-
gree of “irregularity” in formation of these chains.
This characteristic is known as “algorithmic” (or “Kolmogorov’s”) complexity
KA (x) of an individual chain x with respect to the algorithm A, which can be heuris-
tically described as the shortest length of a binary chain at the input of the algorithm
A for which this algorithm can recover this chain at the output.
The formal definitions are as follows.
Let Σ be a collection of all finite binary chains x = (x1 , x2 , . . . , xn ), let |x| (=
n) denote the length of a chain, and let Φ be a certain class of algorithms. The
complexity of a chain x ∈ Σ with respect to an algorithm A ∈ Φ is the number

KA (x) = min{|p| : A(p) = x},

i.e., the minimal length |p| of a binary chain p at the input of the algorithm A from
which x can be recovered at the output of A (A(p) = x).
In [36], Kolmogorov establishes that (for some important classes of algorithms
Φ) the following statement holds: there exists a universal algorithm U ∈ Φ such
that for any A ∈ Φ there is a constant C(A) satisfying

KU (x) ≤ KA (x) + C(A)

for any chain x ∈ Σ and for two universal algorithms U  and U 

|KU (x) − KU (x)| ≤ C, x ∈ Σ,

where C does not depend on x ∈ Σ. (Kolmogorov points out in [36] that a similar
result was simultaneously obtained by R. Solomonov.)
Taking into account the fact that KU (x) grows to infinity with |x| for “typical”
chains x, this result justifies the following definition: the complexity of a chain x ∈ Σ
with respect to a class Φ of algorithms is K(x) = KU (x), where U is a universal
algorithm in Φ.
The quantity K(x) is customarily referred to as algorithmic or Kolmogorov’s
complexity of an “object” x. Kolmogorov regarded this quantity as measuring the
amount of algorithmic information contained in the “finite object” x. He believed
that this concept is even more fundamental than the probabilistic notion of infor-
mation, which requires knowledge of a probability distribution on objects x for its
definition.
The quantity K(x) may be considered also as a measure of compression of a
“text” x. If the class Φ includes algorithms like simple enumeration of elements,
then (up to a constant factor) the complexity K(x) is no greater than the length
|x|. On the other hand, it is easy to show that the number of (binary) chains x of
Development of Mathematical Theory of Probability: Historical Review 327

complexity less than K is no greater than 2K − 1, which is the number of possible


binary chains of length less than x (1 + 2 + · · · + 2K−1 = 2K − 1) at the input.
Further, it can be shown by simple arguments (see, e.g., [66]) that there exist
chains x whose complexity is equal (up to a constant factor) to the length |x| and
that there are not many chains that admit high compression (the fraction of chains
of complexity n − a does not exceed 2−a ). These arguments naturally lead to the
following definition: “algorithmically random chains” (with respect to a class of
algorithms Φ) are those chains x whose algorithmic complexity K(x) is close to |x|.
In other words, the algorithmic approach regards as “random” the chains x of
maximal complexity (K(x) ∼ x).
Kolmogorov’s concepts of complexity and algorithmic randomness gave rise to
a new direction called “Kolmogorov’s complexity,” which is applicable in diverse
fields of mathematics and its applications (see, e.g. [42, 47, 48, 49, 50, 26] for de-
tails).
With regard to probability theory, these new concepts initiated a field of research
aiming to determine for what algorithmically random chains and sequences proba-
bilistic laws (such as the law of large numbers or the law of the iterated logarithm,
see, e.g., [67]) are valid. Results of this kind provide the opportunity to apply the
methods and results of probability theory in the areas which, as was pointed out with
reference to [31] (or [32, 33]), “have no direct relation to the concepts of random
event and probability in the precise meaning of these words”.

References

[1] P. S. Aleksandrov and A. Ya. Khinchin. Andrey Nikolaevich Kolmogorov (for


his 50-ies anniversary) (in Russian), Uspekhi Matem. Nauk, 8, 3 (1953), 177–
200.
[2] L. Bachelier. Théorie de la speculation. Annales de l’Ecole Normale
Supérieure, 17 (1900), 21–86.
[3] Translations from James Bernoulli, transl. by Bing Sung, Dept. Statist.,
Harvard Univ., Preprint No. 2 (1966); Chs. 1–4 also available on:
http://cerebro.xu.edu/math/Sources/JakobBernoulli/ars sung.pdf.
[4] S. N. Bernstein. Axiomatic Justification of Probability Theory. [Opyt ax-
iomaticheskogo obosnovaniya teorii veroyatnostey]. (In Russian.) Soob-
shcheniya Khar’kovskogo Matematichskogo Obshchestva, Ser. 2, 15 (1917),
209–274.
[5] G. Bohlmann. Lebensversicherungsmathematik. Encyklopaedie der mathema-
tischen Wissenschaften. Bd. 1, Heft 2. Artikel ID4b. Teubner, Leipzig, 1903.
[6] L. Boltzmann. Wissenschaftliche Abhandlungen. Bd. 1–3. Barth, Leipzig,
1909.
[7] L. Boltzmann, J. Nabl. Kinetische Theorie der Materie. Encyklopaedie der
mathematischen Wissenschaften. Bd. V, Heft 4. Teubner, Leipzig, 1907.
493–557.
328 Development of Mathematical Theory of Probability: Historical Review

[8] É. Borel. Leçons sur la théorie des fonctions. Gauthier-Villars, Paris, 1898;
Éd. 2. Gauthier-Villars, Paris, 1914.
[9] T. Brodén. Wahrscheinlichkeitsbestimmungen bei der gewöhnlichen Ketten-
bruchentwicklung reeller Zahlen. Akad. Förh. Stockholm 57 (1900), 239–266.
[10] U. Broggi. Die Axiome der Wahrscheinlichkeitsrechnung. Dissertation.
Göttingen, 1907.
[11] R. Brown. A brief account of microscopical observations made in the months
of June, July, and August, 1827, on the particles contained in the pollen of
plants; and on the general existence of active molecules in organic and inor-
ganic bodies. Philosophical Magazine N.S. 4 (1828), 161–173.
[12] A. Church. On the concept of a random sequence. Bull. Amer. Math. Soc. 46,
2 (1940), 130–135.
[13] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction to Algo-
rithms. 3rd ed. MIT Press, 2009.
[14] F. N. David. Games, Gods and Gambling. The Origin and History of Probabil-
ity and Statistical Ideas from the Earliest Times to the Newtonian Era. Griffin,
London, 1962.
[15] A. Einstein. Über die von der molekularkinetischen Theorie der Wärme
geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen.
Annalen der Physik, 17 (1905), 549–560.
[16] B. de Finetti. Sulle probabilità numerabili e geometriche. Istituto Lombardo.
Accademia di Scienze e Lettere. Rendiconti (2), 61 (1928), 817–824.
[17] B. de Finetti. Sulle funzioni a incremento aleatorio. Accademia Nazionale dei
Lincei. Rendiconti (6), 10 (1929), 163–168.
[18] B. de Finetti. Integrazione delle funzioni a incremento aleatorio. Accademia
Nazionale dei Lincei. Rendiconti (6), 10 (1929), 548–553.
[19] B. de Finetti. Probabilismo: saggio critico sulla teoria delle probabilità e sul
valore della scienza. Perrella, Napoli, 1931; Logos. 14 (1931), 163–219. En-
glish transl.: Erkenntnis. The International Journal of Analytic Philosophy 31
(1989), 169–223.
[20] B. de Finetti. Probability, Induction and Statistics. The Art of Guessing. Wiley,
New York etc., 1972.
[21] B. de Finetti. Teoria delle probabilità: sintesi introduttiva con appendice crit-
ica. Vol. 1, 2. Einaudi, Torino, 1970. English transl.: Theory of Probability: A
Critical Introductory Treatment. Vol. 1, 2. Wiley, New York etc., 1974, 1975.
[22] M. Fréchet. Sur l’intégrale d’une fonctionnelle étendue à un ensemble abstrait.
Bulletin de la Société Mathématique de France. 43 (1915), 248–265.
[23] J. W. Gibbs. Elementary Principles in Statistical Mechanics. Developed with
especial reference to the rational foundation of thermodynamics. Yale Univ.
Press, New Haven, 1902; Dover, New York, 1960.
[24] H. Gyldén. Quelques remarques relativement à la représentation de nombres
irrationnels au moyen des fractions continues. Comptes Rendus, Paris, 107
(1888), 1584–1587.
[25] T. Hawkins. Lebesgue’s Theory of Integration. Its Origin and Development.
Univ. Wisconsin Press, Madison, Wis. – London, 1970.
Development of Mathematical Theory of Probability: Historical Review 329

[26] W. Kirchherr, M. Li, P. Vitányi. The Miraculous Universal Distribution. Math-


ematical Intelligencer, 19, 4 (1997), 7–15.
[27] A. N. Kolmogorov. General Measure Theory and Calculus of Probabilities
(in Russian), in: Communist Academy. Section of natural and exact sciences.
Mathematical papers. Moscow, 1929, Vol. 1, 8–21.
[28] A. Kolmogoroff. Über die analytischen Methoden in der Wahrscheinlichkeits-
rechnung. Mathematische Annalen, 104 (1931), 415–458.
[29] A. Kolmogoroff. Sulla forma generale di un processo stochastico omogeneo.
(Un problema di Bruno de Finetti.) Atti della Accademia Nazionale dei Lincei,
15 (1932), 805–808.
[30] A. Kolmogoroff. Ancora sulla forma generale di un processo omogeneo. Atti
della Accademia Nazionale dei Lincei, 15 (1932), 866–869.
[31] A. Kolmogoroff. Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer,
Berlin, 1933; Springer, Berlin–New York, 1973.
[32] A. N. Kolmogorov. Foundations of the Theory of Probability (in Russian).
Moscow–Leningrad, ONTI, 1936; Moscow, Nauka, 1974 (2nd ed.); Moscow,
PHASIS Publishing House, 1998 (3rd ed.).
[33] A. N. Kolmogorov. Foundations of the Theory of Probability. Chelsea, New
York, 1950; 2nd ed. Chelsea, New York, 1956.
[34] A. N. Kolmogorov. The Contribution of Russian Science to the Development
of Probability Theory. Uchen. Zap. Moskov. Univ. 1947, no. 91, 56ff. (in Rus-
sian).
[35] A. N. Kolmogorov. On Tables of Random Numbers, Sankhyā A, 25, 4 (1963),
369–376.
[36] A. N. Kolmogorov. Three Approaches to the Definition of the Notion of
“Amount of Information” (in Russian). Problems of Infromation Transmission,
1, 1 (1965), 3–11.
[37] A. N. Kolmogorov. Probability Theory and Mathematical Statistics (in Rus-
sian). Nauka, Moscow, 1986.
[38] A. N. Kolmogorov. Logical Basis for Information Theory and Probability The-
ory. IEEE Transactions on Information Theory, 14, 5 (1968), 662–664.
[39] A. N. Kolmogorov. On Logical Foundations of Probability Theory. Probability
Theory and Mathematical Statistics (Tbilisi, 1982). Springer-Verlag, Berlin
etc., 1983, 1–5 (Lecture Notes in Mathematics, Vol. 1021).
[40] A. N. Kolmogorov. Combinatorial Foundations of Information Theory and the
Calculus of Probabilities (in Russian). Uspekhi Matem. Nauk (1983), 27–36;
Russian Math. Surveys, 38, 4 (1983), 29–40.
[41] A. N. Kolmogorov. Information Theory and Theory of Algorithms (in Russian).
Nauka, Moscow, 1987.
[42] A. N. Kolmogorov and V. A. Uspensky. Algorithms and Randomness. Theory
Probab. Appl., 32, 3 (1988), 389–412.
[43] R. Lämmel. Untersuchungen über die Ermittlung der Wahrscheinlichkeiten.
Dissertation. Zürich, 1904. (See also [62].)
[44] P. S. Laplace, de. Essai philosophique sur les probabilités. Paris, 1814. English
transl.: A Philosophical Essay on Probabilities. Dover, New York, 1951.
330 Development of Mathematical Theory of Probability: Historical Review

[45] H. Lebesgue. Leçons sur l’intégration et la recherche des fonctions primitives.


Gauthier-Villars, Paris, 1904.
[46] D. E. Maistrov. Probability Theory: A Historical Sketch. Academic Press, New
York, 1974.
[47] P. Martin-Löf, On the concept of a random sequence. Theory Probab. Appl.,
11, 1 (1966), 413–425.
[48] P. Martin-Löf. The Definition of Random Sequences. Information and Control,
9, 6 (1966), 602–619.
[49] P. Martin-Löf. On the Notion of Randomness. Intuitionism and Proof Theory,
Proc. Conf. at Buffalo, NY, 1968, Ed. A. Kino et al.; North-Holland, Amster-
dam, 1970, 73–78.
[50] P. Martin-Löf. Complexity Oscillations in Infinite Binary Sequences.
Z. Wahrsch. verw. Gebiete, 19 (1971), 225–230.
[51] J. C. Maxwell. The Scientific Letters and Papers of James Clerk Maxwell.
Vol. 1: 1846–1862, Vol. 2: 1862–1873, Vol. 3: 1874–1879. Ed. P. M. Harman.
Cambridge, Cambridge Univ. Press, 1990, 1995, 2002.
[52] R. von Mises. Fundamentalsätze der Wahrscheinlichkeitsrechnung. Mathema-
tische Zeitschrift, 4 (1919), 1–97.
[53] R. von Mises. Grundlagen der Wahrscheinlichkeitsrechnung, Mathematische
Zeitschrift. 5 (1919), 52–99; 7 (1920), 323.
[54] R. von Mises. Mathematical Theory of Probability and Statistics. Academic
Press, New York–London, 1964.
[55] J. Newton. The Mathematical Works of Isaac Newton. D. T. Whiteside ed.,
vol. 1, Johnson, New York, 1967.
[56] On Probability Theory and Mathematical Statistics (correspondence between
A. A. Markov and A. A. Chuprov). [O teorii veroyatnostey i matematicheskoy
statistike (perepiska A. A. Markova i A. A. Chuprova)] (in Russian) Moscow,
Nauka, 1977.
[57] J. Plato, von. Creating Modern Probability. Its Mathematics, Physics and Phi-
losophy in Historical Perspective. Cambridge Univ. Press, Cambridge, 1994.
[58] H. Poincaré. Sur le problème des trois corps et les équations de la dy-
namique. I, II. Acta Mathematica, 13 (1890), 1–270.
[59] H. Poincaré. Calcul des probabilités. G. Carré, Paris, 1896.
[60] G. Pólya. Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung
und das Momentenproblem. Mathematische Zeitschrift, (1920), 171–181.
[61] L. J. Savage. The Foundations of Statistics. Wiley, New York; Chapman–Hall,
London, 1954.
[62] I. Schneider, Ed. Die Entwicklung der Wahrscheinlichkeitstheorie von den
Anfängen bis 1933. Akademie-Verlag, Berlin, 1989.
[63] M. R. Smoluchowski, von. Zur kinetischen Theorie der Brownschen Moleku-
larbewegung und der Suspensionen. Annalen der Physik, 21 (1906), 756–780.
[64] I. Todhunter. A History of the Mathematical Theory of Probability from the
Time of Pascal to that of Laplace. Chelsea, New York, 1949. 1st Edition:
Macmillan, Cambridge, 1865.
Development of Mathematical Theory of Probability: Historical Review 331

[65] J. A. Ville. Étude critique de la notion de collectif. Gauthier-Villars, Paris,


1939.
[66] P. Vitányi and M. Li. Two Decades of Applied Kolmogorov Complexity. Us-
pekhi Matem. Nauk, 43, 6 (1988), 129–166.
[67] V. G. Vovk. The Law of the Iterated Logarithm for Random Kolmogorov, or
Chaotic, Sequences. Theory Probab. Appl., 32, 3 (1988), 413–425.
[68] A. Wald. Die Widerspruchsfreiheit des Kollektivbegriffes der Wahrschein-
lichkeitsrechnung. Ergebnisse eines mathematischen Kolloquiums, 8 (1937),
38–72.
[69] A. Wiman. Über eine Wahrscheinlichkeitsaufgabe bei Kettenbruchen-
twicklungen, Akad. Förh. Stockholm, 57 (1900), 829–841.
Historical and Bibliographical Notes
(Chaps. 4–8)

Chapter 4

Section 1. Kolmogorov’s zero–one law appears in his book [50]. For the Hewitt–
Savage zero–one law see also Borovkov [10], Breiman [11], and Ash [2].
Sections 2–4. Here the fundamental results were obtained by Kolmogorov and
Khinchin (see [50] and references therein). See also Petrov [62], Stout [74], and
Durret [20]. For probabilistic methods in number theory see Kubilius [51].
It is appropriate to recall here the historical background of the strong law of large
numbers and the law of the iterated logarithm for the Bernoulli scheme.
The first paper where the strong law of large numbers appeared was Borel’s paper
[7] on normal numbers in [0, 1). Using the notation of Example 3 in Sect. 3, let
n 
 1
Sn = I(ξk = 1) − .
2
k=1

Then Borel’s result stated that for almost all (Lebesgue) ω ∈ [0, 1) there exists
N = N(ω) such that
 S (ω)  log(n/2)
 n 
 ≤ √
n 2n
for all n ≥ N(ω). This implies, in particular, that Sn = o(n) almost surely.
The next step was done by Hausdorff [41], who established that Sn = o(n1/2+ε )
almost surely for any ε > 0. In 1914 Hardy and Littlewood [39] showed that
Sn = O((n log n)1/2 ) almost surely. In 1922 Steinhaus [73] improved their result
by showing that
Sn
lim sup √ ≤1
n 2n log n
almost surely.

© Springer Science+Business Media, LLC, part of Springer Nature 2019 333


A. N. Shiryaev, Probability-2, Graduate Texts in Mathematics 95,
https://doi.org/10.1007/978-0-387-72208-5
334 Historical and Bibliographical Notes (Chaps. 4–8)

In 1923 Khinchin [45] showed that Sn = O( n log log n) almost surely. Finally,
in a year Khinchin obtained [46] the final result (the “law of the iterated logarithm”):
Sn
lim sup " =1
n (n/2) log log n

almost surely. (Note that in this case σ 2 = E[I(ξk = 1) − 1/2]2 = 1/4, which
explains the appearance of the factor n/2 rather than the usual 2n; cf. Theorem 4 in
Sect. 4.)
As was mentioned in Sect. 4, the next step in establishing the law of the iterated
logarithm for a broad class of independent random variables was taken in 1929 by
Kolmogorov [48].
Section 5. Regarding these questions, see Petrov [62], Borovkov [8–10], and
Dacunha-Castelle and Duflo [16].

Chapter 5

Sections 1–3. Our exposition of the theory of (strict sense) stationary random pro-
cesses is based on Breiman [11], Sinai [72], and Lamperti [52]. The simple proof of
the maximal ergodic theorem was given by Garsia [28].

Chapter 6

Section 1. The books by Rozanov [67] and Gihman and Skorohod [30, 31] are de-
voted to the theory of (wide sense) stationary random processes. Example 6 was
frequently presented in Kolmogorov’s lectures.
Section 2. For orthogonal stochastic measures and stochastic integrals see also Doob
[18], Gihman and Skorohod [31], Rozanov [67], and Ash and Gardner [3].
Section 3. The spectral representation (2) was obtained by Cramér and Loève (e.g.,
[56]). Also see Doob [18], Rozanov [67], and Ash and Gardner [3].
Section 4. There is a detailed exposition of problems of statistical estimation of the
covariance function and spectral density in Hannan [37, 38].
Sections 5–6. See also Rozanov [67], Lamperti [52], and Gihman and Skorohod
[30, 31].
Section 7. The presentation follows Liptser and Shiryaev [54].
Historical and Bibliographical Notes (Chaps. 4–8) 335

Chapter 7

Section 1. Most of the fundamental results of the theory of martingales were ob-
tained by Doob [18]. Theorem 1 is taken from Meyer [57]. Also see Meyer [58],
Liptser and Shiryaev [54], Gihman and Skorohod [31], and Jacod and Shiryaev [43].
Theorem 1 is often called the theorem “on transformation under a system of
optional stopping” (Doob [18]). For identities (13) and (14) and Wald’s fundamental
identity see Wald [76].
Section 3. The right inequality in (25) was established by Khinchin [45] (1923) in
the course of proving the law of the iterated logarithm. To explain what led Khinchin
to obtain this inequality, let us recall the line of the proof of the strong law of
large numbers by Borel and Hausdorff (see also the earlier comment to Sects. 2–
4, Chap. 4).
Let ξ1 , ξ2 , . . . be a sequence of independent identically distributed random vari-
ables with P{ξ1 = 1} = P{ξ1 = −1} = 1/2 (Bernoulli scheme), and let
Sn = ξ1 + · · · + ξn .
Borel’s proof that Sn = o(n) almost surely was essentially as follows. Since

 S  E S4 3n2
 n 3
P   ≥ δ ≤ 4 n4 ≤ 4 4 = 2 4 for any δ > 0,
n n δ n δ n δ
we have

S  
 S  3  1
 k  k
P sup  ≥ δ ≤ P  ≥δ ≤ 4 →0
k≥n k k δ k2
k≥n k≥n

as n → ∞; therefore Sn /n → 0 almost surely by the Borel–Cantelli lemma (Chap. 2,


Sect. 10).
Hausdorff’s proof that Sn = o(n1/2+ε ) almost surely for any ε > 0 proceeded in
a similar way: since E Sn2r = O(nr ) for any integer r > 1/(2ε), we have

  
 
 Sk   Sk 
P sup  ≥δ ≤ P   ≥δ
k≥n k1/2+ε k≥n
k1/2+ε
1   Sk 2r c  kr
≤ 2r
E 1/2+ε  ≤ 2r →0
δ k δ kr+2εr
k≥n k≥n

as n → ∞, where c is a positive constant. This implies (again by the Borel–Cantelli


lemma) that
Sn
→0
n1/2+ε
almost surely.
The foregoing considerations show that the key element of the proofs was ob-
taining a “good” bound for the probabilities P{|Sn | ≥ t(n)}, where t(n) = n in
336 Historical and Bibliographical Notes (Chaps. 4–8)

Borel’s proof and t(n) = n1/2+ε in Hausdorff’s (while Hardy and Littlewood needed
t(n) = (n log n)1/2 ).
Analogously, Khinchin needed inequalities (25) (in fact, the right one) to obtain
a “good” bound for the probabilities P{|Sn | ≥ t(n)}.
Regarding the derivation of Khinchin’s inequalities (both right and left) for any
p > 0 and optimality of the constants Ap and Bp in (25) see the survey paper by
Peškir and Shiryaev [61].
Khinchin derives from the right inequality in (25) for p = 2m that for any t > 0

(2m)! −2m 2m
P{|Xn | > t} ≤ t−2m E |Xn |2m ≤ t [X]n .
2m m!
By Stirling’s formula
(2m)!  2 m
≤ D mm ,
2m m! e
√  2 
t
where D = 2. Therefore, setting m = 2[X] 2 , we obtain
n

 2m[X]2 m
P{|Xn | > t} ≤ D n
≤ D e−m
et2

t2
t2
≤ D exp 1 − = D e exp −
2[X]2n 2[X]2n

t2
= c exp −
2[X]2n

with c = De = 2e.
This inequality implies the bound
t2
P{|Sn | > t} ≤ e− 2n2 ,

which was used by Khinchin for the proof that Sn = O( n log log n) almost surely.
Chow and Teicher [13] contains an illuminating study of the inequalities pre-
sented here. Theorem 2 is due to Lenglart [53].
Section 4. See Doob [18].
Section 5. Here we follow Kabanov, Liptser, and Shiryaev [44], Engelbert and
Shiryaev [24], and Neveu [59]. Theorem 4 and the examples were given by Liptser.
Section 6. This approach to problems of absolute continuity and singularity, and the
results given here, can be found in Kabanov, Liptser, and Shiryaev [44].
Section 7. Theorems 1 and 2 were given by Novikov [60]. Lemma 1 is a discrete
analog of Girsanov’s lemma (see [54]).
Section 8. See also Liptser and Shiryaev [55] and Jacod and Shiryaev [43], which
discuss limit theorems for random processes of a rather general nature (e.g., martin-
gales, semimartingales).
Section 9. The presentation follows Shiryaev [70, 71]. For the development of the
approach given here to the generalization of Ito’s formula see [27].
Historical and Bibliographical Notes (Chaps. 4–8) 337

Section 10. Martingale methods in insurance are treated in [29]. The proofs pre-
sented here are close to those in [70].
Section 11–12. For more detailed exposition of the topics related to application of
martingale methods in financial mathematics and engineering see [71].
Section 13. The basic monographs on the theory and problems of optimal stopping
rules are Dynkin and Yushkevich [22], Robbins, Chow, and Siegmund [66], and
Shiryaev [69].

Chapter 8

Sections 1–2. For the definitions and basic properties of Markov chains see also
Dynkin and Yushkevich [22], Dynkin [21], Ventzel [75], Doob [18], Gihman and
Skorohod [31], Breiman [11], Chung [14, 15], and Revuz [65].
Sections 3–7. For problems related to limiting, ergodic, and stationary distributions
for Markov chains see Kolmogorov’s paper [49] and the books by Feller [25, 26],
Borovkov [10, 9], Ash [1], Chung [15], Revuz [65], and Dynkin and Yushkevich
[22].
Section 8. The simple random walk is a textbook example of the simplest Markov
chain for which many regularities were discovered (e.g., the properties of recur-
rence, transience, and ergodicity). These issues are treated in many of the books
cited earlier; see, for example, [1, 10, 15, 65].
Section 9. The interest in optimal stopping was due to the development of statisti-
cal sequential analysis (Wald [76], De Groot [17], Zacks [77], Shiryaev [69]). The
theory of optimal stopping rules is treated in Dynkin and Yushkevich [22], Shiryaev
[69], and Billingsley [5]. The martingale approach to the optimal stopping problems
is presented in Robbins, Chow, and Siegmund [66].
DEVELOPMENT OF MATHEMATICAL THEORY OF PROBABILITY: HISTORICAL
REVIEW. This historical review was written by the author as a supplement to the
third edition of Kolmogorov’s Foundations of the Theory of Probability [50].
References

[1] R. B. Ash. Basic Probability Theory. Wiley, New York, 1970.


[2] R. B. Ash. Real Analysis and Probability. Academic Press, New York, 1972.
[3] R. B. Ash and M. F. Gardner. Topics in Stochastic Processes. Academic Press,
New York, 1975.
[4] P. Billingsley. Convergence of Probability Measures. Wiley, New York, 1968.
[5] P. Billingsley. Probability and Measure. 3rd ed. New York, Wiley, 1995.
[6] G. D. Birkhoff. Proof of the ergodic theorem. Proc. Nat. Acad. Sci USA, 17,
650–660.
[7] É. Borel. Les probabilités dénombrables et leurs applications arithmétiques.
Rendiconti del Circolo Matematico di Palermo. 27, (1909), 247–271.
[8] A. A. Borovkov. Mathematical Statistics [Matematicheskaya Statistika]
(in Russian). Nauka, Moscow, 1984.
[9] A. A. Borovkov. Ergodicity and Stability of Random Processes [Ergodichnost’
i ustoichivost’ Sluchaı̆nykh Processov] (in Russian). Moscow, URSS, 1999.
[10] A. A. Borovkov. Wahrscheinlichkeitstheorie: eine Einführung, 1st edition
Birkhäuser, Basel–Stuttgart, 1976; Theory of Probability (in Russian), 3rd edi-
tion [Teoriya veroyatnosteı̆]. Moscow, URSS, 1999.
[11] L. Breiman. Probability. Addison-Wesley, Reading, MA, 1968.
[12] A. V. Bulinsky and A. N. Shiryayev. Theory of Random Processes [Teoriya
Sluchaı̆nykh Processov] (in Russian). Fizmatlit, Moscow, 2005.
[13] Y. S. Chow and H. Teicher. Probability Theory: Independence, Interchange-
ability, Martingales. Springer-Verlag, New York, 1978.
[14] K. L. Chung. Markov Chains with Stationary Transition Probabilities.
Springer-Verlag, New York, 1967.
[15] K. L. Chung. Elementary Probability Theory with Stochastic Processes. 3rd
ed. Springer-Verlag, Berlin, 1979.
[16] D. Dacunha-Castelle and M. Duflo. Probabilités et statistiques. 1. Problèmes
à temps fixe. 2. Problèmes à temps mobile. Masson, Paris, 1982; Probability
and Statistics. Springer-Verlag, New York, 1986 (English translation).
[17] M. H. De Groot. Optimal Statistical Decisions. McGraw-Hill, New York,
1970.
© Springer Science+Business Media, LLC, part of Springer Nature 2019 339
A. N. Shiryaev, Probability-2, Graduate Texts in Mathematics 95,
https://doi.org/10.1007/978-0-387-72208-5
340 References

[18] J. L. Doob. Stochastic Processes. Wiley, New York, 1953.


[19] J. L. Doob. What is a martingale? American Mathematical Monthly 78 (1971),
451–463.
[20] R. Durrett. Probability: Theory and Examples. Wadsworth & Brooks/Cole,
Pacific Grove, CA, 1991.
[21] E. B. Dynkin. Markov Processes, Vol. 1, 2, Academic Press, New York;
Springer-Verlag, Berlin, 1965.
[22] E. B. Dynkin and A. A. Yushkevich. Markov Processes: Theorems and Prob-
lems, Plenum, New York, 1969.
[23] P. Ehrenfest and T. Ehrenfest. Über zwei bekannte Einwände gegen das Boltz-
mannsche H-Theorem. Physikalische Zeitschrift, 8 (1907), 311–314.
[24] H. J. Engelbert and A. N. Shiryaev. On the sets of convergence of generalized
submartingales. Stochastics 2 (1979), 155–166.
[25] W. Feller. An Introduction to Probability Theory and Its Applications, vol. 1,
3rd ed. Wiley, New York, 1968.
[26] W. Feller. An Introduction to Probability Theory and Its Applications, vol. 2,
3rd ed. Wiley, New York, 1971.
[27] H. Föllmer, Ph. Protter, and A. N. Shiryaev. Quadratic covariation and an ex-
tension of Itô’s formula. Bernoulli, 1, 1/2 (1995), 149–170.
[28] A. Garcia. A simple proof of E. Hopf’s maximal ergodic theorem. J. Math.
Mech. 14 (1965), 381–382.
[29] H. U. Gerber, Life Insurance Mathematics, Springer, Zürich, 1997.
[30] I. I. Gihman [Gikhman] and A. V. Skorohod [Skorokhod]. Introduction to the
Theory of Random Processes, 1st ed. Saunders, Philadelphia, 1969; 2nd ed.
[Vvedenie v teoriyu sluchaĭnyh protsessov]. Nauka, Moscow, 1977.
[31] I. I. Gihman and A. V. Skorohod. Theory of Stochastic Processes, 3 vols.
Springer-Verlag, New York–Berlin, 1974–1979.
[32] B. V. Gnedenko and A. Ya. Khinchin. An Elementary Introduction to the The-
ory of Probability. Freeman, San Francisco, 1961; 9th ed. [Elementarnoe vve-
denie v teoriyu veroyatnosteǐ]. Nauka, Moscow, 1982.
[33] B. V. Gnedenko and A. N. Kolmogorov. Limit Distributions for Sums of Inde-
pendent Random Variables, revised edition. Addison-Wesley, Reading, MA,
1968.
[34] P. E. Greenwood and A. N. Shiryaev. Contiguity and the Statistical Invariance
Principle. Gordon and Breach, New York, 1985.
[35] G. R. Grimmet and D. R. Stirzaker. Probability and Random Processes. 3rd
ed. Oxford University Press, Oxford, 2001.
[36] J. B. Hamilton. Time Series Analysis. Princeton University Press, Princeton,
NJ, 1994.
[37] E. J. Hannan. Time Series Analysis. Methuen, London, 1960.
[38] E. J. Hannan. Multiple Time Series. Wiley, New York, 1970.
[39] G. H. Hardy and J. E. Littlewood. Some problems of Diophantine approxima-
tion. Acta Mathematica. 37 (1914), 155–239.
[40] P. Hartman and A. Wintner. On the law of iterated logarithm. Amer. J. Math.
63, 1 (1941), 169–176.
References 341

[41] F. Hausdorff. Grundzüge der Mengenlehre. Veit, Leipzig, 1914.


[42] M. Hazewinkel, editor. Encyclopaedia of Mathematics, Vols. 1–10 + Supple-
ment I–III. Kluwer, 1987–2002. [Engl. transl. (extended) of: I. M. Vinogradov,
editor. Matematicheskaya Entsiklopediya, in 5 Vols.], Moscow, Soviet Entsik-
lopediya, 1977–1985.
[43] J. Jacod and A. N. Shiryaev, Limit Theorems for Stochastic Processes, 2nd ed.
Springer-Verlag, Berlin, 2003.
[44] Yu. M. Kabanov, R. Sh. Liptser, and A. N. Shiryaev. On the question of the
absolute continuity and singularity of probability measures. Math. USSR-Sb.
33 (1977), 203–221.
[45] A. Khintchine. Über dyadische Brüche. Mathematische Zeitschrift. 18 (1923),
109–116.
[46] A. Khintchine. Über einen Satz der Wahrscheinlichkeitsrechnung. Funda-
menta Mathematicae. 6 (1924), 9–20.
[47] A. Khintchine. Zu Birkhoffs Lösung des Ergodenproblems. Mathematische
Annalen, 107 (1932), 485–488.
[48] A. Kolmogoroff. Über das Gesetz des iterierten Logarithmus. Mathematische
Annalen. 101 (1929), 126–135.
[49] A. N. Kolmogorov. Markov Chains with finitely many states (in Russian). Bull.
Moscow State Univ. [Bulleten’ MGU], 1, 3 (1937), 1–16.
[50] A. N. Kolmogorov. Foundations of the Theory of Probability. Chelsea,
New York, 1956; 2nd ed. [Osnovnye poniatiya Teorii Veroyatnosteı̆]. Nauka,
Moscow, 1974.
[51] J. Kubilius. Probabilistic Methods in the Theory of Numbers. American Math-
ematical Society, Providence, RI, 1964.
[52] J. Lamperti. Stochastic Processes. Springer-Verlag, New York, 1977.
[53] E. Lenglart. Relation de domination entre deux processus. Ann. Inst. H.
Poincaré. Sect. B (N.S.), 13 (1977), 171–179.
[54] R. S. Liptser and A. N. Shiryaev. Statistics of Random Processes. Springer-
Verlag, New York, 1977.
[55] R. Sh. Liptser and A. N. Shiryaev. Theory of Martingales. Kluwer, Dordrecht,
Boston, 1989.
[56] M. Loève. Probability Theory. Springer-Verlag, New York, 1977–78.
[57] P.-A. Meyer, Martingales and stochastic integrals. I. Lecture Notes in Mathe-
matics, no. 284. Springer-Verlag, Berlin, 1972.
[58] P.-A. Meyer. Probability and Potentials. Blaisdell, Waltham, MA, 1966.
[59] J. Neveu. Discrete Parameter Martingales. North-Holland, Amsterdam, 1975.
[60] A. A. Novikov. On estimates and the asymptotic behavior of the probability of
nonintersection of moving boundaries by sums of independent random vari-
ables. Math. USSR-Izv. 17 (1980), 129–145.
[61] G. Peškir and A. N. Shiryaev, The Khintchine inequalities and martingale ex-
panding sphere of their action, Russian Math. Surveys, 50, 5 (1995), 849–904.
[62] V. V. Petrov. Sums of Independent Random Variables. Springer-Verlag, Berlin,
1975.
[63] H. Poincaré. Calcul des probabilitiés, 2nd ed. Gauthier Villars, Paris, 1912.
342 References

[64] I. I. Privalov. Randeigenschaften analytischer Functionen. Deutscher Verlag


der Vissenschaft, 1956.
[65] D. Revuz. Markov Chains, 2nd ed. North-Holland, Amsterdam, 1984.
[66] H. Robbins, Y. S. Chow, and D. Siegmund. Great Expectations: The Theory of
Optimal Stopping. Houghton Mifflin, Boston, 1971.
[67] Yu. A. Rozanov. Stationary Random Processes. Holden-Day, San Francisco,
1967.
[68] A. N. Shiryaev. Random Processes [Sluchainye processy] (in Russian).
Moscow State University Press, 1972.
[69] A. N. Shiryayev. Optimal Stopping Rules. Applications of Mathematics, Vol. 8.
Springer-Verlag, New York-Heidelberg, 1978.
[70] A. N. Shiryayev. Probability, 2nd ed. Springer-Verlag, Berlin, 1995.
[71] A. N. Shiryaev. Essentials of Stochastic Finance: Facts, Models, Theory.
World Scientific, Singapore, 1999.
[72] Ya. G. Sinai. Introduction to Ergodic Theory. Princeton University Press,
Princeton, NJ, 1976.
[73] H. Steinhaus. Les probabilités dénombrables et leur rapport à la théorie de la
mesure. Fundamenta Mathematicae. 4 (1923), 286–310.
[74] W. F. Stout. Almost Sure Convergence. Academic Press, New York, 1974.
[75] A. D. Ventsel. A Course in the Theory of Stochastic Processes. McGraw-Hill,
New York, 1981.
[76] A. Wald. Sequential Analysis. Wiley, New York, 1947.
[77] S. Zacks. The Theory of Statistical Inference. Wiley, New York, 1971.
Index

Symbols Almost
B(K0 , N ; p), 227 invariant, 37
C+ , 156 periodic, 49
C(fN ; P), 222 Amplitude, 50
CN , 227 Arbitrage
R(n), 48 opportunity, 209
Xnπ , 208 absence of, 207
Z (λ), 56 Arithmetic properties, 265
Z (Δ), 56 Asymptotically uniformly infinitesimal, 196
[X , Y ]n , 117 Asymptotic negligibility condition, 185
[X ]n , 117 Asymptotic properties, 265
#, 200 Average time of return, 270
M(P), 210
N A, 210 B
Z, 47 Balance equation, 53
Cov(ξ, η), 48 Bank account, 207
M , 116 Bartlett’s estimator, 76
X , Y , 116 Bernoullian shifts, 44
f , g, 58 Binary expansion, 18
PN , 227 Birkhoff G. D., 34
MNn , 229 Borel, É.
En (λ), 144 normal numbers, 19, 45
C(f ; P), 224 zero–one law, 3
ρ(n), 48
Brownian motion, 184
(B, S)-market
θk ξ , 33
complete, 214
{Xn →}, 156
CRR (Cox, Ross, Rubinstein) model, 213,
225
A
Absolutely continuous C
change of measure, 178, 182 Cesàro summation, 277
probability measures, 165 Compensator, 115
spectral density, 51, 55 Contingent claim, 214
Absolute moment, 15 replicable, 214
Absorbing Convergence of
barrier, 290, 305, 307 submartingales and martingales
state, 167, 289, 306 general theorems , 148
© Springer Science+Business Media, LLC, part of Springer Nature 2019 343
A. N. Shiryaev, Probability-2, Graduate Texts in Mathematics 95,
https://doi.org/10.1007/978-0-387-72208-5
344 Index

Convex hull, 306 theory, 33


Covariance/correlation function, 48 transformation, 37
Covariance function Ergodicity, 37
estimation, 71 Essential supremum, 229
spectral representation, 54 Estimation, 71, 203
Covariation Estimator
quadratic, 117, 197 of covariance function
Cramér–Lundberg model, 203 unbiased, consistent, 73
Cramér least-squares, 161
condition, 28 of mean value
transform, 28 unbiased, consistent, 72
Cramér–Wold method, 191 nonlinear, 84
Curvilinear boundary, 178 optimal linear, 53, 81, 85, 90, 94
of spectral density
D asymptotically unbiased, 75
Decomposition Bartlett’s, 76
canonical, 186
Parzen’s, 76
Doob, 115, 135, 158
Zhurbenko’s, 77
Krickeberg, 146
of spectral function, 74, 77
Lebesgue, 166
strongly consistent, 162
of martingale, 146
Event
of random sequence, 79
Degenerate symmetric, 4
random variable, 3, 6 tail, 2, 152, 168
Deterministic, 315 Excessive
Dichotomy function, 300
Hájek–Feldman, 173 majorant, 300
Kakutani, 168 least, 300
Diffusion model, discrete Extension of a measure, 57, 151
Bernoulli–Laplace, 294 Extrapolation, 81, 85, 105
Ehrenfest, 293
Distribution F
stationary (invariant), 258
Fair game, 114
Donsker–Prohorov invariance principle, 185
Fejér kernel, 75
Doob, J. L.
Fiancée (secretary) problem, 306
maximal inequalities, 132
Filter
theorem
frequency characteristic, 66
on convergence of (sub)martingales, 148
on decomposition, 115 impulse response, 66
on the number of intersections, 142 Kalman–Bucy, 95, 99
on random time change, 119 linear, 66
Dynamic programming, 300 physically realizable, 66, 83
spectral characteristic, 66
E transfer function, 66
Equation Filtration
balance, 53 flow of σ -algebras, 107, 237
dynamic programming, 299 Filtering, 92
Wald–Bellman, 234, 300 Financial (B, S)-market, 208
Ergodic arbitrage-free, 210
distribution, 277, 283 Formula
sequence of r.v.’s, 43 discrete differentiation, 209
theorem, 37, 39, 44 Itô, 197
maximal, 40 Szegő–Kolmogorov, 95
mean-square, 69 Forward contract, 220
Index 345

Function self-financing, 208, 209


excessive/superharmonic, 300 value of, 208
harmonic, 311 Isometry correspondence, 62
superharmonic, 311 Itô
upper/lower, 22 formula, 197
Fundamental theorem stochastic integral, 201
of arbitrage theory, 207
first, 210 K
second, 214 Kakutani dichotomy, 168
Kolmogorov, A.N.
G inequality, 7, 135
Game, fair/favorable/unfavorable, 113 one-sided analog, 11
Gaussian sequence, 64, 103, 104, 173 interpolation, 91
Generalized Kolmogorov–Chapman equation, 247, 254,
distribution function, 59 257, 260
Markov property, 249 law of the iterated logarithm, 23
martingale/submartingale, 109, 155, 164 regular stationary sequence, 84
strong law of large numbers, 13, 16, 21
H Szegő–Kolmogorov formula, 95
Hájek–Feldman dichotomy, 173 three-series theorem, 9, 163
Hardy class H 2 , 83 transformation, 45
Harmonics, 50 zero–one law, 3, 6, 169
Hedging, 220
perfect, 222 L
Hydrology, 53 Laplace–Stieltjes transform, 204
Large deviations, 27, 143
I Law of large numbers
Inequality strong, 12
Burkholder, 137 application to Monte Carlo method, 19
Davis, 139 application to number theory, 18
Dvoretzky, 147 for martingales, 140, 160
Etemadi, 12 Kolmogorov, 13, 16, 21
Hájek–Rényi, 147 rate of convergence, 29
Khinchin, 137 for a renewal process, 19
Kolmogorov, 7, 135 weak, 12
one-sided analog, 11 Law of the iterated logarithm, 22
Lévy, 26 Hartman and Wintner, 23
Marcinkiewicz and Zygmund, 137 upper/lower function, 22
for martingales, 132 Lemma
Ottaviani, 147 Borel–Cantelli, 2, 10, 16, 24, 71, 267
Prohorov, 27 Borel–Cantelli–Lévy, 159
for probabilities of large deviations, 143 fundamental
variational, 232, 299 of discrete renewal theory, 271
Innovation sequence, 79 Kronecker, 14
Insurance Toeplitz, 14
probability of ruin, 202
Interest rate M
bank, 207 Market, complete, 214
market, 208 Markov, A.A.
Interpolation, 90 chain, 119, 237
Invariant (almost invariant) accessible states, 260
random variable, 37 aperiodic, 264, 279
set, 37 aperiodic class of states, 263
Investment portfolio, 208 classification of states, 259, 265
346 Index

Markov, A.A. (cont.) strong law of large numbers, 140


communicating states, 260 super(sub)martingale, 108
cyclic subclass of states, 263 transform, 111
ergodic, 258 uniformly integrable, 121
ergodic distributions, 256, 277, 279, 283 Martingale difference, 115, 148, 183
essential/inessential states, 260 square-integrable, 185, 196
family of, 246 Measures
homogeneous, 243 absolutely continuous, 165
indecomposable, 260, 279 locally, 165
indecomposable class of states, 262 sufficient conditions, 168
indecomposable subclass, 279 equivalent, 165
initial distribution, 246 Esscher, 212
Kolmogorov–Chapman equation, 247 Gauss, 46
limiting distributions, 256, 277, 283 martingale, 210
null/positive state, 270 orthogonal stochastic, 59
optimal stopping, 296 singular (orthogonal), 165
period of a class, 262 stationary (invariant), 56, 256
period of a state, 262 stochastic, 56
positive recurrent, 279 elementary, 56
recurrent state, 266 extension of, 57
recurrent/transient, 275 finitely additive, 56
shift operator, 249 orthogonal (with orthogonal values), 57
stationary distributions, 256, 277, 279 Morphism, 34
transient state, 266
transition probabilities, 246 N
kernel, 242 Noise, 93, 102
property white, 50, 67, 76
generalized, 249, 273, 296 O
strict sense, 238 Optimal stopping, 228, 296
strong, 251 price, 297
wide sense, 238 Option, 220
time, 109, 205 American type, 221, 224
Martingale, 108 buyer’s (call), 221
bounded, 216 call–put parity, 227
compensator, 115 contract, 220
convergence of, 148 European type, 221
Doob decomposition, 115 fair price of, 222
in gambling, 114 seller’s (put), 221
generalized, 109
inequalities for, 132 P
large deviations, 143 Period
Levy, 108 of an indecomposable class, 262
local, 110 of a sequence, 262
mutual characteristic, 116 of a state, 262
nonnegative, 149 Periodogram, 75
oscillations of, 142 Poincaré
property preservation of, 119 recurrence theorem, 35
quadratic characteristic of, 116 Poisson process, 203
quadratic variation/covariation, 117, 197 Probability
random time change, 119 of first arrival/return, 266
reversed, 31, 118 space
sets of convergence, 156 coordinate, 249
square-integrable, 116, 143, 157, 180, 188 space, filtered, 237
S-representation, 216 Pseudoinverse, 93
Index 347

R measurable, with filtration, 164


Random process phase (state), 35, 238, 260
Brownian motion, 184 countable, 119, 258, 265, 277, 284, 311
with orthogonal increments, 60 finite, 275, 283, 291, 303
Random sequence of random variables, 48
conditionally Gaussian, 97 of sequences, 44, 169
of independent random variables, 1 Spectral
innovation, 79 density, 51
partially observed, 92 estimation of, 74
stationary rational, 69, 71, 87
almost periodic, 49 function, 51, 72
autoregression, 52 estimation of, 74
autoregression and moving average, 53 measure/function, 55
decomposition, 79 representation, 47, 61
deterministic, 79 of covariance function, 47, 54, 61
ergodic, 43 of sequence, 61
moving average, 51, 67, 79 window, 76
purely (completely) nondeterministic, 79 Spectrum, 50, 51, 84
regular/singular, 78 Stationary distribution/measure, 257
spectral decomposition, 66, 94 Statistical estimation, 71
spectral representation, 61 Stochastic
strict sense, 33 exponential, 144
white noise, 50 integral, 58
wide sense, 47, 61 matrix, 283
Random variable measure, 56
independent of future, 109, 205 sequence, 107
Random walk canonical decomposition, 186
absorbing state, 289 dominated, 135
Pólya’s theorem, 289 increasing, 107
reflecting barrier, 292 partially observed, 96
simple, 284 predictable, 107
Recalculation of conditional expectations, 170 reversed, 118, 198
Renewal process Stock price, 209
strong law of large numbers, 19 Stopping time, 109
Renewal theory, 129 optimal, 234
Robbins–Monro procedure, 156 Structure function, 57
Ruin Submartingale, 108
probability of, 203 compensator of, 115
time of, 202 convergence of, 148
generalized, 109
S inequalities for, 132
Series of random variables, 6 local, 110
Set nonpositive, 149
continuation of observation, 233, 299 sets of convergence, 156
invariant stopped, 110
w.r.t. a sequence of r.v.’s, 43 uniformly integrable, 150
stopping, 233, 299 Sums of random variables
σ -algebra, tail (terminal, asymptotic), 2 dependent, 183
Signal, detection of, 93 independent, 1
Slowly varying, 179 Superhedging, 222
Space upper price, 224
Borel, 36, 242 Supermartingale, 108
functional, 184 majorant, 232
Hilbert, 48, 58, 62, 86, 91 least, 232
348 Index

T Transformation
Theorem Bernoulli, 45
Birkhoff and Khinchin, 39 ergodic, 37
Cantelli, 12 Esscher, 212
central limit conditional, 214
for dependent random variables, 183 Kolmogorov, 45
functional, 185 measurable, 34
Chernoff, 31 measure-preserving, 34
Doob mixing, 38
on convergence of submartingales, 148 metrically transitive, 37
on maximal inequalities, 132 Transition
on random time change, 119 function, 242
submartingale decomposition, 115 matrix, 259–264
ergodic, 44 algebraic properties, 259
mean-square, 69 operator, one-step, 296
Girsanov, discrete version, 179 probabilities, 237, 256
Herglotz, 54, 74 Trapezoidal rule, 199
Kolmogorov Two-pointed conditional distribution, 217
on interpolation, 91
regular stationary sequence, 84
W
strong law of large numbers, 13, 16
Wald’s identity, 124
three-series, 9
zero–one law, 3, 169 fundamental, 126
Kolmogorov and Khinchin Water level, 53
convergence of series, 6 White noise, 50, 67, 76
two-series, 9 Wiener process, 184
Lévy, 150 Wold’s expansion, 78, 81
law of the iterated logarithm, 23
Liouville, 35 Z
maximal ergodic, 40 Zero–one law, 1
Pólya, on random walk, 289 Borel, 3
Poincaré, on recurrence, 35 Hewitt–Savage, 5
of renewal theory, 129 Kolmogorov, 3, 152, 169

You might also like