0% found this document useful (0 votes)
21 views

Parimal Parag's Lecture Notes On Random Processes

Uploaded by

Chda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Parimal Parag's Lecture Notes On Random Processes

Uploaded by

Chda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Lecture-01: Sample and Event Space

1 Functions and cardinality


Notation: We denote the set of first N positive integers by [ N ] ≜ {1, 2, . . . , N }, the set of positive integers
(natural numbers) by N ≜ {1, 2, . . . }, the set of integers by Z ≜ {. . . , −1, 0, 1, . . . }, the set of non-negative
integers by Z+ ≜ {0, 1, . . . }, the set of rational numbers by Q, the set of reals by R, and the set of non-
negative reals by R+ . Power set of a set A is the collection of all subsets of A and denoted by P ( A) ≜
{ B : B ⊆ A }.
Definition 1.1 (Function). For sets A, B, we denote a function from set B to set A by f : B → A, where
(b, f (b)) ∈ B × A and for each b ∈ B there is only one value f (b) ∈ A. That is,

{(b, f (b)) : b ∈ B} ⊆ B × A.
The set B and A are called the domain and co-domain of function f , and the set f ( B) = { f (b) ∈ A : b ∈ B}
is called the range of function f .

Example 1.2. Let B = {1, 2, 3} and A = { a, b}, then {(1, a), (2, a), (3, b)} corresponds to a function f : B → A
such that f (1) = f (2) = a, f (3) = b. We will denote this function by ordered tuple ( aab).

Notation: The collection of all A-valued functions with the domain B is denoted by A B .

Example 1.3. Let B = {1, 2, 3} and A = { a, b}, then the collection A B is defined by the set of ordered tuples

{( aaa), ( aab), ( aba), ( abb), (baa), (bab), (bba), (bbb)} .

Definition 1.4 (Inverse Map). For a function f ∈ A B , we define set inverse map f −1 (C ) = {b ∈ B : f (b) ∈ C }
for all subsets C ⊆ A.
Remark 1. This is a slight abuse of notation, since f −1 : P ( A) → P ( B) is a map from sets to sets.

Example 1.5. Let B = {1, 2, 3} and A = { a, b} and f be denoted by the ordered tuple ( aba), then f −1 ({ a}) =
{1, 3} and f −1 ({b}) = {2}.

Definition 1.6 (Injective, surjective, bijective). A function f ∈ A B is


injective: if for any distinct b ̸= c ∈ B, we have f (b) ̸= f (c),
surjective: if f ( B) = A, and
bijective: if it is injective and surjective.

1
Example 1.7. injective: Let B = {1, 2, 3} and A = { a, b, c, d}. Then ( abc) is an injective function.

surjective: Let B = {1, 2, 3, 4} and A = { a, b, c}. Then ( abca) is a surjective function.


bijective: Let B = {1, 2, 3} and A = { a, b, c}. Then ( abc) is a bijective function.

Definition 1.8 (Cardinality). We denote the cardinality of a set A by | A|. If there is a bijection between two
sets, they have the same cardinality. Any set which is bijective to the set [ N ] has cardinality N.

Example 1.9. The cardinality of A = { a, b, c} is | A| = 3, since there is a bijection between B = {1, 2, 3} and
A = { a, b, c}.

Definition 1.10 (Countable). Any set which is bijective to a subset of natural numbers N is called a count-
able set. Any set which has a finite cardinality is called a countably finite set. Any set which is bijective to
the set of natural numbers N is called a countably infinite set.

Exercise 1.11. Show the following are true.

1. A B = | A|| B| .

2. A[ N ] is set of all A-valued N-length sequences.


3. AN is a set of all A-valued countably infinite sequences indexed by the set of natural numbers
N.
4. The sets N, Z+ , Z, Q have the same cardinality.

2 Sample space
Consider an experiment where the outcomes are random and unpredictable.
Definition 2.1 (Sample space). The set of all possible outcomes of a random experiment is called sample
space and denoted by Ω.

Example 2.2 (Single coin toss). Consider a single coin toss, where the outputs can be heads or tails, denoted
by H and T respectively. The set of all possible outcomes is Ω = { H, T }.

Example 2.3 (Finite coin tosses). Consider N tosses of a single coin, where the possible output of each
coin toss belongs to the set { H, T } as before. In this case, the sample space is Ω = { H, T }[ N ] , and a single
outcome is a sequence ω = (ω1 , . . . , ω N ) where ωi ∈ { H, T } for all i ∈ [ N ].
Example 2.4 (Countably infinite coin tosses). Consider an infinite sequence of coin tosses. The set of all
possible outcomes is Ω = { H, T }N . A single outcome is a sequence ω = (ωi ∈ { H, T } : i ∈ N) ∈ Ω.
Example 2.5 (Point on non-negative real line). The set of all possible outcomes for a single point on non-
negative real line is Ω = R+ .
Example 2.6 (Countable points on non-negative real line). The set of all possible outcomes is Ω = RN
+,
where a single outcome is a sequence ω = (ωi ∈ R+ : i ∈ N) ∈ Ω.

2
3 Event space
Definition 3.1 (Event space). A collection of subsets of sample space Ω is called the event space if it is
a σ-algebra over subsets of Ω, and denoted by F. In other words, the collection F satisfies the following
properties.
1. Event space includes the certain event Ω. That is, Ω ∈ F.
2. Event space is closed under complements. That is, if A ∈ F, then Ac ∈ F.
3. Event space is closed under countable unions. That is, if Ai ∈ F for all i ∈ N, then ∪i∈N Ai ∈ F.
The elements of the event space F are called events.

Example 3.2 (Coarsest event space). For any sample space Ω, the trivial event space is F = {∅, Ω}.
Example 3.3 (Single coin toss). Recall that the sample space for a single coin toss is Ω = { H, T }. We define
an event space F ≜ {∅, { H } , { T } , { H, T }}. Can you verify that F is a σ-algebra?
Example 3.4 (Finest event space for finite coin tosses). Recall that the sample space for N coin tosses is
Ω = { H, T } N , An event space for this sample space is F ≜ P (Ω) = { A : A ⊆ Ω}. Can you verify that F is a
σ-algebra?

Remark 2. We observe that F ⊆ P (Ω). However, not all subsets of the sample space Ω necessary belong to
the event space. That is, F ⊂ P (Ω). To show this, it suffices to construct a set A ⊂ Ω that is not in F.

3.1 Properties of event space


Proposition 3.5 (Properties of event space). Event space contains the impossible event, is closed under finite
unions, and closed under countable intersections.
Proof. From the inclusion of certain event and the closure complements of σ-algebras, it follows that the
impossible event ∅ ∈ F. That is, since Ω ∈ F and ∅ = Ωc , we have ∅ ∈ F.
We need to show that if A, B ∈ F, then the union A ∪ B ∈ F. This follows from the fact that we can
A1 = A, A2 = B and Ai = ∅ for all i ⩾ 2, and apply closure under countable unions.
We need to show that for any sequence of events such that Ai ∈ F for all i ∈ N, we have ∩i∈N Ai ∈ F.
To show this, we first notice that Aic ∈ F for all i ∈ N from the closure under complements. Further, we
have ∪i∈N Aic ∈ F from the closure under countable unions, and then the result follows from taking the
complement and the closure under complements.

3.2 Event space generated by a family of sets


Definition 3.6 (Event space generated by a family of sets). Consider a sample space Ω and a family F ⊂
P (Ω) of subsets of Ω. Then the event space generated by F is the smallest event space containing each
element of F, and is denoted by σ ( F ).

Example 3.7 (Event space generated by a single event). For any sample space Ω and a subset A ⊆ Ω, the
smallest event space generated by this event A is F = {∅, A, Ac , Ω}.
Example 3.8 (Countably infinite coin tosses). Recall that the sample space for a countably infinite number
of coin tosses is Ω = { H, T }N . We will construct an event space F on this outcome space, which would not
be the power set of the sample space. By definition, we must have the certain and the impossible event in
an event space. We consider the event space F generated by the events

An ≜ {ω ∈ Ω : ωi = H for some i ∈ [n]} , for each n ∈ N. (1)

3
That is An is the event of getting at least one head in first n tosses. We see that ( An ∈ F : n ∈ N) is a sequence
of increasing events. From the closure under countable union, we have ∪n∈N An = Ω \ ( T, T, . . . ) ∈ F.

Exercise 3.9. Consider a countably infinite sequence of coin tosses, with the sample space Ω =
{ H, T }N and the the event space F generated by the events ( An : n ∈ N) defined in Eq. (1). Let Bn
be the event of observing first head in nth toss, then show that Bn ∈ F for all n ∈ N.

Definition 3.10 (Borel event space). For sample space R, a Borel event space is generated by the subsets
Bx ≜ (−∞, x ] ⊆ R for each x ∈ R, and denoted by B(R) = σ ({ Bx : x ∈ R}).

Exercise 3.11. Show that the events { x } , ( x, y), [ x, y] belong to the Borel event space B(R) for all
x, y ∈ R.

4
Lecture-02: Probability Function

1 Indicator Functions
Definition 1.1 (Indicator functions). Consider a random experiment over an outcome space Ω and an
event space F. The indicator of an event B ∈ F is denoted by
(
1, ω ∈ B,
1B (ω ) =
0, ω ∈/ B.

Example 1.2. Consider a roll of dice that has an outcome space Ω = [6] and event space F = P (Ω). For an
event O ≜ {1, 3, 5} that represents odd outcomes for dice roll, we observe that 1O (1) = 1 and 1O (2) = 0.
Example 1.3. Consider N trials of a random experiment over outcome space Ω and the event space F. Let
ωn ∈ Ω denote the outcome of the experiment of the nth trial. For each event B ∈ F, we define an indicator
function (
1, ωn ∈ B,
1 B ( ωn ) =
0, ωn ∈
/ B.

For any event B ∈ F, the number of times the event B occurs in N trials is denoted by N ( B) = ∑nN=1 1B (ωn ).
N ( B)
We denote the relative frequency of an event A in N trials by N .

Definition 1.4 (Disjoint events). Let (Ω, F ) be a pair of sample and event space, and a sequence of events
A ∈ FN . The sequence is mutually disjoint if An ∩ Am = ∅ for any m ̸= n ∈ N. If ∪n∈N An = Ω, then A is
a partition of sample space Ω.

Exercise 1.5. Consider the sample space Ω = { H, T }N and event space F = σ ( An : n ∈ N) where
An ≜ {ω ∈ Ω : ωi = H for some i ∈ [n]}. Construct a sequence of non-trivial disjoint events.
Exercise 1.6. Let (Ω, F ) be a pair of sample and event space. For any sequence of events A ∈ FN ,
show that
1. 1 ∩ n ∈N A n ( ω ) = ∏ n ∈N 1 A n ( ω ) ,
2. 1∪n∈N An (ω ) = ∑n∈N 1 An (ω ) if the sequence A is mutually disjoint.

Example 1.7. We observe the following properties of the relative frequency.


N ( B)
1. For all events B ∈ F, we have 0 ⩽ N ⩽ 1. This follows from the fact that 0 ⩽ N ( B) ⩽ N for any event
B ∈ F.
2. Let A ∈ FN be a sequence of mutually disjoint events, then
N (∪i∈N Ai ) N ( Ai )
N = ∑ i ∈N N . This follows from
the fact that for mutually disjoint event sequence A, we have
1 ∪ i ∈N A i ( ω n ) = ∑ 1 A i ( ω n ) .
i ∈N

1
N (Ω)
3. For the certain event Ω, we have N = 1. This follows from the fact that N (Ω) = N.
Since the relative frequency is positive and bounded, it may converge to a real number as N grows very
N ( B)
large, and the limit lim N →∞ N may exist.

2 Probability axioms
Inspired by the relative frequency, we list the following axioms for a probability function P : F → [0, 1].
Axiom 2.1 (Axioms of probability). We define a probability measure on sample space Ω and event space
F by a function P : F → [0, 1] which satisfies the following axioms.
Non-negativity: For all events B ∈ F, we have P( B) ⩾ 0.
σ-additivity: For an infinite sequence A ∈ FN of mutually disjoint events, we have P(∪i∈N Ai ) = ∑i∈N P( Ai ).
Certainty: P(Ω) = 1.
Definition 2.2 (Probability space). A sample space Ω, an event space F ⊆ P (Ω), and a probability measure
P : F → [0, 1], together define a probability space (Ω, F, P).

3 Properties of Probability
Theorem 3.1. For any probability space (Ω, F, P), we have the following properties of probability measure.
impossibility: P(∅) = 0.
finite additivity: For mutually disjoint events A ∈ F n , we have P(∪in=1 Ai ) = ∑in=1 P( Ai ).
monotonicity: If events A, B ∈ F such that A ⊆ B, then P( A) ⩽ P( B).
inclusion-exclusion: For any events A, B ∈ F, we have P( A ∪ B) = P( A) + P( B) − P( A ∩ B).
continuity: For a sequence of events A ∈ FN ) such that limn An exists, we have P(limn→∞ An ) = limn→∞ P( An ).
Proof. We consider the probability space (Ω, F, P).
1. We take disjoint events E ∈ FN where E1 = Ω and Ei = ∅ for i ⩾ 2. It follows that ∪i∈N Ei = Ω and E is
a collection of mutually disjoint events. From the countable additivity axiom of probability, it follows
that
P(Ω) = P(Ω) + ∑ P( Ei ).
i ⩾2

Since P( Ei ) ⩾ 0, it implies that P(∅) = 0.


2. We see that finite additivity follows from the countable additivity. We consider disjoint events A1 , . . . , An ,
and take Ai = ∅ for all i > n. It follows that the sequence of sets A ∈ FN is mutually disjoint, and
since P(∅) = 0, it follows that
n n
P(∪in=1 Ai ) = P(∪i∈N Ai ) = ∑ P ( A i ) + ∑ P ( ∅ ) = ∑ P ( A i ).
i =1 i >n i =1

3. For events A, B ∈ F such that A ⊆ B, we can take disjoint events E1 = A and E2 = B \ A. From closure
under complements and intersection, it follows that E2 ∈ F. From non-negativity of probability, we
have P( E2 ) ⩾ 0. Finally, the result follows from finite additivity of disjoint events

P( B) = P( E1 ∪ E2 ) = P( E1 ) + P( E2 ) ⩾ P( A).

2
4. For any two events A, B ∈ F, we can write the following events as disjoint unions

A = ( A \ B ) ∪ ( A ∩ B ), B = ( B \ A ) ∪ ( A ∩ B ), A ∪ B = ( A \ B ) ∪ ( A ∩ B ) ∪ ( B \ A ).

The result follows from the finite additivity of probability of disjoint events.

5. To show the continuity of probability in events, we first need to understand the limits of events. We
show the continuity of probability in the next section.

Example 3.2. Consider a single coin tosse with the sample space Ω = { H, T }, an event space F = P (Ω).
The probability measure P : F → [0, 1] is defined as

P(∅) = 0, P({ H }) = p P({ T }) = 1 − p P({ H, T }) = 1.

Can you verify that P is a probability function?

Example 3.3. Consider N coin tosses with the sample space Ω = { H, T }[ N ] , an event space F = P (Ω). For
each outcome ω ∈ Ω, we define the number of heads as NH (ω ) ≜ ∑n∈[ N ] 1{ H } (ωn ) and the number of tails
as NT (ω ) ≜ N − NH (ω ). The probability measure P : F → [0, 1] is defined for each event A ∈ F as

P( A) ≜ ∑ p NH (ω ) (1 − p) NT (ω ) .
ω∈ A

Can you verify that P is a probability function?

4 Limits of Sets
Definition 4.1 (Limits of monotonic sets). For a sequence of non-decreasing sets ( An : n ∈ N), we can
define the limit as
lim An ≜ ∪n∈N An .
n→∞

Similarly, for a sequence of non-increasing sets ( An : n ∈ N), we can define the limit as

lim An ≜ ∩n∈N An .
n→∞

Example 4.2 (Monotone sets). Consider a monotonically increasing sequence a ∈ RN defined as an ≜ − n1


for all n ∈ N, which converges to the limit 0. We consider monotone sequence of sets A, B ∈ B(R)N de-
fine as An ≜ [−2, − n1 ] and Bn ≜ [−2, n1 ] for all n ∈ N, which are monotonically increasing and decreasing
respectively. We can verify the following limits

lim An = ∪n∈N An = [−2, 0), lim Bn = ∩n∈N Bn = [−2, 0].


n n

Definition 4.3 (Limits of sets). For a sequence of sets ( An : n ∈ N), we can define the limit superior and
limit inferior of this sequence of sets as

lim sup An ≜ ∩n∈N ∪m⩾n Am = lim ∪m⩾n Am , lim inf An ≜ ∪n∈N ∩m⩾n Am = lim ∩m⩾n Am .
n→∞ n→∞ n→∞ n→∞

Lemma 4.4. For a sequence of sets ( An : n ∈ N), we have lim infn→∞ An ⊆ lim supn→∞ An .

3
Proof. For each n ∈ N, we define En ≜ ∪m⩾n Am and Fn ≜ ∩m⩾n Am . Consider a fixed n, then F1 , F2 , . . . , Fn ⊆
An , and Fm ⊆ Am for all m ⩾ n. Therefore, we can write ∪n∈N Fn ⊆ ∪m⩾n Am for each n ∈ N, and hence the
result follows.
Definition 4.5. If the limit superior and limit inferior of any sequence of sets ( An : n ∈ N) are equal, then
the sequence of sets has a limit A∞ , which is defined as
A∞ ≜ lim An = lim sup An = lim inf An .
n→∞ n→∞ n→∞

Example 4.6 (Sequence of sets with different limits). We consider sequence of sets ( An = [−2, (−1)n + n1 ] :
n ∈ N). It follows that Fn = ∩m⩾n Am = [−2, −1] and
(
1
[−2, 1 + n+ 1 ], n odd,
En = ∪m⩾n Am = 1
[−2, 1 + n ], n even.

We can verify the following limits

lim inf An = ∪n∈N Fn = [−2, −1], lim sup An = ∩n∈N En = [−2, 1].
n n

4.1 Proof of continuity of probability


Continuity for increasing sets. Let ( An ∈ F : n ∈ N) be a non-decreasing sequence of events, then limn→∞ An =
∪n∈N An . This implies that ( P( An ) : n ∈ N) is a non-negative non-decreasing bounded sequence, and
hence has a limit. It remains to show that limn→∞ P( An ) = P(∪n∈N An ). To this end, we observe that
( A1 , A2 \ A1 , . . . , An \ An−1 ) is a partition of the event An , and P( Ai \ Ai−1 ) = P( Ai ) − P( Ai−1 ) for each
i ∈ N. From finite additivity of P for mutually disjoint events, we can write for each n ∈ N
n −1 n −1
P ( A n ) = P ( A1 ) + ∑ P ( A i +1 \ A i ) = P ( A 1 ) + ∑ ( P( Ai+1 ) − P( Ai )).
i =1 i =1
From σ-additivity of P for sequence of mutually disjoint events, we can write for limn→∞ An = ∪n∈N An ,
n −1
P(∪n∈N An ) = P( A1 ) + ∑ ( P( Ai+1 ) − P( Ai )) = P( A1 ) + nlim
→∞
∑ ( P( Ai+1 ) − P( Ai )) = nlim
→∞
P ( A n ).
i ∈N i =1

Continuity for decreasing sets. Similarly, for a non-increasing sequence of sets ( Bn ∈ F : n ∈ N), we can
find the non-decreasing sequence of sets ( Bnc ∈ F : n ∈ N). By the first part, we have
P( lim Bn ) = P(∩n∈N Bn ) = 1 − P(∪n∈N Bnc ) = 1 − P( lim Bnc ) = 1 − lim P( Bnc ) = lim P( Bn ).
n→∞ n→∞ n→∞ n→∞
Continuity for general sequence of sets. We can similarly prove the general result for a sequence of
sets ( An ∈ F : n ∈ N) such that the limits limn An exists. We can define non-increasing sequences of sets
( En = ∪m⩾n Am ∈ F : n ∈ N) and non-decreasing sequences of sets ( Fn = ∩m⩾n Am ∈ F : n ∈ N). From the
continuity of probability for the monotonic sets, we have
P(lim sup An ) = P(∩n∈N En ) = lim P( En ), P(lim inf An ) = P(∪n∈N Fn ) = lim P( Fn ).
n n→∞ n n→∞

From the definition of two sequences of sets, we obtain


P( En ) ⩾ sup P( Am ), P( Fn ) ⩽ inf P( Am ).
m⩾n m⩾n

Therefore taking limsup and liminf, we obtain


lim sup P( En ) ⩾ inf sup P( Am ) ⩾ sup inf P( Am ) ⩾ lim inf P( Fn ).
n ∈N n ∈N m⩾n n ∈N m⩾n n ∈N

Since P(limn An ) = limn P( En ) = limn P( Fn ) exists, the result follows.

4
Lecture-03: Independence

1 Law of Total Probability

Exercise 1.1 (Countably infinite coin tosses). Consider a sequence of coin tosses, such that the
sample space is Ω = { H, T }N . For set of outcomes En ≜ {ω ∈ Ω : ωn = H }, we consider an event
space generated by F ≜ σ({ En : n ∈ N}). Let Fn be the event space generate by the first n coin
tosses, i.e. Fn ≜ σ ({ Ei : i ∈ [n]}). Let An be the set of outcomes corresponding to at least one head
in first n outcomes An ≜ {ω ∈ Ω : ωi = H for some i ∈ [n]} = ∪in=1 Ei ∈ F, and Bn be the set of out-
comes corresponding to first head at the nth outcome Bn ≜ {ω ∈ Ω : ω1 = · · · = ωn−1 = T, ω = H } =
∩in=−11 Eic ∩ En ∈ F.
1. Show that F = σ ({Fn : n ∈ N}).

2. Show that σ ({ An : n ∈ N}) ⊆ F and σ ({ Bn : n ∈ N}) ⊆ F.

Theorem 1.2 (Law of total probability). For a probability space (Ω, F, P), consider a sequence of events B ∈ FN
that partitions the sample space Ω, i.e. Bm ∩ Bn = ∅ for all m ̸= n, and ∪n∈N Bn = Ω. Then, for any event A ∈ F,
we have
P( A) = ∑ P( A ∩ Bn ).
n ∈N

Proof. We can expand any event A ∈ F in terms of any partition B of the sample space Ω as

A = A ∩ Ω = A ∩ (∪n∈N Bn ) = ∪n∈N ( A ∩ Bn ).

From the mutual disjointness of the events B ∈ FN , it follows that the sequence ( A ∩ Bn ∈ F : n ∈ N) is
mutually disjoint. The result follows from the countable additivity of probability of disjoint events.

Example 1.3 (Countably infinite coin tosses). Consider the sample space Ω = { H, T }N and event space F
generated by sequence E ∈ FN defined in Exercise 1.1. We observe that any event A ∈ Fn can be written as

A = ∪ω ∈ A {ω } = ∪ω ∈ A ∩in=1 ({ω ∈ Ei } ∪ {ω ∈
/ Ei }).

2 Independence
Definition 2.1 (Independence of events). For a probability space (Ω, F, P), a family of events A ∈ F I is said
to be independent, if for any finite set F ⊆ I, we have

P(∩i∈ F Ai ) = ∏ P( Ai ).
i∈ F

Remark 1. The certain event Ω and the impossible event ∅ are always independent to every event A ∈ F.

1
Example 2.2 (Two coin tosses). Consider two coin tosses, such that the sample space is Ω =
{ HH, HT, TH, TT }, and the event space is F = P (Ω). It suffices to define a probability function P : F → [0, 1]
on the sample space. We define one such probability function P, such that

1
P({ HH }) = P({ HT }) = P({ TH }) = P({ TT }) = .
4

Let event E1 ≜ { HH, HT } and E2 ≜ { HH, TH } correspond to getting a head on the first or the second toss
respectively.
From the defined probability function, we obtain the probability of getting a tail on the first or the
second toss is 21 , and identical to the probability of getting a head on the first or the second toss. That is,
P( E1 ) = P( E2 ) = 12 and the intersecting event E1 ∩ E2 = { HH } with the probability P( E1 ∩ E2 ) = 41 . That is,
for events E1 , E2 ∈ F, we have
P( E1 ∩ E2 ) = P( E1 ) P( E2 ).
That is, events E1 and E2 are independent.

Example 2.3 (Countably infinite coin tosses). Consider the outcome space Ω = { H, T }N and event space
F generated by the sequence E defined in Exercise 1.1. We define a probability function P : F → [0, 1] by
P(∩i∈ F Ei ) = p| F| for any finite subset F ⊆ N. By definition, E ∈ FN is a sequence of independent events.
Consider A, B ∈ FN , where An ≜ ∪in=1 Ei and Bn ≜ ∩in=−11 Eic ∩ En ∈ F for all n ∈ N. It follows that P( An ) =
1 − (1 − p)n and P( Bn ) = p(1 − p)n−1 for n ∈ N.
For any ω ∈ Ω, we can define the number of heads in first n trials by k n (ω ) ≜ ∑in=1 1{ H } (ωi ) =
∑in=1 1{ω ∈Ei } . For any general event A ∈ Fn = σ({ Ei : i ∈ [n]}), we can write
n h i
P( A) = ∑∏ P {ω ∈ Ei } + P {ω ∈ Eic } = ∑ p k n ( ω ) (1 − p ) n − k n ( ω ) .
ω ∈ A i =1 ω∈ A

Example 2.4 (Counter example). Consider a probability space (Ω, F, P) and the events A1 , A2 , A3 ∈ F. The
condition P( A1 ∩ A2 ∩ A3 ) = P( A1 ) P( A2 ) P( A3 ) is not sufficient to guarantee independence of the three
events. In particular, we see that if

P ( A1 ∩ A2 ∩ A3 ) = P ( A1 ) P ( A2 ) P ( A3 ), P( A1 ∩ A2 ∩ A3c ) ̸= P( A1 ) P( A2 ) P( A3c ),

then P( A1 ∩ A2 ) = P( A1 ∩ A2 ∩ A3 ) + P( A1 ∩ A2 ∩ A3c ) ̸= P( A1 ) P( A2 ).

Definition 2.5. A family of collections of events (Ai ⊆ F : i ∈ I ) is called independent, if for any finite set
F ⊆ I and Ai ∈ Ai for all i ∈ F, we have

P(∩i∈ F Ai ) = ∏ P( Ai ).
i∈ F

3 Conditional Probability
Consider N trials of a random experiment over an outcome space Ω and an event space F. Let ωn ∈ Ω
denote the outcome of the experiment of the nth trial. Consider two events A, B ∈ F and denote the number

2
of times event A and event B occurs by N ( A) and N ( B) respectively. We denote the number of times both
events A and B occurred by N ( A ∩ B). Then, we can write these numbers in terms of indicator functions as

N N N
N ( A) = ∑ 1{ ω n ∈ A } , N ( B) = ∑ 1{ ω n ∈ B } , N ( A ∩ B) = ∑ 1{ ω n ∈ A ∩ B } .
n =1 n =1 n =1

N ( A) N ( B) N ( A∩ B)
We denote the relative frequency of events A, B, A ∩ B in N trials by N , N , N respectively. We can
find the relative frequency of events A, on the trials where B occurred as
N ( A∩ B)
N N ( A ∩ B)
N ( B)
= .
N ( B)
N

Inspired by the relative frequency, we define the conditional probability function conditioned on events.
Definition 3.1. Fix an event B ∈ F such that P( B) > 0, we can define the conditional probability P(·| B) :
F → [0, 1] of any event A ∈ F conditioned on the event B as

P( A ∩ B)
P( A| B) = .
P( B)

Lemma 3.2 (Conditional probability). For any event B ∈ F such that P( B) > 0, the conditional probability
P(·| B) : F → [0, 1] is a probability measure on space (Ω, F ).
Proof. We will show that the conditional probability satisfies all three axioms of a probability measure.
Non-negativity: For all events A ∈ F, we have P( A| B) ⩾ 0 since P( A ∩ B) ⩾ 0.
σ-additivity: For an infinite sequence of mutually disjoint events ( Ai ∈ F : i ∈ N) such that Ai ∩ A j = ∅
for all i ̸= j, we have P(∪i∈N Ai | B) = ∑i∈N P( Ai | B). This follows from disjointness of the sequence
( A i ∩ B ∈ F : i ∈ N).
Certainty: Since Ω ∩ B = B, we have P(Ω| B) = 1.

Remark 2. For two independent events A, B ∈ F such that P( A ∩ B) > 0, we have P( A| B) = P( A) and
P( B| A) = P( B). If either P( A) = 0 or P( B) = 0, then P( A ∩ B) = 0.
Remark 3. For any partition B of the sample space Ω, if P( Bn ) > 0 for all n ∈ N, then from the law of total
probability and the definition of conditional probability, we have

P( A) = ∑ P( A| Bn ) P( Bn ).
n ∈N

4 Conditional Independence
Definition 4.1 (Conditional independence of events). For a probability space (Ω, F, P), a family of events
A ∈ F I is said to be conditionally independent given an event C ∈ F such that P(C ) > 0, if for any finite set
F ⊆ I, we have
P(∩i∈ F Ai |C ) = ∏ P( Ai |C ).
i∈ F

Remark 4. Let C ∈ F be an event such that P(C ) > 0. Two events A, B ∈ F are said to be conditionally
independent given event C, if
P ( A ∩ B | C ) = P ( A | C ) P ( B | C ).
If the event C = Ω, it implies that A, B are independent events.
Remark 5. Two events may be independent, but not conditionally independent and vice versa.

3
Example 4.2. Consider two independent events A, B ∈ F such that P( A ∩ B) > 0 and P( A ∪ B) < 1. Then
the events A and B are not conditionally independent given A ∪ B. To see this, we observe that

P(( A ∩ B) ∩ ( A ∪ B)) P( A ∩ B) P( A) P( B)
P( A ∩ B| A ∪ B) = = = = P ( A | A ∪ B ) P ( B ).
P( A ∪ B) P( A ∪ B) P( A ∪ B)
P( B)
We further observe that P( B| A ∪ B) = P( A∪ B)
̸= P( B) and hence P( A ∩ B| A ∪ B) ̸= P( A| A ∪ B) P( B| A ∪ B).

Example 4.3. Consider two non-independent events A, B ∈ F such that P( A) > 0. Then the events A and B
are conditionally independent given A. To see this, we observe that

P( A ∩ B)
P( A ∩ B| A) = = P ( B | A ) P ( A | A ).
P( A)

4
Lecture-04: Random Variable

1 Random Variable
Definition 1.1 (Random variable). Consider a probability space (Ω, F, P). A random variable X : Ω → R
is a real-valued function from the sample space to real numbers, such that for each x ∈ R the event

A X ( x ) ≜ {ω ∈ Ω : X (ω ) ⩽ x } = { X ⩽ x } = X −1 (−∞, x ] = X −1 ( Bx ) ∈ F.

We say that the random variable X is F-measurable.

Remark 1. Recall that the set A X ( x ) is always a subset of sample space Ω for any mapping X : Ω → R, and
A X ( x ) ∈ F is an event when X is a random variable.

Example 1.2 (Constant function). Consider a mapping X : Ω → {c} ⊆ R defined on an arbitrary prob-
ability space (Ω, F, P), such that X (ω ) = c for all outcomes ω ∈ Ω. We observe that
(
∅, x < c,
A X ( x ) = X −1 ( Bx ) =
Ω, x ⩾ c.

That is A X ( x ) ∈ F for all event spaces, and hence X is a random variable and measurable for all event
spaces.
Example 1.3 (Indicator function). For an arbitrary probability space (Ω, F, P) and an event A ∈ F,
consider the indicator function 1 A : Ω → [0, 1]. Let x ∈ R, and Bx = (−∞, x ], then it follows that

Ω, x ⩾ 1,

A X ( x ) = 1 A ( Bx ) = Ac , x ∈ [0, 1),
−1

∅, x < 0.

That is, A X ( x ) ∈ F for all x ∈ R, and hence the indicator function 1 A is a random variable.

Remark 2. Since any outcome ω ∈ Ω is random, so is the real value X (ω ).


Remark 3. Probability is defined only for events and not for random variables. The events of interest for
random variables are the lower level sets A X ( x ) = {ω : X (ω ) ⩽ x } = X −1 ( Bx ) for any real x.
Remark 4. Consider a probability space (Ω, F, P) and a random variable X : Ω → R that is G measurable for
G ⊆ F. If G ⊆ H, then X is also H measurable.

1.1 Distribution function for a random variable


Definition 1.4. For an F measurable random variable X : Ω → R defined on the probability space (Ω, F, P),
we can associate a distribution function (CDF) FX : R → [0, 1] such that for all x ∈ R,

FX ( x ) ≜ P( A X ( x )) = P({ X ⩽ x }) = P ◦ X −1 (−∞, x ] = P ◦ X −1 ( Bx ).

1
Example 1.5 (Constant random variable). Let X : Ω → {c} ⊆ R be a constant random variable defined
on the probability space (Ω, F, P). The distribution function is a right-continuous step function at c with
step-value unity. That is, FX ( x ) = 1[c,∞) ( x ). We observe that P({ X = c}) = 1.

Example 1.6 (Indicator random variable). For an indicator random variable 1 A : Ω → {0, 1} defined
on a probability space (Ω, F, P) and an event A ∈ F, we have

1,
 x ⩾ 1,
FX ( x ) = 1 − P( A), x ∈ [0, 1),

0, x < 0.

Lemma 1.7 (Properties of distribution function). The distribution function FX for any random variable X sat-
isfies the following properties.

1. The distribution function is monotonically non-decreasing in x ∈ R.


2. The distribution function is right-continuous at all points x ∈ R.
3. The upper limit is limx→∞ FX ( x ) = 1 and the lower limit is limx→−∞ FX ( x ) = 0.
Proof. Let X be a random variable defined on the probability space (Ω, F, P).

1. Let x1 , x2 ∈ R such that x1 ⩽ x2 . Then for any ω ∈ A x1 , we have X (ω ) ⩽ x1 ⩽ x2 , and it follows that
ω ∈ A x2 . This implies that A x1 ⊆ A x2 . The result follows from the monotonicity of the probability.
2. For any x ∈ R, consider any monotonically decreasing sequence x ∈ RN such that limn xn = x0 . It
follows that the sequence of events A xn = X −1 (−∞, xn ] ∈ F : n ∈ N , is monotonically decreasing
and hence limn∈N A xn = ∩n∈N A xn = A x0 . The right-continuity then follows from the continuity of
probability, since

FX ( x0 ) = P( A x0 ) = P( lim A xn ) = lim P( A xn ) = lim F ( xn ).


n ∈N n ∈N xn ↓ x

3. Consider a monotonically increasing sequence x ∈ RN such that limn xn = ∞, then ( A xn ∈ F : n ∈ N)


is a monotonically increasing sequence of sets and limn A xn = ∪n∈N A xn = Ω. From the continuity of
probability, it follows that

lim FX ( xn ) = lim P( A xn ) = P(lim A xn ) = P(Ω) = 1.


xn →∞ n n

Similarly, we can take a monotonically decreasing sequence x ∈ RN such that limn xn = −∞, then
( A xn ∈ F : n ∈ N) is a monotonically decreasing sequence of sets and limn A xn = ∩n∈N A xn = ∅. From
the continuity of probability, it follows that limxn →−∞ FX ( xn ) = 0.

Remark 5. If two reals x1 < x2 then FX ( x1 ) ⩽ FX ( x2 ) with equality if and only if P {( x1 < X ⩽ x2 }) = 0. This
follows from the fact that A x2 = A x1 ∪ X −1 ( x1 , x2 ].

1.2 Event space generated by a random variable


Definition 1.8 (Event space generated by a random variable). Let X : Ω → R be an F measurable ran-
dom variable defined on the probability space (Ω, F, P). The smallest event space generated by the events
A X ( x ) = X −1 ( Bx ) = X −1 (−∞, x ] for x ∈ R is called the event space generated by this random variable X,
and denoted by σ ( X ) ≜ σ ({ A X ( x ) : x ∈ R}).

2
Remark 6. The  event space generated by a random variable is the collection of the inverse of Borel sets,
i.e. σ ( X ) = X −1 ( B) : B ∈ B(R) . This follows from the fact that A X ( x ) = X −1 ( Bx ) and the inverse map
respects countable set operations such as unions, complements, and intersections. That is, if B ∈ B(R) =
σ ({ Bx : x ∈ R}), then X −1 ( B) ∈ σ ({ A X ( x ) : x ∈ R}). Similarly, if A ∈ σ( X ) = σ ({ A X ( x ) : x ∈ R}), then
A = X −1 ( B) for some B ∈ σ ({ Bx : x ∈ R}).

Example 1.9 (Constant random variable). Let X : Ω → {c} ⊆ R be a constant random variable defined
on the probability space (Ω, F, P). Then the smallest event space generated by this random variable is
σ ( X ) = {∅, Ω}.
Example 1.10 (Indicator random variable). Let 1 A be an indicator random variable defined on the
probability space (Ω, F, P) and event A ∈ F, then the smallest event space generated by this random
variable is σ ( X ) = σ ({∅, Ac , Ω}) = {∅, A, Ac , Ω}.

1.3 Discrete random variables


Definition 1.11 (Discrete random variables). If a random variable X : Ω → X ⊆ R takes countable values on
real-line, then it is called a discrete random variable. That is, the range of random variable X is countable,
and the random variable is completely specified by the probability mass function
PX ( x ) = P({ X = x }), for all x ∈ X.

Example 1.12 (Bernoulli random variable). For the probability space (Ω, F, P), the Bernoulli random
variable is a mapping X : Ω → {0, 1} and PX (1) = p. We observe that Bernoulli random variable is an
indicator for the event A ≜ X −1 {1}, and P( A) = p. Therefore, the distribution function FX is given by

FX = (1 − p)1[0,1) + 1[1,∞) .

Lemma 1.13. Any discrete random variable is a linear combination of indicator function over a partition of the sample
space.
Proof. For a discrete random variable X : Ω → X ⊂ R on a probability space (Ω, F, P), the range X is count-
able, and we can define events Ex ≜ {ω ∈ Ω : X (ω ) = x } ∈ F for each x ∈ X. Then the mutually disjoint
sequence of events ( Ex ∈ F : x ∈ X) partitions the sample space Ω. We can write
X (ω ) = ∑ x 1 Ex ( ω ) .
x ∈X

Definition 1.14. Any discrete random variable X : Ω → X ⊆ R defined over a probability space (Ω, F, P),
with finite range is called a simple random variable.

Example 1.15 (Simple random variables). Let X be a simple random variable, then X = ∑ x∈X x1 AX ( x)
where ( A X ( x ) = X −1 { x } ∈ F : x ∈ X) is a finite partition of the sample space Ω. Without loss of gener-
ality, we can denote X = { x1 , . . . , xn } where x1 ⩽ . . . ⩽ xn . Then,

Ω,

 x ⩾ xn ,
−1
X (−∞, x ] = ∪ j=1 A X ( x j ), x ∈ [ xi , xi+1 ), i ∈ [n − 1],
i

∅,

x < x1 .

Then the smallest event space generated by the simple random variable X is {∪ x∈S A X ( x ) : S ⊆ X}.

3
1.4 Continuous random variables
Definition 1.16. For a continuous random variable X, there exists density function f X : R → [0, ∞) such
that Z x
FX ( x ) = f X (u)du.
−∞

Example 1.17 (Gaussian random variable). For a probability space (Ω, F, P), Gaussian random vari-
able is a continuous random variable X : Ω → R defined by its density function

( x − µ )2
 
1
f X (x) = √ exp − , x ∈ R.
2πσ 2σ2

4
Lecture-05: Random Vectors

1 Random vectors
Definition 1.1 (Projection). For a vector x ∈ Rn , we can define πi : Rn → R is the projection of an n-length
vector onto its i-th component, such that πi ( x ) = xi .
Definition 1.2. The Borel sigma algebra over space Rn is defined as the smallest sigma algebra generated
by the family (πi−1 ( Bx ) : x ∈ R, i ∈ [n]) and is denoted by B(Rn ). The elements of the Borel sigma algebra
are called Borel sets.
Remark 1. By definition of B(Rn ), projection πi : Rn → R is a Borel measurable function for all i ∈ [n]. For
a subset A ⊆ R and projection πi : Rn → R, we can write
πi−1 ( A) = { x ∈ Rn : xi ∈ A} = R × · · · × A × · · · × R.
Thus, for any A ∈ B(R), we have πi−1 ( A) ∈ B(Rn ).
Definition 1.3 (Random vectors). Consider a probability space (Ω, F, P) and a finite n ∈ N. A random
vector X : Ω → Rn is an F-measurable mapping from the sample space to an n-length real-valued vector.
That is, for any x ∈ Rn , we have
A X ( x ) ≜ {ω ∈ Ω : X1 (ω ) ⩽ x1 , . . . , Xn (ω ) ⩽ xn } = ∩in=1 Xi−1 (−∞, xi ] ∈ F.

Example 1.4 (Tuple of indicators). Consider a probability space (Ω, F, P), a finite n ∈ N, and events
A1 , . . . , An ∈ F. We define a mapping X : Ω → {0, 1}n by Xi (ω ) ≜ 1 Ai (ω ) for all outcomes ω ∈ Ω. Let
x ∈ Rn , then we can write A X ( x ) = ∩in=1 1− 1
A (− ∞, xi ]. Recall that
i

Ω,
 xi ⩾ 1,
1 Ai (−∞, xi ] = Aic ,
−1
xi ∈ [0, 1),
∅,

xi < 0.

It follows that the inverse image A X ( x ) lies in F, and hence X is an F-measurable random vector.

Theorem 1.5. Consider a probability space (Ω, F, P), and a finite n ∈ N. A mapping X : Ω → Rn is a random
vector if and only if Xi ≜ πi ◦ X : Ω → R are random variables for all i ∈ [n].
Proof. We will first show that X : Ω → Rn implies that πi ◦ X is a random variable for any i ∈ [n]. For any
i ∈ [n] and xi ∈ R, we take x = (∞, . . . , xi , . . . , ∞). This implies that πi−1 (−∞, xi ] = R × · · · × (−∞, xi ] × · · · ×
R ∈ B(Rn ). Further, defining A Xi ( xi ) ≜ Xi−1 (−∞, xi ], we observe from the definition of random vectors that
A X ( x ) = ∩nj=1 X j−1 (−∞, x j ] = Xi−1 (−∞, xi ] = A Xi ( xi ) ∈ F. (1)

We will next show that if Xi : Ω → R is a random variable for all i ∈ [n], then X ≜ ( X1 , . . . , Xn ) : Ω → Rn
is a random vector. For any x ∈ Rn , we have A Xi ( xi ) = Xi−1 (−∞, xi ] ∈ F for all i ∈ [n], from the definition
of random variables. From the closure of event set under countable intersections, we have
A X ( x ) = ∩in=1 A Xi ( xi ) ∈ F. (2)

1
1.1 Distribution of random vectors
Definition 1.6. Consider a probability space (Ω, F, P) and a finite n ∈ N. The joint distribution function
of a random vector X : Ω → Rn is defined as the mapping FX : Rn → [0, 1] such that

FX ( x ) ≜ P( A X ( x )) = P(∩in=1 A Xi ( xi )).

Example 1.7 (Tuple of indicators). Consider a probability space (Ω, F, P), a finite n ∈ N, and events
A1 , . . . , An ∈ F, that define the random vector X ≜ (1 A1 , . . . , 1 An ). For any x ∈ Rn , we can define index
sets I0 ( x ) ≜ {i ∈ [n] : xi < 0} and I1 ( x ) ≜ {i ∈ [n] : xi ∈ [0, 1)}, and write the joint distribution function for
this random vector X as 
1,
 I0 ( x ) ∪ I1 ( x ) = ∅,
FX ( x ) = P(∩i∈ I1 ( x) Ai ), I0 ( x ) = ∅, I1 ( x ) ̸= ∅,
c

I0 ( x ) ̸= ∅.

0,

Definition 1.8. For a random vector X : Ω → Rn defined on the probability space (Ω, F, P) and i ∈ [n], the
distribution of the ith random variable Xi ≜ πi ◦ X : Ω → R is called the ith marginal distribution, and
denoted by FXi : Ω → [0, 1].

Example 1.9 (Tuple of indicators). Consider a probability space (Ω, F, P), a finite n ∈ N, and events
A1 , . . . , An ∈ F, that define the random vector X ≜ (1 A1 , . . . , 1 An ). The ith marginal distribution is given
by
FXi ( x ) = (1 − P( Ai ))1[0,1) ( x ) + 1[1,∞) ( x ).

Corollary 1.10 (Marginal distribution). Consider a random vector X : Ω → Rn defined on a probability space
(Ω, F, P) with the joint distribution FX : Rn → [0, 1]. The ith marginal distribution and can be obtained from the
joint distribution of X as
FXi ( xi ) = lim FX ( x ).
x j →∞, for all j̸=i

Proof. For any i ∈ [n] and xi ∈ R, we have Xi−1 (−∞, xi ] = A X ( x ) for x = (∞, . . . , xi , . . . , ∞) from (1).
Lemma 1.11 (Properties of the joint distribution function). Consider a random vector X : Ω → Rn defined
on the probability space (Ω, F, P). The associated joint distribution function FX : Rn → [0, 1] satisfies the following
properties.
(i) For x, y ∈ Rn such that xi ⩽ yi for each i ∈ [n], we have FX ( x ) ⩽ FX (y).
(ii) The function FX ( x ) is right continuous at all points x ∈ Rn .
(iii) The lower limit is limxi →−∞ FX ( x ) = 0, and the upper limit is limxi →∞,i∈[n] FX ( x ) = 1.

Proof. Consider a random vector X : Ω → Rn defined on the probability space (Ω, F, P) and any x ∈ Rn .
(i) We can verify that A X ( x ) = ∩in=1 A Xi ( xi ) ⊆ ∩in=1 A Xi (yi ) = A(y). The result follows from the mono-
tonicity of probability measure.
(ii) The proof is similar to the proof for single random variable.
(iii) The event A X ( x ) = ∅ when xi = −∞ for some i ∈ [n] and A X ( x ) = Ω when xi = ∞ for all i ∈ [n], hence
the result follow.

2
Example 1.12 (Probability of rectangular events). Consider a probability space (Ω, F, P) and a random
vector X : Ω → R2 . Consider the points x ⩽ y ∈ R2 and the events

B1 ≜ { x1 < X1 ⩽ y1 } = A X1 (y1 ) \ A X1 ( x1 ) ∈ F, B2 ≜ { x2 < X2 ⩽ y2 } = A X2 (y2 ) \ A X2 ( x2 ) ∈ F.

The marginal probabilities are given by

P( B1 ) = P( A X1 (y1 )) − P( A X1 ( x1 )) = FX1 (y1 ) − FX1 ( x1 ),


P( B2 ) = P( A X2 (y2 )) − P( A X2 ( x2 )) = FX2 (y2 ) − FX2 ( x2 ).

Writing x = ( x1 , x2 ) and y = (y1 , y2 ), we observe that the end points of the rectangular event B1 ∩ B2 are
points x, (y1 , x2 ), y, ( x1 , y2 ). Therefore, we can write this event as

B1 ∩ B2 = ( A X (y) \ A X ( x1 , y2 )) \ ( A X (y1 , x2 ) \ A X ( x )) ∈ F.

Hence, we can write the probability of this rectangular event as

P( B1 ∩ B2 ) = ( FX (y) − FX ( x1 , y2 )) − ( FX (y1 , x2 ) − FX ( x )).

1.2 Event space generated by random vectors


Definition 1.13. Consider a probability space (Ω, F, P) and a finite n ∈ N. The event space generated by a
random vector X : Ω → Rn is the smallest σ-algebra generated by the collection of events ( A X ( x ) : x ∈ Rn )
and denoted by σ( X ) ≜ σ ( A X ( x ) : x ∈ Rn ).
Theorem 1.14. Consider a probability space (Ω, F, P), a finite n ∈ N, a random vector X : Ω → Rn , and its projec-
tions Xi ≜ πi ◦ X for all i ∈ [n]. Then, σ ( X ) = σ ( X1 , . . . , Xn ).
Proof. Recall that σ ( X ) is generated by the family ( A X ( x ) : x ∈ Rn ) and σ ( X1 , . . . , Xn ) is generated by the
family ( A Xi ( xi ) : xi ∈ R, i ∈ [n]). We first show that A Xi ( xi ) = A X ( x ) for x = (∞, . . . , xi , . . . , ∞), and hence
σ ( X1 , . . . , Xn ) ⊆ σ ( X ). We then show that A X ( x ) = ∩in=1 A Xi ( xi ), and hence σ ( X ) ⊆ σ ( X1 , . . . , Xn ).

Example 1.15 (Tuple of indicators). Consider a probability space (Ω, F, P), a finite n ∈ N, and events
A1 , . . . , An ∈ F, that define the random vector X ≜ (1 A1 , . . . , 1 An ). The σ ( X ) = σ ( Ai : i ∈ [n]).

1.3 Independence of random variables


Definition 1.16. A family of collections of events (Ai ⊆ F : i ∈ I ) is called independent, if for any finite set
F ⊆ I and Ai ∈ Ai for all i ∈ F, we have

P(∩i∈ F Ai ) = ∏ P( Ai ).
i∈ F

Definition 1.17 (Independent and identically distributed). A random vector X : Ω → Rn defined on the
probability space (Ω, F, P) is called independent if
n
FX ( x ) = ∏ FXi ( xi ), for all x ∈ Rn .
i =1

The random vector X is called identically distributed if each of its components have the identical marginal
distribution, i.e.
FXi = FX1 , for all i ∈ [n].

3
Remark 2. Independence of a random vector implies that events ( A Xi ( xi ) : i ∈ [n]) are independent for any
x ∈ Rn . Defining families Ai ≜ ( A Xi ( x ) : x ∈ R) for all i ∈ [n], we observe that the families (A1 , . . . , An ) are
mutually independent.
Remark 3. In general, if two collection of events are mutually independent, then the event space generated
by them are independent. This can be proved using Dynkin’s π-λ Theorem.
Theorem 1.18. For an independent random vector X : Ω → Rn defined on a probability space (Ω, F, P), the event
spaces generated by its components (σ ( Xi ) : i ∈ [n]) are independent.
Proof. For an we define a family of events Ai ≜ ( Xi−1 (−∞, x ] : x ∈ R) for each i ∈ [n]. From the definition
of independence of random vectors, the families (Ai ⊆ F : i ∈ [n]) are mutually independent. Since σ (Ai ) =
σ( Xi ), the result follows from the previous remark.
Definition 1.19 (Independent random vectors). To random vectors X : Ω → Rn and Y : Ω → Rm defined
on the same probability space (Ω, F, P) are independent, if the collection of events ( A X ( x ) : x ∈ Rn ) and
( AY (y) : y ∈ Rm ) are independent, where A X ( x ) ≜ ∩in=1 Xi−1 (−∞, xi ] and AY (y) ≜ ∩im=1 Yi−1 (−∞, yi ].

Example 1.20 (Independent random vectors). Consider a set of vectors X = {(0, 0, 1), (1, 0, 0)} ⊆ R3 . Con-
sider two independent coin tosses, such that Ω = { H, T }2 , F = P (Ω) and P(ω ) = pk2 (ω ) (1 − p)2−k2 (ω ) ,
where k2 (ω ) = ∑2i=1 1{ωi = H } . We define random vectors

X = (0, 0, 1)1{ω1 = H } + (1, 0, 0)1{ω1 =T } , Y = (0, 0, 1)1{ω2 = H } + (1, 0, 0)1{ω2 =T } .

We can verify that X, Y : Ω → R3 are mutually independent random vectors, though we can also check that
X1 , X3 are dependent random variables and so are Y1 , Y3 .

1.4 Discrete random vectors


Definition 1.21 (Discrete random vectors). If a random vector X : Ω → X1 × · · · × Xn ⊆ Rn takes countable
values in Rn , then it is called a discrete random vector. That is, the range of random vector X is countable,
and the random vector is completely specified by the probability mass function
PX ( x ) = P(∩in=1 { Xi = xi }) for all x ∈ X1 × · · · × Xn .
Remark 4. For an independent discrete random vector X : Ω → Rn , we have PX ( x ) = ∏in=1 PXi ( xi ) for each
x ∈ Rn .

Example 1.22 (Multiple coin tosses). For a probability space (Ω, F, P), such that Ω = { H, T }n , F =
2Ω , P(ω ) = 21n for all ω ∈ Ω.
Consider the random vector X : Ω → R such that Xi (ω ) = 1{ωi = H } for each i ∈ [n]. Observe that X is a
bijection from the sample space to the set {0, 1}n . In particular, X is a discrete random variable.
For any x ∈ [0, 1]n , we can write N ( x ) = ∑in=1 1[0,1) ( xi ). Further, we can write the joint distribution as

1,
 xi ⩾ 1 for all i ∈ [n],
1
FX ( x ) = N (x) , xi ∈ [0, 1] for all i ∈ [n],
2

0, xi < 0 for some i ∈ [n].

We can derive the marginal distribution for i-th component as



1, xi ⩾ 1,

FXi ( xi ) = 21 , xi ∈ [0, 1),

0, xi < 0.

4
Therefore, it follows that X is an i.i.d. random vector.

1.5 Continuous random vectors


Definition 1.23 (Joint density function). For jointly continuous random vector X : Ω → Rn with joint
distribution function FX : Rn → [0, 1], there exists a joint density function f X : Rn → [0, ∞) such that
n
f X ( x ) = dx d...dxn FX ( x ), and
1
Z Z
FX ( x ) = du1 · · · dun f X (u1 , . . . , un ).
u1 ⩽ x1 un ⩽ xn

Remark 5. For an independent continuous random vector X : Ω → Rn , we have f X ( x ) = ∏in=1 f Xi ( xi ) for all
x ∈ Rn .

Example 1.24 (Gaussian random vectors). For a probability space (Ω, F, P), Gaussian random vector is a
continuous random vector X : Ω → Rn defined by its density function
 
1 1 T −1
f X (x) = p exp − ( x − µ) Σ ( x − µ) for all x ∈ Rn ,
(2π )n det(Σ) 2

where the mean vector µ ∈ Rn and the positive definite covariance matrix Σ ∈ Rn×n . The components of
the Gaussian random vector are Gaussian random variables, which are independent when Σ is diagonal
matrix and are identically distributed when Σ = σ2 I.

5
Lecture-06: Transformation of random vectors

1 Functions of random variables


Definition 1.1. Borel measurable sets on a space Rn is denoted by B(Rn ) and generated by the collection
(πi−1 (−∞, x ] : x ∈ R, i ∈ [n]). A function g : Rn → Rm is called Borel measurable function, if g−1 ( Bm ) ∈
B(Rn ) for any Bm ∈ B(Rm ).
Proposition 1.2. Consider a random variable X : Ω → R defined on the probability space (Ω, F, P). Suppose g :
R → R is function such that g−1 (−∞, x ] ∈ B(R), then g( X ) is a random variable.

Proof. We represent g( X ) by a map Y : Ω → R such that Y (ω ) ≜ ( g ◦ X )(ω ) for all outcomes ω ∈ Ω. We


further check that for any half open set Bx = (−∞, x ], we have Y −1 ( Bx ) = ( X −1 ◦ g−1 )( Bx ). Since g−1 ( Bx ) ∈
B(R), it follows that Y −1 ( Bx ) ∈ F by the definition of random variables.

Example 1.3 (Monotone function of random variables). Let g : R → R be a monotonically increasing


function, then g−1 (−∞, x ] ∈ B(R) for all x ∈ R. Consider a random variable X : Ω → R defined on the
probability space (Ω, F, P), then Y ≜ g( X ) is a random variable with distribution function
n o
FY (y) = P { g( X ) ⩽ y} = P X ⩽ g−1 (y) = FX ( g−1 (y)).

Here, g−1 (y) is the functional inverse, and not inverse image as we have been seeing typically. We can
think g−1 (y) = g−1 {y}, though this inverse image has at most a single element since g is monotonically
increasing.

Example 1.4. Consider a positive random variable X : Ω → R+ defined on a probability space (Ω, F, P). Let
g : R+ → R+ be such that g( x ) = e−θx for all x ∈ R+ and some θ > 0. Then, g is monotonically decreasing
in X and x = g−1 (y) = − 1θ ln y. This implies that g−1 (−∞, y] = [− 1θ ln y, ∞) ∈ B(R+ ) for all y ∈ R+ . Thus g
is a measurable function, and Y = g( X ) is a random variable.

Proposition 1.5 (Independence of function of random variables). Let g : R → R and h : R → R be functions


such that g−1 (−∞, x ] and h−1 (−∞, x ] are Borel sets for all x ∈ R. Consider independent random variables X and Y
defined on the probability space (Ω, F, P), then g( X ) and h(Y ) are independent random variables.

Proof. For any u, v ∈ R, we can define inverse images A g (u) ≜ g−1 (−∞, u] and Ah (v) ≜ h−1 (∞, v]. Since g, h
are Borel measurable, we have A g (u), Ah (v) ∈ B(R). We can write the following outcome set equality for
the joint event
n o n o
{ g( X ) ⩽ u} ∩ {h(Y ) ⩽ v} = X ∈ g−1 (−∞, u] ∩ Y ∈ h−1 (−∞, v] = X −1 ( A g (u)) ∩ Y −1 ( Ah (v)) ∈ F.

Since X and Y are independent random variables, it follows that X −1 ( A g (u)) and Y −1 ( Ah (v)) are indepen-
dent events, and the result follows.

1
2 Function of random vectors
Proposition 2.1. Consider a random vector X : Ω → Rn defined on the probability space (Ω, F, P), and a Borel
measurable function g : Rn → Rm such that A g (y) ≜ ∩m
j=1 x ∈ R : g j ( x ) ⩽ y j ∈ B(R ) for all y ∈ R . Then,
n n m


g( X ) : Ω → Rm is a random vector. The joint distribution function FY : Rm → [0, 1] for the vector Y ≜ g( X ) is given
by
FY (y) = P( X −1 ( A g (y))), for all y ∈ Rm .

Example 2.2 (Sum of random variables). For a random vector X : Ω → Rn defined on a probability space
(Ω, F, P). Define an addition function + : Rn → R such that +( x ) = ∑in=1 xi for any x ∈ Rn . We can verify
that + is a Borel measurable function and hence Y = +( X ) = ∑in=1 Xi is a random variable. When n = 2 and
X is a continuous random vector with density f X : R2 → R+ , we can write
Z Z
FY (y) = P({Y ⩽ y}) = P({ X1 + X2 ⩽ y}) = f X ( x1 , x2 )dx1 dx2 .
x1 ∈R x2 ⩽y − x1

By applying a change of variable ( x1 , t) = ( x1 , x1 + x2 ) and changing the order of integration, we see that
Z Z
FY (y) = dt dx1 f X ( x1 , t − x1 ).
t⩽y x1 ∈R

When Y is a continuous random vector, we can write

dFY (y)
Z
f Y (y) = = f X ( x, y − x )dx.
dy x ∈R

When X : Ω → R2 is an independent vector, then f X ( x ) = f X1 ( x1 ) f X ( x2 ) for all x ∈ R2 . Therefore, the


density of the sum X1 + X2 is given by
Z
f Y (y) = dx f X1 ( x ) f X2 (y − x ) = ( f X1 ∗ f X2 )(y),
x ∈R

where ∗ : RR × RR → RR is the convolution operator.

Theorem 2.3. For a continuous random vector X : Ω → Rm defined on the probability space (Ω, F, P) with density
f X : Rm → R+ and an injective and smooth Borel measurable function g : Rm → Rm , such that Y = g( X ) is a
continuous random vector. Then the density of random vector Y is given by

f X (x)
f Y (y) = ,
| J (y)|
∂y j
where x = g−1 (y) and J (y) = ( Jij (y) ≜ ∂xi : i, j ∈ [m]) is the Jacobian matrix.

Proof. For an injective map g : Rm → Rm we have { x } = g−1 {y} for any y ∈ g(Rm ). Further, since g is
smooth, we have dy = J (y)dx + o (|dx |), and thus

|dy| = | J (y)| |dx | + o (|dx |). (1)

Defining set dB(y) ≜ w ∈ Rm : y j ⩽ w j ⩽ y j + dy j , we observe that for any continuous random vector


Y : Ω → Rm , we have
 
P ◦ Y −1 (dB(y)) = f Y (y) |dy| = P ◦ X −1 dB( x ) = f X ( x ) |dx | . (2)

We get the result by combining (??) and (??).

2
Example 2.4 (Sum of random variables). Suppose that X : Ω → R2 is a continuous random vector and
Y1 = X1 + X2 . Let us compute f Y1 (y1 ) using the above theorem. Let us define a random vector Y : Ω → R2
such that Y = ( X1 + X2 , X2 ) so that | J (y)| = 1. This implies, f Y (y) = f X ( x ). Thus, we may compute the
marginal density of Y1 as,
Z ∞ Z ∞
f Y1 (y1 ) = f X ( x )1{ x2 =y2 ,x1 + x2 =y1 } dy2 = f X (y1 − y2 , y2 )dy2 .
−∞ −∞

If X is an independent random vector, then


Z ∞
f Y1 (y1 ) = f X1 (y1 − y2 ) f X2 (y2 )dy2 = ( f X1 ∗ f X2 )(y1 ),
−∞

where ∗ represents convolution.

3
Lecture-07: Random Processes

1 Introduction
Remark 1. For an arbitrary index set T, and a real-valued function x ∈ RT , the projection operator πt : RT →
R maps x ∈ RT to πt ( x ) = xt .
Definition 1.1 (Random process). Let (Ω, F, P) be a probability space. For an arbitrary index set T and
state space X ⊆ R, a map X : Ω → XT is called a random process if the projections Xt : Ω → X defined by
ω 7→ Xt (ω ) ≜ (πt ◦ X )(ω ) are random variables on the given probability space.
Definition 1.2. For each outcome ω ∈ Ω, we have a function X (ω ) : T 7→ X called the sample path or the
sample function of the process X.
Remark 2. A random process X defined on probability space (Ω, F, P) with index set T and state space
X ⊆ R, can be thought of as
(a) a map X : Ω × T → X,
(b) a map X : T → XΩ , i.e. a collection of random variables Xt : Ω → X for each time t ∈ T,
(c) a map X : Ω → XT , i.e. a collection of sample functions X (ω ) : T → X for each random outcome ω ∈ Ω.

1.1 Classification
State space X can be countable or uncountable, corresponding to discrete or continuous valued process. If
the index set T ⊆ R is countable, the stochastic process is called discrete-time stochastic process or random
sequence. When the index set T is uncountable, it is called continuous-time stochastic process. The index
set T doesn’t have to be time, if the index set is space, and then the stochastic process is spatial process.
When T = Rn × [0, ∞), stochastic process X is a spatio-temporal process.

Example 1.3. We list some examples of each such stochastic process.


i Discrete random sequence: brand switching, discrete time queues, number of people at bank each day.
ii Continuous random sequence: stock prices, currency exchange rates, waiting time in queue of nth
arrival, workload at arrivals in time sharing computer systems.

iii Discrete random process: counting processes, population sampled at birth-death instants, number of
people in queues.
iv Continuous random process: water level in a dam, waiting time till service in a queue, location of a
mobile node in a network.

1.2 Measurability
For random process X : Ω → XT defined on the probability space (Ω, F, P), the projections Xt ≜ πt ◦ X are
F-measurable random variables. Therefore, the set of outcomes A Xt ( x ) ≜ Xt−1 (−∞, x ] ∈ F for all t ∈ T and
x ∈ R.

1
Definition 1.4. A random map X : Ω → XT is called F-measurable and hence a random process, if the set
of outcomes A Xt ( x ) = Xt−1 (−∞, x ] ∈ F for all t ∈ T and x ∈ R.

Definition 1.5. The event space generated by a random process X : Ω → XT defined on a probability space
(Ω, F, P) is given by
σ ( X ) ≜ σ ( A Xt ( x ) : t ∈ T, x ∈ R).

Definition 1.6. For a random process X : Ω → XT defined on the probability space (Ω, F, P), we define the
projection of X onto components S ⊆ T as the random vector XS : Ω → XS , where XS ≜ ( Xs : s ∈ S).

×
Remark 3. Recall that πt−1 (−∞, x ] = s∈T (−∞, xs ] where xs = x for s = t and xs = ∞ for all s ̸= t. The
F-measurability of process X implies that for any countable set S ⊆ T, we have A XS ( xS ) ≜ ∩s∈S A Xs ( xs ) ∈ F
for xS ∈ XS .
Remark 4. We can define A X ( x ) ≜ ∩t∈T A Xt ( xt ) for any x ∈ RT . However, A X ( x ) is guaranteed to be an
event only when S ≜ {t ∈ T : πt ( x ) < ∞} is a countable set. In this case,

A X ( x ) = ∩t∈T A Xt ( xt ) = ∩s∈S A Xs ( xs ) = A XS ( xS ) ∈ F.

Example 1.7 (Bernoulli sequence). Consider a sample space { H, T }N . We define a mapping X : Ω →


{0, 1}N such that Xn (ω ) = 1{ H } (ωn ) = 1{ωn = H } . The map X is an F-measurable random sequence, if each
Xn : Ω → {0, 1} is a bi-variate F-measurable random variable on the probability space (Ω, F, P). Therefore,
the event space F must contain the event space generated by sequence of events E ∈ FN defined by En ≜
{ω ∈ Ω : Xn (ω ) = 1} = {ω ∈ Ω : ωn = H } ∈ F for all n ∈ N. That is,

σ( X ) = σ( E) = σ ({ En : n ∈ N}).

1.3 Distribution
Definition 1.8. For a random process X : Ω → XT defined on the probability space (Ω, F, P), we define a
finite dimensional distribution FXS : RS → [0, 1] for a finite S ⊆ T by

FXS ( xS ) ≜ P( A XS ( xS )), x S ∈ RS .

Example 1.9. Consider a probability space (Ω, F, P) defined by the sample space Ω = { H, T }N , the event
space F ≜ σ ( E) where En = {ω ∈ Ω : ωn = H } for n ∈ N, and the probability measure P : F → [0, 1] defined
by
P(∩i∈ F Ei ) = p| F| , for all finite F ⊆ N.
Let X : Ω → {0, 1}N defined as Xn (ω ) = 1En (ω ) for all outcomes ω ∈ Ω and n ∈ N. For this random
sequence, we can obtain the finite dimensional distribution FXS : RS → [0, 1] for any finite S ⊆ T and x ∈ RS
in terms of I0 ( x ) ≜ {i ∈ S : xi < 0} and I1 ( x ) ≜ {i ∈ S : xi ∈ [0, 1)}, as

1,
 I0 ( x ) ∪ I1 ( x ) = ∅,
FXS ( x ) = (1 − p)| I1 ( x)| , I0 ( x ) = ∅, I1 ( x ) ̸= ∅, (1)
I0 ( x ) ̸= ∅.

0,

To define a measure on a random process, we can either put a measure on subsets of sample paths
( X (ω ) ∈ RT : ω ∈ Ω), or equip the collection of random variables ( Xt ∈ RΩ : t ∈ T ) with a joint measure.

2
Either way, we are interested in identifying the joint distribution F : RT → [0, 1]. To this end, for any x ∈ RT ,
we need to know
!
{ ω ∈ Ω : Xt ( ω ) ⩽ x t } Xt−1 (−∞, xt ]) = P( A X ( x )).
\ \
FX ( x ) ≜ P = P(
t∈ T t∈ T

First of all, we don’t know whether A X ( x ) is an event when T is uncountable. Though, we can verify
that A X ( x ) ∈ F for x ∈ RT such that {t ∈ T : xt < ∞} is countable. Second, even for a simple independent
process with countably infinite T, any function of the above form would be zero if xt is finite for all t ∈ T.
Therefore, we only look at the values of FX ( x ) for x ∈ RT where {t ∈ T : xt < ∞} is finite. That is, for any
finite set S ⊆ T, we focus on the events AS ( xS ) and their probabilities. However, these are precisely the finite
dimensional distributions. Set of all finite dimensional distributions of the stochastic process X : Ω → XT
characterizes its distribution completely.

Example 1.10. Consider a probability space (Ω, F, P) defined by the sample space Ω = { H, T }N and the
event space F ≜ σ ( E) where En = {ω ∈ Ω : ωn = H } for all n ∈ N. Let X : Ω → {0, 1}N defined as Xn (ω ) =
1En (ω ) for all outcomes ω ∈ Ω and n ∈ N. For this random sequence, if we are given the finite dimensional
distribution FXS : RS → [0, 1] for any finite S ⊆ T and x ∈ RS in terms of sets I0 ( x ) ≜ {i ∈ S : xi < 0} and
I1 ( x ) ≜ {i ∈ S : xi ∈ [0, 1)}, as defined in Eq. (??). Then, we can find the probability measure P : F → [0, 1] is
given by
P(∩i∈ F Ei ) = p| F| , for all finite F ⊆ N.

1.4 Independence
Definition 1.11. A random process is independent if the collection of event spaces (σ ( Xt ) : t ∈ T ) is inde-
pendent. That is, for all xS ∈ RS , we have

FXS ( xS ) = P(∩s∈S { Xs ⩽ xs }) = ∏ P { Xs ⩽ xs } = ∏ FXs ( xs ).


s∈S s∈S

That is, independence of a random process is equivalent to factorization of any finite dimensional distribu-
tion function into product of individual marginal distribution functions.

Example 1.12. Consider a probability space (Ω, F, P) defined by the sample space Ω = { H, T }N , the event
space F ≜ σ ( E) where En = {ω ∈ Ω : ωn = H } for all n ∈ N, and the probability measure P : F → [0, 1]
defined by
P(∩i∈ F Ei ) = p| F| , for all finite F ⊆ N.
Then, we observe that the random sequence X : Ω → {0, 1}N defined by Xn (ω ) ≜ 1En (ω ) for all outcomes
ω ∈ Ω and n ∈ N, is independent.

Definition 1.13. Two stochastic processes X : Ω → XT1 , Y : Ω → YT2 are independent, if the corresponding
event spaces σ( X ), σ(Y ) are independent. That is, for any x ∈ RS1 , y ∈ RS2 for finite S1 ⊆ T1 , S2 ⊆ T2 , the
events AS1 ( x ) ≜ ∩s∈S1 Xs−1 (−∞, xs ] and BS2 (y) ≜ ∩s∈S2 Ys−1 (−∞, ys ] are independent. That is, the joint finite
dimensional distribution of X and Y factorizes, and

P( AS1 ( x ) ∩ BS2 (y)) = P( AS1 ( x )) P( BS2 (y)) = FXS ( x ) FYS (y), x ∈ R S1 , y ∈ R S2 .


1 2

3
Lecture-08: Expectation

1 Expectation

Example 1.1. Consider a probability space (Ω, F, P). We consider N trials of a random experiment, and
define a random vector X : Ω → X N such that Xi ≜ πi ◦ X : Ω → X is a discrete random variable associated
with the trial i ∈ [ N ]. If the marginal distributions of random variables ( X1 , . . . , X N ) are identical with the
common probability mass function PX1 : X → [0, 1], then the empirical mean of random variable X1 can be
written as
1 N
N i∑
m̂ ≜ Xi .
=1

For a random variable X1 : Ω → X, we can define events EX1 ( x ) ≜ X1−1 { x } for each value x ∈ X. The
probability mass function PX1 : X → [0, 1] for the discrete random variable X1 can be estimated for each
x ∈ X as the empirical probability mass function

1 N
1 EX ( x ) .
N i∑
P̂X1 ( x ) ≜
i
=1

Recall that a simple random variable X1 can be written as X1 = ∑ x∈X x1EX (x) , where EX1 ≜ ( EX1 ( x ) ∈
1
F : x ∈ X) is a finite partition of the sample space Ω and PX1 ( x ) = P( EX1 ( x )). That is, we can write the
empirical mean in terms of the empirical PMF as

1 N
m̂ =
N i∑ ∑ x1EXi (x) (ω ) = ∑ x P̂X1 (x).
=1 x ∈X x ∈X

This example motivates the following definition of mean for simple random variables.
Definition 1.2 (Expectation of simple random variable). The mean or expectation of a simple random
variable X : Ω → X ⊆ R defined on a probability space (Ω, F, P), is denoted by E[ X ] and defined as

E[ X ] ≜ ∑ xPX (x).
x ∈X

Remark 1. For an indicator random variable 1 A , we have E1 A = P( A). That is, the expectation of an indi-
cator function is the probability of the indicated set.
Remark 2. Since a simple random variable can be written as X = ∑ x∈X x1EX ( x) where EX ( x ) ≜ X −1 { x } for
all x ∈ X, we can write the expectation of a simple random variable as an integral
Z Z Z
E[ X ] =

X (ω ) P(dω ) = ∑ x1EX (x) (ω ) P(dω ) = ∑ x
Ω x ∈X Ω
1EX (x) (ω ) P(dω ) = ∑ xE[1EX (x) ] = ∑ xPX ( x).
x ∈X x ∈X x ∈X

Theorem 1.3. Consider a non-negative random variable X : Ω → R+ defined on a probability space (Ω, F, P). There
exists a sequence of non-decreasing non-negative simple random variables Y : Ω → RN
+ such that for all ω ∈ Ω

Yn (ω ) ⩽ Yn+1 (ω ), for all n ∈ N, and lim Yn (ω ) = X (ω ).


n

1
Then E[Yn ] is defined for each n ∈ N, the sequence (E[Yn ] ∈ R+ : n ∈ N) is non-decreasing, and the limit limn E[Yn ] ∈
R+ ∪ {∞} exists. This limit is independent of the choice of the sequence and depends only on the probability space.

Proof. For each n ∈ N and k ∈ 0, . . . , 22n − 1 , we define half-open sets Bn,k ≜ (k2−n , (k + 1)2−n ]. Then,


the collection of sets Bn ≜ ( Bn,k : k ∈ 0, . . . , 22n − 1 ) partitions the set (0, 2n ] for each n ∈ N. Further, we


observe that ∪n∈N (0, 2n ] = R+ and that Bn+1,2k ∪ Bn+1,2k+1 = Bn,k for all n ∈ N and k.
For a non-negative random variable X : Ω → R+ , we define events An,k X = X −1 ( B ) ∈ F, and a sequence
n,k
of simple non-negative random variables Y : Ω → RN + in the following fashion
2n 2n 2n
! !
2 −1 2 −1 2 −1
Yn (ω ) ≜ ∑ 1 An,k
X (ω ) inf X (ω )
X
ω ∈ An,k
= ∑ 1 An,k
X (ω ) inf
X (ω )∈ Bn,k
X (ω ) = ∑ k2−n 1 AX (ω ).
n,k
k =0 k =0 k =0

We observe that Yn is a quantized version of X, and its value is the left end-point k2−n when X ∈ Bn,k for
2n
each k ∈ 0, . . . , 22n − 1 . Since ∪2k=0−1 An,k
X = X −1 (0, 2n ], it follows that we cover the positive real line as


n grows larger and the step size grows smaller. Thus, the limiting random variable can take all possible
non-negative real values. We observe that
2( n +1)
!
2 −1
Yn+1 (ω ) = ∑ 1 AnX+1,k (ω ) inf
X (ω )∈ Bn+1,k
X (ω )
k =0
22n −1
! !!
⩾ ∑ 1 AX (ω ) inf X (ω ) + 1 AX (ω ) inf X (ω )
n+1,2k X (ω )∈ Bn+1,2k n+1,2k+1 X (ω )∈ Bn+1,2k+1
k =0
22n −1
!
⩾ ∑ 1 An,2k
X (ω ) inf
X (ω )∈ Bn,k
X (ω ) = Yn (ω ).
k =0

We see that Yn (ω ) ⩽ Yn+1 (ω ) ⩽ X (ω ) and limn Yn (ω ) = X (ω ) for all ω ∈ Ω.


Since Yn : Ω → R is a simple random variable for all n ∈ N, the expectation E[Yn ] is defined for all n,
and can be written as
22n −1
E[Yn ] = ∑ k2−n [ FX ((k + 1)2−n ) − FX (k2−n )].
k =0
We observe that this expectation is completely specified by the distribution function FX , and we can write
the limit
22n −1 Z
lim E[Yn ] = lim
n n
∑ k2−n [ FX (k2−n + 2−n ) − FX (k2−n) )] =
R+
xdFX ( x ).
k =0

Definition 1.4 (Expectation of a non-negative random variable). For a non-negative random variable X :
Ω → R defined on the probability space (Ω, F, P), consider the sequence of non-decreasing simple random
variables Y : Ω → RN+ such that limn Yn = X. The expectation of the non-negative random variable X is
defined as
E[ X ] ≜ lim E[Yn ].
n

Remark 3. From the definition, it follows that E[ X ] =


R
R+ xdFX ( x ).

Definition 1.5 (Expectation of a real random variable). For a real-valued random variable X defined on a
probability space (Ω, F, P), we can define the following functions

X+ ≜ max { X, 0} , X− ≜ max {0, − X } .

We can verify that X+ , X− are non-negative random variables and hence their expectations are well defined.
We observe that X (ω ) = X+ (ω ) − X− (ω ) for each ω ∈ Ω. If at least one of the E[ X+ ] and E[ X− ] is finite,
then the expectation of the random variable X is defined as

E[ X ] ≜ E[ X+ ] − E[ X− ].

2
Theorem 1.6 (Expectation as an integral with respect to the distribution function). For a random variable
X : Ω → R defined on the probability space (Ω, F, P), the expectation is given by
Z
E[ X ] = xdFX ( x ).
R

Proof. It suffices to show this for a non-negative random variable X, and the result follows from the defini-
tion of expectation of a non-negative random variable as the limit of expectation of approximating simple
functions.

2 Properties of Expectations
Theorem 2.1 (Properties). Let X : Ω → R be a random variable defined on the probability space (Ω, F, P).
(i) Linearity: Let a, b ∈ R and X, Y be random variables defined on the probability space (Ω, F, P). If EX, EY,
and aEX + bEY are well defined, then E( aX + bY ) is well defined and
E( aX + bY ) = aEX + bEY.

(ii) Monotonicity: If P { X ⩾ Y } = 1 and E[Y ] is well defined with E[Y ] > −∞, then E[ X ] is well defined and
E [ X ] ⩾ E [Y ] .
(iii) Functions of random variables: Let g : R → R be a Borel measurable function, then g( X ) is a random
variable with E[ g( X )] = x∈R g( x )dF ( x ).
R

(iv) Continuous random variables: Let f X : R → [0, ∞) be the density function, then EX = x∈R x f X ( x )dx.
R

(v) Discrete random variables: Let PX : X → [0, 1] be the probability mass function, then EX = ∑ x∈X xPX ( x ).
(vi) Integration by parts: The expectation EX = x⩾0 (1 − FX ( x ))dx − x<0 FX ( x )dx is well defined when at
R R

least one of the two parts is finite on the right hand side.
Proof. It suffices to show properties (i ) − (iii ) for simple random variables.
(i) Let X = ∑ x∈X x1EX ( x) and Y = ∑y∈Y y1EY (y) be simple random variables, then ( EX ( x ) ∩ EY (y) ∈
F : ( x, y) ∈ X × Y) partition the sample space Ω. Hence, we can write aX + bY = ∑( x,y)∈X×Y ( ax +
by)1{ EX ( x)∩EY (y)} and from linearity of sum it follows that

E[ aX + bY ] = ∑ ( ax + by) P { EX ( x ) ∩ EY (y)}
( x,y)∈X×Y

=a ∑ x ∑ P {EX (x) ∩ EY (y)} + b ∑ y ∑ P { EX ( x ) ∩ EY (y)}


x ∈X y ∈Y y ∈Y x ∈X

=a ∑ xP(EX (x)) + b ∑ yP(EY (y)) = aEX + bEY.


x ∈X y ∈Y

(ii) From the fact that X − Y ⩾ 0 almost surely and linearity of expectation, it suffices to show that EX ⩾ 0
for non-negative random variable X. It can easily be shown for simple non-negative random variables,
and follows for general non-negative random variables by taking limits.
(iii) It suffices to show this holds true for simple random variables X : Ω → X ⊂ R. Since g : R → R is Borel
measurable, Y ≜ g( X ) : Ω → Y ≜ g(X) is a random variable. It follows that, we can write the following
disjoint union X = ∪y∈Y ∪ x∈ g−1 {y} { x }. Further, for each y ∈ Y, we have

EY (y) = {ω ∈ Ω : ( g ◦ X )(ω ) = y} = X −1 ◦ g−1 {y} = ∪ x∈ g−1 {y} EX ( x ).

Since EX ( x ) are disjoint for all x ∈ X, we get P( EY (y)) = ∑ x∈ g−1 {y} P( EX ( x )). Using the above two
facts, we can write the expectation
E [Y ] = ∑ yP(EY (y)) = ∑ ∑ g( x ) P( EX ( x )) = ∑ g( x ) P( EX ( x )).
y ∈Y y ∈ g (X ) x ∈ g −1 ( y ) x ∈X

3
(iv) For continuous random variables, we have dFX ( x ) = f X ( x )dx for all x ∈ R.
(v) For discrete random variables X : Ω → X, we have dFX ( x ) = PX ( x ) for all x ∈ X and zero otherwise.
(vi) We can write EX = x<0 xdFX ( x ) − x⩾0 xd(1 − FX )( x ). We apply integration by parts to the first term
R R

on the right, to get Z Z


xdFX ( x ) = xFX ( x )|0−∞ − FX ( x )dx.
x <0 x <0
Similarly, we apply integration by parts to the second term on the right, to get
Z Z
− xd(1 − FX )( x ) = − x (1 − FX ( x ))|0∞ + (1 − FX ( x ))dx.
x⩾0 x⩾0

4
Lecture-09: Moments and L p spaces

1 Moments
Let X : Ω → R be a random variable defined on the probability space (Ω, F, P) with the distribution function
FX : R → [0, 1].

Example 1.1 (Absolute value function). For the function |·| : R → R+ , we can write the inverse image of
half open sets (−∞, x ] for any x ∈ R as A g ( x ) = g−1 (−∞, x ]. It follows that A g ( x ) = ∅ ∈ B(R) for x < 0
and A g ( x ) = [− x, x ] ∈ B(R) for x ∈ R+ . Since g−1 (−∞, x ] ∈ B( R), it follows that |·| : R → R+ is a Borel
measurable function.

Lemma 1.2. If E | X | is finite, then EX exists and is finite.


Proof. The function |·| : R → R is a Borel measurable function and hence | X | is a random variable. Further
| X | ⩾ 0, and hence the expectation E | X | always exists. If E | X | is finite, it means EX+ and EX− are both
finite, and hence EX = EX+ − EX− is finite as well.
Corollary 1.3. Let g : R → R be a Borel measurable function. If E | g( X )| is finite, then Eg( X ) exists and is finite.

Exercise 1.4 (Polynomial function). For any k ∈ N, we define functions gk : R → R such that gk :
x 7→ x k . Show that gk is Borel measurable for all k ∈ N.

Definition 1.5 (Moments). We define the kth moment of the random variable X as mk ≜ Egk ( X ) = EX k .
First moment EX is called the mean of the random variable.
Remark 1. If E | X |k is finite, then mk exists and is finite.
n o
Remark 2. If P {| X | ⩽ 1} = 1, then P | X |k ⩽ 1 = 1. Therefore, by the monotonicity of expectations E | X |k ⩽
1, and the moments mk exist and are finite for all k ∈ N.

2 L p spaces
Definition 2.1. A vector space over field F is denoted by a set V has (a) vector addition + : V × V → V that
satisfies associativity and commutativity, and has an identity and inverse element for each vector v ∈ V,
and (b) scalar multiplication · : F × V → V that satisfies compatibility of field multiplication, distributivity
with vector and field addition, and has an identity element.

Remark 3. We can verify that the set of random variables is a vector space over reals, and we denote it
by V. To this end, we can verify the associativity and commutativity of addition of random variables.
For any random variable X, the constant function 0 is the identity random variable, and − X is the additive
inverse for vector additions. We can verify the compatibility of scalar multiplication with random variables,
existence of unit scalar 1, and distributivity of scalar multiplications with respect to field addition and vector
addition.

1
Definition 2.2. For a vector space V over field F, a norm on the vector space is a map f : V → R+ that
satisfies
homogeneity: f ( av) = | a| f (v) for all a ∈ F and v ∈ V,
sub-additivity: f (v + w) ⩽ f (v) + f (w) for all v, w ∈ V, and

point-separating: f (v) ⩾ 0 for all v ∈ V.


Definition 2.3. For a probability space (Ω, F, P), and p ⩾ 1, we define the
 set of random variables with finite
1
absolute pth moment as the vector space L p ≜ X ∈ V : (E | X | p ) p < ∞ .

1
Definition 2.4. We define a function ∥∥ p : L p → R+ defined by ∥∥ p ( X ) = ∥ X ∥ p ≜ (E | X | p ) p for any X ∈ L p
and real p ⩾ 1.
Remark 4. For p = 1, the map ∥∥ p is norm. Therefore, L1 is a normed vector space consisting of random
variables with bounded absolute mean.
Remark 5. We will show that ∥ X ∥∞ = sup {| X (ω )| : ω ∈ Ω}, and hence L∞ is a normed vector space of
bounded random variables.
Remark 6. We will also show that ∥∥ p is a norm for all p ∈ (1, ∞), and hence L p is a normed vector space
of random variables with bounded ∥∥ p norm. In particular, the L2 space consists of random variables with
bounded second moment.
Remark 7. If E | X | N is finite for some N ∈ N, then E | X |k is finite for all k ∈ [ N ]. This follows from the
linearity and monotonicity of expectations, and the fact that

| X |k = | X |k 1{|X |⩽1} + | X |k 1{|X |>1} ⩽ 1 + | X | N .

This implies that L N ⊆ Lk for all k ∈ [ N ]. We will show that Lq ⊆ L p for any real numbers 1 ⩽ p ⩽ q.

3 Central Moments

Example 3.1 (Shifted polynomial functions). For any k ∈ N, we define functions hk : R → R such that
hk : x 7→ ( x − m1 )k . Then, hk = gk ( x − m1 ) = gk ◦ f where f : R → R is defined as f ( x ) = x − m1 for all x ∈ R.
Since gk and f are measurable, so is hk .

Definition 3.2 (Central moments). Let X : Ω → R be a random variable defined on the probability space
(Ω, F, P) with finite first moment m1 . We define the kth central moment of the random variable X as
σk ≜ Ehk ( X ) = E( X − m1 )k . The second central moment σ2 = E( X − m1 )2 is called the variance of the
random variable and denoted by σ2 .
Lemma 3.3. The first central moment σ1 = E( X − m1 ) = 0 and the variance σ2 = E( X − m1 )2 for a random
variable X is always non-negative, with equality when X is a constant. That is, m2 ⩾ m21 with equality when X is a
constant.

Proof. Recall that h1 , h2 are Boreal measurable functions, and hence h1 ( X ) = X − m1 and h2 ( X ) = ( X − m1 )2
are random variables. From the linearity of expectations, it follows that σ1 = Eh1 ( X ) = EX − m1 = 0. Since
( X − m1 )2 ⩾ 0 almost surely, it follows from the monotonicity of expectation that 0 ⩽ E( X − m1 )2 . From the
linearity of expectation and expansion of ( X − m1 )2 , we get σ2 = EX 2 − 2m1 EX + m21 = m2 − m21 ⩾ 0.

Remark 8. If second moment is finite, then the first moment is finite. That is, L2 ⊆ L1 .

2
4 Inequalities
Theorem 4.1 (Markov’s inequality). Let X : Ω → R be a random variable defined on the probability space
(Ω, F, P). Then, for any monotonically non-decreasing function f : R → R+ , we have
E[ f ( X )]
P { X ⩾ ϵ} ⩽ .
f (ϵ)
Proof. We can verify that any monotonically non-decreasing function f : R → R+ is Borel measurable.
Hence, f ( X ) is a random variable for any random variable X. Therefore,
f ( X ) = f ( X )1{ X ⩾ϵ } + f ( X )1{ X < ϵ } ⩾ f ( ϵ )1{ X ⩾ϵ } .
The result follows from the monotonicity of expectations.
EX
Corollary 4.2 (Markov). Let X be a non-negative random variable, then P { X ⩾ ϵ} ⩽ ϵ for all ϵ > 0.
Corollary 4.3 (Chebychev). Let X be a random variable with finite mean m1 and variance σ2 , then
σ2
P {| X − m1 | ⩾ ϵ} ⩽ 2 , for all ϵ > 0.
ϵ
Proof. Apply the Markov’s inequality for random variable Y = | X − m1 | ⩾ 0 and increasing function f ( x ) =
x2 for x ⩾ 0.
Corollary 4.4 (Chernoff). Let X be a random variable with finite E[eθX ] for some θ > 0, then
P { X ⩾ ϵ} ⩽ e−θϵ E[eθX ], for all ϵ > 0.
Proof. Apply the Markov’s inequality for random variable X and increasing function f ( x ) = eθx > 0 for
θ > 0.
Definition 4.5 (Convex function). A real-valued function f : R → R is convex if for all x, y ∈ R and θ ∈ [0, 1],
we have
f (θx + (1 − θ )y) ⩽ θ f ( x ) + (1 − θ ) f (y).
Theorem 4.6 (Jensen’s inequality). For any convex function f : R → R and random variable X, we have
f (EX ) ⩽ E f ( X ).
Proof. It suffices to show this for simple random variables X : Ω → X. We show this by induction on cardi-
nality of alphabet X. The inequality is trivially true for |X| = 1. We assume that the inductive hypothesis is
true for |X| = n.
Let X ∈ X, where |X| = n + 1. We can denote X = { x1 , . . . , xn+1 } with pi ≜ P { X = xi } for all i ∈ [n + 1].
p
We observe that ( 1−jp : j ⩾ 2) is a probability mass function for some random variable Y ∈ Y = { x2 , . . . , xn+1 }
1
with cardinality n. Hence, by inductive hypothesis, we have
!
n +1 n +1
pi pi
f ∑ xi = f (EY ) ⩽ E f (Y ) = ∑ f ( x i ).
i =2
1 − p 1 i =2
1 − p1
p
Applying the convexity of f to θ = p1 , x = x1 , y = ∑in=+21 1−ip xi , we get
1

n +1 n +1  n +1 p
 pi  
f (EX ) = f ( ∑ pi xi ) = f p1 x1 + (1 − p1 ) ∑ x i ⩽ p1 f ( x1 ) + (1 − p1 ) f ∑ i
xi .
i =1 i =2
1 − p1 i =2
1 − p1
From the inductive step, it follows that the RHS is upper bounded by E f ( X ), and the result follows.
Theorem 4.7. For any real numbers 1 ⩽ p ⩽ q < ∞, we have Lq ⊆ L p .
Proof. Let 1 ⩽ p ⩽ q < ∞ and consider a convex function g : R+ → R+ defined by g( x ) ≜ x q/p for all x ∈ R+ .
It follows that g(| X | p ) = | X |q and hence from the Jensen’s inequality, we get
∥ X ∥qp = g(E | X | p ) ⩽ Eg(| X | p ) = ∥ X ∥qq .

3
Lecture-10: Correlation

1 Correlation
Let X : Ω → R and Y : Ω → R be random variables defined on the same probability space (Ω, F, P).

Exercise 1.1. Show that the function g : R2 → R defined by g : ( x, y) 7→ xy is a Borel measurable


function.

Definition 1.2 (Correlation). For two random variables X, Y defined on the same probability space, the cor-
relation between these two random variables is defined as E[ XY ]. If E[ XY ] = E[ X ]E[Y ], then the random
variables X, Y are called uncorrelated.
Lemma 1.3. If X, Y are independent random variables, then they are uncorrelated.
Proof. It suffices to show for X, Y simple and independent random variables. We can write X = ∑ x∈X x1 AX ( x)
and Y = ∑y∈Y y1 AY (y) . Therefore,

E[ XY ] = ∑ xyP { A X ( x ) ∩ AY (y)} = ∑ xP( AX (x)) ∑ yP( AY (y)) = E[X ]E[Y ].


( x,y)∈X×Y x ∈X y ∈Y

Proof. If X, Y are independent random variables, then the joint distribution FX,Y ( x, y) = FX ( x ) FY (y) for all
( x, y) ∈ R2 . Therefore,
Z Z Z
E[ XY ] = xydFX,Y ( x, y) = xdFX ( x ) ydFY (y) = E[ X ]E[Y ].
( x,y)∈R2 x ∈R y ∈R

Example 1.4 (Uncorrelated dependent random variables). Let X : Ω → R be a continuous random variable
with even density function f X : R → R+ , and g : R → R+ be another even function that is increasing for
y ∈ R+ . Then g is Borel measurable function and Y = g( X ) is a random variable. Further, we can verify
that X, Y are uncorrelated and dependent random variables. 
To show dependence of X and Y, we take positive x, y such that FX ( x ) < 1 and x > xy where xy =
g−1 (y) ∩ R+ . Then, we can write the set

AY (y) = Y −1 (−∞, y] = X −1 [− xy , xy ].
Hence, we can write the joint distribution at ( x, y) as
FX,Y ( x, y) = P { X ⩽ x, Y ⩽ y} = P( A X ( x ) ∩ AY (y)) = P( AY (y)) = FY (y) ̸= FX ( x ) FY (y).
Since X has even density function, we have f X ( x ) = f X (− x ) for all x ∈ R. Therefore, we have
Z Z
EXg( X )1{ X <0} = xg( x ) f X ( x )dx = (−u) g(−u) f X (−u)du = −EXg( X )1{X >0} .
x <0 u >0

The last equality follows from the fact that g and f X are even. Therefore, we have
E[ Xg( X )] = E[ Xg( X )1{ X <0} ] + E[ Xg( X )1{ X >0} ] = −E[ Xg( X )1{ X >0} ] + E[ Xg( X )1{ X >0} ] = 0.

1
Theorem 1.5 (AM greater than GM). For any two random variables X, Y, the correlation is upper bounded by the
average of the second moments, with equality iff X = Y almost surely. That is,
1
E[ XY ] ⩽ (EX 2 + EY 2 ).
2
Proof. This follows from the linearity and monotonicity of expectations and the fact that ( X − Y )2 ⩾ 0 with
equality iff X = Y.
Theorem 1.6 (Cauchy-Schwarz inequality). For any two random variables X, Y, the correlation of absolute values
of X and Y is upper bounded by the square root of product of second moments, with equality iff X = αY for any
constant α ∈ R. That is, √
E | XY | ⩽ EX 2 EY 2 .

Proof. For two random variables X and Y, we can define normalized random variables W ≜ √| X | and
EX 2
Z≜ √|Y | , to get the result.
EY 2

2 Covariance
Definition 2.1 (Covariance). For two random variables X, Y defined on the same probability space, the
covariance between these two random variables is defined as cov( X, Y ) ≜ E( X − EX )(Y − EY ).
Lemma 2.2. If the random variables X, Y are uncorrelated, then the covariance is zero.
Proof. We can write the covariance of uncorrelated random variables X, Y as
cov( X, Y ) = E( X − EX )(Y − EY ) = EXY − (EX )(EY ) = 0.

Lemma 2.3. Let X : Ω → Rn be an uncorrelated random vector and a = ( a1 , . . . , an ) ∈ Rn , then


!
n n
Var ∑ a i Xi = ∑ a2i Var (Xi ) .
i =1 i =1

Proof. From the linearity of expectation, we can write the variance of the linear combination as
!2
n n
E ∑ ai (Xi − EXi ) = ∑ a2i Var Xi + ∑ ai a j cov(Xi , Xj ).
i =1 i =1 i̸= j

Definition 2.4 (Correlation coefficient). The ratio of covariance of two random variables X, Y and the
square root of product of their variances is called the correlation coefficient and denoted by
cov( X, Y )
ρ X,Y ≜ p .
Var( X ), Var(Y )
Theorem 2.5 (Correlation coefficient). For any two random variables X, Y, the absolute value of correlation
q co-
Var( X )
efficient is less than or equal to unity, with equality iff X = αY + β almost surely for constants α = Var(Y )
and
β = EX − αEY.
Proof. For two random variables X and Y, we can define normalized random variables W ≜ √X −EX and
Var( X )
Z ≜ √Y −EY . Applying the AM-GM inequality to random variables W, Z, we get
Var(Y )
q
|cov( X, Y )| ⩽ Var( X ) Var(Y ).

= Z almost surely or equivalently iff X = αY + β almost surely. Taking


Recall that equality is achieved iff Wp
U = −Y, we see that − cov( X, Y ) ⩽ Var( X ) Var(Y ), and hence the result follows.

2
3 L p spaces
Definition 3.1. A pair ( p, q) ∈ R2 where p, q ⩾ 1 and 1
p + 1q = 1, i s called the conjugate pair, and the spaces
L p and Lq are called dual spaces.

Example 3.2. The dual of L1 space is L∞ . The space L2 is dual of itself, and called a Hilbert space.

Theorem 3.3 (Hölder’s inequality). Consider a conjugate pair ( p, q) and random variables X ∈ L p , Y ∈ Lq . Then,
E | XY | ⩽ ∥ X ∥ p ∥Y ∥q .
n o
Proof. Consider a random variable Z : Ω → { p ln v, q ln w} with probability mass function 1 1
p, q . It follows
from Jensen’s inequality applied to the convex function f ( x ) = e x and the random variable Z, that
vp wq
vw = f (EZ ) ⩽ E f ( Z ) = + .
p q
Vp Wq
It follows that for any random variables V, W, we have VW ⩽ p + q . Taking expectation on both sides,
EV p EW q |X| |Y |
we get from the monotonicity of expectation that EVW ⩽ p + q . Taking V ≜ ∥X∥ p
and W ≜ ∥Y ∥ q
, we
get the result.
Definition 3.4. For a pair of random variables ( X, Y ) ∈ ( L p , Lq ) for conjugate pair ( p, q), we can define inner
product ⟨⟩ : L p × Lq → R by
⟨⟩ ( X, Y ) ≜ ⟨ X, Y ⟩ ≜ EXY.
Remark 1. For X ∈ L p and Y ∈ Lq , the expectation E | XY | is finite from Hölder’s inequality. Therefore, the
inner product ⟨ X, Y ⟩ = E[ XY ] is well defined and finite.
Remark 2. This inner product is well defined for the conjugate pair (1, ∞).
Theorem 3.5 (Minkowski’s inequality). For 1 ⩽ p < ∞, let X, Y ∈ L p be two random variables defined on a
probability space (Ω, F, P). Then,
∥ X + Y ∥ p ⩽ ∥ X ∥ p + ∥Y ∥ p ,
with inequality iff X = αY for some α ⩾ 0 or Y = 0.
Proof. Since addition is a Borel measurable function, X + Y is a random variable. We first show that X + Y ∈
L p , when X, Y ∈ L p . To this end, we observe that g : R+ → R+ defined by g( x ) = x p for all x ∈ R+ , is a
convex function for p ⩾ 1. From the convexity of g, we have
p p
1 1 1 1 1 1 1 1 1 1
X+ Y ⩽ | X | + |Y | = g( | X | + |Y |) ⩽ g(| X |) + g(|Y |) = | X | p + |Y | p .
2 2 2 2 2 2 2 2 2 2
This implies that | X + Y | p ⩽ 2 p−1 (| X | p + |Y | p ).
The inequality holds trivially if ∥ X + Y ∥ p = 0. Therefore, we assume that ∥ X + Y ∥ p > 0, without any
loss of generality. Using the definition of ∥∥ p , triangle inequality, and linearity of expectation we get

∥ X + Y ∥ pp = E[| X + Y | | X + Y | p−1 ] ⩽ E([| X | + |Y |) | X + Y | p−1 ] = E | X | | X + Y | p−1 + E |Y | | X + Y | p−1 .


From the Hölder’s inequality applied to conjugate pair ( p, q) to the two products on RHS, we get

∥ X + Y ∥ pp ⩽ (∥ X ∥ p + ∥Y ∥ p ) | X + Y | p−1
q

p 1− 1p
Recall that q = p −1 . Therefore, | X + Y | p−1 = (E | X + Y | p ) and the result follows.
q

Remark 3. We have shown that the map ∥∥ p is a norm by proving the Minkowski’s inequality. Therefore, L p
is a normed vector space. We can define distance between two random variables X1 , X2 ∈ L p by the norm
∥ X1 − X2 ∥ p .

3
Lecture-11: Generating functions

1 Generating functions
Suppose that X : Ω → R is a continuous random variable on the probability space (Ω, F, P) with distribution
function FX : R → [0, 1].

1.1 Characteristic function


Example 1.1. Let j ≜ −1, then we can show that hu : R → C defined by hu ( x ) ≜ e jux = cos(ux ) + j sin(ux )
is also Borel measurable for all u ∈ R. Thus, hu ( X ) : Ω → C is a complex valued random variable on this
probability space.

Definition 1.2. For a random variable X : Ω → R defined on the probability space (Ω, F, P), the character-
istic function Φ X : R → C is defined by Φ X (u) ≜ Ee juX for all u ∈ R and j2 = −1.
Remark 1. The characteristic function Φ X (u) is always finite, since |Φ X (u)| = Ee juX ⩽ E e juX = 1.
Remark 2. For a discrete random variable X : Ω → X with PMF PX : X → [0, 1], the characteristic function
Φ X (u) = ∑ x∈X e jux PX ( x ).
Remark 3. For a continuous
R∞ random variable X : Ω → R with density function f X : R → R+ , the characteristic
function Φ X (u) = −∞ e juX f X ( x )dx.

Example 1.3 (Gaussian random variable). For a Gaussian random variable X : Ω → R with mean µ and
variance σ2 , the characteristic function Φ X is

1
Z ( x − µ )2  u2 σ 2 

Φ X (u) = √ e jux e 2σ2 dx = exp − + juµ .
2πσ2 x ∈R 2
2 σ2 /2
We observe that |Φ X (u)| = e−u has Gaussian decay with zero mean and variance 1/σ2 .

(k)
Theorem 1.4. If E | X | N is finite for some integer N ∈ N, then Φ X (u) is finite and continuous functions of u ∈ R
(k)
for all k ∈ [ N ]. Further, Φ X (0) = jk E[ X k ] for all k ∈ [ N ].

Proof. Since E | X | N is finite, then so is E | X |k for all k ∈ [ N ]. Therefore, E[ X k ] exists and is finite. Exchanging
derivative and the integration (which can be done sinceh e jux is a bounded i function with all derivatives), and
(k) dk e juX
evaluating the derivative at u = 0, we get Φ X (0) = E duk u=0
= j k E[ X k ].

Theorem 1.5. Two random variables have the same distribution iff they have the same characteristic function.
Proof. It is easy to see the necessity and the sufficiency is difficult.

1
1.2 Moment generating function

Example 1.6. A function gt : R → R+ defined by gt ( x ) ≜ etx is monotone and hence Borel measurable for
all t ∈ R. Therefore, gt ( X ) : Ω → R+ is a positive random variable on this probability space.

Definition 1.7. For a random variable X : Ω → R defined on the probability space (Ω, F, P), the moment
generating function MX : R → R+ is defined by MX (t) ≜ EetX for all t ∈ R where MX (t) is finite.
Remark 4. Characteristic function always exist, however are complex in general. Sometimes it is easier to
work with moment generating functions, when they exist.
n
Lemma 1.8. For a random variable X, if the MGF MX (t) is finite for some t ∈ R, then MX (t) = 1 + ∑n∈N n!
t
E[ X n ].
n
Proof. From the Taylor series expansion of eθ around θ = 0, we get eθ = 1 + ∑n∈N θn! . Therefore, taking
tn n
θ = tX, we can write etX = 1 + ∑n∈N n! X . Taking expectation on both sides, the result follows from the
linearity of expectation, when both sides have finite expectation.

Example 1.9 (Gaussian random variable). For a Gaussian random  X : Ω → R with mean µ and
 2 variable
2 t σ2
variance σ , the moment generating function MX is MX (t) = exp 2 + tµ .

1.3 Probability generating function


For a non-negative integer-valued random variable X : Ω → X ⊆ Z+ , it is often more convenient to work
with the z-transform of the probability mass function, called the probability generating function.
Definition 1.10. For a discrete non-negative integer-valued random variable X : Ω → X ⊆ Z+ with proba-
bility mass function PX : X → [0, 1], the probability generating function Ψ X : C → C is defined by

Ψ X ( z ) ≜ E[ z X ] = ∑ zx PX (x), z ∈ C, |z| ⩽ 1.
x ∈X

Lemma 1.11. For a non-negative simple random variable X : Ω → X, we have |Ψ X (z)| ⩽ 1 for all |z| ⩽ 1.
Proof. Let z ∈ C with |z| ⩽ 1. Let PX : X → [0, 1] be the probability mass function of the non-negative simple
random variable X. Since any realization x ∈ X of random variable X is non-negative, we can write

|Ψ X (z)| = ∑ zx PX (x) ⩽ ∑ |z|x PX (x) ⩽ ∑ PX ( x ) = 1.


x ∈X x ∈X x ∈X

Theorem 1.12. For a non-negative simple random variable X : Ω → X with finite kth moment EX k , the k-th deriva-
tive of probability generating function evaluated at z = 1 is the k-th order factorial moment of X. That is,
" #
k −1
∏ (X − i)
(k)
Ψ X (1) = E = E[ X ( X − 1)( X − 2) . . . ( X − k + 1)].
i =0

Proof. It follows from the interchange of derivative and expectation.


Remark 5. Moments can be recovered from kth order factorial moments. For example,
′ (2) ′
E[ X ] = Ψ X (1), E[ X 2 ] = Ψ X (1) + Ψ X (1).

2
Theorem 1.13. Two non-negative integer-valued random variables have the same probability distribution iff their
z-transforms are equal.
(k) (k)
Proof. The necessity is clear. For sufficiency, we see that Ψ X = ∑ x⩾k k!z x−k PX ( x ) and hence Ψ X (0) =
k!PX (k). Further, interchanging the derivative and the summation (by dominated convergence theorem),
we get the second result.

2 Gaussian Random Vectors


Definition 2.1. For a random vector X : Ω → Rn defined on a probability space (Ω, F, P), we can define the
characteristic function Φ X : Rn → C by Φ X (u) ≜ Ee j⟨u,X ⟩ where u ∈ Rn .
Remark 6. If X : Ω → Rn is an independent random vector, then Φ X (u) = ∏in=1 Φ Xi (ui ) for all u ∈ Rn .
Definition 2.2. For a probability space (Ω, F, P), Gaussian random vector is a continuous random vector
X : Ω → Rn defined by its density function
 
1 1 T −1
f X (x) ≜ p exp − ( x − µ) Σ ( x − µ) for all x ∈ Rn ,
(2π )n det(Σ) 2

where the vector µ ∈ Rn and the positive definite matrix Σ ∈ Rn×n .


Remark 7. For a Gaussian random vector with vector µ = (µ1 , . . . , µ1 ) for some real scalar µ1 andmatrix Σ =
( x − µ )2
σ2 I for some positive σ2 ∈ R+ , we can write its density as f X ( x ) = (2πσ12 )n/2 exp − 12 ∑in=1 i σ2 1 for all x ∈
R . It follows that X is an i.i.d. random vector with each component being a Gaussian random variable with
n

mean µ1 and variance σ2 . The characteristic function Φ X of an i.i.d.  Gaussian random vector X : Ω → R
n

σ2
parametrized by (µ1 , σ2 ) is given by Φ X (u) = ∏in=1 Φ Xi (ui ) = exp − 2 ∑in=1 u2i + jµ1 ∑in=1 ui .

Lemma 2.3. For an i.i.d. zero mean unit variance Gaussian vector Z : Ω → Rn , vector α ∈ Rn , and scalar µ ∈ R,
the affine combination Y ≜ µ + ⟨α, Z ⟩ is a Gaussian random variable.
Proof. From the linearity of expectation and the fact that Z is a zero mean vector, we get EY = µ. Further,
from the linearity of expectation and the fact that E[ ZZ T ] = I, we get
n n n
σ2 ≜ Var(Y ) = E(Y − µ)2 = ∑ ∑ αi αk E[Zi Zk ] = ⟨α, α⟩ = ∥α∥22 = ∑ α2i .
i =1 k =1 i =1

2 2
To show that Y is Gaussian, it suffices to show that ΦY (u) = exp(− u 2σ + juµ) for any u ∈ R. Recall that
Z is an independent random vector with individual components being identically zero mean unit variance
2
Gaussian. Therefore, Φ Zi (u) = exp(− u2 ), and we can compute the characteristic function of Y as
n n
u2 σ 2
ΦY (u) = Ee juY = e juµ E ∏ e juαi Zi = e juµ ∏ Φ Zi (uαi ) = exp(− + juµ).
i =1 i =1
2

1
Theorem 2.4. A random vector X : Ω → Rn is Gaussian with parameters (µ, Σ) iff Z ≜ Σ− 2 ( X − µ) is an i.i.d.
zero mean unit variance Gaussian random vector.
1
Proof. Let X = µ + Σ 2 Z for an i.i.d. zero mean unit variance Gaussian random vector Z : Ω → Rn , then
we will show that X is a Gaussian random vector by transformation of random vector densities. Since
∂x j 1
the (i, j)th component of the Jacobian matrix J ( x ) is given by Jij ( x ) = ∂zi = Σi,j
2
for all i, j ∈ [n], we can

3
1
write the Jacobian matrix J ( x ) = Σ 2 , Since the density of Z is f Z (z) = √ 1
exp(− 12 z T z), and from the
(2π )n
transformation of random vectors, we get
1
f Z (Σ− 2 ( x − µ)) 1  1
T −1

f X (x) = = exp − ( x − µ ) Σ ( x − µ ) , x ∈ Rn .
1
det(Σ 2 ) (2π )n/2 det(Σ)1/2 2

1
Conversely, we can show that if X is a Gaussian random vector, then Z = Σ− 2 ( X − µ) is an i.i.d. zero
mean unit variance Gaussian random vector by transformation of random vectors.

Remark 8. A random vector X : Ω → Rn with mean µ ∈ Rn and covariance matrix Σ ∈ Rn×n is Gaussian
1
iff X can be written as X = µ + Σ 2 Z, for an i.i.d. Gaussian random vector Z : Ω → Rn with mean 0 and
variance 1. It follows that EX = µ and Σ = E( X − µ)( X − µ) T .
1
Remark 9. We observe that the components of the Gaussian random vector X = µ + AZ for A = Σ 2 are
Gaussian random variables with mean µi and variance ∑nk=1 A2i,k = ( AA T )i,i = Σi,i , since each component
Xi = µi + ∑nk=1 Ai,k Zk is an affine combination of zero mean unit variance i.i.d. random variables.
Remark 10. For any u ∈ Rn , we compute the characteristic function Φ X from the distribution of Z as
 D E 1
Φ X (u) = Ee j⟨u,X ⟩ = E exp j ⟨u, µ⟩ + j A T u, Z = exp( j ⟨u, µ⟩)Φ Z ( A T u) = exp( j ⟨u, µ⟩ − u T Σu).
2
Lemma 2.5. If the components of the Gaussian random vector are uncorrelated, then they are independent.
Proof. If a Gaussian vector is uncorrelated, then the covariance matrix Σ is diagonal. It follows that we can
write f X ( x ) = ∏in=1 f Xi ( xi ) for all x ∈ Rn .

4
Lecture-12: Conditional Expectation

1 Conditional expectation given a non trivial event


Consider a probability space (Ω, F, P) and an event B ∈ F such that P( B) > 0. Then, the conditional proba-
bility of any event A ∈ F given an event B is defined as

P( A ∩ B)
P( A B) = .
P( B)

Consider a random variable X : Ω → R defined on a probability space (Ω, F, P), with distribution function
FX : R → [0, 1], and a non trivial event B ∈ F such that P( B) > 0.

Definition 1.1. The conditional distribution of X given event B is denoted by FX | B : R → [0, 1] and FX | B ( x )
is defined as the probability of event A X ( x ) ≜ X −1 (−∞, x ] conditioned on event B for all x ∈ R. That is,

P( A X ( x ) ∩ B)
FX | B ( x ) ≜ P( A X ( x )| B) = for all x ∈ R.
P( B)

Remark 1. The conditional distribution FX | B is a distribution function. This follows from the fact that (i)
FX | B ⩾ 0, (ii) FX | B is right continuous, (iii) limx↓−∞ FX | B ( x ) = 0 and limx↑∞ FX | B ( x ) = 1.
Remark 2. For a discrete random variable X : Ω → X, the conditional probability mass function of X given
P( X −1 { x }∩ B)
a non trivial event B is given by PX | B ( x ) = P( B)
for all x ∈ X.
Remark 3. For a continuous random variable X : Ω → R, the conditional density of X given a non trivial
dFX | B ( x )
event B is given by f X | B ( x ) = dx for all x ∈ R.

Example 1.2 (Conditional distribution). Consider the probability space (Ω, F, P) corresponding to a ran-
dom experiment where a fair die is rolled once. For this case, the outcome space Ω = [6], the event space
F = P ([6]), and the probability measure P(ω ) = 61 for all ω ∈ Ω.
We define a random variable X : Ω → R such that X (ω ) = ω for all ω ∈ Ω, and an event B ≜
{ω ∈ Ω : X (ω ) ⩽ 3} = [3] ∈ F. We note that P( B) = 0.5 and the conditional PMF of X given B is
1 1 1
PX | B ( x ) = 1{ x =3} + 1{ x =2} + 1{ x =1} .
3 3 3

Definition 1.3. The conditional expectation of X given event B is given as E[ X | B] ≜


R
x ∈R xdFX | B ( x ).

Remark 4. For a discrete random variable X : Ω → X, the conditional expectation of X given a non trivial
event B is given by E[ X | B] = ∑ x∈X xPX | B ( x ).

Example 1.4 (Conditional expectation). For the random variable X and event B defined in Example 1.2,
the conditional expectation E[ X | B] = 2.

1
Remark 5. Consider two random variables X, Y defined on this probability space, then for y ∈ R such that
FY (y) > 0, we can define events A X ( x ) ≜ X −1 (−∞, x ] and AY (y) = Y −1 (−∞, y], such that

 F ( x, y)
P {X ⩽ x} {Y ⩽ y} = X,Y .
FY (y)

The key observation is that {Y ⩽ y} is a non-trivial event. How do we define conditional expectation based
on events such as {Y = y}? When random variable Y is continuous, this event has zero probability measure.

2 Conditional expectation given an event space


Consider random variables X : Ω → R and Y : Ω → R defined on the same probability space (Ω, F, P) such
that E | X | < ∞, and a smaller event space G ⊂ F. For each non trivial event G ∈ G, we know how to define
the conditional distribution FX |G and E[ X | G ]. For any trivial event N ∈ G, these are undefined.
Definition 2.1. The conditional expectation of random variable X given event space G is a random vari-
able E[ X G] : Ω → R defined on the same probability space, such that
(i) Z ≜ E[ X G] is G measurable,
(ii) for all G ∈ G, we have E[ X 1G ] = E[ Z1G ],
(iii) E | Z | < ∞.
Lemma 2.2. The conditional expectation of X given G is an a.s. unique random variable.
Proof. Consider two random variables Z1 = E[ X |G] and Z2 = E[ X |G]. Then from the 1 , Z2 are G
n definition, Zo
measurable random variables, and Z1 − Z2 is also G measurable. Therefore, Gn ≜ Z1 − Z2 > 1
n ∈ G and
E[( Z1 − Z2 )1Gnn ] = 0 by definition.
o It follows from continuity of probability, that P(limn Gn ) = 0. Similarly,
1
defining Fn ≜ Z2 − Z1 > n , we can show that P(limn Fn ) = 0.

Example 2.3 (Conditional expectation as averaging). Consider a random variable X : Ω → R defined on


a probability space (Ω, F, P) with E | X | < ∞, and the coarsest event space G = {∅, Ω} ⊆ F and finest event
space F. We observe that E[ X |G] = EX a.s. uniquely, since (i) EX is a constant and hence G measurable,
(ii) E[EX 1∅ ] = E[ X 1∅ ] = 0 and E[EX 1Ω ] = E[ X 1Ω ] = EX, and (iii) E |EX | = |EX | ⩽ E | X | < ∞ from the
Jensen’s inequality.
We also observe that E[ X |F ] = X a.s. uniquely, since (i) X is F measurable random variable, (ii)
E[ X 1G ] = E[ X 1G ] for all events G ∈ F, and (iii) E | X | < ∞.

Lemma 2.4. The mean of conditional expectation of random variable X given event space G is EX.
Proof. From the definition of event space Ω ∈ G, and from the definition of conditional expectation, we get
E[E[ X | G]] = E[E[ X | G]1Ω ] = E[ X 1Ω ] = EX.

Definition 2.5. The conditional expectation of X given Y is a random variable E[ X Y ] ≜ E[ X σ (Y )]


defined on the same probability space.

Example 2.6 (Conditioning on simple random variables). For a simple random variable Y : Ω → Y ⊆ R
defined on the probability space (Ω, F, P), we define fundamental events Ey ≜ Y −1 {y} ∈ F for all y ∈ Y.
Then the sequence of events E ≜ ( Ey ∈ F : y ∈ Y) partitions the sample space, and we can write the event
space generated by random variable Y as σ (Y ) = (∪y∈ I Ey : I ⊆ Y).
For a random variable X : Ω → R defined on the same probability space, the random variable Z ≜
E[ X Y ] is σ(Y ) measurable. Therefore, E[ X |Y ] = ∑y∈Y αy 1Ey for some α ∈ RY . We verify that Z : Ω → R

2
is a σ (Y ) measurable random variable, since σ ( Z ) ⊆ σ (Y ). We can also check that E | Z | < ∞. Further, we
E[ X 1 E ]
have E[ Z1Ey ] = E[ X 1Ey ] for any y ∈ Y, which implies that αy = P (y)y for any y ∈ Y. Notice that
Y

1 1
Z Z
E[ X | Ey ] = xdFX | Ey ( x ) = xdP( A X ( x ) ∩ Ey ) = E[ X 1Ey ] = αy .
x ∈R PY (y) x ∈R PY (y)

Remark 6. There are three main takeaways from this definition. For a random variable Y, the event space
generated by Y is σ (Y ).
1. The conditional expectation E[ X |Y ] = E[ X |σ (Y )] and is σ(Y ) measurable. That is, E[ X |Y ] is a Borel
measurable function of Y. In particular when Y is discrete, this implies that E[ X |Y ] is a simple random
variable that takes value E[ X | Ey ] when ω ∈ Ey , and the probability of this event is PY (y). When Y is
continuous, E[ X |Y ] is a continuous random variable with density f Y .

2. Expectation is averaging. Conditional expectation is averaging over event spaces. We can observe
that the coarsest averaging is E[ X | {∅, Ω}] = EX and the finest averaging is E[ X |σ ( X )] = X. Further,
E[ X |σ (Y )] is averaging of X over events generated by Y. If we take any event A ∈ σ (Y ) generated by
Y, then the conditional expectation of X given Y is fine enough to find the averaging of X when this
event occurs. That is, E[ X 1 A ] = E[E[ X |Y ]1 A ].

3. If X ∈ L1 , then the conditional expectation E[ X |Y ] ∈ L1 .

3 Conditional distribution given an event space


Definition 3.1. The conditional probability of an event A ∈ F given event space G is defined as P( A | G) ≜
E[1 A | G].
Remark 7. From the definition of conditional expectation, it follows that P( A | G) : Ω → [0, 1] is a G measur-
able random variable, such that E[1G P( A|G)] = P( A ∩ G ) for all G ∈ G, and is uniquely defined up to sets
of probability zero.

Example 3.2. For the trivial sigma algebra G = {∅, Ω}, the conditional probability is the constant function
P( A | {∅, Ω}) = P( A).
Example 3.3. If A ∈ G, then P( A | G) = 1 A .

Definition 3.4. The conditional distribution of random variable X given sub event space G is defined as
FX |G ( x ) ≜ P( A X ( x ) | G) for all x ∈ R.

Remark 8. Recall that FX |G ( x ) : Ω → [0, 1] a random variable, for each x ∈ R. Further, we observe that FX |G
is monotone nondecreasing in x ∈ R, right continuous at all x ∈ R, and has limits limx↓−∞ FX |G ( x ) = 0 and
limx↑∞ FX |G ( x ) = 1. It follows that FX |G : Ω → [0, 1]R is a random distribution.
Theorem 3.5. Let g : R → R R be a Borel measurable function and G be a sub-event space. Then, the conditional
expectation E[ g( X ) | G] = x∈R g( x )dFX |G ( x ).

Proof. It suffices to show this for simple random variables X : Ω → X. Since g is Borel measurable, then
g( X ) is a random variable. We will show that E[ g( X ) | G] = ∑ x∈X g( x ) PX |G ( x ) by showing that it satis-
fies three properties of conditional expectation. For part (i), we observe that from the definition of condi-
tional probability PX |G ( x ) is a G-measurable random variable for all x ∈ X, and so is the linear combination

3
∑ x∈X g( x ) PX |G ( x ). For part (ii), we let G ∈ G. Then, it follows from the linearity of expectation and the
definition of conditional probability, that

E[ ∑ g( x ) PX |G ( x )1G ] = ∑ g( x )E[ PX |G ( x )1G ] = ∑ g ( x ) E [ 1 Ex ∩ G ] = E [ X 1 G ] .


x ∈X x ∈X x ∈X

For part (iii), it follows from the triangle inequality, the linearity of expectation, and the definition of condi-
tional probability that E ∑ x∈X g( x ) PX |G ( x ) ⩽ ∑ x∈X | g( x )| EPX |G ( x ) = ∑ x∈X | g( x )| PX ( x ) = E | X | < ∞.

Remark 9. The conditional characteristic function is given by Φ X |G (u) = E[e juX | G] = jux dF
R
x ∈R e X |G ( x ).

Definition 3.6. The conditional distribution of random variable X given random variable Y is defined as
FX |Y ( x ) ≜ P( A X ( x ) | σ (Y )] for all x ∈ R.

Example 3.7 (Conditional distribution given simple random variables). Consider a random variable X :
Ω → R and a simple random variable Y : Ω → Y defined on the same probability space. Since random
variables FX |Y ( x ) = E[1 AX ( x) | Y ] are σ (Y ) measurable, they can can be written as FX |Y ( x ) = ∑y∈Y β x,y 1Ey
for some β x ∈ RY and Ey = Y −1 {y} for all y ∈ Y. Further, we have E[ FX |Y ( x )1Ey ] = E[1 AX ( x) 1Ey ] for
P( A X ( x )∩ Ey )
any y ∈ Y, which implies that β x,y = PY (y)
= FX |Ey ( x ) for any y ∈ Y. It follows that FX |Y is a σ(Y )
measurable simple random variable.

Example 3.8 (Conditional expectation). Consider a random experiment of a fair die being thrown and a
random variable X : Ω → R taking the value of the outcome of the experiment. That is, for outcome space
Ω = [6] and event space F = P (Ω), we have X (ω ) = ω with PX ( x ) = 1/6 for x ∈ [6]. Define another random
variable Y = 1{ X⩽3} . Then the conditional expectation of X given Y is a random variable given by

E[ X |Y ] = E[ X | {Y = 1}]1{Y =1} + E[ X | {Y = 1}]1{Y =0} = 21Y −1 (1) + 51Y −1 (0) .

Since P {Y = 1} = P {Y = 0} = 0.5, it follows that that E[E[ X |Y ]] = E[ X ] = 3.5.

Example 3.9 (Conditional distribution). Consider the zero-mean Gaussian random variable N : Ω → R
with variance σ2 , and another independent random variable Y ∈ {−1, 1} with PMF (1 − p, p) for some
p ∈ [0, 1]. Let X = Y + N, then the conditional distribution of X given simple random variable Y is

FX |Y = FX |Y −1 (−1) 1Y −1 (−1) + FX |Y −1 (1) 1Y −1 (1) ,

( t − µ )2
Rx −
where FX |Y −1 (µ) is −∞ e
σ2 dt.

Definition 3.10. When X, Y are both continuous random variables, there exists a joint density f X,Y ( x, y) for
all ( x, y) ∈ R2 . For each y ∈ Y such that f Y (y) > 0, we can define a function f X |Y : R2 → R+ such that

f X,Y ( x, y)
f X |Y ( x, y) ≜ , for all x ∈ R.
f Y (y)

Exercise 3.11. For continuous random variables X, Y, show that the function f X |Y −1 (y) is a density of
continuous random variable X for each y ∈ R.

4
Lecture-13: Conditional expectation

1 Conditional expectation conditioned on a random vector


Let X : Ω → Rn be a random vector defined on a probability space (Ω, F, P). Recall that the projection
πi : Rn → R of n-dimensional vector X on its ith component gives Xi = πi ( X ), and it is a Borel measurable
function. It follows that Xi = πi ◦ X is a random variable for each i ∈ [n]. Further the event space generated
by random vectors X is given by
σ ( X ) = σ ( X1 , . . . , X n ) .
Remark 1. Let ( Ai ∈ F : i ∈ [n]) be an n-length sequence of events, then X = (1 Ai : i ∈ [n]) is a random vector,
and the smallest event space generated by this random vector is σ ( X ) = σ ( A1 , . . . , An ).
Definition 1.1. Consider a random variable X : Ω → R such that E | X | < ∞ and a random vector Y : Ω → Rn
for some n ∈ N, defined on the same probability space (Ω, F, P). The conditional expectation of random
variable X given the random vector Y is defined as E[ X | Y ] ≜ E[ X | σ (Y )].
Lemma 1.2. For a random sequence X : Ω → RN defined on the same probability space (Ω, F, P), we have
σ ( X1 , . . . , X n ) ⊆ σ ( X1 , . . . , X n + 1 ) , n ∈ N.

Proof. For any x ∈ Rn+1 , any generating event for collection σ ( X1 , . . . , Xn ) is of the form ∩in=1 Xi−1 (−∞, xi ] =
∩in=1 Xi−1 (−∞, xi ] ∩ Xn−+11 (R), a generating event for collection σ( X1 , . . . , Xn+1 ).

2 Properties of Conditional Expectation


Proposition 2.1. Let X, Y be random variables on the probability space (Ω, F, P) such that E | X | , E |Y | < ∞. Let G
and H be event spaces such that G, H ⊂ F. Then the following statements are true.
1. Identity: If X is G-measurable and E | X | < ∞, then X = E[ X | G] a.s. In particular, c = E[c | G], for any
constant c ∈ R.
2. Linearity: E[(αX + βY ) | G] = αE[ X | G] + βE[Y | G] a.s.
3. Monotonicity: If X ⩽ Y a.s., then E[ X | G] ⩽ E[Y | G] a.s.
4. Conditional Jensen’s inequality: If ψ : R → R is convex and E |ψ( X )| < ∞, then ψ(E[ X | G]) ⩽ E[ψ( X ) |
G] a.s.
5. Pulling out what’s known: If Y is G-measurable and E | XY | < ∞, then E[ XY | G] = YE[ X | G] a.s.
6. Tower property: If H ⊆ G, then E[E[ X | G] | H] = E[ X | H] a.s.
7. Irrelevance of independent information: If H is independent of σ (G, σ ( X )) then E[ X |σ (G, H)] = E[ X |
G] a.s. In particular, if X is independent of H, then E[ X | H] = E[ X ] a.s.

8. L2 -projection: If E | X |2 < ∞, then ζ ∗ ≜ E[ X | G] minimizes E[( X − ζ )2 ] over all G-measurable random


variables ζ such that E |ζ |2 < ∞.
Proof. In all the properties below, we have to show that conditional expectation given an event space equals
another random variable almost surely. To this end, we will show that the right hand side satisfies the three
properties of the conditional expectation random variable, and hence is the conditional expectation from
almost sure uniqueness.

1
1. Identity: It follows from the definition that X satisfies all three conditions for conditional expectation.
The event space generated by any constant function is the trivial event space {∅, Ω} ⊆ G for any event
space. Hence, E[c | G] = c.

2. Linearity: Let Z ≜ E[(αX + βY ) | G] and Z1 ≜ E[ X | G] and Z2 ≜ E[Y | G]. We have to show that
Z = αZ1 + βZ2 almost surely.
(i) From the definition of conditional expectation, we have E[ X | G], E[Y | G] are G-measurable. In
addition, the linear map g : R2 → R defined by g( x ) ≜ αx1 + βx2 is Borel measurable. Therefore,
the linear combination αZ1 + βZ2 is G-measurable.
(ii) For any G ∈ G, from the linearity of expectation and definition of conditional expectation, we
have
E[(αZ1 + βZ2 )1G ] = αE[E[ X | G]1G ] + βE[E[Y | G]1G ] = E[(αX + βY )1G ].
(iii) From the definition of conditional expectation, we have E | Z1 | , E | Z2 | are finite. Therefore, we
have
E |αZ1 + βZ2 | ⩽ |α| E | Z1 | + | β| E | Z2 | < ∞.

This implies that αZ1 + βZ2 satisfies three properties of conditional expectation E[(αX + βY ) | G].
From the almost sure uniqueness of conditional expectation, we have Z = αZ1 + βZ2 almost surely.

3. Monotonicity: Let ϵ > 0 and define Aϵ ≜ {E[ X | G] − E[Y | G] > ϵ} ∈ G. Then from the definition of
conditional expectation, we have

0 ⩽ ϵP( Aϵ ) = E[ϵ1 Aϵ ] < E[(E[ X | G] − E[Y | G])1 Aϵ ] = E[( X − Y )1 Aϵ ] ⩽ 0.

Thus, we obtain that P( Aϵ ) = 0 for all ϵ > 0. Taking limit ϵ ↓ 0, we get

0 = lim P( Aϵ ) = P(lim Aϵ ) = P( A0 ).
ϵ ↓0 ϵ

4. Conditional Jensen’s inequality: We will use the fact that a convex function can always be repre-
sented by the supremum of a family of affine functions. Accordingly, we will assume for a convex
function ψ : R → R, we have linear functions ϕi : R → R and constants ci ∈ R for all i ∈ I such that
ψ = supi∈ I (ϕi + ci ).
For each i ∈ I, we have ϕi (E[ X | G]) + ci = E[ϕi ( X ) | G] + ci ⩽ E[ψ( X ) | G] from the linearity and
monotonicity of conditional expectation. It follows that

ψ(E[ X | G]) = sup(ϕi (E[ X | G]) + ci ) ⩽ E[ψ( X ) | G].


i∈ I

5. Pulling out what’s known: From the almost sure uniqueness of conditional expectation, it suffices to
show that YE[ X | G] satisfies following three properties of the conditional expectation E[ XY | G]:
(i) the random variable YE[ X | G] is G-measurable,
(ii) E[ XY 1G ] = E[YE[ X | G]1G ] for any event G ∈ G, and
(iii) E |YE[ X | G]| is finite.
Part (i) is true since Y is given to be G-measurable, E[ X | G] is G-measurable by the definition of
conditional expectation, and product function is Borel measurable.
It suffices to show part (ii) and (iii) for simple G-measurable random variables Y = ∑y∈Y y1Ey where
Ey = Y −1 {y} ∈ G. For any G ∈ G, we have G ∩ Ey ∈ G for all y ∈ Y and ∪y∈Y ( G ∩ Ey ) = G.
Part (ii) follow from the linearity of expectation and definition of conditional expectation E[ X | G],
such that

E[YE[ X | G]1G ] = ∑ yE[1G∩Ey E[X | G]] = ∑ yE[X1G∩Ey ] = E[X ∑ y1G∩Ey ] = E[XY1G ].


y ∈Y y ∈Y y ∈Y

2
Part (iii) follows from the fact that E | XY | is finite and the conditional Jensen’s inequality applied to
convex function |·| : R → R+ to get |E[ X | G]| ⩽ E[| X | | G]. Therefore,

E[|Y | |E[ X | G]|] = ∑ |y| E[|E[X | G]| 1Ey ] ⩽ ∑ |y| E[|X | 1Ey ] = E |XY | .
y ∈Y y ∈Y

6. Tower property: From the almost sure uniqueness of conditional expectation, it suffices to show that
(i) E[ X | H] is H-measurable,
(ii) E[E[ X | H]1 H ] = E[E[ X | G]1 H ] for all H ∈ H, and
(iii) E[|E[ X | H]|] is finite.
Part (i) follows from the definition of conditional expectation, which implies that E[ X | H] is H mea-
surable.
Part (ii) follows from the fact that H ⊆ G, and hence any H ∈ H belongs to G. Therefore, from the
definition of conditional expectation, we have

E[E[ X | G]1 H ] = E[ X 1 H ] = E[E[ X | H ]1 H ].

Part (iii) follows from the conditional Jensen’s inequality applied to convex function |·| : R → R+ to
get |E[ X | H]| ⩽ E[| X | | H], and hence E |E[ X | H]| ⩽ E | X | < ∞.
7. Irrelevance of independent information: From the almost sure uniqueness of conditional expecta-
tion, it suffices to show that
(i) E[ X | G] is σ (G, H)-measurable,
(ii) E[E[ X | G]1G ] = E[ X 1G ] for all G ∈ σ (G, H), and
(iii) E |E[ X | G]| is finite.
Part (i) follows from the definition of conditional expectation and the definition of σ (G, H). Since
E[ X | G] is G-measurable, it is σ (G, H) measurable.
Part (ii) follows from the fact that it suffices to show for events A = G ∩ H ∈ σ (G, H) where G ∈ G and
H ∈ H. In this case,

E[E[ X | G]1 G ∩ H ] = E[E[ X | G]1 G 1 H ] = E[E[ X | G]1 G ] E [1 H ] = E[ X 1 G ]E[1 H ] = E[ X 1 G ∩ H ].

Part (iii) follows from the conditional Jensen’s inequality applied to convex function |·| : R → R+ to
get |E[ X | G]| ⩽ E[| X | | G]. This implies that E |E[ X | G]| ⩽ E | X | < ∞.

8. L2 -projection: We define L2 (G) ≜ ζ a G measurable random variable : Eζ 2 < ∞ . From the condi-


tional Jensen’s inequality applied to convex function (·)2 : R → R+ , we get that (E[ X | G])2 ⩽ E[ X 2 | G]
a.s.. Since X ∈ L2 , it follows that X 2 ∈ L1 and hence E[ X | G] ∈ L2 . It follows that ζ ∗ ≜ E[ X | G] ∈ L2 (G)
from the definition of conditional expectation.
We first show that X − ζ ∗ is uncorrelated with all ζ ∈ L2 (G). Towards this end, we let ζ ∈ L2 (G) and
observe that
E[( X − ζ ∗ )ζ ] = E[ζX ] − E[ζE[ X | G]] = E[ζX ] − E[E[ζX | G]] = 0.
The above equality follows from the linearity of expectation, the G-measurability of ζ, and the def-
inition of conditional expectation. Since ζ ∗ ∈ L2 (G), we have (ζ − ζ ∗ ) ∈ L2 (G). Therefore, E[( X −
ζ ∗ )(ζ − ζ ∗ )] = 0.
For any ζ ∈ L2 (G), we can write from the linearity of expectation

E( X − ζ )2 = E( X − ζ ∗ )2 + E(ζ − ζ ∗ )2 − 2E( X − ζ ∗ )(ζ − ζ ∗ ) ⩾ E( X − ζ ∗ )2 .

3
Lecture-14: Almost sure convergence

1 Point-wise convergence
Consider a random sequence X : Ω → RN defined on a probability space (Ω, F, P), then each Xn ≜ πn ◦ X :
Ω → R is a random variable. There are many possible definitions for convergence of a sequence of random
variables. One idea is to consider X (ω ) ∈ RN as a real valued sequence for each outcome ω, and consider
the limn Xn (ω ) for each outcome ω.
Definition 1.1. A random sequence X : Ω → RN defined on a probability space (Ω, F, P) converges point-
wise to a random variable X∞ : Ω → R, if for all outcomes ω ∈ Ω, we have
lim Xn (ω ) = X∞ (ω ).
n

Remark 1. This is a very strong convergence. Intuitively, what happens on an event of probability zero is
not important. We will strive for a weaker notion of convergence, where the sequence of random variable
converge point-wise on a set of outcomes with probability one.

2 Almost sure statements


Definition 2.1. A statement holds almost surely (a.s.) if there exists an event called the exception set N ∈ F
with P( N ) = 0 such that the statement holds for all ω ∈
/ N.

Example 2.2 (Almost sure equality). Two random variables X, Y defined on the probability space
(Ω, F, P) are said to be equal a.s. if the following exception set

N ≜ {ω ∈ Ω : X (ω ) ̸= Y (ω )} ∈ F,

has probability measure P( N ) = 0. Then Y is called a version of X, and we can define an equivalence
class of a.s. equal random variables.
Example 2.3 (Almost sure monotonicity). Two random variables X, Y defined on the probability space
(Ω, F, P) are said to be X ⩽ Y a.s. if the exception set N ≜ {ω ∈ Ω : X (ω ) > Y (ω )} ∈ F has probability
measure P( N ) = 0.

3 Almost sure convergence


Definition 3.1 (Almost sure convergence). A random sequence X : Ω → RN defined on the probability
space (Ω, F, P) converges almost surely, if the following exception set
 
N ≜ ω ∈ Ω : lim inf Xn (ω ) < lim sup Xn (ω ) or lim sup Xn (ω ) = ∞ ∈ F,
n n n

has zero probability. Let X∞ be the point-wise limit of the sequence of random variables X : Ω → RN on
the set N c , then we say that the sequence X converges almost surely to X∞ , and denote it as
lim Xn = X∞ a.s.
n

1
Example 3.2 (Convergence almost surely but not everywhere). Consider the probability space
([0, 1], B([0, 1]), λ) such that λ([ a, b]) = b − a for all 0 ⩽ a ⩽ b ⩽ 1. For each n ∈ N, we define the scaled
indicator random variable Xn : Ω → {0, 1} such that

Xn (ω ) ≜ n1[0, 1 ] (ω ).
n

Let N = {0}, then for any ω ∈ / N, there exists m = ⌈ ω1 ⌉ ∈ N, such that for all n > m, we have Xn (ω ) = 0.
That is, limn Xn = 0 a.s. since λ( N ) = 0. However, Xn (0) = n for all n ∈ N.

4 Convergence in probability
Definition 4.1 (convergence in probability). A random sequence X : Ω → RN defined on the probability
space (Ω, F, P) converges in probability to a random variable X∞ : Ω → R, if limn P( An (ϵ)) = 0 for any
ϵ > 0, where
An (ϵ) ≜ {ω ∈ Ω : | Xn (ω ) − X∞ (ω )| > ϵ} ∈ F.
Remark 2. limn Xn = X∞ a.s. means that for almost all outcomes ω, the difference Xn (ω ) − X∞ (ω ) gets
small and stays small.
Remark 3. limn Xn = X∞ i.p. is a weaker convergence than a.s. convergence, and merely requires that the
probability of the difference Xn (ω ) − X∞ (ω ) being non-trivial becomes small.

Example 4.2 (Convergence in probability but not almost surely). Consider the probability space
([0, 1], B([0, 1]), λ) such that λ([ a, b]) = b − a for all 0 ⩽ a ⩽ b ⩽ 1. For each k ∈ N, we consider the se-
quence Sk = ∑ik=1 i, and define integer intervals Ik ≜ {Sk−1 + 1, . . . , Sk }. Clearly, the intervals ( Ik : k ∈ N)
partition the natural numbers, and each n ∈ N lies in some Ikn , such that n = Skn −1 + in for in ∈ [k n ].
Therefore, for each n ∈ N, we define indicator random variable Xn : Ω → {0, 1} such that

X n ( ω ) = 1 [ i n −1 , i n ] ( ω ) .
kn kn

For any ω ∈ [0, 1], we have Xn (ω ) = 1 for infinitely many values since there exist infinitely many (i, k)
( i −1)
pairs such that k ⩽ ω ⩽ ki , and hence lim supn Xn (ω ) = 1 and hence limn Xn (ω ) ̸= 0. However,
limn Xn (ω ) = 0 in probability, since

1
lim λ { Xn (ω ) ̸= 0} = lim = 0.
n n kn

5 Infinitely often and all but finitely many


Lemma 5.1 (infinitely often and all but finitely many). Let A ∈ FN be a sequence of events.
(a) For some subsequence (k n : n ∈ N) depending on ω, we have
( )
lim sup An = ω ∈ Ω : ω ∈ Akn for all n ∈ N = ω ∈ Ω : ∑ 1 An ( ω ) = ∞

= { An infinitely often } .
n n ∈N

(b) For a finite n0 (ω ) ∈ N depending on ω, we have


( )
lim inf An = {ω ∈ Ω : ω ∈ An for all n ⩾ n0 (ω )} =
n
ω∈Ω: ∑ 1 Acn (ω ) < ∞ = { An for all but finitely many n} .
n ∈N

2
Proof. Let A ∈ FN be a sequence of events.
(a) Let ω ∈ lim supn An = ∩n∈N ∪k⩾n Ak , then ω ∈ ∪k⩾n Ak for all n ∈ N. Therefore, for each n ∈ N, there
exists k n ∈ N such that ω ∈ Akn , and hence

∑ 1 Aj (ω ) ⩾ ∑ 1 Akn (ω ) = ∞.
j ∈N n ∈N

Conversely, if ∑ j∈N 1 A j (ω ) = ∞, then for each n ∈ N there exists a k n ∈ N such that ω ∈ Akn and hence
ω ∈ ∪k⩾n Ak for all n ∈ N.
(b) Let ω ∈ lim infn An = ∪n∈N ∩k⩾n Ak , then there exists n0 (ω ) such that ω ∈ An for all n ⩾ n0 (ω ). Con-
versely, if ∑ j∈N 1 Ac (ω ) < ∞, then there exists n0 (ω ) such that ω ∈ An for all n ⩾ n0 (ω ).
j

Theorem 5.2 (Convergence a.s. implies in probability). If a sequence of random variables X : Ω → RN defined
on a probability space (Ω, F, P) converges a.s. to a random variable X∞ : Ω → R, then it converges in probability to
the same random variable.
Proof. Let limn Xn = X∞ a.s. and ϵ > 0. We define events An ≜ {ω ∈ Ω : | Xn (ω ) − X∞ (ω )| > ϵ} for each
n ∈ N. We will show that limn P( An ) = 0. To this end, let N be the exception set such that
 
N ≜ ω ∈ Ω : lim inf Xn (ω ) < lim sup Xn (ω ) or lim sup Xn (ω ) = ∞ .
n n n

/ N, there exists an n0 (ω ) such that | Xn − X∞ | ⩽ ϵ for all n ⩾ n0 . That is, ω ∈ Acn for all n ⩾ n0 (ω )
For ω ∈
and hence N c ⊆ lim infn Acn . It follows that 1 = P(lim infn Acn ). Since lim infn Acn = (lim supn An )c , we get
0 = P(lim supn An ) = limn P(∪k⩾n Ak ) ⩾ limn P( An ) ⩾ 0.

6 Borel-Cantelli Lemma
Proposition 6.1 (Borel-Cantelli Lemma). Let A ∈ FN be a sequence of events such that ∑n∈N P( An ) < ∞, then
P { An i.o.} = 0.
Proof. We can write the probability of infinitely often occurrence of An , by the continuity and sub-additivity
of probability as
P(lim sup An ) = lim P(∪k⩾n Ak ) ⩽ lim ∑ P( Ak ) = 0.
n n n
k ⩾n

The last equality follows from the fact that ∑n∈N P( An ) < ∞.
Proposition 6.2 (Borel zero-one law). Let A ∈ FN be a sequence of independent events, then
(
0, iff ∑n P( An ) < ∞,
P { An i.o.} =
1, iff ∑n P( An ) = ∞.

Proof. Let A ∈ FN be a sequence of independent events.


(a) From Borel-Cantelli Lemma, if ∑n P( An ) < ∞ then P { An i.o.} = 0.
(b) Conversely, suppose ∑n P( An ) = ∞, then ∑k⩾n P( Ak ) = ∞ for all n ∈ N. From the definition of lim sup
and lim inf, continuity of probability, and independence of sequence of events A ∈ FN , we get
m
k =n Ak ) = 1 − lim lim ∏ (1 − P ( Ak )).
P { An i.o.} = 1 − P(lim inf Acn ) = 1 − lim lim P(∩m c
n n m n m
k=n

Since 1 − x ⩽ e− x
for all x ∈ R, from the above equation, the continuity of exponential function, and the
hypothesis, we get
m
1 ⩾ P { An i.o.} ⩾ 1 − lim lim e− ∑k=n P( Ak ) = 1 − lim exp(− ∑ P( Ak )) = 1.
n m n
k ⩾n

3
Example 6.3 (Convergence in probability can imply almost sure convergence). Consider a random
Bernoulli sequence X : Ω → {0, 1}N defined on the probability space (Ω, F, P) such that P { Xn = 1} = pn
for all n ∈ N. Note that the sequence of random variables is not assumed to be independent, and
definitely not identical. If limn pn = 0, then we see that limn Xn = 0 in probability.
In addition, if ∑n∈N pn < ∞, then limn Xn = 0 a.s. To see this, we define event An ≜ { Xn = 1} ∈ F
for each n ∈ N. Then, applying the Borel-Cantelli Lemma to sequence of events A ∈ FN , we get

1 = P((lim sup An )c ) = P(lim inf Acn ).


n n

That is, limn Xn = 0 for ω ∈ lim infn Acn , implying almost sure convergence.

Theorem 6.4. A random sequence X : Ω → RN converges to a random variable X∞ : Ω → R in probability, then


there exists a subsequence (nk : k ∈ N) ⊂ N such that ( Xnk : k ∈ N) converges almost surely to X∞ .
Proof. Letting n1 = 1, we define the following subsequence and event recursively for each j ∈ N,
n n o o n o
n j ≜ inf N > n j−1 : P | Xr − X∞ | > 2− j < 2− j , for all r ⩾ N , A j ≜ X n j +1 − X ∞ > 2 − j .

From the construction, we have limk nk = ∞, and P( A j ) < 2− j for each j ∈ N. Therefore, ∑k∈N P( Ak ) < ∞,
and hence by the Borel-Cantelli Lemma, we have P(lim supk Ak ) = 0. Let N = lim supk Ak be the exception
/ N, for all but finitely many j ∈ N
set such that for any outcome ω ∈

Xn j ( ω ) − X∞ ( ω ) ⩽ 2− j .

That is, for all ω ∈


/ N, we have limn Xn (ω ) = X∞ (ω ).
Theorem 6.5. A random sequence X : Ω → RN converges to a random variable X∞ in probability iff each subse-
quence ( Xnk : k ∈ N) contains a further subsequence ( Xnk : j ∈ N) converges almost surely to X∞ .
j

A Limits of sequences
Definition A.1. For any real valued sequence a ∈ RN , we can define

lim sup an ≜ inf sup ak , lim inf an ≜ sup inf ak .


n n n n k ⩾n
k ⩾n

Remark 4. We define en ≜ supk⩾n ak and f n ≜ infk⩾n ak , and observe that f n ⩽ ak for all k ⩾ n. That is,
f 1 , . . . , f n−1 ⩽ an and f k ⩽ ak for all k ⩾ n. It follows that supn f n ⩽ supk⩾n ak = en for all n ∈ N, and hence
lim infn an = supn f n ⩽ infn en = lim supn an .

Definition A.2. A sequence a ∈ RN is said to converge if lim supn an = lim infn an and the limit is defined
as an ≜ limn an = lim supn an = lim infn an .

Theorem A.3. A sequence a ∈ RN converges to a∞ ∈ R if for all ϵ > 0 there exists an integer N ∈ N such that for
all n ⩾ N, we have | an − a∞ | < ϵ.
Proof. Let ϵ > 0 and find the integer Nϵ ∈ N such that an ∈ ( a∞ − ϵ, a∞ + ϵ) for all n ⩾ Nϵ . It follows that
a∞ − ϵ ⩽ f n ⩽ en ⩽ a∞ + ϵ for all n ⩾ Nϵ , and hence a∞ − ϵ ⩽ lim infn an ⩽ lim supn an ⩽ a∞ + ϵ. Since ϵ was
arbitrary, it follows that limn an = a∞ .

4
Proposition A.4. For any sequence a ∈ RN+ , the following statements are true.
(i) If ∑n∈N an < ∞ then limn→∞ ∑k⩾n ak = 0.
(ii) If ∑n∈N an = ∞ then ∑k⩾n ak = ∞ for all k ∈ N.
Proof. We observe that (∑k<n ak : n ∈ N) is a non-decreasing sequence, and hence limn→∞ ∑k<n ak = supn ∑k<n ak =
∑ n ∈N a n .
(i) It follows that ∑k⩾n ak = ∑n∈N an − ∑k<n ak is a non-increasing sequence with limit 0.
(ii) We can write ∑n∈N an = ∑k<n ak + ∑k⩾n ak . Since the first term is finite for all n ∈ N, it follows that the
second term must be infinite for all n ∈ N.

5
Lecture-15: L p convergence

1 L p convergence
Definition 1.1 (Convergence in L p ). Let p ⩾ 1, then we say that a random sequence X : Ω → RN defined
on a probability space (Ω, F, P) converges in L p to a random variable X∞ : Ω → R, if

lim ∥ Xn − X∞ ∥ p = 0.
n

The convergence in Lp is denoted by limn Xn = X∞ in L p .


Remark 1. For p ∈ [1, ∞), the convergence of a random sequence X : Ω → RN in L p to a random variable
X∞ : Ω → R is equivalent to
lim E | Xn − X∞ | p = 0.
n

Proposition 1.2 (Convergences L p implies in probability). Consider p ∈ [1, ∞) and a sequence of random
variables X : Ω → RN defined on a probability space (Ω, F, P) such that limn Xn = X∞ in L p , then limn Xn = X∞
in probability.
Proof. Let ϵ > 0, then from the Markov’s inequality applied to random variable | Xn − X | p , we have

E | Xn − X∞ | p
P {| Xn − X∞ | > ϵ} ⩽ .
ϵ

Example 1.3 (Convergence almost surely doesn’t imply convergence in L p ). Consider the probability
space ([0, 1], B([0, 1]), λ) such that λ([ a, b]) = b − a for all 0 ⩽ a ⩽ b ⩽ 1. We define the scaled indicator
random variable Xn : Ω → {0, 1} such that

Xn (ω ) = 2n 1[0, 1 ] (ω ).
n

We define N = {0}, and for any ω ∈ / N, we can find m ≜ ⌈ ω1 ⌉, such that for all n > m, we have Xn (ω ) = 0.
np
Since λ( N ) = 0, it implies that limn Xn = 0 a.s. However, we see that E | Xn | p = 2n .

Remark 2. Convergence almost surely implies convergence in probability. Therefore, above example also
serves as a counterexample to the fact that convergence in probability doesn’t imply convergence in L p .
Theorem 1.4 (L2 weak law of large numbers). Consider a sequence of uncorrelated random variables X : Ω →
RN defined on a probability space (Ω, F, P) such that EXn = µ and Var( Xn ) = σ2 for all n ∈ N. Defining the sum
Sn ≜ ∑in=1 Xi and the n-empirical mean X̄n ≜ Snn , we have limn X̄n = µ in L2 and in probability.
Proof. From the uncorrelatedness of random sequence X, and linearity of expectation, we get

1 σ2
Var( X̄n ) = E( X̄n − µ)2 = 2
E(Sn − nµ)2 = .
n n
It follows that limn X̄n = µ in L2 . Since the convergence in L p implies convergence in probability, the result
holds.

1
Theorem 1.5 (L1 weak law of large numbers). Consider an i.i.d. random sequence X : Ω → RN defined on a
probability space (Ω, F, P) such that E | X1 | < ∞ and EX1 = µ. Defining the sum Sn ≜ ∑in=1 Xi and the n-empirical
mean X̄n ≜ Snn , we have limn X̄n = µ in probability.

Example 1.6 (Convergence in L p doesn’t imply almost surely). Consider the probability space
([0, 1], B([0, 1]), λ) such that λ([ a, b]) = b − a for all 0 ⩽ a ⩽ b ⩽ 1. For each k ∈ N, we consider the se-
quence Sk = ∑ik=1 i, and define integer intervals Ik ≜ {Sk−1 + 1, . . . , Sk }. Clearly, the intervals ( Ik : k ∈ N)
partition the natural numbers, and each n ∈ N lies in some Ikn , such that n = Sk+n−1 + in for in ∈ [k n ].
Therefore, for each n ∈ N, we define indicator random variable Xn : Ω → {0, 1} such that

X n ( ω ) ≜ 1 [ i n −1 , i n ] ( ω ) .
kn kn

For any ω ∈ [0, 1], we have Xn (ω ) = 1 for infinitely many values since there exist infinitely many (i, k)
( i −1)
pairs such that k ⩽ ω ⩽ ki , and hence lim supn Xn (ω ) = 1 and hence limn Xn (ω ) ̸= 0. However,
limn Xn (ω ) = 0 in L p , since
1
E | Xn | p = λ { Xn ( ω ) ̸ = 0} = .
kn

2 L1 convergence theorems
Theorem 2.1 (Monotone Convergence Theorem). Consider a non-decreasing non-negative random sequence
X : Ω → RN + defined on a probability space (Ω, F, P ), such that Xn ∈ L for all n ∈ N. Let X∞ ( ω ) = supn Xn ( ω )
1

for all ω ∈ Ω, then EX∞ = supn EXn .


Proof. From the monotonicity of sequence X and the monotonicity of expectation, we have supn EXn ⩽
EX∞ . Let α ∈ (0, 1) and Y : Ω → R+ a non-negative simple random variable such that Y ⩽ X∞ . We define

En ≜ {ω ∈ Ω : Xn (ω ) ⩾ αY } ∈ F.
From the monotonicity of sequence X, the sequence of events E ∈ FN are monotonically non-decreasing
such that ∪n∈N En = Ω. It follows that
αE[Y 1En ] ⩽ E[ Xn 1En ] ⩽ EXn .
We will use the fact that limn E[Y 1En ] = E[Y ], then αEY ⩽ supn EXn . Taking supremum over all α ∈ (0, 1)
and all simple functions Y ⩽ X∞ , we get EX∞ ⩽ supn EXn .

Theorem 2.2 (Fatou’s Lemma). Consider a non-negative random sequence X : Ω → RN + defined on a probability
space (Ω, F, P). Let X∞ (ω ) ≜ lim infn Xn (ω ) for all ω ∈ Ω, then EX∞ ⩽ lim infn EXn .
Proof. We define Yn ≜ infk⩾n Xk for all n ∈ N. It follows that Y : Ω → RN
+ is a non-negative non-decreasing
sequence of random variables, and X∞ = supn Yn = limn Yn . Applying monotone convergence theorem to
random sequence Y, we get EX∞ = supn EYn . The result follows from the monotonicity of expectation, and
the fact that Yn ⩽ Xk for all k ⩾ n, to get EYn ⩽ infk⩾n EXk .

Theorem 2.3 (Dominated Convergence Theorem). Let X : Ω → RN be a random sequence defined on a proba-
bility space (Ω, F, P). If limn Xn = X∞ a.s. and there exists a Y : Ω → R+ such that Y ∈ L1 and | Xn | ⩽ Y a.s., then
EX∞ = limn EXn .
Proof. From the hypothesis, we have Y + Xn ⩾ 0 a.s. and Y − Xn ⩾ 0 a.s. Therefore, from Fatou’s Lemma
and linearity of expectation, we have
EY + EX∞ ⩽ lim inf E(Y + Xn ) = EY + lim inf EXn , EY − EX∞ ⩽ lim inf E(Y − Xn ) = EY − lim sup EXn .
n n n n

Therefore, we have lim supn EXn ⩽ EX∞ ⩽ lim infn EXn , and the result follows.

2
3 Convergence theorems for conditional means
Proposition 3.1. Let X : Ω → RN be a random sequence on the probability space (Ω, F, P) such that E | Xn | < ∞
for all n ∈ N. Let G and H be event spaces such that G, H ⊂ F. Then the following theorems hold.
1. Conditional monotone convergence theorem: If 0 ⩽ Xn ⩽ Xn+1 a.s., for all n ∈ N and Xn → X∞ a.s. for
X∞ ∈ L1 , then E[ Xn | G] ↑ E[ X∞ | G] a.s.
2. Conditional Fatou’s lemma: If Xn ⩾ 0 a.s., for all n ∈ N, and lim infn Xn ∈ L1 , then E[lim infn Xn | G] ⩽
lim infn E[ Xn | G] a.s.

3. Conditional dominated convergence theorem: If | Xn | ⩽ Z for all n ∈ N and some Z ∈ L1 , and if Xn → X∞ ,


a.s., then E[ Xn | G] → E[ X∞ | G] a.s. and in L1 .
Proof. Let X : Ω → RN be a random sequence on the probability space (Ω, F, P) such that Xn ∈ L1 for all
n ∈ N.

1. Conditional monotone-convergence theorem: By monotonicity, we have E[ Xn | G] ↑ Y a.s. where


Y : Ω → R+ is G measurable. The monotone convergence theorem implies that, for each G ∈ G,

E[Y 1G ] = lim E[1G E[ Xn | G]] = lim E[1G Xn ] = E[1G X∞ ].


n n

2. Conditional Fatou’s lemma: Defining Yn ≜ infk⩾n Xk , we get Yn ↑ Y∞ = lim infk Xk . By monotonicity,

E[Yn | G] ⩽ inf E[ Xk | G] a.s..


k ⩾n

The conditional monotone-convergence theorem implies that

E[Y∞ | G] = lim E[Yn | G] ⩽ lim inf E[ Xn | G] a.s..


n ∈N n

3. Conditional dominated-convergence theorem: By the conditional Fatou’s lemma, we have

E[ Z + X∞ | G] ⩽ lim inf E[ Z + Xn | G] a.s. , E[ Z − X∞ | G] ⩽ lim inf E[ Z − Xn | G] a.s. ,


n n

and the a.s.-statement follows.

4 Uniform integrability

Definition 4.1 (uniform integrability). A family ( Xt ∈ L1 : t ∈ T ) of random variables indexed by T is


uniformly integrable if
lim sup E[| Xt | 1{| Xt |> a} ] = 0.
a→∞ t∈ T

Example 4.2 (Single element family). If | T | = 1, then the family is uniformly integrable, since X1 ∈ L1
and lima E[| X1 | 1{| Xt |> a} ] = 0. This is due to the fact that ( Xn ≜ | X | 1{| X |⩽n} : n ∈ N) is a sequence of
increasing random variables limn Xn = X. From monotone convergence theorem, we get limn E | Xn | =
E limn | Xn |. Therefore,

lim E[| X | 1{| X |> a} ] = E | X | − lim E[| X | 1{| X |⩽a} ] = 0.


a a

3
Proposition 4.3. Let X ∈ L p and ( An : n ∈ N) ⊂ F be a sequence of events such that limn P( An ) = 0, then

lim ∥| X | 1 An ∥ p = 0.
n

Example 4.4 (Dominated family). If there exists Y ∈ L1 such that supt∈T | Xt | ⩽ |Y |, then the family of
random variables ( Xt : t ∈ T ) is uniformly integrable. This is due to the fact that

sup E[| X | 1{| X |> a} ] ⩽ E[|Y | 1{|Y |>a} ].


t∈ T

Example 4.5 (Finite family). then the family of random variables ( Xt : t ∈ T ) is uniformly integrable.
This is due to the fact that supt∈T | Xt | ⩽ ∑t∈T | Xt | ∈ L1 .

Theorem 4.6 (Convergence in probability with uniform integrability implies convergence in L p ). Con-
sider a sequence of random variables ( Xn : n ∈ N) ⊂ L p for p ⩾ 1. Then the following are equivalent.
(a) The sequence ( Xn : n ∈ N) converges in L p , i.e. limn E | Xn − X | p = 0.
(b) The sequence ( Xn : n ∈ N) is Cauchy in L p , i.e. limm,n→∞ E | Xn − Xm | p = 0.

(c) limn Xn = X in probability and the sequence (| Xn | p : n ∈ N) is uniformly integrable.


Proof. For a random sequence ( Xn : n ∈ N) in L p , we will show that ( a) =⇒ (b) =⇒ (c) =⇒ ( a).
( a) =⇒ (b) : We assume the sequence ( Xn : n ∈ N) converges in L p . Then, from Minkowski’s inequality,
we can write
1 1 1
(E | Xn − Xm | p ) p ⩽ (E | Xn − X | p ) p + (E | Xm − X | p ) p .

(b) =⇒ (c) : We assume that the sequence ( Xn : n ∈ N) is Cauchy in L p , i.e. limm,n→∞ E | Xn − Xm | p = 0.


Let ϵ > 0, then for each n ∈ N, there exists Nϵ such that for all n, m ⩾ Nϵ
ϵ
E | Xn − Xm | p ⩽ .
2
Let A a = {ω ∈ A : | Xn | > a}. Then, using triangle inequality and the fact that 1 Aa ⩽ 1, from the linearity
and monotonicity of expectation, we can write for n ⩾ Nϵ
1 1 1 1 ϵ
(E[| Xn | p 1{|Xn |>a} ]) p ⩽ (E[| X Nϵ | p 1 Aa ]) p + (E[| Xn − X Nϵ | p ]) p ⩽ (E[| X Nϵ | p 1 Aa ]) p + .
2
Therefore, we can write supn E[| Xn | p 1{| Xn |> a} ] ⩽ supm⩽ Nϵ E[| Xm | p 1 Aa ] + 2ϵ . Since (| Xn | p : n ⩽ Nϵ ) is
finite family of random variables in L1 , it is uniformly integrable. Therefore, there exists aϵ ∈ R+ such
1 1
that supm⩽ Nϵ (E[| Xm | p 1 Aa ]) p < 2ϵ . Taking a′ = max { a, aϵ }, we get supn (E[| Xn | p 1{| Xn |> a′ } ]) p ⩽ ϵ. Since
the choice of ϵ was arbitrary, it follows that
1
lim sup(E[| Xn | p 1{| Xn |> a′ } ]) p = 0.
a→∞ n

The convergence in probability follows from the Markov inequality, i.e.

1
P | Xn − Xm | p > ϵ ⩽ E | Xn − Xm | p .

ϵ

4
(c) =⇒ ( a) : Since the sequence ( Xn : n ∈ N) is convergent in probability to a random variable X, there
exists a subsequence (nk : k ∈ N) ⊂ N such that limk Xnk = X a.s. Since (| Xn | p : n ∈ N) is a family of
uniformly integrable sequence, by Fatou’s Lemma
p
E | X | p ⩽ lim inf E Xnk ⩽ sup E | Xn | p < ∞.
k n

Therefore, X ∈ L1 , and we define An (ϵ) = {| Xn − X | > ϵ} for any ϵ > 0. From Minkowski’s inequality,
we get
∥ Xn − X ∥ p ⩽ ( Xn − X )1 { | Xn − X | p ⩽ϵ } + Xn 1 A n ( ϵ ) + X 1 An (ϵ) .
p p p

We can check that ( Xn − X )1 Acn (ϵ) ⩽ ϵ. Further, since limn Xn = X in probability, ( An : n ∈ N) ⊂ F


p
is decreasing sequence of events, and since Xn , X ∈ L1 , we have limn Xn 1 An (ϵ) = limn X 1 An (ϵ) = 0.

5
Lecture-16: Weak convergence

1 Convergence in distribution
Definition 1.1 (convergence in distribution). A random sequence X : Ω → RN defined on a probability
space (Ω, F, P) converges in distribution to a random variable X∞ : Ω′ → R defined on a probability space
(Ω′ , F ′ , P′ ) if limn FXn ( x ) = FX∞ ( x ) at all continuity points x of FX∞ . Convergence in distribution is denoted
by limn Xn = X∞ in distribution.
Proposition 1.2. Consider a random sequenceX : Ω → RN defined on a probability space (Ω, F, P) and a random
variable X∞ : Ω′ → R defined on another probability space (Ω′ , F ′ , P′ ). Then the following statements are equivalent.
(a) limn Xn = X∞ in distribution.
(b) limn E[ g( Xn )] = E[ g( X∞ )] for any bounded continuous Borel measurable function g : R → R.
(c) Characteristic functions converge point-wise, i.e. limn Φ Xn (u) = Φ X∞ (u) for each u ∈ R.
Proof. Let X : Ω → RN be a sequence of random variables and let X∞ : Ω′ → R be a random variable. We
will show that ( a) =⇒ (b) =⇒ (c) =⇒ ( a).
( a) =⇒ (b): Applying the bounded Rconvergence theorem R to any bounded continuous Borel measurable
function g : R → R, we have limn x∈R g( x )dFXn ( x ) = x∈R g( x ) limn dFXn ( x ).

(b) =⇒ (c): Taking g( x ) = e jux , we get the result.


(c) =⇒ ( a): The proof of this part is technical and is omitted.

Example 1.3 (Convergence in distribution but not in probability). Consider a sequence of non-degenerate
continuous i.i.d. random variables X : Ω → RN and independent random variable Y : Ω′ → R, all with the
common distribution FY . Then FXn = FY for all n ∈ N, and hence limn Xn = Y in distribution. If the common
2
√ Gaussian with variance σ , then Xn − Y is zero mean Gaussian with variance
distribution FY is zero mean
2σ2 . Therefore, for ϵ < σ π and all n ∈ N

1
Z
2 /4σ2 ϵ
P {| Xn − Y | ⩽ ϵ} = √ e−z dz ⩽ √ < 1.
4πσ2 z∈[−ϵ,ϵ] σ π

It follows that P {| Xn − Y | > ϵ} ⩾ 1 − ϵ



σ π
for all n ∈ N, and hence lim Xn ̸= Y in probability.

Lemma 1.4 (Convergence in probability implies in distribution). Consider a sequence X : Ω → RN of random


variables and a random variable X∞ : Ω → R defined on a probability space (Ω, F, P), such that limn Xn = X∞ in
probability, then limn Xn = X∞ in distribution.
Proof. We will show that all continuity points x of FX∞ , we have limn→∞ FXn ( x ) = FX∞ ( x ). Fix ϵ > 0. Since
x is a continuity point of non-decreasing function FX∞ , choose δ > 0 such that FX∞ ( x + δ) − FX∞ ( x − δ) < ϵ.
Therefore, it suffices to show that

FX∞ ( x − δ) ⩽ lim inf FXn ( x ) ⩽ lim sup FXn ( x ) ⩽ FX∞ ( x + δ).


n→∞ n→∞

1
For the chosen δ > 0, we consider the event An (δ) ≜ {ω ∈ Ω : | Xn (ω ) − X∞ (ω )| > δ} = { Xn ∈
/ [ X∞ − δ, X∞ + δ]} ∈
F, and define events A Xn ( x ) ≜ { Xn ⩽ x } and A X∞ ( x ) ≜ { X∞ ⩽ x }. Then, we can write

A Xn ( x ) ∩ A X∞ ( x + δ ) ⊆ A X∞ ( x + δ ), A Xn ( x ) ∩ AcX∞ ( x + δ) ⊆ An (δ),
A X∞ ( x − δ ) ∩ A Xn ( x ) ⊆ A Xn ( x ), A X∞ ( x − δ) ∩ AcXn ( x ) ⊆ An (δ).

From the above set relations, law of total probability, and union bound, we have

FX∞ ( x − δ) − P( An (δ)) ⩽ FXn ( x ) ⩽ FX∞ ( x + δ) + P( An (δ)).

From the convergence in probability, we have limn P( An (δ)) = 0, and the result follows.
Theorem 1.5 (Central Limit Theorem). Consider an i.i.d. random sequence X : Ω → RN defined on a probability
space (Ω, F, P), with EXn = µ and Var( Xn ) = σ2 for all n ∈ N. We define the n-sum as Sn ≜ ∑in=1 Xi and consider
y 2
a standard normal random variable Y : Ω → R with density function f Y (y) = √1 e− 2 for all y ∈ R. Then,

Sn − nµ
lim √ = Y in distribution.
n σ n
Xi − µ
Proof. The classical proof is using the characteristic functions. Let Zi ≜ σ for all i ∈ N, then the shifted
Sn −nµ
and scaled n-sum is given by √
σ n
= √1
n ∑in=1 Zi .
We use the third equivalence in Proposition 1.2 to show
that the characteristic function of converges to the characteristic function of the standard normal. We define
the characteristic functions
(Sn − nµ)
 
Φn (u) ≜ E exp ju √ , Φ Zi (u) ≜ E exp( juZi ), ΦY (u) ≜ E exp( juY ).
σ n

We can compute the characteristic function of the standard normal as

(y − ju)2
 
1 u2 u2
Z
ΦY ( u ) = √ e− 2 exp − dy = e− 2 .
2π y ∈R 2

(1)
Since the random sequence Z : Ω → RN is a zero mean i.i.d. sequence, it follows that Φ Z1 (0) = jEZ1 = 0
(2)
and Φ Z1 (0) = j2 EZ12 = −1. Using the Taylor expansion of the characteristic function Φ Z1 , we have

n n n
( X − µ) u2 u2
     
u
Φn (u) = ∏ E exp ju i √ = Φ Z1 √ = 1− +o .
i =1 σ n n 2n n

For any u ∈ R, taking limit n ∈ N, we get the result.

2 Strong law of large numbers


Definition 2.1. For a random sequence X : Ω → RN defined on a probability space (Ω, F, P) with bounded
mean E | Xn | < ∞ for all n ∈ N, we define the n-sum as Sn ≜ ∑in=1 Xi and the empirical n-mean Snn for each
n ∈ N. For each n ∈ N, we define event

En ≜ {ω ∈ Ω : |Sn − ESn | > nϵ} ∈ F.

Theorem 2.2 (L4 strong law of large numbers). Let X : Ω → RN be a sequence of independent random variables
defined on probability space (Ω, F, P) with bounded mean EXn for each n ∈ N and uniformly bounded fourth central
moment supn∈N E( Xn − EXn )4 ⩽ B < ∞. Then, the empirical n-mean converges to limn ES n
n almost surely.

2
Proof. Recall that E(Sn − ESn )4 = E(∑in=1 ( Xi − EXi ))4 = ∑in=1 E( Xi − EXi )4 + 3 ∑in=1 ∑ j̸=i E( Xi − EXi )2 E( X j −
EX j )2 . Recall that when the fourth moment is bounded, then so is second moment. Hence, supi∈N E( Xi −
EXi )2 ⩽ C for some C ∈ R+ . Therefore, from the Markov’s inequality, we have

E(Sn − ESn )4 nB + 3n(n − 1)C2


P( En ) ⩽ ⩽ .
n4 ϵ4 n4 ϵ4
It follows that the ∑n∈N P( En ) < ∞, and hence by Borel Canteli Lemma, we have

P { Enc for all but finitely many n} = 1.

Since, the choice of ϵ was arbitrary, the result follows.

Theorem 2.3 (L2 strong law of large numbers). Let X : Ω → RN be a sequence of pair-wise uncorrelated random
variables defined on a probability space (Ω, F, P) with bounded mean EXn for all n ∈ N and uniformly bounded
variance supn∈N Var( Xn ) ⩽ B < ∞. Then, the empirical n-mean converges to limn ES n
n almost surely.

Proof. For each n ∈ N, we define events Fn ≜ En2 , and


( ) ( )
k
|Sk − Sn2 − E(Sk − Sn2 )| > n ϵ 2
ω∈Ω: ∑ (Xi − EXi ) 2
[
Gn ≜ max = >n ϵ .
n2 ⩽k <(n+1)2 i = n2
n2 ⩽k <(n+1)2

From the Markov’s inequality and union bound, we have


2 ( n +1)2 −1
∑n Var( Xi ) B (k − n2 + 1) B (2n + 1)2 B
P( Fn ) ⩽ i=1 4 2
n ϵ
⩽ 2 2,
n ϵ
P( Gn ) ⩽ ∑ n4 ϵ2

n4 ϵ2
.
k = n2

Therefore, ∑n∈N P( Fn ) < ∞ and ∑n∈N P( Gn ) < ∞, and hence by Borel Canteli Lemma, we have

Sn2 − ESn2 Sk − Sn2 − E ( Sk − Sn2 )


lim = lim max = 0 a.s.
n n2 n n2 ⩽k <(n+1)2 −1 n2

The result follows from the fact that for any k ∈ N, there exists n ∈ N such that k ∈ n2 , . . . , (n + 1)2 − 1


and hence
|Sk − ESk | |Sn2 − ESn2 | |Sk − Sn2 − E(Sk − Sn2 )|
 
⩽ + .
k n2 n2

Theorem 2.4 (L1 strong law of large numbers). Let X : Ω → RN be a random sequence defined on a probability
space (Ω, F, P) such that supn∈N E | Xn | ⩽ B < ∞. Then, the empirical n-mean converges to limn ES n
n almost surely.

3
Lecture-17: Tractable Random Processes

1 Examples of Tractable Stochastic Processes


Recall that a random process X : Ω → XT defined on the probability space (Ω, F, P) with index set T and
state space X ⊆ R, is completely characterized by its finite dimensional distributions FXS : RS → [0, 1] for all
finite S ⊆ T, where
FXS ( xS ) ≜ P( A XS ( xS )) = P(∩s∈S Xs−1 (−∞, xs ]), xS ∈ RS .
Simpler characterizations of a stochastic process X are in terms of its moments. That is, the first moment
such as mean, and the second moment such as correlations and covariance functions.

m X (t) ≜ EXt , R X (t, s) ≜ EXt Xs , CX (t, s) ≜ E( Xt − m X (t))( Xs − m X (s)).

In general, it is very difficult to characterize a stochastic process completely in terms of its finite dimensional
distribution. However, we have listed few analytically tractable examples below, where we can completely
characterize the stochastic process.

1.1 Independent and identically distributed (i.i.d. ) processes


Definition 1.1 (i.i.d. process). A random process X : Ω → XT is an independent and identically distributed
(i.i.d. ) random process with the common distribution F : R → [0, 1], if for any finite S ⊆ T and a real vector
xS ∈ RS we can write the finite dimensional distribution for this process as

FXS ( xS ) = P (∩s∈S { Xs (ω ) ⩽ xs }) = ∏ F ( xs ).
s∈S

Remark 1. It’s easy to verify that the first and the second moments are independent of time indices. That is,
if 0 ∈ T then Xt = X0 in distribution, and we have

m X = EX0 , R X (t, s) = (EX02 )1{t=s} + m2X 1{t̸=s} , CX (t, s) = Var( X0 )1{t=s} .

1.2 Stationary processes


Definition 1.2 (Stationary process). We consider the index set T ⊆ R that is closed under addition and
subtraction. A stochastic process X : Ω → XT is stationary if all finite dimensional distributions are shift
invariant. That is, for any finite S ⊆ T and t ∈ T, we have

FXS ( xS ) = P(∩s∈S { Xs (ω ) ⩽ xs }) = P(∩s∈S { Xs+t (ω ) ⩽ xs }) = FXt+S ( xS ).

Remark 2. That is, for any finite n ∈ N and t ∈ R, the random vectors ( Xs1 , . . . , Xsn ) and ( Xs1 +t , . . . , Xs1 +t )
have the identical joint distribution for all s1 ⩽ . . . ⩽ sn .

Lemma 1.3. Any i.i.d. process with index set T ⊆ R is stationary.

1
Proof. Let X : Ω → XT be an i.i.d. random process, where T ⊆ R. Then, for any finite index subset S ⊆ T, t ∈ T
and xS ∈ RS , we can write

{ Xs ⩽ xs }) = ∏ P ◦ Xs−1 (−∞, xs ] = ∏ P ◦ Xt−+1s (−∞, xs ] = P(


\ \
FXS ( xS ) = P( { Xs ⩽ xs }) = FXt+S ( xS ).
s∈S s∈S s∈S s∈t+S

First equality follows from the definition, the second from the independence of process X, the third from the
identical distribution for the process X. In particular, we have shown that process X is also stationary.

Remark 3. For a stationary stochastic process, all the existing moments are shift invariant when they exist.

Definition 1.4. A second order stochastic process X has finite auto-correlation R X (t, t) < ∞ for all t ∈ T.
Remark 4. This implies R X (t1 , t2 ) < ∞ by Cauchy-Schwartz inequality, and hence the mean, auto-correlation,
and the auto-covariance functions are well defined and finite.
Remark 5. For a stationary process X, we have Xt = X0 and ( Xt , Xs ) = ( Xt−s , X0 ) in distribution. Therefore,
for a second order stationary process X, we have

m X = EX0 , R X (t, s) = R X (t − s, 0) = EXt−s X0 , CX (t − s, 0) = R X (t − s, 0) − m2X .

Definition 1.5. A random process X is wide sense stationary if


1. m X (t) = m X (t + s) for all s, t ∈ T, and
2. R X (t, s) = R x (t + u, s + u) for all s, t, u ∈ T.
Remark 6. It follows that a second order stationary stochastic process X, is wide sense stationary. A second
order wide sense stationary process is not necessarily stationary. We can similarly define joint stationarity
and joint wide sense stationarity for two stochastic processes X and Y.

Example 1.6 (Gaussian process). Let X : Ω → RR be a continuous-time Gaussian process, defined by its
finite dimensional distributions. In particular, for any finite S ⊂ R, column vector xS ∈ RS , mean vector
µS ≜ EXS , and the covariance matrix CS ≜ EXS XST , the finite-dimensional density is given by
 
1 1
f XS ( x S ) = p exp − ( xS − µS ) T CS−1 ( xS − µS ) .
(2π )|S|/2 det(CS ) 2

Theorem 1.7. A wide sense stationary Gaussian process is stationary.

Proof. For Gaussian random processes, first and the second moment suffice to get any finite dimensional
distribution. Let X be a wide sense stationary Gaussian process and let S ⊆ R be finite. From the wide
sense stationarity of X, we have EXS = µ1S and

E( Xs − µ)( Xu − µ) = Cs−u , for all s, u ∈ S.

This means that CS = Ct+S , and the result follows.

1.3 Random Walk


Definition 1.8. Let X : Ω → XN be an i.i.d. random sequence defined on the probability space (Ω, F, P) and
the state space X = Rd . A random sequence S : Ω → XZ+ is called a random walk with step-size sequence
X, if S0 ≜ 0 and Sn ≜ ∑in=1 Xi for n ∈ N.

2
Remark 7. We can think of Sn as the random location of a particle after n steps, where the particle starts
from origin and takes steps of size Xi at the ith step. From the i.i.d. nature of step-size sequence, we observe
that ESn = nEX1 and CS (n, m) = (n ∧ m) Var[ X1 ].
Remark 8. For the process S : Ω → XN it suffices to look at finite dimensional distributions for finite sets
[n] ⊆ N for all n ∈ N. If the i.i.d. step-size sequence X has a common density function, then from the
transformation of random vectors, we can find the finite dimensional density
n
1
f S1 ,...,Sn (s1 , s2 , . . . , sn ) = f X1 ,...,Xn (s1 , s2 − s1 , . . . , sn − sn−1 ) = f X1 (s1 ) ∏ f X1 (si − si−1 ).
| J (s)| i =2

Recall that Jij (s) = ∂si


∂x j = 1{ j⩽i} and hence J (s) is a lower triangular matrix with all non-zero entries being
unity, and hence | J (s)| = 1.
Z
Theorem 1.9. The stochastic process S : Ω → Z++ has stationary and independent increments.
Proof. One needs to show that for (m1 , . . . , mn ) ⊆ Z+ , the joint distribution of increments (Sm2 − Sm1 , . . . , Smn −
Smn−1 ) is independent and distributed identically to (Sk+m2 − Sk+m1 , . . . , Sk+mn − Sk+mn−1 ) for any and
k ∈ Z+ . For any j ∈ [n − 1], we define event space E j ≜ σ ( Xm j +1 , . . . , Xm j+1 ) and observe that the jth in-
crement Sm j+1 − Sm j is measurable with respect to jth event space E j . Since X is an independent process,
it follows that (E1 , . . . , En−1 ) are independent event spaces, and hence so are the increments. Stationarity
of increments follow from the fact that the process X is i.i.d. and jth increment Sk+m j+1 − Sk+m j is sum of
m j+1 − m j i.i.d. random variables, and hence an identical distribution for all k ∈ Z+ .
p
Corollary 1.10. Let p ∈ N and for each i ∈ [ p] let n ∈ N p , k ∈ Z+ such that n1 ⩽ . . . ⩽ n p and k1 ⩽ . . . ⩽ k p . Then,
we can write the joint mass function
p
PSn
1
,...,Sn p ( k 1 , . . . , k p ) = P (∩i ∈[k ] { Sni = k i }) = ∏ PSn −n
i i −1
( k i − k i −1 ).
i =1

Proof. The result follows from stationary and independent increment property of the random walk S.
N
Remark 9. For a one-dimensional random walk S : Ω → ZN + with i.i.d. step size sequence X : Ω → {0, 1}
such that P { X1 = 1} = p, the distribution for the random walk at nth step Sn is Binomial (n, p). That is,
 
n k
P { Sn = k } = p (1 − p)n−k , k ∈ {0, . . . , n} .
k

1.4 Lévy processes


A right continuous with left limits stochastic process X : Ω → RT for index set T ⊆ R+ with X0 = 0 almost
surely, is a Lévy process if the following conditions hold.
(L1) The increments are independent. For any instants 0 ⩽ t1 < t2 < · · · < tn < ∞, the random variables
Xt2 − Xt1 , Xt3 − Xt2 , . . . , Xtn − Xtn−1 are independent.
(L2) The increments are stationary. For any instants 0 ⩽ t1 < t2 < · · · < tn < ∞ and time-difference s > 0,
the random vectors ( Xt2 − Xt1 , Xt3 − Xt2 , . . . , Xtn − Xtn−1 ) and ( Xs+t2 − Xs+t1 , Xs+t3 − Xs+t2 , . . . , Xs+tn −
Xs+tn−1 ) are equal in distribution.
(L3) Continuous in probability. For any ϵ > 0 and t ⩾ 0 it holds that limh→0 P({| Xt+h − Xt | > ϵ}) = 0.

Example 1.11. Two examples of Lévy processes are Poisson process and Wiener process. The distribution
of Poisson process at time t is Poisson with rate λt and the distribution of Wiener process at time t is zero
mean Gaussian with variance t.
Example 1.12. A random walk S : Ω → XZ+ with i.i.d. step-size sequence X : Ω → XN , is non-stationary with
stationary and independent increments. To see non-stationarity, we observe that the mean mS (n) = nEX1
depends on the step of the random walk. We have already seen the increment process of random walks.

3
1.5 Markov processes
Definition 1.13. A stochastic process X is Markov if conditioned on the present state, future is independent
of the past. We denote the history of the process until time t as Ft = σ( Xs , s ⩽ t). That is, for any ordered
index set T containing any two indices u > t, we have

P({ Xu ⩽ xu } | Ft ) = P({ Xu ⩽ xu } | σ ( Xt )).

The range of the process is called the state space.

Remark 10. We next re-write the Markov property more explicitly for the process X. For all x, y ∈ X, finite
set S ⊆ T such that max S < t < u, and HS = ∩s∈S { Xs ⩽ xs } ∈ Ft , we have

P({ Xu ⩽ y} | HS ∩ { Xt ⩽ x }) = P({ Xu ⩽ y} | { Xt ⩽ x }).

Remark 11. When the state space X is countable, we can write HS = ∩s∈S { Xs = xs } and the Markov property
can be written as
P({ Xu = y} | HS ∩ { Xt = x }) = P({ Xu = xu } | { Xt = x }).
Remark 12. In addition, when the index set is countable, i.e. T = Z+ , then we can take past as S =
{0, . . . , n − 1}, present as instant n, and the future as n + 1. Then, the Markov property can be written
as
P({ Xn+1 = y} | Hn−1 ∩ { Xn = x }) = P({ Xn+1 = y} | { Xn = x }),
for all n ∈ Z+ , x, y ∈ X.
We will study this process in detail in coming lectures.

Example 1.14. A random walk S : Ω → XZ+ with i.i.d. step-size sequence X : Ω → XN , is a homogeneous
Markov sequence. For any n ∈ Z+ and x, y, s1 , . . . , sn−1 ∈ X, we can write the conditional probability

P({Sn+1 = y} | {Sn = x, Sn−1 = sn−1 , . . . , S1 = s1 }) = P({Sn+1 − Sn = y − x }) = P({Sn+1 = y} | {Sn = x }).


Z
Lemma 1.15. The stochastic process S : Ω → Z++ is homogeneously Markov.
Proof. Since the process has stationary and independent increments, we have

P({Sn+m = k} | {S1 = k1 , S2 = k2 , . . . , Sn = k n }) = P({Sn+m − Sn = k − k n }) = P({Sn+m = k } | {Sn = k n }).

4
Lecture-18: Stopping Times

1 Stopping times
Let (Ω, F, P) be a probability space. Consider a random process X : Ω → XT defined on this probability
space with state space X ⊆ R and ordered index set T ⊆ R considered as time.

Definition 1.1. A collection of event spaces denoted G• ≜ (Gt ⊆ F : t ∈ T ) is called a filtration if Gs ⊆ Gt


for all s ⩽ t.
Remark 1. For the random process X : Ω → XT , we can find the event space generated by all random
variables until time t as Ft ≜ σ ( Xs , s ⩽ t). The collection of event spaces F• ≜ (Ft : t ∈ T ) is a filtration.

Definition 1.2. The natural filtration associated with a random process X : Ω → XT is given by F• ≜
(Ft : t ∈ T ) where Ft ≜ σ( Xs , s ⩽ t).
Remark 2. For a random sequence X : Ω → XN , the natural filtration is a sequence F• = (Fn : n ∈ N) of
event spaces Fn ≜ σ( X1 , . . . , Xn ) for all n ∈ N.
Remark 3. If the random sequence X is independent, then the random sequence ( Xn+ j : j ∈ N) is inde-
pendent of the event space σ( X1 , . . . , Xn ).

Example 1.3. For a random walk S with step size sequence X, the natural filtration of the random walk
is identical to that of the step size sequence. That is, σ(S1 , . . . , Sn ) = σ ( X1 , . . . , Xn ) for all n ∈ N. This
j
follows from the fact that for all n ∈ N, we can can write S j = ∑i=1 Xi and X j = S j − S j−1 for all j ∈ [n].
That is, there is a bijection between ( X1 , . . . , Xn ) and (S1 , . . . , Sn ).

Definition 1.4. A random variable τ : Ω → T is called a stopping time with respect to a filtration F• if
(a) the event τ −1 (−∞, t] ∈ Ft for all t ∈ T, and
(b) the random variable τ is finite almost surely, i.e. P {τ < ∞} = 1.
Remark 4. Intuitively, if we observe the process X sequentially, then the event {τ ⩽ t} can be completely
determined by the observation ( Xs , s ⩽ t) until time t. The intuition behind a stopping time is that its
realization is determined by the past and present events but not by future events. That is, given the
history of the process until time t, we can tell whether the stopping time is less than or equal to t or not.
In particular, E[1{τ⩽t} Ft ] = 1{τ⩽t} is either one or zero.

Definition 1.5 (First hitting time). For a process X : Ω → XT and any Borel measurable set A ∈ B(X), we
define the first hitting time τXA : Ω → T ∪ {∞} for the process X to hit states in A, as τXA ≜ inf {t ∈ T : Xt ∈ A} .

Example 1.6. We observe that the event τXA ⩽ t = { Xs ∈ A for some s ⩽ t} ∈ Ft for all t ∈ T. It follows


that, if τA is finite almost surely, then τA is a stopping time with respect to filtration F• .

Proposition 1.7. For a random sequence X : Ω → XN , an almost sure finite discrete random variable τ : Ω →
N ∪ {∞} is a stopping time with respect to this random sequence X iff the event {τ = n} ∈ σ ( X1 , . . . , Xn ) for
all n ∈ N.
Proof. From Definition 1.4, we have {τ = n} = {τ ⩽ n} \ {τ ⩽ n − 1} ∈ σ ( X1 , . . . , Xn ). Conversely, from
the theorem hypothesis, it follows that {τ ⩽ n} = ∪nm=1 {τ = m} ∈ σ ( X1 , . . . , Xn ).

1
Example 1.8. Consider a random sequence X : Ω → XN , with the natural filtration F• , and a measurable
set A ∈ B(X). If the first hitting time τXA : Ω → N ∪ {∞} for the sequence X to hit set A is almost
−1
surely finite, then τXA is a stopping time. This follows from the fact that τXA = n = ∩nk=

1 { Xk ∈
/ A} ∩
{ Xn ∈ A} ∈ Fn for each n ∈ N.

Definition 1.9. Consider a random process X : Ω → XR+ with discrete state space X ⊆ R. For each state
{y},0
y ∈ X, we define τX ≜ 0 and inductively define the kth hitting time to a state y after time t = 0, as
n o
{y},k {y},k−1
τX ≜ inf t > τX : Xt = y , k ∈ N.
n o
{y},k {y},k
Remark 5. We observe that τX ⩽ t ∈ Ft for all times t ∈ R+ . Hence if τX is almost surely finite,
then it is a stopping time for the process X.
Definition 1.10. For a discrete valued random sequence X : Ω → XN , the number of visits to a state
y ∈ X in first n time steps is defined as Ny (n) ≜ ∑nk=1 1{ Xk =y} for all n ∈ N.
Z
Remark 6. We observe that Ny : Ω → Z++ is a random walk with the Bernoulli step size sequence
(1{Xk =y} : k ∈ N). Further, τX
{y},k {k}
= τNy = inf n ∈ N : Ny (n) = k . We also observe that { Ny (n) ⩽ k} =


( k +1) ( k +1)
{τy > n} and { Ny (n) = k} = {τyk ⩽ n < τy }.
Remark 7. We observe that the number of visits to state y in first n steps of X is also given by
n o n o
> n − 1 = ∑ 1n {y},k o .
{y},k {y},k
Ny (n) = sup k ∈ Z+ : τX ⩽ n = inf k ∈ N : τX
τX ⩽n
k ∈N

This implies that Ny (n) + 1 is the first hitting time to set of states {n + 1, n + 2, . . . } for the increasing
{y},k
random sequence (τX : k ∈ N).

Lemma 1.11 (Wald’s Lemma). Consider a random walk S : Ω → RZ+ with i.i.d. step-sizes X : Ω → RN
having finite E | X1 |. Let τ be a finite mean stopping time with respect to this random walk. Then,

E [ S τ ] = E [ X1 ] E [ τ ] .

Proof. Recall that the event space generate by the random walk and the step-sizes are identical. From the
independence of step sizes, it follows that Xn is independent of σ ( X0 , X1 , . . . , Xn−1 ). Since τ is a stopping
time with respect to random walk S, we observe that {τ ⩾ n} = {τ > n − 1} ∈ σ ( X0 , X1 , . . . , Xn−1 ), and
hence it follows that random variable Xn and indicator 1{τ⩾n} are independent and E[ Xn 1{τ⩾n} ] =
EX1 E1{τ⩾n} . Therefore,
" #
τ h i
E[ Sτ ] = E[ ∑ Xn ] = E[ ∑ Xn 1{ τ ⩾n } ] = ∑ EX1 E 1{τ⩾n} = EX1 E ∑ 1{τ⩾n} = E[ X1 ]E[τ ].
n =1 n ∈N n ∈N n ∈N

We exchanged limit and expectation in the above step, which is not always allowed. We were able to
do it by the application of dominated convergence theorem.
{i }
Corollary 1.12. Consider the stopping time τS ≜ min {n ∈ N : Sn = i } for an integer random walk S : Ω →
Z {i }
Z++ with i.i.d. step size sequence X : Ω → ZN . Then, the mean of stopping time EτS = i/EX1 .
Proof. This follows from the Wald’s Lemma and the fact that Sτi = i.

1.1 Properties of stopping time


Lemma 1.13. Let τ1 , τ2 be two stopping times with respect to filtration F• . Then the following hold true.
i min {τ1 , τ2 } is a stopping time.
ii If T is separable, then τ1 + τ2 is a stopping time.

2
Proof. Let τ1 , τ2 be stopping times with respect to a filtration F• = (Ft : t ∈ T ).
i Result follows since the event {min {τ1 , τ2 } > t} = {τ1 > t} ∩ {τ2 > t} ∈ Ft .
ii A topological space is called separable if it contains a countable dense set. Since R+ is separable
and ordered, we assume T = R+ without any loss of generality. It suffices to show that the event
{τ1 + τ2 ⩽ t} ∈ Ft for T = R+ . To this end, we observe that

{τ1 ⩽ t − s, τ2 ⩽ s} ∈ Ft .
[
{τ1 + τ2 ⩽ t} =
s ∈Q+ : s⩽t

1.2 Strong independence property and applications


Theorem 1.14 (Strong independence property). Let X : Ω → XR+ be an independent random process with
natural filtration F• , and τ : Ω → R+ a stopping time with respect to F• , then ( Xτ +s : s ∈ R+ ) is independent
of history ( Xs : s ⩽ τ ).
Definition 1.15. We can define the kth return time to state y for the random process X : Ω → XR+ as
the interval between two successive visits to state y, that is for all k ∈ N
 
{y},k {y},k {y},k−1
HX ≜ τX − τX = inf s ∈ R+ : X {y},k−1 = y .
τX +s

{y},k {y},j {y},k


Remark 8. We observe that τX = ∑kj=1 HX for all k ∈ N. That is, the hitting times sequence (τX :
{y},j {y},j
k ∈ N) is a random walk with step-size sequence being return times ( HX : j ∈ N). Therefore, if HX
{y},k
is almost sure finite for all j ∈ [k], then the finite sum τX is almost sure finite for all k ∈ N.
{y},k)
Remark 9. From the bijection between hitting and return times, we observe that σ (τX : k ∈ [n]) =
{y},j Ny (n)+1
σ ( HX : j ∈ [n]). Recall that τX > n by definition. If passage times are independent and i.i.d.
from second passage time, then it follows from the Wald’s Lemma that
{y},1 {y},2
EHX + E[ Ny (n)]EHX ⩾ ( n + 1).

Example 1.16. We also observe that the kth hitting time to {1} by a Bernoulli step size sequence X :
Z
Ω → {0, 1}N is the first hitting time to {k} by random walk S : Ω → Z++ . That is,
 
{1},k {k} { k −1}
τX = τS = inf {n ∈ N : Sn = k} = τS + inf n ∈ N : S {k−1} − S {k−1} = 1 .
τS +n τS

Lemma 1.17. For an i.i.d. Bernoulli random sequence X : Ω → {0, 1}N with EX1 ∈ (0, 1), the kth hitting time
{1},k
to state 1 is a stopping time, and τX = ∑ik=1 Yi where Y : Ω → NN is an i.i.d. random sequence distributed
{1}
identically to τX .
n o
{1}
Proof. When the Bernoulli step size sequence X is i.i.d. with EX1 = p ∈ (0, 1), we get that P τS = n =
(1 − p)n−1 p for all n ∈ N. It follows that
n o  n o n o
{1} {1} {1}
P τS < ∞ = P ∪n∈N τS = n = ∑ P τS = n = 1.
n ∈N

{1} {k}
Hence, the random time τS is finite almost surely. We will show that τS is finite almost surely
{ k −1}
for all k ∈ N by induction. By induction hypothesis, τS is finite almost surely. Then S { k −1} −
τS +n
n { k −1}
S {k−1} = ∑ j=1 X {k−1} is the sum of n i.i.d. Bernoulli random variables independent of τS by strong
τS τS +j
{k}
independence property, and hence has distribution identical to Sn . Further, This implies that τS =
{ k −1} {1} {1} {1} { k −1}
τS + τS , where τS has the identical distribution to τX and is independent of τS .

3
Lecture-19: Discrete Time Markov Chains

1 Markov processes
We have seen that i.i.d. sequences are easiest discrete time random processes. However, they don’t capture
correlation well.
Definition 1.1. A stochastic process X : Ω → XT with state space X and ordered index set T is Markov if
conditioned on the present state Xt , future σ ( Xu , u > t) is independent of the past σ ( Xs , s < t). We denote
the history of the process until time t as Ft ≜ σ ( Xs , s ⩽ t). That is, for any Borel measurable set B ∈ B(X)
any two indices u > t, we have
P({ Xu ∈ B} | Ft ) = P({ Xu ∈ B} | σ( Xt )).
Remark 1. We next re-write the Markov property more explicitly for the process X : Ω → RR . For all x, y ∈ X,
finite set S ⊆ R such that max S < t < u, and HS ( xS ) ≜ ∩s∈S { Xs ⩽ xs } ∈ Ft , we have
P({ Xu ⩽ y} | HS ( xS ) ∩ { Xt ⩽ x }) = P({ Xu ⩽ y} | { Xt ⩽ x }).

1.1 Discrete time Markov chains


Definition 1.2. For a state space X ⊆ R and the random sequence X : Ω → XZ+ , we define the history until
time n ∈ Z+ as Fn ≜ σ( X1 , . . . , Xn ).
Remark 2. Recall that the event space Fn is generated by the historical events A X ( x ) ≜ ∩in=0 { Xi ⩽ xi } where
x ∈ Rn +1 .
Remark 3. When the state space X is countable, the event space Fn is generated by the historical events
Hn ( x ) ≜ ∩in=0 { Xi = xi }, where x ∈ Xn+1 . That is, Fn = σ ( Hn ( x ) : x ∈ Xn+1 )
Definition 1.3. For a countable set X, a discrete-valued random sequence X : Ω → XZ+ is called a discrete
time Markov chain (DTMC) if for all positive integers n ∈ Z+ , all states x, y ∈ X, and any historical event
Hn−1 = ∩nm−=10 { Xm = xm } ∈ Fn for ( x0 , . . . , xn−1 ) ∈ Xn , the process X satisfies the Markov property

P({ Xn+1 = y} Hn−1 ∩ { Xn = x }) = P({ Xn+1 = y} { Xn = x }).

Remark 4. The above definition is equivalent to P({ Xn+1 ⩽ x } Fn ) = P({ Xn+1 ⩽ x } σ ( Xn )) for discrete
time discrete state space Markov chain X, since Fn = σ ( Hn ( x ) : x ∈ Xn ) and σ( Xn ) = σ ({ Xn = x } , x ∈ X).

Example 1.4 (Random Walk). A random walk S : Ω → XN with independent step-size sequence X : Ω →
XN , is Markov for a countable state space X that is closed under addition. Given a historical event
−1
Hn−1 (s) ≜ ∩nk= 1 { Sk = sk } and the current state { Sn = sn }, we can write the conditional probability

P({Sn+1 = sn+1 } Hn−1 (s) ∩ {Sn = sn }) = P({ Xn+1 = sn+1 − Sn } Hn−1 (s) ∩ {Sn = sn })
= P({Sn+1 = sn+1 } {Sn = sn }) = P { Xn+1 = sn+1 − sn } .

The equality in the second line follows from the independence of the step-size sequence. In particular, from
the independence of Xn+1 from the collection σ (S0 , X1 , . . . , Xn ) = σ (S0 , S1 , . . . , Sn ).

1
1.2 Transition probability matrix
Definition 1.5. We denote the set of all probability mass functions over a countable state space X by
M(X) ≜ ν ∈ [0, 1]X : ∑ x∈X νx = 1 .


Definition 1.6. The transition probability matrix at time n is denoted by P(n) ∈ [0, 1]X×X , such that its
( x, y)th entry is denoted by p xy (n) ≜ P({ Xn+1 = y} { Xn = x }), that is the transition probability of a
discrete time Markov chain X from a state x ∈ X at time n to state y ∈ X at time n + 1.

Remark 5. We observe that each row Px (n) ≜ ( p xy (n) : y ∈ X) ∈ M(X) is the conditional distribution of Xn+1
given the event { Xn = x }.

Definition 1.7. A matrix A ∈ RX +


×X
with non-negative entries is called sub-stochastic if the row-sum
∑y∈X a xy ⩽ 1 for all rows x ∈ X. If the above property holds with equality for all rows, then it is called
a stochastic matrix. If matrices A and A T are both stochastic, then the matrix A is called doubly stochastic.
Remark 6. We make the following observations for the stochastic matrices.
i Every probability transition matrix P(n) is a stochastic matrix.
ii All the entries of a sub-stochastic matrix lie in [0, 1].
iii Each row A x ≜ ( a xy : y ∈ X) of the stochastic matrix A ∈ RX
+
×X
belongs to M(X).
iv Every finite stochastic matrix has a right eigenvector with unit eigenvalue. This can be observed by
taking 1 T = 1 . . . 1 to be an all-one vector of length |X|. Then we see that A1 = 1, since

( A1) x = ∑ axy 1y = ∑ axy = 1x , for each x ∈ X.


y ∈X y ∈X

v Every finite doubly stochastic matrix has a left and right eigenvector with unit eigenvalue. This follows
from the fact that finite stochastic matrices A and A T have a common right eigenvector 1. It follows
that A has a left eigenvector 1 T .
vi For a probability transition matrix P(n), we have ∑y∈X f (y) p xy (n) = E[ f ( Xn+1 ) | { Xn = x }].

1.3 Homogeneous Markov chains


In general, not much can be said about Markov chains with index dependent transition probabilities. We
consider the simpler case where the transition probabilities p xy (n) = p xy are independent of the index.
Definition 1.8. A discrete time Markov chain with the probability transition matrix P(n) that is indepen-
dent of the index, is called time homogeneous.

Example 1.9 (Integer random walk). For a one-dimensional integer valued random walk S : Ω → ZN with
i.i.d. unit step size sequence X : Ω → {−1, 1}N such that P { X1 = 1} = p, the transition operator P ∈ [0, 1]Z×Z
is given by the entries p xy = p1{y= x+1} + (1 − p)1{y= x−1} for all x, y ∈ Z.

Example 1.10 (Sequence of experiments). Consider a random sequence of experiment outcomes X : Ω →


{0, 1}Z+ , such that P({ Xn+1 = 0} | { Xn = 0}) = 1 − q and P({ Xn+1 = 1} | { Xn = 1}) = 1 − p for all n ∈ Z+ .
Then, we can write the probability transition matrix as
 
1−q q
P= .
p 1− p

Definition 1.11. Consider a time homogeneous Markov chain X : Ω → XZ+ with countable state space X
and transition matrix P. We would respectively denote the conditional probability of events and conditional
expectation of random variables, conditioned on the initial state { X0 = x }, by

E x [ Y ] ≜ E Y { X0 = x } .
 
Px ( A) ≜ P( A { X0 = x }),

2
Proposition 1.12. Conditioned on the initial state, any finite dimensional distribution of a homogeneous Markov
chain is stationary. That is, for any finite n, m ∈ Z+ and states x0 , . . . , xn ∈ X, we have
n
P(∩in=1 { Xi = xi } { X0 = x0 }) = P(∩in=1 { Xm+i = xi } { Xm = x0 }) = ∏ p xi−1 xi .
i =1

Proof. Consider a homogeneous Markov chain X : Ω → XZ+ with natural filtration F• such that Fn ≜
σ( X0 , . . . , Xn ) for each n ∈ N. Using the property of conditional probabilities and Markovity of X, we
can write the conditional probability of sample path ( X1 , . . . , Xn ) given the event { X0 = x0 } as
  n   n  
P ∩in=1 { Xi = xi } | { X0 = x0 } = ∏ P { Xi = xi } | ∩ij− 1
= ∏ P { Xi = x i } | { Xi − 1 = x i − 1 } .

=0 X j = x j
i =1 i =1

Using the property of conditional probabilities and Markovity of X, we can write the conditional probability
of sample path ( Xm+1 , . . . , Xm+n ) given the event { Xm = x0 } as
  n  
P ∩in=1 { Xm+i = xi } | { Xm = x0 } = ∏ P { Xm+i = xi } | { Xm+i−1 = xi−1 } .
i =1

From time-homogeneity of transition probabilities of Markov chain X, it follows that both the transition
probabilities are identical and equal to ∏in=1 p xi−1 ,xi .
Corollary 1.13. The n-step transition probabilities are stationary for any homogeneous Markov chain. That is, for
any states x0 , xn ∈ X and n, m ∈ N, we have P({ Xn+m = xn } | { Xm = x0 }) = P({ Xn = xn } | { X0 = x0 }).
Proof. It follows from summing over all possible paths ( X0 , . . . , Xn ) and ( Xm , . . . , Xm+n ). In particular, we
can partition events { Xn = xn } and { Xm+n = xn } in terms of unions over disjoint paths
[   [  
{ Xn = x n } = ∩in=1 { Xi = xi } , { Xm + n = x n } = ∩in=1 { Xm+i = xi } .
x ∈Xn −1 x ∈Xn −1

The result follows from the countable additivity of conditional probability for disjoint events of taking
distinct paths, and the fact that probability of taking same path is identical for both sums.

1.4 Transition graph


A time homogeneous Markov chain X : Ω → XN with a probability transition matrix P, is sometimes repre-
sented by a directed weighted graph G = (X, E, w), where the set of nodes in the graph G is the state space
X, and the set of directed edges is the set of possible one-step transitions indicated by the initial and the
final state, as
E ≜ [ x, y⟩ ∈ X × X : p xy > 0 .


In addition, this graph has a weight we = p xy on each edge e = [ x, y⟩ ∈ E.

Example 1.14 (Integer random walk). The time homogeneous Markov chain in Example 1.9 can be repre-
sented by an infinite state weighted graph G = (Z, E, w), where the edge set is

E = {(n, n + 1) : n ∈ Z} ∪ {(n, n − 1) : n ∈ Z} .

We have plotted the sub-graph of the entire transition graph for states {−1, 0, 1} in Figure 1.

Example 1.15 (Sequence of experiments). The time homogeneous Markov chain in Example 1.10 can be
represented by the following two-state weighted transition graph G = ({0, 1} , E, w), plotted in Figure 2.

3
p p

-1 0 1

1− p 1− p

Figure 1: Sub-graph of the entire transition graph for an integer random walk with i.i.d. step-sizes in {−1, 1}
with probability p for the positive step.

0 1

Figure 2: Markov chain for the sequence of experiments with two outcomes.

1.5 Random mapping theorem


We saw some example of Markov processes where Xn = Xn−1 + Zn , and Z : Ω → ΛN is an i.i.d. sequence,
independent of the initial state X0 . We will show that any discrete time Markov chain is of this form, where
the sum is replaced by arbitrary functions.
Theorem 1.16 (Random mapping theorem). For any DTMC X : Ω → XZ+ , there exists an i.i.d. sequence
Z ∈ ΛN and functions f n : X × Λ → X such that Xn = f n ( Xn−1 , Zn ) for all n ∈ N.
Remark 7. A random mapping representation of a transition matrix P(n) on state space X is a function
f n : X × Λ → X, along with a random variable Zn : Ω → Λ, satisfying for all x, y ∈ X,

P { f n ( x, Zn ) = y} = p xy (n).

Proof. It suffices to show that every transition matrix P(n) has a random mapping representation. Then, for
the mapping f n and the i.i.d. sequence Z : Ω → ΛN , we would have Xn = f n ( Xn−1 , Zn ) for all n ∈ N.
Let Λ ≜ [0, 1], and we choose the i.i.d. uniform sequence Z : Ω → ΛN . Since X is countable, it can be
ordered. We let X = N without any loss of generality. We set Fxy (n) ≜ ∑w⩽y p xw (n) and define function
f n : X × Λ → X for all pairs ( x, z) ∈ X × Λ by

∑ y1{Fx,y−1 (n)<z⩽Fx,y (n)} = inf y ∈ X : z ⩽ Fx,y (n) .



f n ( x, z) ≜
y ∈N

Since f n ( x, Zn ) is a discrete random variable taking value y ∈ X, iff  the uniform random variable Zn lies in
the interval ( Fx,y−1 (n), Fx,y (n)]. That is, the event { f n ( x, Zn ) = y} = Zn ∈ ( Fx,y−1 (n), Fx,y (n)] for all y ∈ X.
It follows that

P { f n ( x, Zn ) = y} = P Fx,y−1 (n) < Zn ⩽ Fx,y (n) = Fx,y (n) − Fx,y−1 (n) = p xy (n).

4
Lecture-20: Strong Markov Property

1 n-step transition
Definition 1.1. For a time homogeneous Markov chain X : Ω → XZ+ , we can define n-step transition
probability matrix P(n) , with its ( x, y) entry being the n-step transition probability for Xm+n to be in state
(n)
y given the event { Xm = x }. That is, p xy ≜ P({ Xn+m = y} | { Xm = x }) for all x, y ∈ X and m, n ∈ Z+ .
(n) (n)
Remark 1. That is, the row Px = ( p xy : y ∈ X) ∈ M(X) is the conditional distribution of Xn given the initial
state { X0 = x }.
Theorem 1.2. The n-step transition probabilities for a homogeneous Markov chain form a semi-group. That is, for
all positive integers m, n ∈ Z+
P(m+n) = P(m) P(n) .

Proof. The events {{ Xm = z} : z ∈ X} partition the sample space Ω, and hence we can express the event
{ Xm+n = y} as the following disjoint union

{ Xm+n = y} = ∪z∈X { Xm+n = y, Xm = z} .

It follows from the Markov property and law of total probability that for any states x, y and positive integers
m, n

∑ Px ({Xn+m = y, Xm = z}) = ∑ P({Xn+m = y} | {Xm = z, X0 = x}) Px ({Xm = z})


(m+n)
p xy =
z ∈X z ∈X

∑ P({Xn+m = y} | {Xm = z}) Px ({Xm = z}) = ∑


(m) (n)
= p xz pzy = ( P(m) P(n) ) xy .
z ∈X z ∈X

Since the choice of states x, y ∈ X were arbitrary, the result follows.

Corollary 1.3. The n-step transition probability matrix is given by P(n) = Pn for any positive integer n.

Proof. In particular, we have P(n+1) = P(n) P(1) = P(1) P(n) . Since P(1) = P, we have P(n) = Pn by induction.

Definition 1.4. For a time homogeneous Markov chain X : Ω → XZ+ we denote the probability mass func-
tion of Markov chain at step n by πn ∈ M(X).
Lemma 1.5 (Chapman Kolmogorov). The right multiplication of a probability vector with the transition matrix
P transforms the probability distribution of current state to probability distribution of the next state. That is,

πn+1 = πn P, for all n ∈ N.

Proof. To see this, we fix y ∈ X and from the law of total probability and the definition conditional proba-
bility, we observe that

π n +1 ( y ) = P { X n +1 = y } = ∑ P { Xn+1 = y, Xn = x } = ∑ P { Xn = x } p xy = (πn P)y .


x ∈X x ∈X

1
2 Strong Markov property (SMP)
We are interested in generalizing the Markov property to any random times. For a DTMC X : Ω → XZ+
and a random variable τ : Ω → N, we are interested in knowing whether for any historical event Hτ −1 =
∩τn− 1
=0 { Xn = xn } and any state x, y ∈ X, we have

P({ Xτ +1 = y} | Hτ −1 ∩ { Xτ = x }) = p xy .

Example 2.1 (Two-state DTMC). Consider the two state Markov chain X ∈ {0, 1}Z+ such that P0 { X1 = 1} =
q and P1 { X1 = 0} = p for p, q ∈ [0, 1]. Let τ : Ω → N be a random variable defined as

τ ≜ sup {n ∈ N : Xi = 0, for all i ⩽ n} .

That is, {τ = n} = { X1 = 0, . . . , Xn = 0, Xn+1 = 1}. Hence, for the historical event Hτ −1 =


{ X1 = . . . , Xτ −1 = 0}, the conditional probability P({ Xτ +1 = 1} | Hτ −1 ∩ { Xτ = 0}) = 1, and not equal to
q.

Definition 2.2. Let τ : Ω → N be a stopping time with respect to a random sequence X : Ω → XZ+ . Then for
all states x, y ∈ X and the event Hτ −1 = ∩τn− 1
=0 { Xn = xn }, the process X satisfies the strong Markov property
if
P({ Xτ +1 = y} | { Xτ = x } ∩ Hτ −1 ) = P({ Xτ +1 = y} | { Xτ = x }).
Lemma 2.3. Homogeneous Markov chains satisfy the strong Markov property.
Proof. Let X : Ω → XZ+ be a homogeneous DTMC with transition matrix P, and τ : Ω → N be an associated
−1
stopping time. We take any historical event Hτ −1 = ∩nT= 0 { Xn = xn }, and states x, y ∈ X. From the definition
of conditional probability, the law of total probability, and the Markovity of the process X, we have

∑n∈Z+ P({ Xτ +1 = y, Xτ = x } ∩ Hτ −1 ∩ { T = n})


P({ Xτ +1 = y} | Hτ −1 ∩ { Xτ = x }) =
P({ Xτ = x } ∩ Hτ −1 )
= ∑ P({ Xn+1 = y} | { Xn = x } ∩ Hn−1 ∩ {τ = n}) P({τ = n} | { XT = x } ∩ Hτ −1 )
n ∈Z+

= p xy ∑ P({τ = n} | { XT = x } ∩ Hτ −1 ) = p xy .
n ∈Z+

This equality follows from the fact that the event {τ = n} is completely determined by ( X0 , . . . , Xn ).
{y},k
Remark 2. Consider a homogeneous DTMC X : Ω → XZ+ and the first instant τk ≜ τX for the process X to
hit k times, a state y ∈ X. Recall that τ0 ≜ 0 and recurrence time Hk ≜ τk − τk−1 = inf n ∈ N : Xτk−1 +n = y


for all k ∈ N. We define a process Y : Ω → XZ+ where Ym ≜ Xτk +m for all m ∈ Z+ . If τk is almost surely
finite, then it is a stopping time with respect to process X. Using strong Markov property of DTMC X, we
will show that Y is a stochastic replica of X with X0 = y.

3 Hitting and Recurrence Times


We will consider a time-homogeneous discrete time Markov chain X : Ω → XZ+ on countable state space
X with transition probability matrix P : X × X → [0, 1], and initial state X0 = x ∈ X. We denote the natural
filtration generated by the process X as F• , where Fn ≜ σ ( X0 , . . . , Xn ) for all n ∈ N.
(k)
Remark 3. Starting from state x, the mean number of visits to state y in n steps is Ex Ny (n) = ∑nk=1 p xy . From
(k)
the monotone convergence theorem, we also get that Ex Ny (∞) = ∑k∈N p xy .

2
Remark 4. If τk−1 is almost sure finite, then τk−1 is a stopping time for process X. From the strong Markov
property of homogeneous DTMC X applied to stopping time τk−1 , it follows that the future σ( Xτk−1 + j : j ∈
N) is independent of the past σ ( X0 , . . . , Xτk−1 ) given the present σ ( Xτk−1 ). Since Xτk−1 = y for k ⩾ 2 deter-
ministically, it follows that σ ( Xτk−1 ) is a trivial event space and the future σ ( Xτk−1 + j : j ∈ N) is independent
of the random past σ ( X0 , . . . , Xτk−1 ). We further observe that the distribution of σ ( Xτk−1 + j : j ∈ N) is iden-
tical to distribution of X given X0 = y. Thus, the process ( Xτk−1 + j : j ∈ N) is distributed identically for all
k ⩾ 2.
Remark 5. We observe that the recurrence time satisfies { Hk = n} ∈ σ ( Xτk−1 + j : j ∈ [n]) for all n ∈ N, and
hence the recurrence time Hk is independent of the random past σ ( X0 , . . . , Xτk−1 ). Recursively applying this
fact, we can conclude that ( H1 , . . . , Hk ) are independent random variables. Further, since ( Xτk−1 + j : j ∈ N)
is distributed identically for all k ⩾ 2, it follows that ( Hk : k ⩾ 2) are distributed identically.
Lemma 3.1. If H1 and H2 are almost surely finite, then the random sequence ( Hk : k ⩾ 2) is i.i.d. .
Proof. From the above two remarks, it suffices to show that each term of the random sequence τ : Ω → NN
is almost surely finite. We will show this by induction. Since τ1 = H1 is almost surely finite, it follows
that τ1 is stopping time. Since τ2 = τ1 + H2 is almost surely finite, it follows that τ2 is a stopping time. By
inductive hypothesis τk−1 is almost surely finite, and hence Hk is independent of ( H1 , . . . , Hk ) and identically
distributed to H2 and is almost surely finite. It follows that τk = τk−1 + Hk is almost surely finite, and the
result follows.

3
Lecture-21: Recurrent and transient states

1 Recurrence and Transience


We will consider a random sequence X : Ω → XZ+ with initial state X0 = x ∈ X and the kth hitting times to
{y},k
state y for all k ∈ N denoted by τk ≜ τX inductively defined as τk ≜ inf {n > τk−1 : Xn = y} where τ0 ≜ 0.
{y},k
We define the inter-return time sequence H : Ω → NN as Hk ≜ HX = τk − τk−1 for all k ∈ N.
Definition 1.1. For a random sequence X : Ω → XZ+
with initial state X0 = x,
(i) the probability of hitting state y eventually is denoted by f xy ≜ Px {τ1 < ∞}, and
(n)
(ii) the probability of first visit to state y at time n ∈ N is denoted by f xy ≜ Px {τ1 = n} .
Remark 1. We can write the finiteness of hitting time τ1 as the disjoint union {τ1 < ∞} = ∪n∈N {τ1 = n}.
(n)
Therefore, f xy = ∑n∈N f xy .
Remark 2. If f xy = Px {τ1 < ∞} = 1 for all initial states x ∈ X, then τ1 is almost surely finite and hence a
stopping time.
Definition 1.2. From the initial state x, the distribution
(n)
(i) for the first hitting time to state y is called the first passage time distribution and denoted by (( f xy :
n ∈ N), 1 − f xy ), and
(ii) for the first return time to state x is called the first recurrence time distribution and denoted by
(n)
(( f xx : n ∈ N), 1 − f xx ).
Definition 1.3. A state y ∈ X is called recurrent if f yy = 1, and is called transient if f yy < 1.
Definition 1.4. For any state y ∈ X, the mean recurrence time is denoted by µyy ≜ Ey τ1 .
Remark 3. The mean recurrence time for any transient state is infinite. For any recurrent state y ∈ X, we write
τ1 = τ1 1{τ1 <∞} = ∑n∈N n1{τ1 =n} almost surely, and the mean recurrence time is given by µyy = ∑n∈N n f yy .
(n)

Definition 1.5. For a recurrent state y ∈ X,


(i) if the mean recurrence time is finite, then the state y is called positive recurrent, and
(ii) if the mean recurrence time is infinite, then the state y is called null recurrent.
Proposition 1.6. For a homogeneous discrete Markov chain X : Ω → XZ+ , we have
(
1 − f xy , k = 0,
Px Ny (∞) = k =

k −1 (1 − f ), k ∈ N.
f xy f yy yy

Proof. We can write the event of zero visits to state y as Ny (∞) = 0 = {τ1 = ∞}. Further, we can write


the event of m visits to state y as

Ny (∞) = k = {τk < ∞} ∩ {τk+1 = ∞} = ∩kj=1 Hj < ∞ ∩ { Hk+1 = ∞} , k ∈ N.


 

Recall that H : Ω →NN is an independent random sequence with subsequence ( Hk : k ⩾ 2) identically


distributed, with Px Hj = n = Py {τ1 = n} for all j ⩾ 2. Therefore, we get
k
Px Ny (∞) = k = Px { H1 < ∞} ∏ Px Hj < ∞ Px { Hk+1 = ∞} = f xy f yy
k −1
 
(1 − f yy ).
j =2

1
Corollary 1.7. For a homogeneous Markov chain X, we have Px Ny (∞) < ∞ = 1{ f yy <1} + (1 − f xy )1{ f yy =1} .


Proof. We can write the event Ny (∞) < ∞ as disjoint union of events Ny (∞) = k , to get the result.
 

Z+
 For a time homogeneous Markov chain X : Ω → X , we have
Remark 4.
(i) Px Ny (∞) = ∞ = f xy 1{ f yy =1} , and
(ii) Py Ny (∞) = ∞ = 1{ f yy =1} .


1− f yy 1{ f yy <1}
f xy
Corollary 1.8. The mean number of visits to state y, starting from a state x is Ex Ny (∞) = +
∞1{ f xy >0, f yy =1} .

1− f yy 1{ f yy <1} + ∞1{ f yy =1} . That is, the mean number


f yy
Remark 5. For any state y ∈ X, we have Ey Ny (∞) =
of visits to initial state y is finite iff the state y is transient.

Remark 6. In particular, this corollary implies the following consequences.


i A transient state is visited a finite amount of times almost surely. This follows from Corollary 1.7, since
Px Ny (∞) < ∞ = 1 for all transient states y ∈ X and any initial state x ∈ X.


ii A recurrent state is visited infinitely often almost surely. This also follows from Corollary 1.7, since
Py Ny (∞) < ∞ = 0 for all recurrent states y ∈ X.


iii In a finite state Markov chain, not all states may be transient.

Proof. To see this, we assume that for a finite state space X, all states y ∈ X are transient. Then, we
know that Ny (∞) is finite almost surely for all states y ∈ X. It follows that, for any initial state x ∈ X
( )
0 ⩽ Px ∑ Ny (∞) = ∞ = Px (∪y∈X Ny (∞) = ∞ ) ⩽ ∑ Px Ny (∞) = ∞ = 0.
 
y ∈X y ∈X

It follows that ∑y∈X Ny (∞) is also finite almost surely for all states y ∈ X for finite state space X. How-
ever, we know that ∑y∈X Ny (∞) = ∑k∈N ∑y∈X 1{ Xk =y} = ∞. This leads to a contradiction.

(k)
Proposition 1.9. For a homogeneous DTMC X : Ω → XZ+ , a state y ∈ X is recurrent iff ∑k∈N pyy = ∞, and
(k)
transient iff ∑k∈N pyy < ∞.
(k)
Proof. Recall that if the mean recurrence time to a state y ∈ X is Ey Ny (∞) = ∑k∈N pyy finite then the state
is transient and infinite if the state is recurrent.
(k)
(n) ∑nk=1 p xy
Corollary 1.10. For a transient state y ∈ X, the following limits hold limn→∞ p xy = 0, and limn→∞ n = 0.
(n)
Proof. For a transient state y ∈ X and any state x ∈ X, we have Ex Ny (∞) = ∑n∈N p xy < ∞. Since the series
(n)
sum is finite, it implies that the limiting terms in the sequence limn→∞ p xy = 0. Further, we can write
(k)
(k) ∑nk=1 p xy
∑nk=1 p xy ⩽ Ex Ny (∞) ⩽ M for some M ∈ N and hence limn→∞ n = 0.

Lemma 1.11. For any state y ∈ X, let H : Ω → NN be the sequence of almost surely finite inter-visit times to state
y, and Ny (n) = ∑nk=1 1{ Xk =y} be the number of visits to state y in n times. Then, Ny (n) + 1 is a finite mean stopping
time with respect to the sequence H.

2
 observe that Ny (n) + 1 ⩽ n + 1 and hence has a finite mean for each n ∈ N. Further, we
Proof. We first
observe that Ny (n) + 1 = k can be completely determined by observing H1 , . . . , Hk , since
( )
k −1 k
Ny (n) + 1 = k = ∑ Hj ⩽ n < ∑ Hj ∈ σ ( H1 , . . . , Hk ).

j =1 j =1

(k)
∑nk=1 p xy
Theorem 1.12. Let x, y ∈ X be such that f xy = 1 and y is recurrent. Then, limn→∞ n = 1
µyy .

Proof. Let y ∈ X be recurrent. The proof consists of three parts. In the first two parts, we will show that
starting from the state y, we have the limiting empirical average of mean number of visits to state y is
limn→∞ n1 Ey Ny (n) = µ1yy . In the third part, we will show that for any starting state x ∈ X such that f xy = 1,
we have the limiting empirical average of mean number of visits to state y is limn→∞ n1 Ex Ny (n) = 1
µyy .

Lower bound: We observe that Ny (n) + 1 is a stopping time with respect to inter-visit times H from Lemma 1.11.
Ny (n)+1 Ny (n)+1
Further, we have ∑ j=1 Hj > n. Applying Wald’s Lemma to the random sum ∑ j=1 Hj , we get
(k)
∑nk=1 pyy
Ey ( Ny (n) + 1)µyy > n. Taking limits, we obtain lim infn∈N n ⩾ 1
µyy .

Upper bound: Let X0 = y and consider a fixed positive integer M ∈ N. Then H is i.i.d. and we define
truncated recurrence times H̄ : Ω → [ M ]N for all j ∈ N as H̄j ≜ M ∧ Hj . It follows that the sequence H̄
is i.i.d. and H̄j ⩽ Hj for all j ∈ N. We define the mean of the truncated recurrence times as µ̄yy ≜ Ey H̄1 .
From the monotonicity of truncation, we get µ̄yy ⩽ µyy .
We define the random variable τ̄k ≜ ∑kj=1 H̄j for all k ∈ N, and τ̄k ⩽ τk for all k ∈ N. We can define
the associated counting process that counts the number of truncated recurrences in first n steps as
N̄y (n) ≜ ∑k∈N 1{τ̄k ⩽n} for all n ∈ N. We conclude that N̄y (n) + 1 is a stopping time with respect to
i.i.d. process H̄, and N̄y (n) ⩾ Ny (n) sample path wise. Further, we have

N̄y (n)+1
∑ H̄j = τ̄N̄y (n)+1 = τ̄N̄y (n) + H̄N̄y (n)+1 ⩽ n + M.
j =1

Applying Wald’s Lemma to stopping time Ny (n) + 1 with respect to i.i.d. sequence H and stopping
time N̄y (n) + 1 with respect to i.i.d. sequence H̄, and monotonicity of expectation, we get

Ey ( Ny (n) + 1)µ̄yy ⩽ Ey ( N̄y (n) + 1)µ̄yy ⩽ n + M.


(k)
∑nk=1 pyy 1
Taking limits, we obtain lim supn∈N n ⩽ µ̄yy . Letting M grow arbitrarily large, we obtain the
upper bound.
−1 (k) (k−s) (s) (k)
Starting from x: Further, we observe that p xy = ∑ks= 0 f xy pyy . Since 1 = f xy = ∑k∈N f xy , we have

n n k −1 n −1 n−s n −1 n −1
(k−s) (s) (k−s)
∑ pxy ∑ ∑ f xy ∑ pyy ∑ ∑ pyy ∑ pyy ∑
(k) (s) (s) (s) (k)
= pyy = f xy = − f xy .
k =1 k =1 s =0 s =0 k − s =1 s =0 s =0 k>n−s

(k) (k)
(k) ∑nk=1 p xy ∑nk=1 pyy
Since the series ∑k∈N f xy converges, we get limn→∞ n = limn→∞ n .

3
Lecture-22: Communicating classes

1 Communicating classes
(n)
Definition 1.1. For states x, y ∈ X, it is said that state y is accessible from state x if p xy > 0 for some n ∈ Z+ ,
and denoted by x → y. If two states x, y ∈ X are accessible to each other, they are said to communicate with
each other, denoted by x ↔ y.
Proposition 1.2. Communication is an equivalence relation.
Proof. Relation on state space X is a subset of product of sets X × X. Communication is a relation on state
space X, as it relates two states x, y ∈ X. To show equivalence, we have to show reflexivity, symmetry, and
transitivity of the relation.
(0)
Reflexivity: Since P0 = I, it follows that p xx = 1 > 0 for each state x ∈ X, and hence x ↔ x.
Symmetry: If x ↔ y, then we know that x → y and y → x and hence y ↔ x.
(m) (n)
Transitivity: For transitivity, suppose x ↔ y and y ↔ z. Let m, n ∈ Z+ such that p xy > 0 and pyz > 0. Then
(m+n) (m) (n) (m) (n)
by Chapman Kolmogorov equation, we have p xz = ∑w∈X p xw pwz ⩾ p xy pyz > 0. This implies
x → z, and using similar arguments one can show that z → x, and the transitivity follows.

Remark 1. Since the communication is an equivalence relation, it partitions state space X into equivalence
classes.
Definition 1.3. Each equivalence class of communication relation is called a communicating class. A prop-
erty of states is said to be a class property if for each communicating class C , either all states in C have the
property, or none do.

Definition 1.4. A set of states that communicate are called a communicating class.
(i) A communicating class C is called closed if no edges leave this class. That is, for all edges ( x, y) ∈
C × C c , we have p xy = 0.
(ii) An open communicating class is not closed, i.e. there exist an edge that leaves this class. That is, there
exists an edge ( x, y) ∈ C × C c such that p xy > 0.

1.1 Irreducibility and periodicity


Definition 1.5. A Markov chain with a single class is called an irreducible Markov chain. That is, for any
(n)
two states x, y ∈ X, there exists an integer n ∈ Z+ such that p xy > 0. In other words, any state y can be
reached from any state x using transitions of positive probability.
n o
(n)
Definition 1.6. Let T ( x ) ≜ n ∈ N : p xx > 0 be the set of times when the chain can possibly return to the
(n)
initial state x. The period of any state x ∈ X is defined as d( x ) ≜ gcd T ( x ) = gcd{n ∈ N : p xx > 0}. We
(n)
define d( x ) = ∞, if p xx = 0 for all n ∈ N. A state x ∈ X is called aperiodic if the period d( x ) is 1.
Proposition 1.7. If x ↔ y, then d( x ) = d(y). That is, periodicity is a class property.

1
(m) (n)
Proof. Consider two states x ̸= y ∈ X and integers m, n ∈ N be such that p xy pyx > 0. Suppose s ∈ T ( x ),
(s)
that is p xx > 0. Then
(n+m) (n) (m) (n+s+m) (n) (s) (m)
pyy ⩾ pyx p xy > 0, pyy ⩾ pyx p xx p xy > 0.
Hence d(y)|n + m and d(y)|n + s + m, and hence d(y)|s for any s ∈ T ( x ). In particular, it implies that
d(y)|d( x ). By symmetrical arguments, we get d( x )|d(y). Hence d( x ) = d(y).
Definition 1.8. For an irreducible chain, the period of the chain is defined to be the period which is common
to all states. An irreducible Markov chain is called aperiodic if the single communicating class is aperiodic.
(r )
Proposition 1.9. If the transition matrix P is aperiodic and irreducible, then there is an integer r0 such that p xy > 0
for all x, y ∈ X and r ⩾ r0 .

1.2 Transient and recurrent states


Proposition 1.10. Transience and recurrence are class properties.
Proof. Let us start with proving recurrence is a class property. Let x be a recurrent state and let x ↔ y.
(m) (n)
Hence there exist some m, n > 0, such that Pxy > 0 and pyx > 0. As a consequence of the recurrence,
(s) (m+n+s) (n) (s) (m) (s)
∑s∈N p xx = ∞. ∑s∈N pyy ⩾ ∑s∈N pyx p xx Pxy = ∞. If x were transient instead then ∑s∈N p xx < ∞,
(m+n+s)
(s) ∑s∈N p xx
and we conclude that y is also transient by the following observation ∑s∈N pyy ⩽ (n) (m) < ∞.
pyx Pxy

Corollary 1.11. If y is a recurrent state and there exists a state x such that y → x, then x → y and f xy = 1.
Proof. Let y ∈ X be a recurrent state, and consider state x ∈ X such that y → x. We will show that f xy = 1
(n)
and hence f xy > 0 for some n ∈ Z+ and x → y. To this end, we observe that since y → x, there exists an
integer n ∈ Z+ such that the probability of hitting state x for the first time starting from state y in n-steps is
positive. That is, n o
(n) { x },1
f yx ≜ Py { Xn = x, Xn−1 ̸= x, . . . , X1 ̸= x } = Py τX = n > 0.
From the strong Markov property, we have
n o n o n o n o
{y},1 {y},1 { x },1 {y},1 { x },1 (n)
1 − f yy = Py τX = ∞ ⩾ Py τX = ∞, τX = n = Px τX = ∞ Py τX = n = f yx (1 − f xy ).
Since state y is recurrent, it follows that f xy = 1 and hence x → y.
(k)
∑nk=1 p xy
Corollary 1.12. Let x, y ∈ X be in the same communicating class and the state y is recurrent. Then, limn∈N n =
1 (n) 1
µyy . Furthermore, if the state y is aperiodic, then limn∈N p xy = µyy .
Proof. Since y is recurrent and y → x, it follow that f xy = 1 from the previous Lemma. From the Theorem
(k)
∑n p
1.7 in previous lecture, it follows that limn∈N k=n1 xy = µ1yy .
Let the period of the state y be d. Then we know that there exists a positive integer r0 such that for all
(nd)
n ⩾ r0 , we have pyy > 0.
Theorem 1.13. The states in a communicating class are of one of the following types; all transient, or all null
recurrent, or all positive recurrent.
Proof. It suffices to show that if x, y belong to the same communicating class and y is null recurrent, then x
(r ) ( s ) (r ) (ℓ) (s)
is null recurrent as well. We take r, s ∈ N, such that pyx p xy > 0. It follows that pryy+ℓ+s ⩾ pyx p xx Pxy for all
ℓ ∈ N. Hence, for any n > r + s, we have
!
n −r − s
1 n (k) 1 n
n−r−s
 
1
n k∑ ∑ pyy ⩾ ∑ pxx pyx Pxy .
(k) (ℓ) (r ) ( s )
pyy ⩾
=1
n k =r + s +1
n n − r − s ℓ=1
(ℓ)
Since y is null recurrent LHS goes to zero as n increases, which implies limn∈N n1 ∑nℓ=1 p xx = 0. Hence, x is
null recurrent as well.

2
Lecture-23: Invariant Distribution

1 Invariant Distribution
Let X : Ω → XZ+ be a time-homogeneous Markov chain with transition probability matrix P : X × X → [0, 1].
Definition 1.1. A probability distribution π ∈ M(X) is said to be invariant distribution for the Markov
chain X if it satisfies the global balance equation π = πP.

Definition 1.2. When the initial distribution of a Markov chain is ν ∈ M(X), then the conditional probabil-
ity is denoted by Pν : F → [0, 1] defined by

Pν ( A) ≜ ∑ ν(x) Px ( A) for all events A ∈ F.


x ∈X

Definition 1.3. For a Markov chain X : Ω → XZ+ , we denote the distribution of random variable Xn : Ω → X
by νn ∈ M(X) for all n ∈ Z+ . That is, νn ( x ) ≜ Pν0 { Xn = x } for all x ∈ X.
Remark 1. We observe that νn ( x ) = ∑z∈X ν0 (z)( Pn )zx for all x ∈ X.

Remark 2. Facts about the invariant distribution π.


i The global balance equation π = πP is a matrix equation, that is we have a collection of |X| equa-
tions πy = ∑ x∈X π x p xy for each y ∈ X.
ii The invariant distribution π is left eigenvector of stochastic matrix P with the largest eigenvalue 1.
The all ones vector is the right eigenvector of this stochastic matrix P for the eigenvalue 1.
iii From the Chapman-Kolmogorov equation for initial probability vector π, we have π = πPn for
n ∈ N. That is, if ν0 = π, then νn = π for all n ∈ Z+ .
iv Resulting process with initial distribution π is stationary, and hence have shift-invariant finite
dimensional distributions. For example, for any k, n ∈ Z+ and x0 , . . . , xn ∈ X, we have

Pπ { X0 = x0 , . . . , Xn = xn } = Pπ { Xk = x0 , . . . , Xk+n = xn } = π x0 p x0 x1 . . . p xn−1 xn .

v For an irreducible Markov chain, if π x > 0 for some x ∈ X, then the entire invariant vector π is
positive. To this end, we will show that πy > 0 for all states y ∈ X. Let y ∈ X, then from the
(m)
irreducibility of Markov chain, there exists an m ∈ Z+ such that p xy > 0. Further, π = πPm and
(m)
hence πy ⩾ π x p xy > 0.
vi Any scaled version of π satisfies the global balance equation. Therefore, for any invariant vector
α ∈ XR+ of a positive recurrent transition matrix P, the sum ∥α∥1 = ∑ x∈X α x must be finite. We can
normalize α and get an invariant probability measure π = ∥αα∥ .
1

Theorem 1.4. An irreducible Markov chain with transition probability matrix P : X × X → [0, 1] is positive recurrent
iff there exists a unique invariant probability measure π ∈ M(X) that satisfies global balance equation π = πP and
π x = µ1xx > 0 for all x ∈ X.

1
Proof. Consider an irreducible Markov chain X : Ω → XZ+ with transition probability matrix P. We will
first show that positive recurrence of X implies the existence of a positive invariant distribution π and its
uniqueness. Then we will show that the existence of a unique positive invariant distribution π implies
positive recurrence of X.
Implication: For Markov chain X, let the initial state be X0 = x. Recall that the number of visits to state
y ∈ X in the first n steps of the Markov chain X is denoted by Ny (n) = ∑nk=1 1{ Xk =y} . It follows that
{ x },1
∑y∈X Ny (n) = n for each n ∈ N. Let Hx ≜ τX be the first recurrence time to state x ∈ X, then we
have Nx ( Hx ) = 1 and ∑y∈X Ny ( Hx ) = Hx .

Existence: We define a vector v ∈ RX by vy ≜ Ex [ Ny ( Hx )] for each y ∈ X. We observe that vy ⩾ 0


for each state y ∈ X. In particular, v x = 1. Since X is positive recurrent, we get that ∑y∈X vy =
Ex Hx = µ xx < ∞. We will show that the vector v satisfies the global balance equation v = vP. To
(n)
see this, we first define λ xy ≜ Px { Xn = y, n ⩽ Hx } for all n ∈ N and states x, y ∈ X. We observe
(1)
that λ xy = p xy for each y ∈ X. Next, we observe from the monotone convergence theorem, that

∑ 1{Xn =y,n⩽Hx } = ∑ ∑
(n)
vy = Ex Ny ( Hx ) = Ex Px { Xn = y, n ⩽ Hx } = λ xy . (1)
n ∈N n ∈N n ∈N

For n ⩾ 2, partitioning the event { Xn = y, n ⩽ Hx } by the events ({ Xn−1 = z} : z ∈ X \ { x }), the


countable additivity of conditional probability for disjoint events, and the definition of condi-
tional probability, we get

∑ P({Xn = y}
(n)
λ xy = { Xn−1 = z, n ⩽ Hx , X0 = x }) Px { Xn−1 = z, n ⩽ Hx } .
z̸= x

Recall that since Hx is adapted to natural filtration F• of Markov chain X, we have {n ⩽ Hx , X0 = x } =


{ X0 = x } ∩ { Hx > n − 1}c ∈ Fn−1 . Together with the Markov property of X and the fact that
{ Xn−1 = z, n ⩽ Hx } = { Xn−1 = z, n − 1 ⩽ Hx }, we obtain
( n −1)
∑ P({Xn = y} ∑ λxz
(n)
λ xy = { Xn−1 = z}) Px { Xn−1 = z, n − 1 ⩽ Hx } = pzy . (2)
z̸= x z̸= x

(n) (n)
Substituting the expression for λ xy in (2) into the expression for vy = ∑n∈N λ xy in (1) and using
the fact that v x = 1, we obtain
( n −1)
vy = p xy + ∑ ∑ λxz pzy = v x p xy + ∑ vz pzy = ∑ vx pxy .
n⩾2 z̸= x z̸= x x ∈X

v
Since v has a finite sum, it follows that π ≜ ∑ x ∈X v x
is an invariant distribution for the Markov
vx 1
chain X with the transition matrix P. In addition, we have π x = ∑ y ∈X v y
= µ xx > 0.
Uniqueness: Next, we show that this is a unique invariant measure independent of the initial state x,
and hence πy = µ1yy > 0 for all y ∈ X. For uniqueness, we observe from the Chapman-Kolmogorov
(k)
equations and invariance of π that π = n1 π ( P + P2 + · · · + Pn ). Hence, πy = ∑ x∈X π x n1 ∑nk=1 p xy
for all states y ∈ X. Taking limit n → ∞ on both sides, and exchanging limit and summation
on right hand side using bounded convergence theorem for summable series π, we get πy =
µyy ∑ x ∈X π x = µyy > 0 for all states y ∈ X.
1 1

1
Converse: Let π be the unique positive invariant distribution of Markov chain X, such that πy = µyy >0
for all states y ∈ X. It follows that the Markov chain X is positive recurrent.

Corollary 1.5. An irreducible Markov chain on a finite state space X has a unique positive stationary distribution π.

2
Definition 1.6. An irreducible, aperiodic, positive recurrent Markov chain is called ergodic.

Remark 3. Additional remarks about the stationary distribution π.

i For a Markov chain with multiple positive recurrent communicating classes C1 , . . . , Cm , one can
find the positive equilibrium distribution for each class, and extend it to the entire state space X
denoting it by πk for class k ∈ [m]. It is easy to check that any convex combination π = ∑m k =1 α k π k
satisfies the global balance equation π = πP, where αk ⩾ 0 for each k ∈ [m] and ∑m α
k =1 k = 1. Hence,
a Markov chain with multiple positive recurrent classes have a convex set of invariant probability
measures, with the individual invariant distribution πk for each positive recurrent class k ∈ [m]
being the extreme points.
ii Let µ(0) = ex , that is let the initial state of the positive recurrent Markov chain be X0 = x. Then, we
know that
1 1 n (k) 1
πy = = lim ∑ p xy = lim Ex Ny (n).
µyy n→∞ n k=1 n→∞ n

That is, πy is limiting average of number of visits to state y ∈ X.


iii If a positive recurrent Markov chain is aperiodic, then limiting probability of being in a state y is
(n)
its invariant probability, that is πy = limn→∞ p xy .

Theorem 1.7. For an ergodic Markov chain X with invariant distribution π, and nth step distribution µ(n), we
have limn→∞ µ(n) = π in the total variation distance.
Proof. Consider independent time homogeneous Markov chains X : Ω → XZ+ and Y : Ω → XZ+ each with
transition matrix P. The initial state of Markov chain X is assumed to be X0 = x, whereas the Markov chain
Y is assumed to have an initial distribution π. It follows that Y is a stationary process, while X is not. In
particular,
(n)
µy (n) = Px { Xn = y} = p xy , Pπ {Yn = y} = πy .

Let τ = inf{n ∈ Z+ : Xn = Yn } be the first time that two Markov chains meet, called the coupling time.
Finiteness: First, we show that the coupling time is almost surely finite. To this end, we define a a new
Markov chain on state space X × X with transition probability matrix Q such that q(( x, w), (y, z)) =
p xy pwz for each pair of states ( x, w), (y, z) ∈ X × X. The n-step transition probabilities for this couples
Markov chain are given by
(n) (n)
q(n) (( x, w), (y, z)) ≜ p xy pwz .
Ergodicity: Since the Markov chain X with transition probability matrix P is irreducible and aperi-
(n) (n)
odic, for each x, y, w, z ∈ X there exists an n0 ∈ Z+ such that q(n) (( x, w), (y, z)) = p xy pwz > 0 for
all n ⩾ n0 from a previous Lemma on aperiodicity. Hence, the irreducibility and aperiodicity of
this new product Markov chain follows.
Invariant: It is easy to check that θ ( x, w) = π x πw is the invariant distribution for this product Markov
chain, since θ ( x, w) > 0 for each ( x, w) ∈ X × X, ∑ x,w∈X θ ( x, w) = 1, and for each (y, z) ∈ X × X,
we have
∑ θ (x, w)q((x, w), (y, z)) = ∑ πx pxy ∑ πw pwz = πy πz = θ (y, z).
x,w∈X x ∈X w ∈X

Recurrence: This implies that the product Markov chain is positive recurrent, and each state ( x, x ) ∈
X × X is reachable with unit probability from any initial state (y, w) ∈ X × X.

In particular, the coupling time is almost surely finite.

3
Coupled process: Second, we show that from the coupling time onwards, the evolution of two Markov
chains is identical in distribution. That is, for each y ∈ X and n ∈ Z+ ,

PXτ { Xn = y, n ⩾ τ } = PYτ {Yn = y, n ⩾ τ } .

This follows from the strong Markov property for the joint process where τ is stopping time for the
joint process (( Xn , Yn ) : n ∈ Z+ ) such that Xτ = Yτ , and both marginals have the identical transition
matrix.
Limit: For any y ∈ X, we can write the difference as
(n)
p xy − πy = | Px { Xn = y, n < τ } − Pπ {Yn = y, n < τ }| ⩽ 2Pδx ,π (τ > n).

Since the coupling time is almost surely finite for each initial state x, y ∈ X, we have ∑n∈N Pδx ,π {τ = n} =
1 and the tail-sum Pδx ,π {τ > n} goes to zero as n grows large, and the result follows.

2 Computing invariant distribution


When the state space X is finite, one can find left eigenvector of probability transition matrix P for the
largest eigenvalue 1. This is the invariant distribution that satisfies the global balance equation π = πP.
Definition 2.1. Consider a time homogeneous positive recurrent Markov chain X : Ω → XZ+ with prob-
ability transition matrix P and invariant distribution π ∈ M(X). For any two disjoint sets A, B ⊆ X, the
probability flux from set of nodes A to set of nodes B is defined as Φ( A, B) = ∑ x∈ A ∑y∈ B π x p xy .

Remark 4. The probability flux from a single node x to single node y is denoted by Φ( x, y) = π x p xy .

Definition 2.2. For a time homogeneous Markov chain X : Ω → XZ+ with probability transition matrix
P represented as the weighted transition graph G = (X, E, w), a cut is defined as the partition (X1 , X2 ) of
nodes.
Lemma 2.3. Probability flux balances across cuts.

Proof. A cut for the state space X is given by a partition (X1 , X2 ). We show that Φ(X1 , X2 ) = Φ(X2 , X1 ). To
this see, we observe that by exchanging sums from the monotone convergence theorem, and exchanging x
and y as running variables, we can write the probability flux Φ(X1 , X2 ) as

∑ πy ∑ pyx = ∑ π y (1 − ∑ pyx ) = ∑ πy − ∑ ∑ πy pyx = ∑ πy − ∑ (π x − ∑ πy pyx ).


y ∈ X1 / X1
x∈ y∈X1 x ∈X1 y∈X1 x ∈ X1 y ∈ X1 y∈X1 x ∈ X1 ∈ X1
y/

Corollary 2.4. For any states y ∈ X, we have πy (1 − pyy ) = πy ∑ x̸=y pyx = ∑ x̸=y π x p xy .

Proof. It follows from probability flux balancing across the cut ({y} , {y}c ).

4
Lecture-24: Poisson Point Processes

1 Simple point processes


Consider the d-dimensional Euclidean space Rd . Then collection of oBorel measurable subsets B(Rd ) of the
above Euclidean space is generated by sets B( x ) ≜ y ∈ Rd : yi ⩽ xi for x ∈ Rd .

Definition 1.1. A simple point process is a random countable collection of distinct points S : Ω → XN , such
that the distance ∥Sn ∥ → ∞ as n → ∞.
Remark 1. Since S is a simple point process, each point Sn is unique. Therefore, we can identify S as a
random set of points in X and S ∩ A is the random set of points in A.
Remark 2. For any simple point process S, we have P({Sn = Sm for any n ̸= m}) = 0 and |S ∩ A| is finite
almost surely for any bounded set A ∈ B(X).

Example 1.2 (Simple point process on the half-line). We can simplify this definition for d = 1. When
X = R+ , one can order the points of the process S : Ω → RN N , such that
n process S̃ : Ω → Ro
+ to get ordered +
S̃n = S(n) is the nth order statistics of S. That is, S(0) ≜ 0, and S(n) ≜ inf Sk > S(n−1) : k ∈ N . such that
S(1) < S(2) < · · · < S(n) < . . . , and limn∈N S(n) = ∞. We will call this an arrival process.

Definition 1.3. Corresponding to a point process S : Ω → XN , we denote the number of points in a set
A ∈ B(X) by
N ( A) ≜ |S ∩ A| = ∑ 1 A (Sn ), where we have N (∅) = 0.
n ∈N
B(X)
The resulting process N : Ω → Z+ is called a counting process for the point process S : Ω → XN .
Remark 3. Let A ∈ B(X)k be a bounded partition of B ∈ B(X). From the disjointness of ( A1 , . . . , Ak ), we have
k k k
N ( B) = ∑ 1∪ik=1 Ai (Sn ) = ∑ ∑ 1 Ai (Sn ) = ∑ ∑ 1 Ai (Sn ) = ∑ N ( Ai ).
n ∈N n ∈N i =1 i =1 n ∈N i =1

Definition 1.4. A counting process is simple if the underlying point process is simple.
Remark 4. For a simple counting process N, we have N ({ x }) ⩽ 1 almost surely for all x ∈ X.
Remark 5. Let N : Ω → Z+ B(X) be the counting process for the point process S : Ω → XN .
i Note that the point process S and the counting process N carry the same information.
ii The distribution of point process S is completely characterized by the finite dimensional distributions
of random vectors ( N ( A1 ), . . . , N ( Ak )) for any bounded sets A1 , . . . , Ak ∈ B(X) and finite k ∈ N.

Example 1.5 (Simple point process on the half-line). Since the Borel measurable sets B(R+ ) are generated
by half-open intervals {(0, t] : t ∈ R+ }, we denote the counting process by N : Ω → Z+ R+ , where Nt ≜
N (0, t] = ∑n∈N 1{Sn ∈(0,t]} is the number of points in the half-open interval (0, t]. For s < t, the number of
points in interval (s, t] is N (s, t] = N (0, t] − N (0, s] = Nt − Ns .

1
Theorem 1.6 (Rényi). Distribution of a simple point process S : Ω → XN on a locally compact second countable
space X is completely determined by void probabilities ( P { N ( A) = 0} : A ∈ B(X)).
Proof. It suffices to show that the finite dimensional distributions of S on locally compact sets are character-
ized by void probabilities.

Step 1: We will show this by induction on the number of points k in a bounded set A ∈ B. Let A1 , . . . , Ak , B ∈
B(X) locally compact, then we will show that uk ≜ P(∩ik=1 { N ( Ai ) > 0} ∩ { N ( B) = 0}) can be com-
puted from void probabilities. From k = 1, we have

P { N ( A1 ) > 0, N ( B) = 0} = P { N ( B) = 0} − P { N ( B ∪ A1 ) = 0} .

The induction can be proved by the recursive relation

uk = P(∩ik=−11 { N ( Ai ) > 0} ∩ { N ( B) = 0}) − P(∩ik=−11 { N ( Ai ) > 0} ∩ { N ( Ak ∪ B) = 0}).

Step 2: For any locally compact set B ∈ B(X), there exists a sequence of nested partitions Bn ≜ ( Bn,j : j ∈
[ Jn ]) that eventually separates the points in S ∩ B as n → ∞. We define the number of subsets of
1 1{ N ( B )>0} where
Jn
partition ( Bn,j : j ∈ [ Jn ]) that consist of at least one point in S ∩ B, as Hn ( B) ≜ ∑ j=
n,j
Hn ( B) ↑ N ( B) almost surely.
Step 3: We next show that for all locally compact sets B1 , . . . , Bk ∈ B(X) and j1 , . . . , jk ∈ N, the probability
P(∩ik=1 { Hn ( Bi ) = ji }) can be expressed in terms of void probabilities. We observe that
 n o n o
P(∩ik=1 { Hn ( B) = ji }) = ∑ i
P ∩ik=1 ∩ j∈Ti N ( Bn,j ) > 0 ∩ N (∪ j/∈∪k
i =1 Ti
i
Bn,j ) =0 .
T1 ,...,Tk ⊆[ Jn ]:| T1 |= j1 ,...,| Tk |= jk

This can be expressed in terms of void probabilities by Step 1.

Step 4: For a simple point process, we have the following almost sure limit limn ∩ik=1 { Hn ( Bi ) = ji } =
∩ik=1 { N ( Bi ) = ji }. The result follows from the continuity of probability.

is the volume of the set A ∈ B(Rd ) and for any such A.


R
Remark 6. Recall that | A| = x ∈ A dx

Definition 1.7. The intensity measure Λ : B(X) → R+ is defined for each bounded set A ∈ X as its scaled
volume in terms of the intensity density λ : Rd → R+ , as
Z
Λ( A) ≜ λ( x )dx.
x∈ A

If the intensity density λ( x ) = λ for all x ∈ Rd , then Λ( A) = λ | A|. In particular for partition A1 , . . . , Ak for
a set B, we have Λ( B) = ∑ik=1 Λ( Ai ).

2 Poisson point process


Definition 2.1. A non-negative integer valued random variable N : Ω → Z+ is called Poisson if for some
constant λ > 0, we have
λn
P { N = n } = e−λ .
n!
Remark 7. It is easy to check that EN = Var[ N ] = λ. Furthermore, the moment generating function M Nt =
t
EetN = eλ(e −1) exists for all t ∈ R.
B(X)
Corollary 2.2. A simple counting process N : Ω → Z+ has Poisson marginal distribution with intensity measure
Λ : B(X) → R+ if and only if void probabilities are exponential with the same intensity measure Λ.

2
Proof. It is clear that if the marginal distribution of the counting process N is Poisson with intensity measure
Λ, then the void probability P { N ( A) = 0} = e−Λ( A) is exponential for any bounded set A ∈ B(X).
Conversely, we assume that the void probabilities are exponentially distributed with intensity mea-
sure Λ. It follows from the linearity of intensity measure that for any finite, bounded, and disjoint sets
B1 , . . . , Bk ∈ B(X), we have
n o k k
P(∩ik=1 { N ( Bi ) = 0}) = P N (∪ik=1 Bi ) = 0 = e−Λ(∪i=1 Bi ) = ∏ e−Λ( Bi ) = ∏ P { N ( Bi ) = 0} .
k

i =1 i =1

That is, the Bernoulli random vector (1{ N ( Bi )=0} : i ∈ [k ]) is independent for any finite k ∈ N and bounded
disjoint B(X) measurable sets B1 , . . . , Bk . Next we consider a set B ∈ B(X) and a partition Bn ≜ ( Bn,j : j ∈ [ Jn ])
Λ( B)
1 1{ N ( B
J
of B such that Λ( Bn,j ) = Jn for all j ∈ [ Jn ]. It follows that Hn ( B) ≜ ∑ j=
n
} is the sum of Jn
n,j )>0

i.i.d. Bernoulli random variables with success probability pn ≜ 1 − e−Λ( B)/Jn , and hence has a Binomial
distribution with parameters ( Jn , pn ). Therefore,

e−Λ( B) Jn
  m
Jn ! 
P { Hn ( B) = m} = (eΛ( B)/Jn pn )m = e−Λ( B) eΛ( B)/Jn − 1 .
m! m ( Jn − m)!

Recall that Hn ( B) ↑ N ( B) as n → ∞ in the proof of Rényi’s Theorem, and limn→∞ Jn = ∞ and limn∈N Bn,j =
0. Thus, limn→∞ ( J J−n !m)! (eΛ( B)/Jn − 1)m = Λ( B)m . Taking limit n → ∞ on both sides of the above equation,
n
we get the result.
B(X)
Definition 2.3. A counting process N : Ω → Z+ has the completely independence property, if for any
collection of finite disjoint and bounded sets A1 , . . . , Ak ∈ B(X), the vector ( N ( A1 ), . . . , N ( Ak )) : Ω → Zk+ is
independent. That is,
!
k k
= ∏ P { N ( Ai ) = ni } , n ∈ Zk+ .
\
P { N ( Ai ) = ni }
i =1 i =1

Definition 2.4. A simple point process S : Ω → XN is Poisson point process, if the associated counting
B(X)
process N : Ω → Z+ has complete independence property and the marginal distributions are Poisson.

Definition 2.5. The intensity measure Λ : B(X) → R+ of Poisson process S is defined by Λ( A) ≜ EN ( A)


for all bounded A ∈ B(X).

Remark 8. Recall that for any partition A ∈ B(X)k of a bounded set B ∈ B(X), we have N ( B) = ∑ik=1 N ( Ai )
and therefore it follows from the linearity of expectations that Λ( B) = EN ( B) = ∑ik=1 EN ( Ai ) = ∑ik=1 Λ( Ai ).
Thus, this is a valid intensity measure.
Remark 9. For a Poisson process with intensity measure Λ, it follows from the definition that for any finite
k ∈ Z+ , and bounded mutually disjoint sets A1 , . . . , Ak ∈ B(X), we have

k 
Λ ( A i ) ni
  
P ∩ik=1 { N ( Ai ) = ni } = ∏ e−Λ( Ai ) , n ∈ Zk+ .
i =1
ni !

Definition 2.6. If the intensity measure Λ of a Poisson process S satisfies Λ( A) = λ | A| for all bounded
A ∈ B(X), then we call S a homogeneous Poisson point process and λ is its intensity.

3 Equivalent characterizations
Theorem 3.1 (Equivalences). Following are equivalent for a simple counting process N : Ω → Z+ B(X) .
i Process N is Poisson with locally finite intensity measure Λ.

3
ii For each bounded A ∈ B(X), we have P { N ( A) = 0} = e−Λ( A) .
iii For each bounded A ∈ B(X), the number of points N ( A) is a Poisson with parameter Λ( A).
iv Process N has the completely independence property, and EN ( A) = Λ( A) for all bounded sets A ∈ B(X).
Proof. We will show that i =⇒ ii =⇒ iii =⇒ iv =⇒ i .
i =⇒ ii It follows from the definition of Poisson point processes and definition of Poisson random vari-
ables.
ii =⇒ iii From Corollary 2.2, we know that if void probabilities are exponential, then the marginal distri-
butions are Poisson.
iii =⇒ iv We will show this in two steps.
Mean: Since the distribution of random variable N ( A) is Poisson, it has mean EN ( A) = Λ( A).
CIP: Consider a partition A ∈ Bk for a bounded set B ∈ B(X), then Λ( B) = Λ( A1 ) + · · · + Λ( Ak ).
Consider all partitions n ∈ Zk+ of a non-negative integer m ∈ Z+ , to write

P { N ( B) = m} = ∑ P { N ( A1 ) = n1 , . . . , N ( A k ) = n k } .
n1 +···+nk =m

Using the definition of Poisson distribution, we can write the LHS of the above equation as

Λ( B)m k
(∑k Λ( Ai ))m
P { N ( B ) = m } = e−Λ( B) = ∏ e − Λ ( A i ) i =1 .
m! i =1
m!
m n
Since the expansion of ( a1 + · · · + ak )m = ∑n1 +···+nk =m (n
1 ,...,nk
) ∏ik=1 ai i , we get
 k !
k
−Λ( Ai ) Λ ( Ai ))
1

m ni
P { N ( B) = m} = ∑ ∏ e−Λ( Ai ) Λ( Ai ))ni = ∑ ∏e .
m! n1 +···+ nk =n n 1 , . . . , n k i =1 n1 +···+nk =m i =1
ni !

Equating each term in the summation, we get P { N ( A1 ) = n1 , . . . , N ( Ak ) = nk } = ∏ik=1 P { N ( Ai ) = ni }.


iv =⇒ i From Corollary 2.2, if the void probability is exponential with intensity measure Λ, then the
marginal distribution if Poisson with the same intensity measure. We define f : B(X) → (−∞, 0]
by f ( A) ≜ ln P { N ( A) = 0} for all bounded A ∈ B(X). Then, we observe that for any partition
( A1 , . . . , Ak ) of A, we have f (∪ik=1 Ai ) = ln P { N ( A) = 0} = ln ∏ik=1 P { N ( Ai ) = 0} = ∑ik=1 f ( Ai ). It
follows that − f : B(X) → R+ is an intensity measure, and P { N ( A) = 0} = e f ( A) . Since EN ( A) =
− f ( A) = Λ( A), the result follows.

R
Corollary 3.2 (Poisson process on the half-line). A random process N : Ω → Z++ indexed by time t ∈ Z+ is
the counting process associated with a one-dimensional Poisson process S : Ω → RN
+ having intensity measure Λ iff
(a) Starting with N0 = 0, the process Nt takes a non-negative integer value for all t ∈ R+ ;
(b) the increment Ns − Nt is surely nonnegative for any s ⩾ t;
(c) the increments Nt1 , Nt2 − Nt1 , . . . , Ntn − Ntn−1 are independent for any 0 < t1 < t2 < · · · < tn−1 < tn ;
(d) the increment Ns − Nt is distributed as Poisson random variable with parameter Λ(t, s] for s ⩾ t.
The Poisson process is homogeneous with intensity λ, iff in addition to conditions ( a), (b), (c), the distribution of
the increment Nt+s − Nt depends on the value s ∈ R+ but is independent of t ∈ R+ . That, is the increments are
stationary.
Proof. We have already seen that definition of Poisson processes implies all four conditions. Conditions ( a)
and (b) imply that N is a simple counting process on the half-line, condition (c) is the complete indepen-
dence property of the point process, and condition (d) provides the intensity measure. The result follows
from the equivalence iv in Theorem 3.1.

4
Lecture-25: Poisson processes: Conditional distribution

1 Joint conditional distribution of points in a finite window


Let X = Rd be a d-dimensional Euclidean space, and S : Ω → XN be a Poisson point process with intensity
B(X)
measure Λ : B(X) → R+ and associated counting process N : Ω → Z+ .

Proposition 1.1. Let k ∈ N be any positive integer. Consider a Poisson point process S : Ω → XN with intensity
measure Λ : B(X) → R+ , a finite partition A ∈ B(X)k that partitions a bounded set B ∈ B(X), and a vector n ∈ Zk+
that partitions a non negative integer m ∈ Z+ . Then,

k  ni
Λ ( Ai )
  
m
P({ N ( A1 ) = n1 , . . . , N ( Ak ) = nk } { N ( B) = m}) =
n1 , . . . , n k ∏ Λ( B)
. (1)
i =1

Proof. From the definition of conditional probability and the fact that ∩ik=1 { N ( Ai ) = ni } ⊆ { N ( B) = m}, we
can write the conditional probability on LHS as the ratio

P { N ( A1 ) = n1 , . . . , N ( A k ) = n k , N ( B ) = m } P { N ( A1 ) = n1 , . . . , N ( A k ) = n k }
= .
P { N ( B) = m} P { N ( B) = m}

From the complete independence property and Poisson marginals for the joint distribution of ( N ( A1 ), . . . , N ( Ak ))
for the partition A ∈ B(X)k , and the fact that the intensity measure sums over disjoint sets, i.e. Λ( B) =
∑ik=1 Λ( Ai ), we can rewrite the RHS of the above equation as
Λ ( A ) ni
∏ik=1 e−Λ( Ai ) nii! k  ni
Λ ( Ai )
  
P { N ( A1 ) = n1 , . . . , N ( A k ) = n k } m
P { N ( B) = m}
= Λ( B)m
= ∏ Λ( B)
.
e−Λ( B) m! n1 , . . . , n k i =1

Remark 1. Consider a Poisson point process S : Ω → XN with intensity measure Λ : B(X) → R+ and count-
B(X)
ing process N : Ω → Z+ . Let A ∈ B(X)k be a partition for bounded set B ∈ B(X).
Λ ( Ai )
i Defining pi ≜ Λ( B)
, we see that ( p1 , . . . , pk ) ∈ M([k]) is a probability distribution. We also observe that

pi = P({ N ( Ai ) = 1} { N ( B) = 1}) = P({|S ∩ Ai | = 1} {|S ∩ B| = 1}).

When N ( B) = 1, we can call the point of S in B as S1 without any loss of generality. That is, if we call
{S1 } = S ∩ B, then we have
pi = P({S1 ∈ Ai } {S1 ∈ B}).
Similarly, when N ( B) = ni , we call the points of S in B as S1 , . . . , Sni and denote S ∩ B = {S1 , . . . , Sni }.
For this case, we observe
n
P({ N ( Ai ) = ni } { N ( B) = ni }) = P({|S ∩ Ai | = ni } {|S ∩ B| = ni }) = pi i
ni
= P(∩nj=i 1 S j ∈ Ai {{S1 , . . . , Sni } = S ∩ B}) = ∏ P( S j ∈ Ai
  
S j ∈ B ).
j =1

1
ii We can rewrite the Equation (1) as a multinomial distribution, where
 
m n n
P({ N ( A1 ) = n1 , . . . , N ( Ak ) = nk } { N ( B) = m}) = p 1 . . . pk k .
n1 , . . . , n k 1

iii For any finite set F ⊆ N of size m ∈ Z+ and n ∈ Zk+ a partition of m, we define Pk ( F, n) to be the
collection of all k-partitions E ∈ P (N)k of F such that | Ei | = ni for i ∈ [k]. Then, the multinomial
coefficient accounts for number of partitions of m points into sets with n1 , . . . , nk points. That is,
 
m
= |Pk ([m], n)| .
n1 , . . . , n k

iv Recall that the event { N ( Ai ) = ni } = {|S ∩ Ai | = ni }. Hence, we can write


 
k m n n
P(∩i=1 {|S ∩ Ai | = ni } {|S ∩ B| = m}) = p 1 . . . pk k
n1 , . . . , n k 1
k
∑ ∏∏
 
= P ( S j ∈ Ai S j ∈ B ).
E∈Pk (S∩ B,n) i =1 S j ∈ Ei

v When N ( B) = m, we denote S ∩ B by F = {S1 , . . . , Sm } without any loss of generality. We further


observe that when N ( Ai ) = ni for all i ∈ [k ], then (S ∩ A1 , . . . , S ∩ Ak ) ∈ Pk (S ∩ B, n). Therefore, we can
re-write the event

∩ik=1 { N ( Ai ) = ni } = ∩ik=1 {|S ∩ Ai | = ni } = ∪ E∈Pk (S∩ B,n) (∩ik=1 {S ∩ Ai = Ei }).


That is, we can write the conditional probability conditioned on S ∩ B = F, as

P(∩ik=1 { N ( Ai ) = ni } { N ( B) = m}) = ∑ P(∩ik=1 {S ∩ Ai = Ei } {S ∩ B = F })


E∈Pk ( F,n)

∑ P(∩ik=1 ∩S j ∈Ei S j ∈ Ai

= {S ∩ B = F }).
E∈Pk ( F,n1 ,...,nk )

vi Equating the RHS of the above equation term-wise, we obtain that conditioned on each of these points
falling inside the window B, the conditional probability of each point falling in partition Ai is indepen-
dent of all other points and given by pi . That is, we have
k k k  ni
Λ ( Ai )

{S ∩ B = F }) = ∏ ∏ Sj ∈ B ) = ∏ =∏
n
P(∩ik=1
  
∩Sj ∈Ei S j ∈ Ai P ( S j ∈ Ai pi i .
i =1 S j ∈ Ei i =1 i =1
Λ( B)

It means that given m points in the window B, the location of these points are independently and
Λ(·)
identically distributed in B according to the distribution Λ( B) .

vii If the Poisson process is homogeneous, the distribution is uniform over the window B.
viii For a Poisson process with intensity measure Λ and any bounded set A ∈ B, the number of points
N ( A) in the set A is a Poisson random variable with parameter Λ( A). Given the number of points
λ( x )
N ( A), the location of all the points in S ∩ A are i.i.d. with density Λ( A) for all x ∈ A.

Remark 2 (Simulating a homogeneous Poisson point process). If we are interested in simulating a two di-
mensional homogeneous Poisson point process with density λ in a uniform square A = [0, 1] × [0, 1]. Then,
n
we first generate the random variable N ( A) : Ω → Z+ that takes value n with probability e−λ λn! . Next,
for each of the N ( A) = n points, we generate the location ( Xi , Yi ) ∈ R2 uniformly at random. That is,
X : Ω → [0, 1]n and Y : Ω → [0, 1]n are independent i.i.d. uniform sequences.

2
Corollary 1.2. For a homogeneous Poisson point process on the half-line with ordered set of points S̃ = (S(n) ∈
R+ : n ∈ N), we can write the conditional density of ordered points (S(1) , . . . , S(k) ) given the event { Nt = k} as the
ordered statistics of i.i.d. uniformly distributed random variables. Specifically, we have

1{0<t1 ⩽...⩽tk ⩽t}


f (t1 , . . . , tk ) = k! .
S(1) ,...,S(k) Nt =k tk

Proof. Given { Nt = k}, we can denote the points of the Poisson process in (0, t] by S1 , . . . , Sk . From the above
remark, we know that S1 , . . . , Sk are i.i.d. uniform in (0, t], conditioned on the number of points Nt = k.
Hence, we can write
k k
t
F (t1 , . . . , tk ) = P(∩ik=1 {Si ∈ (0, ti ]} { Nt = k}) = ∏ P({Si ∈ (0, ti ]} {Si ∈ (0, t]}) = ∏ i 1{0<ti ⩽t} .
S1 ,...,Sk Nt =k
i =1 i =1
t

Therefore, for 0 < t1 < · · · < tk < 1 and (h1 , . . . , hk ) sufficiently small, we have

k
  h
P ∩ik=1 {Si ∈ (ti , ti + hi ]} = ∏ i .
i =1
t

Since (S1 , . . . , Sk ) are conditionally independent given S ∩ A = {S1 , . . . , Sk }, it follows that any permutation
σ : [k ] → [k ], the conditional joint distribution of (Sσ(1) , . . . , Sσ(k) ) is identical to that of (S1 , . . . , Sk ) given
S ∩ A = {S1 , . . . , Sk }. Further, we observe that the order statistics of (Sσ(1) , . . . , Sσ(k) ) is identical to that of
(S1 , . . . , Sk ). Therefore, we can write the following equality for the events
n o n o
∩ik=1 S(i) ∈ (ti , ti + hi ] = ∪σ:[k]→[k] permutation ∩ik=1 Sσ(i) ∈ (ti , ti + hi ] .

The result follows since the number of permutations σ : [k] → [k ] is k!.

3
Lecture-26: Properties of Poisson point processes

1 Laplace functional
Let X = Rd be the d-dimensional Euclidean space. Recall that all the random points are unique for a
simple point process S : Ω → XN , and hence S can also be considered as a set of countable points in X.
B(X)
Let N : Ω → Z+ be the counting process associated with the simple point process S.
/ S and dN ( x ) = δx 1{ x∈S} . Hence, for any Borel measur-
Remark 1. We observe that dN ( x ) = 0 for all x ∈
able function f : X → R and bounded A ∈ B(X), we have x∈ A f ( x )dN ( x ) = ∑ x∈S∩ A f ( x ).
R

Definition 1.1. The Laplace functional LS : RX N


+ → R+ of a point process S : Ω → X and associated
B(X)
counting process N : Ω → Z+ is defined for all non-negative Borel measurable function f : X → R+
as  Z 
LS ( f ) ≜ E exp − f ( x )dN ( x ) .
Rd

Remark 2. For a simple function f = ∑ik=1 ti 1 Ai , we can write the Laplace functional as a function of the
   
vector (t1 , t2 , . . . , tk ), LS ( f ) = E exp − ∑ik=1 ti A dN ( x ) = E exp − ∑ik=1 ti N ( Ai ) . We observe that
R
i
this is a joint Laplace transform of the random vector ( N ( A1 ), . . . , N ( Ak )). This way, one can compute
all finite dimensional distribution of the counting process N.
Proposition 1.2. The Laplace functional of a Poisson point process S : Ω → XN with intensity measure Λ :
B(X) → R+ evaluated at any non-negative Borel measurable function f : X → R+ , is
 Z 
− f (x)
LS ( f ) = exp − (1 − e )dΛ( x ) .
X

Proof. For a bounded Borel measurable set A ∈ B(X), consider the truncated function g = f 1 A . Then,
Z Z
LS ( g) = E exp(− g( x )dN ( x )) = E exp(− f ( x )dN ( x )).
X A

Clearly dN ( x ) = δx 1{ x∈S} and hence we can write LS ( g) = E exp (− ∑ x∈S∩ A f ( x )). We know that the
probability of N ( A) = |S ∩ A| = n points in set A is given by

Λ( A)n
P { N ( A ) = n } = e−Λ( A) .
n!
Given there are n points in set A, the density of n point locations are independent and given by
n  
dΛ( xi )
f ( x1 , . . . , x n ) = ∏ 1 .
S1 ,...,Sn N ( A)=n
i =1
Λ ( A ) { xi ∈ A }

Hence, we can write the Laplace functional as


n Z
Λ( A)n
 Z 
−Λ( A) − f ( xi ) dΛ ( xi )
LS ( g ) = e ∑ n! ∏ e
Λ( A)
= exp − (1 − e − g( x )
X
)dΛ( x ) .
n ∈Z+ i =1 A

Result follows from taking increasing sequences of sets Ak ↑ X and monotone convergence theorem.

1
1.1 Superposition of point processes
Definition 1.3. Let Sk : Ω → XN be a simple point process with intensity measures Λk : B(X) → R+ and
B(X)
counting process Nk : Ω → Z+ , for each k ∈ N. The superposition of point processes (Sk : k ∈ N) is
defined as a point process S ≜ ∪k Sk .
Remark 3. The counting process associated with superposition point process S : Ω → XN is given by
B(X)
N : Ω → Z+ defined by N ≜ ∑k Nk , and the intensity measure of point process S is given by Λ :
B(X) → R+ defined by Λ = ∑k Λk from monotone convergence theorem.
Remark 4. The superposition process S is simple iff ∑k Nk is locally finite.
Theorem 1.4. The superposition of independent Poisson point processes (Sk : k ∈ N) with intensities (Λk : k ∈
N) is a Poisson point process with intensity measure ∑k Λk if and only if the latter is a locally finite measure.
Proof. Consider the superposition S = ∪k Sk of independent Poisson point processes Sk ∈ X with inten-
sity measures Λk . We will prove just the sufficiency part this theorem. We assume that ∑k Λk is locally
finite measure. It is clear that N ( A) = ∑k Nk ( A) is finite by locally finite assumption, for all bounded
sets A ∈ B(X). In particular, we have dN ( x ) = ∑k dNk ( x ) for all x ∈ X. From the monotone convergence
theorem and the independence of counting processes, we have for a non-negative Borel measurable
function f : X → R+ ,
Z
! Z
!
LS ( f ) = E exp − f ( x ) ∑ dNk ( x ) = ∏ LSk = exp − (1 − e − f ( x ) ) ∑ Λ k ( x ) .
X k k X k

1.2 Thinning of point processes


Definition 1.5. Consider a probability retention function p : X → [0, 1] and an independent Bernoulli
point retention process Y : Ω → {0, 1}X such that EY ( x ) = p( x ) for all x ∈ X. The thinning of point
process S : Ω → XN with the probability retention function p : X → [0, 1] is a point process S( p) : Ω → XN
defined by
S ( p ) ≜ ( Sn ∈ S : Y ( Sn ) = 1),
where Y (Sn ) is an independent indicator for the retention of each point Sn and E[Y (Sn ) Sn ] = p(Sn ).
Theorem 1.6. The thinning of a Poisson point process S : Ω → XN of intensity measure Λ : B(X) → R+
with the retention probability function p : X → [0, 1], yields a Poisson point process S( p) : Ω → XN of intensity
measure Λ( p) : B(X) → R+ defined for all bounded A ∈ B(X) as Λ( p) ( A) ≜ A p( x )dΛ( x ).
R

Proof. Let A ∈ B(X) be a bounded Boreal measurable set, and let f : X → R+ be a non-negative function.
Let N ( p) be the associated counting process to the thinned point process S( p) . Hence, for any bounded
set A ∈ B(X), we have N ( p) ( A) = ∑ x∈S∩ A Y ( x ). That is, dN ( p) ( x ) = δx Y ( x )1{ x∈S} . Therefore, for
any non-negative function g( x ) = f ( x )1{ x∈ A} , we can write x∈X g( x )dN ( p) ( x ) = x∈ A f ( x )dN ( p) ( x ) =
R R

∑ x∈S∩ A f ( x )Y ( x ). We can write the Laplace functional of the thinned point process S( p) for the non-
negative function g( x ) = f ( x )1{ x∈ A} , as
  Z   n
LS( p) ( g) = E E[exp − f ( x )dN ( x ) | N ( A)] = ∑ P { N ( A) = n} ∏ E[− f (Si )Y (Si ) | {Si ∈ A}].
p
A n ∈Z+ i =1

The first equality follows from the definition of Laplace functional and taking nested expectations.
Second equality follows from the fact that the distribution of all points of a Poisson point process are
i.i.d. . Since Y is a Bernoulli process independent of the underlying process S with E[Y (Si )] = p(Si ), we
get
E[e− f (Si )Y (Si ) | {Si ∈ S ∩ A}] = E[e− f (Si ) p(Si ) + (1 − p(Si )) | {Si ∈ S ∩ A}].
Λ′ ( x )
From the distribution Λ( A)
for x ∈ S ∩ A for the Poisson point process S, we get
Z n  Z 
1
LS ( p ) ( g ) = e − Λ ( A ) ∑ ( p( x )e− f (x) + (1 − p( x ))dΛ( x ) = exp − (1 − e− g(x) ) p( x )dΛ( x ) .
n ∈Z+
n! A X

Result follows from taking increasing sequences of sets Ak ↑ X and monotone convergence theorem.

2
Lecture-27: Poisson process on the half-line

1 Simple point processes on the half-line


R
A stochastic process defined on the half-line N : Ω → Z++ is a counting process if
1. N0 = 0, and
2. for each ω ∈ Ω, the sample path N (ω ) : R+ → Z+ is non-decreasing, integer valued, and right con-
tinuous function of time t ∈ R+ .
Each discontinuity of the sample path of the counting process can be thought of as a jump of the process, as
shown in Figure 1. A simple counting process has the unit jump size almost surely. General point processes
in higher dimension don’t have any inter-arrival time interpretation.

N (t)

t
S̃1 S̃2 S̃2 S̃4

Figure 1: Sample path of a simple counting process.

Definition 1.1. The points of discontinuity are also called the arrival instants of the counting process N.
The nth arrival instant is a random variable denoted S̃n : Ω → R+ , defined inductively as

S̃0 ≜ 0, S̃n ≜ inf {t ⩾ 0 : Nt ⩾ n} , n ∈ N.

Definition 1.2. The inter arrival time between (n − 1)th and nth arrival is denoted by Xn and written as
Xn ≜ S̃n − S̃n−1 .
Remark 1. For a simple point process, we have P { Xn = 0} = P { Xn ⩽ 0} = 0.
R
Lemma 1.3. Simple counting process N : Ω → Z++ and arrival process S̃ : Ω → RN
+ are inverse processes, i.e.

S̃n ⩽ t = { Nt ⩾ n} .

Proof. Let ω ∈ S̃n ⩽ t , then NS̃n = n by definition. Since N is a non-decreasing process, we have Nt ⩾
NS̃n = n. Conversely, let ω ∈ { Nt ⩾ n}, then it follows from definition that S̃n ⩽ t.
R
Corollary 1.4. For arrival instants S̃ : Ω → RN + associated with a counting process N : Ω → Z+ we have
+

S̃n ⩽ t, S̃n+1 > t = { Nt = n} for all n ∈ Z+ and t ∈ R+ .




1
c
= { Nt ⩾ n + 1}c = { Nt < n + 1}. Hence,
 
Proof. It is easy to see that S̃n+1 > t = S̃n+1 ⩽ t

{ Nt = n} = { Nt ⩾ n, Nt < n + 1} = S̃n ⩽ t, S̃n+1 > t .

Lemma 1.5. Let Fn ( x ) be the distribution function for Sn , then Pn (t) ≜ P { Nt = n} = Fn (t) − Fn+1 (t).
Proof. It suffices to observe that following is a union of disjoint events,
  
S̃n ⩽ t = S̃n ⩽ t, S̃n+1 > t ∪ S̃n ⩽ t, S̃n+1 ⩽ t .

2 IID exponential inter-arrival times characterization


R
Proposition 2.1. The counting process N : Ω → Z++ associated with a simple Poisson point process S : Ω → RN
+
is Markov.
Proof. We define the event space Ft ≜ σ( Ns : s ⩽ t) as the history of the process until time t ∈ R+ . Then,
from the independent increment property of Poisson processes, we have for any historical event Hs ∈ Fs

P({ Nt = n} Hs ∩ { Ns = k }) = P({ Nt − Ns = n − k} Hs ∩ { Ns = k}) = P({ Nt = n} { Ns = k}).

(Λ(s,t])n−k
The transition probability matrix is P(s, t) with its (k, n)th entry given by e−Λ(s,t] (n−k)!
.

Remark 2. A Markov process X : Ω → XR is time homogeneous if the transition matrix P(s, t) = P(t − s)
for all t ⩾ s. Thus the counting process for a homogeneous Poisson point process is time homogeneous
Markov process, as the transition probability matrix P(s, t) = P(t − s) with its (k, n)th entry given by
(λ(t−s))n−k
e−λ(t−s) (n−k)!
.

R
Theorem 2.2. The counting process N : Ω → Z++ associated with a simple Poisson point process S : Ω → RN
+ is
strongly Markov.
R
Proposition 2.3. A simple counting process N : Ω → Z++ is associated with a homogeneous Poisson process
with a constant intensity density λ, iff the inter-arrival time sequence X : Ω → RN
+ are i.i.d. random variables with
an exponential distribution of rate λ.
Proof. Let Nt be a counting process associated with a homogeneous Poisson point process on half-line with
constant intensity density λ. From equivalence iii in Theorem ??, we obtain for any positive integer t,

P { X1 > t} = P { Nt = 0} = e−λt .

It suffices to show that inter-arrivals time sequence X : Ω → RN + is i.i.d. . We can show that N is Markov
process with strong Markov property. Since the sequence of ordered points S̃ : Ω → RN + is a sequence of
stopping times for the counting process, it follows from the strong Markov property of this process that
( NS̃n +t − NS̃n : t ⩾ 0) is independent of σ( Ns : s ⩽ S̃n ) and hence of S̃n and NS̃n . Further, we see that
n o
Xn+1 = inf t > 0 : NS̃n +t − NS̃n = 1 .

It follows that X : Ω → RN + is an independent sequence. For homogeneous Poisson point process, we have
NS̃n +t − NS̃n = Nt in distribution, and hence Xn+1 has same distribution as X1 for each n ∈ N.
For the given i.i.d. inter-arrival time sequence X : Ω → RN + distributed exponentially with rate λ, we
n
define the nth arrival instant S̃n ≜ ∑i=1 Xi for each n ∈ N, and the number of arrivals in time duration (0, t]

2
as Nt ≜ ∑n∈N 1{S̃n ⩽t} for all t ∈ R+ . It follows that Nt is path wise non-decreasing, integer-valued, right
continuous, and simple since P { X1 ⩽ 0} = 0. Therefore, N is a simple counting process such that

P { Nt = 0} = P { X1 > t} = e−λt .
It follows that the void probabilities are exponential and hence the random variable Nt is Poisson with
parameter λt for all t ∈ R+ . Hence, N is a counting process associated with a homogeneous Poisson process
with the constant intensity density λ from the equivalence ii in Theorem ??.
For many proofs regarding Poisson processes, we partition the sample space with the disjoint events
{ Nt = n} for n ∈ Z+ . We need the following lemma that enables us to do that.
Lemma 2.4. For any finite time t > 0, the number of points on the interval (0, t] from a Poisson process is finite
almost surely.
Proof. By strong law of large numbers, we have limn→∞ Snn = E[ X1 ] = λ1 almost surely. Fix t > 0 and we
define a sample space subset M = {ω n∈ Ω : N (ω, t)o= ∞}. For any ω ∈ M, we have Sn (ω ) ⩽ t for all n ∈ N.
Sn
This implies lim supn n = 0 and ω ̸∈ limn Snn = 1
λ . Hence, the probability measure for set M is zero.

2.1 Distribution functions


Lemma 2.5. The following are true for the nth arrival instant S̃n of the Poisson arrival process S̃ : Ω → RN
+ with
constant intensity density λ.
(a) The moment generating function is MS̃n (θ ) = E[eθ S̃n ] = λn
1
(λ−θ )n {θ <λ}
+ ∞ 1{ θ ⩾λ } .

−1 (λt)k
(b) The distribution function is Fn (t) ≜ P S̃n ⩽ t = 1 − e−λt ∑nk=

0 k! .
λ(λs)n−1 −λs
(c) The density function is Gamma distributed with parameters n and λ. That is, f n (s) = ( n −1) !
e .
R
Corollary 2.6. Consider the counting process N : Ω → Z++ associated with the Poisson arrival process S̃ : Ω → RN
+
having constant intensity density λ. The following are true.
(a) The relation between distribution of nth arrival instant and probability mass function for the counting process is
given by Fn (t) = ∑ j⩾n Pj (t).

(b) For each t ∈ R+ , the probability mass function PNt ∈ M(Z+ ) for discrete random variable Nt : Ω → Z+ is
(λt)n
given by Pn (t) ≜ PNt (n) = P { Nt = n)} = e−λt n! .

(c) The relation between distribution of nth arrival instant and the mean of the counting process is given by ∑n∈N Fn (t) =
ENt .
(d) For each t ∈ R+ , the mean E[ Nt ] = λt, explaining the rate parameter λ for the Poisson process.
Proof. We observe the inverse relationship S̃n ⩽ t = { Nt ⩾ n} for all n ∈ Z+ and t ∈ R+ .


 result follows by taking the probability on both sides of the inverse relationship, to get Fn (t) =
(a) The
P S̃n ⩽ t = P { Nt ⩾ n} = ∑ j⩾n P { Nt = j} = ∑ j⩾n Pj (t).

(b) The result follows from the explicit from for the distribution of S̃n and recognizing that Pn (t) = Fn (t) −
Fn+1 (t).
(c) The result from the following observation ∑n∈N Fn (t) = E ∑n∈N 1{ Nt ⩾n} = ∑n∈N P { Nt ⩾ n} = ENt .

(d) The result follows by summing the distribution function of the nth arrivals, to get ENt = ∑n∈N Fn (t) =
(λt)k k −1
e−λt ∑n∈N ∑k⩾n k! = λte−λt ∑k∈N ((λt )
k −1) !
= λt.

Remark 3. A Poisson process is not a stationary process. That is, the finite dimensional distributions are not
shift invariant. This is clear from looking at the first moment ENt = λt, which is linearly increasing in time.

3
Lecture-28: Compound Poisson Processes

1 Compound Poisson process


R
Definition 1.1. A compound Poisson process is a real-valued right-continuous process Z : Ω → R++
with the following properties.
i Finitely many jumps: for all ω ∈ Ω, sample path t 7→ Zt (ω ) has finitely many jumps in finite
intervals,
ii Independent increments: for all t, s ⩾ 0; increments Zt+s − Zt is independent of past Ft ≜ σ ( Zu :
u ⩽ t ),
iii Stationary increments: for all t, s ⩾ 0, distribution of Zt+s − Zt depends only on s and not on t.
Definition 1.2. For each ω ∈ Ω and n ∈ N, we can define time and size of nth jump
n o
S̃0 (ω ) = 0, S̃n (ω ) = inf t > S̃n−1 : Zt (ω ) ̸= ZS̃n−1 (ω )
Y0 (ω ) = 0, Yn (ω ) = ZS̃n (ω ) − ZS̃n−1 (ω ).

Remark 1. Recall that Fs = σ ( Zu , u ∈ (0, s]) is the collection of historical events until time s associated
with the process Z. If S̃n is almost surely finite for all n ∈ N, then the sequence of jump times S̃ : Ω → RN+
is a sequence of stopping times with respect to the natural filtration F• of the process Z.
R
Remark 2. Let N : Ω → Z++ be the simple counting process associated with the number of jumps of
compound Poisson process Z in (0, t] defined by Nt ≜ ∑n∈N 1{S̃n ⩽t} for all t ∈ R+ . Then, S̃n and Yn are
the respectively the arrival instant and the size of the nth jump, and we can write Zt = ∑iN=t 1 Yi .
R
Proposition 1.3. A stochastic process Z : Ω → R++ is a compound Poisson process iff its jump times form a
Poisson process and the jump sizes form an i.i.d. random sequence independent of the jump times.
Proof. We will prove it in two steps.
Implication: Let Z be a compound Poisson process with the jump instant sequence S̃ : Ω → RN + and the
N R+
jump size sequence Y : Ω → R+ . We will show that the counting process N : Ω → Z+ is simple
and has stationary and independent increments and the jump size sequence Y is i.i.d. .
Independence of jumps and increments: From the definition of jump instant sequence S̃, it fol-
lows that the counting process N is adapted to the natural filtration F• of the compound
N
Poisson process Z. Since Zt+s − Zt = ∑i=t+Nst +1 Yi , and the compound Poisson processes have
independent increments, it follows that the increment ( Nt+s − Nt : s ⩾ 0) and (YNt + j : j ∈ N)
are independent of the past Ft .
Stationarity: Let’s assume that step sizes are positive, then we have
n o
S̃n = inf t > S̃n−1 : Zt > ZS̃n−1 , and { Nt+s − Nt = 0} = { Zt+s − Zt = 0} .

From the stationarity of the increments it follows that the probability P { Nt+s − Nt = 0} is
independent of t and equal to e−λs for some λ ∈ R+ . It follows that the counting process
R
N : Ω → Z++ has stationary increments and the associated jump sequence S̃ is homogeneous
Poisson with intensity density λ.
Strong Markovity: The compound Poisson process has the Markov property from stationary and
independent increment property. Further, since each sample path t 7→ Zt is right continuous,
the process satisfies the strong Markov property at each almost sure stopping time.

1
Inter jump times i.i.d. : We will inductively show that S̃ : Ω → RN + is a stopping time sequence and
N
hence the inter jump times sequence X : Ω → R+ defined by Xn ≜ S̃n − S̃n−1 for each n ∈ N
is an i.i.d. sequence. From the exponential distribution of S̃1 , it follows that it is almost surely
finite and hence a stopping time. From the stationary of increments of compound Poisson
process and the strong Markov property at stopping times S̃2 − S̃1 = S̃1 is independent of
FS̃1 and identical in distribution to S̃1 . The result follows inductively.
Jump size i.i.d. : From strong Markov property of Z, the jump size Yn is independent of the past
FS̃n−1 = σ ( Zu : u ⩽ S̃n−1 ) and from stationarity it is identically distributed to Y1 for each
n ∈ N. It follows that the jump size sequence Y is i.i.d. and independent of jump instant
sequence S̃.
Superposition: Similar arguments can be used to show for negative jump sizes. For real jump
sizes, we can form two independent Poisson processes with negative and positive jumps,
and the superposition of these two processes is Poisson.
Converse: Let X : Ω → RN+ be an i.i.d. inter-jump sequence distributed exponentially with rate λ and
Y : Ω → RN be an i.i.d. jump size sequence independent of X. We can define the jump instant
sequence S̃ : Ω → RN n
+ defined as S̃n ≜ ∑i =1 Xi for each n ∈ N, the counting process for the number
R
of jumps N : Ω → Z++ defined as Nt ≜ ∑n∈N 1{S̃n ⩽t} for each t ∈ R+ , and the compound process
Z : Ω → RR+ defined as Zt ≜ ∑nN=t 1 Yn for each t ∈ R+ .

Finitely many jumps: Since Nt is finite for any finite t, it follows that the compound Poisson pro-
cess Z has finitely many jumps in finite intervals.
Independence of increments: For any finite n ∈ N and finite intervals Ii for i ∈ [n], we can write
N( I )
Z ( Ii ) = ∑k=1i Yik , where Yik denotes the kth jump size in the interval Ii . Since the independent
sequence ( N ( Ii ) : i ∈ [n]) and Y : Ω → RN
+ are also mutually independent, it follows that Z ( Ii )
are independent.
Stationarity of increments: Further, the stationarity of the increments of the compound process is
inferred from the distribution of Z ( Ii ), which is
( )
m
P { Z ( Ii ) ⩽ x } = ∑ P { Z ( Ii ) ⩽ x, N ( Ii ) = m} = ∑ P ∑ Yik ⩽ x P { N ( Ii ) = m} .
m ∈Z+ m ∈Z+ k =1

Example 1.4. Examples of compound Poisson processes.


• Arrival of customers in a store is a Poison process N. Each customer i spends an i.i.d. amount
Xi independent of the arrival process.
n
Y0 = 0, Yn = ∑ Xi , i ∈ [ n ] .
i =1

Now define Zt ≜ YNt as the amount spent by the customers arriving until time t ∈ R+ . Then
R
Z : Ω → R++ is a compound Poisson Process.

• Let the time between successive failures of a machine be independent and exponentially dis-
tributed. The cost of repair is i.i.d. random at each failure. Then the total cost of repair in a
certain time t is a compound Poisson Process.

You might also like