Parimal Parag's Lecture Notes On Random Processes
Parimal Parag's Lecture Notes On Random Processes
{(b, f (b)) : b ∈ B} ⊆ B × A.
The set B and A are called the domain and co-domain of function f , and the set f ( B) = { f (b) ∈ A : b ∈ B}
is called the range of function f .
Example 1.2. Let B = {1, 2, 3} and A = { a, b}, then {(1, a), (2, a), (3, b)} corresponds to a function f : B → A
such that f (1) = f (2) = a, f (3) = b. We will denote this function by ordered tuple ( aab).
Notation: The collection of all A-valued functions with the domain B is denoted by A B .
Example 1.3. Let B = {1, 2, 3} and A = { a, b}, then the collection A B is defined by the set of ordered tuples
Definition 1.4 (Inverse Map). For a function f ∈ A B , we define set inverse map f −1 (C ) = {b ∈ B : f (b) ∈ C }
for all subsets C ⊆ A.
Remark 1. This is a slight abuse of notation, since f −1 : P ( A) → P ( B) is a map from sets to sets.
Example 1.5. Let B = {1, 2, 3} and A = { a, b} and f be denoted by the ordered tuple ( aba), then f −1 ({ a}) =
{1, 3} and f −1 ({b}) = {2}.
1
Example 1.7. injective: Let B = {1, 2, 3} and A = { a, b, c, d}. Then ( abc) is an injective function.
Definition 1.8 (Cardinality). We denote the cardinality of a set A by | A|. If there is a bijection between two
sets, they have the same cardinality. Any set which is bijective to the set [ N ] has cardinality N.
Example 1.9. The cardinality of A = { a, b, c} is | A| = 3, since there is a bijection between B = {1, 2, 3} and
A = { a, b, c}.
Definition 1.10 (Countable). Any set which is bijective to a subset of natural numbers N is called a count-
able set. Any set which has a finite cardinality is called a countably finite set. Any set which is bijective to
the set of natural numbers N is called a countably infinite set.
1. A B = | A|| B| .
2 Sample space
Consider an experiment where the outcomes are random and unpredictable.
Definition 2.1 (Sample space). The set of all possible outcomes of a random experiment is called sample
space and denoted by Ω.
Example 2.2 (Single coin toss). Consider a single coin toss, where the outputs can be heads or tails, denoted
by H and T respectively. The set of all possible outcomes is Ω = { H, T }.
Example 2.3 (Finite coin tosses). Consider N tosses of a single coin, where the possible output of each
coin toss belongs to the set { H, T } as before. In this case, the sample space is Ω = { H, T }[ N ] , and a single
outcome is a sequence ω = (ω1 , . . . , ω N ) where ωi ∈ { H, T } for all i ∈ [ N ].
Example 2.4 (Countably infinite coin tosses). Consider an infinite sequence of coin tosses. The set of all
possible outcomes is Ω = { H, T }N . A single outcome is a sequence ω = (ωi ∈ { H, T } : i ∈ N) ∈ Ω.
Example 2.5 (Point on non-negative real line). The set of all possible outcomes for a single point on non-
negative real line is Ω = R+ .
Example 2.6 (Countable points on non-negative real line). The set of all possible outcomes is Ω = RN
+,
where a single outcome is a sequence ω = (ωi ∈ R+ : i ∈ N) ∈ Ω.
2
3 Event space
Definition 3.1 (Event space). A collection of subsets of sample space Ω is called the event space if it is
a σ-algebra over subsets of Ω, and denoted by F. In other words, the collection F satisfies the following
properties.
1. Event space includes the certain event Ω. That is, Ω ∈ F.
2. Event space is closed under complements. That is, if A ∈ F, then Ac ∈ F.
3. Event space is closed under countable unions. That is, if Ai ∈ F for all i ∈ N, then ∪i∈N Ai ∈ F.
The elements of the event space F are called events.
Example 3.2 (Coarsest event space). For any sample space Ω, the trivial event space is F = {∅, Ω}.
Example 3.3 (Single coin toss). Recall that the sample space for a single coin toss is Ω = { H, T }. We define
an event space F ≜ {∅, { H } , { T } , { H, T }}. Can you verify that F is a σ-algebra?
Example 3.4 (Finest event space for finite coin tosses). Recall that the sample space for N coin tosses is
Ω = { H, T } N , An event space for this sample space is F ≜ P (Ω) = { A : A ⊆ Ω}. Can you verify that F is a
σ-algebra?
Remark 2. We observe that F ⊆ P (Ω). However, not all subsets of the sample space Ω necessary belong to
the event space. That is, F ⊂ P (Ω). To show this, it suffices to construct a set A ⊂ Ω that is not in F.
Example 3.7 (Event space generated by a single event). For any sample space Ω and a subset A ⊆ Ω, the
smallest event space generated by this event A is F = {∅, A, Ac , Ω}.
Example 3.8 (Countably infinite coin tosses). Recall that the sample space for a countably infinite number
of coin tosses is Ω = { H, T }N . We will construct an event space F on this outcome space, which would not
be the power set of the sample space. By definition, we must have the certain and the impossible event in
an event space. We consider the event space F generated by the events
3
That is An is the event of getting at least one head in first n tosses. We see that ( An ∈ F : n ∈ N) is a sequence
of increasing events. From the closure under countable union, we have ∪n∈N An = Ω \ ( T, T, . . . ) ∈ F.
Exercise 3.9. Consider a countably infinite sequence of coin tosses, with the sample space Ω =
{ H, T }N and the the event space F generated by the events ( An : n ∈ N) defined in Eq. (1). Let Bn
be the event of observing first head in nth toss, then show that Bn ∈ F for all n ∈ N.
Definition 3.10 (Borel event space). For sample space R, a Borel event space is generated by the subsets
Bx ≜ (−∞, x ] ⊆ R for each x ∈ R, and denoted by B(R) = σ ({ Bx : x ∈ R}).
Exercise 3.11. Show that the events { x } , ( x, y), [ x, y] belong to the Borel event space B(R) for all
x, y ∈ R.
4
Lecture-02: Probability Function
1 Indicator Functions
Definition 1.1 (Indicator functions). Consider a random experiment over an outcome space Ω and an
event space F. The indicator of an event B ∈ F is denoted by
(
1, ω ∈ B,
1B (ω ) =
0, ω ∈/ B.
Example 1.2. Consider a roll of dice that has an outcome space Ω = [6] and event space F = P (Ω). For an
event O ≜ {1, 3, 5} that represents odd outcomes for dice roll, we observe that 1O (1) = 1 and 1O (2) = 0.
Example 1.3. Consider N trials of a random experiment over outcome space Ω and the event space F. Let
ωn ∈ Ω denote the outcome of the experiment of the nth trial. For each event B ∈ F, we define an indicator
function (
1, ωn ∈ B,
1 B ( ωn ) =
0, ωn ∈
/ B.
For any event B ∈ F, the number of times the event B occurs in N trials is denoted by N ( B) = ∑nN=1 1B (ωn ).
N ( B)
We denote the relative frequency of an event A in N trials by N .
Definition 1.4 (Disjoint events). Let (Ω, F ) be a pair of sample and event space, and a sequence of events
A ∈ FN . The sequence is mutually disjoint if An ∩ Am = ∅ for any m ̸= n ∈ N. If ∪n∈N An = Ω, then A is
a partition of sample space Ω.
Exercise 1.5. Consider the sample space Ω = { H, T }N and event space F = σ ( An : n ∈ N) where
An ≜ {ω ∈ Ω : ωi = H for some i ∈ [n]}. Construct a sequence of non-trivial disjoint events.
Exercise 1.6. Let (Ω, F ) be a pair of sample and event space. For any sequence of events A ∈ FN ,
show that
1. 1 ∩ n ∈N A n ( ω ) = ∏ n ∈N 1 A n ( ω ) ,
2. 1∪n∈N An (ω ) = ∑n∈N 1 An (ω ) if the sequence A is mutually disjoint.
1
N (Ω)
3. For the certain event Ω, we have N = 1. This follows from the fact that N (Ω) = N.
Since the relative frequency is positive and bounded, it may converge to a real number as N grows very
N ( B)
large, and the limit lim N →∞ N may exist.
2 Probability axioms
Inspired by the relative frequency, we list the following axioms for a probability function P : F → [0, 1].
Axiom 2.1 (Axioms of probability). We define a probability measure on sample space Ω and event space
F by a function P : F → [0, 1] which satisfies the following axioms.
Non-negativity: For all events B ∈ F, we have P( B) ⩾ 0.
σ-additivity: For an infinite sequence A ∈ FN of mutually disjoint events, we have P(∪i∈N Ai ) = ∑i∈N P( Ai ).
Certainty: P(Ω) = 1.
Definition 2.2 (Probability space). A sample space Ω, an event space F ⊆ P (Ω), and a probability measure
P : F → [0, 1], together define a probability space (Ω, F, P).
3 Properties of Probability
Theorem 3.1. For any probability space (Ω, F, P), we have the following properties of probability measure.
impossibility: P(∅) = 0.
finite additivity: For mutually disjoint events A ∈ F n , we have P(∪in=1 Ai ) = ∑in=1 P( Ai ).
monotonicity: If events A, B ∈ F such that A ⊆ B, then P( A) ⩽ P( B).
inclusion-exclusion: For any events A, B ∈ F, we have P( A ∪ B) = P( A) + P( B) − P( A ∩ B).
continuity: For a sequence of events A ∈ FN ) such that limn An exists, we have P(limn→∞ An ) = limn→∞ P( An ).
Proof. We consider the probability space (Ω, F, P).
1. We take disjoint events E ∈ FN where E1 = Ω and Ei = ∅ for i ⩾ 2. It follows that ∪i∈N Ei = Ω and E is
a collection of mutually disjoint events. From the countable additivity axiom of probability, it follows
that
P(Ω) = P(Ω) + ∑ P( Ei ).
i ⩾2
3. For events A, B ∈ F such that A ⊆ B, we can take disjoint events E1 = A and E2 = B \ A. From closure
under complements and intersection, it follows that E2 ∈ F. From non-negativity of probability, we
have P( E2 ) ⩾ 0. Finally, the result follows from finite additivity of disjoint events
P( B) = P( E1 ∪ E2 ) = P( E1 ) + P( E2 ) ⩾ P( A).
2
4. For any two events A, B ∈ F, we can write the following events as disjoint unions
A = ( A \ B ) ∪ ( A ∩ B ), B = ( B \ A ) ∪ ( A ∩ B ), A ∪ B = ( A \ B ) ∪ ( A ∩ B ) ∪ ( B \ A ).
The result follows from the finite additivity of probability of disjoint events.
5. To show the continuity of probability in events, we first need to understand the limits of events. We
show the continuity of probability in the next section.
Example 3.2. Consider a single coin tosse with the sample space Ω = { H, T }, an event space F = P (Ω).
The probability measure P : F → [0, 1] is defined as
Example 3.3. Consider N coin tosses with the sample space Ω = { H, T }[ N ] , an event space F = P (Ω). For
each outcome ω ∈ Ω, we define the number of heads as NH (ω ) ≜ ∑n∈[ N ] 1{ H } (ωn ) and the number of tails
as NT (ω ) ≜ N − NH (ω ). The probability measure P : F → [0, 1] is defined for each event A ∈ F as
P( A) ≜ ∑ p NH (ω ) (1 − p) NT (ω ) .
ω∈ A
4 Limits of Sets
Definition 4.1 (Limits of monotonic sets). For a sequence of non-decreasing sets ( An : n ∈ N), we can
define the limit as
lim An ≜ ∪n∈N An .
n→∞
Similarly, for a sequence of non-increasing sets ( An : n ∈ N), we can define the limit as
lim An ≜ ∩n∈N An .
n→∞
Definition 4.3 (Limits of sets). For a sequence of sets ( An : n ∈ N), we can define the limit superior and
limit inferior of this sequence of sets as
lim sup An ≜ ∩n∈N ∪m⩾n Am = lim ∪m⩾n Am , lim inf An ≜ ∪n∈N ∩m⩾n Am = lim ∩m⩾n Am .
n→∞ n→∞ n→∞ n→∞
Lemma 4.4. For a sequence of sets ( An : n ∈ N), we have lim infn→∞ An ⊆ lim supn→∞ An .
3
Proof. For each n ∈ N, we define En ≜ ∪m⩾n Am and Fn ≜ ∩m⩾n Am . Consider a fixed n, then F1 , F2 , . . . , Fn ⊆
An , and Fm ⊆ Am for all m ⩾ n. Therefore, we can write ∪n∈N Fn ⊆ ∪m⩾n Am for each n ∈ N, and hence the
result follows.
Definition 4.5. If the limit superior and limit inferior of any sequence of sets ( An : n ∈ N) are equal, then
the sequence of sets has a limit A∞ , which is defined as
A∞ ≜ lim An = lim sup An = lim inf An .
n→∞ n→∞ n→∞
Example 4.6 (Sequence of sets with different limits). We consider sequence of sets ( An = [−2, (−1)n + n1 ] :
n ∈ N). It follows that Fn = ∩m⩾n Am = [−2, −1] and
(
1
[−2, 1 + n+ 1 ], n odd,
En = ∪m⩾n Am = 1
[−2, 1 + n ], n even.
lim inf An = ∪n∈N Fn = [−2, −1], lim sup An = ∩n∈N En = [−2, 1].
n n
Continuity for decreasing sets. Similarly, for a non-increasing sequence of sets ( Bn ∈ F : n ∈ N), we can
find the non-decreasing sequence of sets ( Bnc ∈ F : n ∈ N). By the first part, we have
P( lim Bn ) = P(∩n∈N Bn ) = 1 − P(∪n∈N Bnc ) = 1 − P( lim Bnc ) = 1 − lim P( Bnc ) = lim P( Bn ).
n→∞ n→∞ n→∞ n→∞
Continuity for general sequence of sets. We can similarly prove the general result for a sequence of
sets ( An ∈ F : n ∈ N) such that the limits limn An exists. We can define non-increasing sequences of sets
( En = ∪m⩾n Am ∈ F : n ∈ N) and non-decreasing sequences of sets ( Fn = ∩m⩾n Am ∈ F : n ∈ N). From the
continuity of probability for the monotonic sets, we have
P(lim sup An ) = P(∩n∈N En ) = lim P( En ), P(lim inf An ) = P(∪n∈N Fn ) = lim P( Fn ).
n n→∞ n n→∞
4
Lecture-03: Independence
Exercise 1.1 (Countably infinite coin tosses). Consider a sequence of coin tosses, such that the
sample space is Ω = { H, T }N . For set of outcomes En ≜ {ω ∈ Ω : ωn = H }, we consider an event
space generated by F ≜ σ({ En : n ∈ N}). Let Fn be the event space generate by the first n coin
tosses, i.e. Fn ≜ σ ({ Ei : i ∈ [n]}). Let An be the set of outcomes corresponding to at least one head
in first n outcomes An ≜ {ω ∈ Ω : ωi = H for some i ∈ [n]} = ∪in=1 Ei ∈ F, and Bn be the set of out-
comes corresponding to first head at the nth outcome Bn ≜ {ω ∈ Ω : ω1 = · · · = ωn−1 = T, ω = H } =
∩in=−11 Eic ∩ En ∈ F.
1. Show that F = σ ({Fn : n ∈ N}).
Theorem 1.2 (Law of total probability). For a probability space (Ω, F, P), consider a sequence of events B ∈ FN
that partitions the sample space Ω, i.e. Bm ∩ Bn = ∅ for all m ̸= n, and ∪n∈N Bn = Ω. Then, for any event A ∈ F,
we have
P( A) = ∑ P( A ∩ Bn ).
n ∈N
Proof. We can expand any event A ∈ F in terms of any partition B of the sample space Ω as
A = A ∩ Ω = A ∩ (∪n∈N Bn ) = ∪n∈N ( A ∩ Bn ).
From the mutual disjointness of the events B ∈ FN , it follows that the sequence ( A ∩ Bn ∈ F : n ∈ N) is
mutually disjoint. The result follows from the countable additivity of probability of disjoint events.
Example 1.3 (Countably infinite coin tosses). Consider the sample space Ω = { H, T }N and event space F
generated by sequence E ∈ FN defined in Exercise 1.1. We observe that any event A ∈ Fn can be written as
A = ∪ω ∈ A {ω } = ∪ω ∈ A ∩in=1 ({ω ∈ Ei } ∪ {ω ∈
/ Ei }).
2 Independence
Definition 2.1 (Independence of events). For a probability space (Ω, F, P), a family of events A ∈ F I is said
to be independent, if for any finite set F ⊆ I, we have
P(∩i∈ F Ai ) = ∏ P( Ai ).
i∈ F
Remark 1. The certain event Ω and the impossible event ∅ are always independent to every event A ∈ F.
1
Example 2.2 (Two coin tosses). Consider two coin tosses, such that the sample space is Ω =
{ HH, HT, TH, TT }, and the event space is F = P (Ω). It suffices to define a probability function P : F → [0, 1]
on the sample space. We define one such probability function P, such that
1
P({ HH }) = P({ HT }) = P({ TH }) = P({ TT }) = .
4
Let event E1 ≜ { HH, HT } and E2 ≜ { HH, TH } correspond to getting a head on the first or the second toss
respectively.
From the defined probability function, we obtain the probability of getting a tail on the first or the
second toss is 21 , and identical to the probability of getting a head on the first or the second toss. That is,
P( E1 ) = P( E2 ) = 12 and the intersecting event E1 ∩ E2 = { HH } with the probability P( E1 ∩ E2 ) = 41 . That is,
for events E1 , E2 ∈ F, we have
P( E1 ∩ E2 ) = P( E1 ) P( E2 ).
That is, events E1 and E2 are independent.
Example 2.3 (Countably infinite coin tosses). Consider the outcome space Ω = { H, T }N and event space
F generated by the sequence E defined in Exercise 1.1. We define a probability function P : F → [0, 1] by
P(∩i∈ F Ei ) = p| F| for any finite subset F ⊆ N. By definition, E ∈ FN is a sequence of independent events.
Consider A, B ∈ FN , where An ≜ ∪in=1 Ei and Bn ≜ ∩in=−11 Eic ∩ En ∈ F for all n ∈ N. It follows that P( An ) =
1 − (1 − p)n and P( Bn ) = p(1 − p)n−1 for n ∈ N.
For any ω ∈ Ω, we can define the number of heads in first n trials by k n (ω ) ≜ ∑in=1 1{ H } (ωi ) =
∑in=1 1{ω ∈Ei } . For any general event A ∈ Fn = σ({ Ei : i ∈ [n]}), we can write
n h i
P( A) = ∑∏ P {ω ∈ Ei } + P {ω ∈ Eic } = ∑ p k n ( ω ) (1 − p ) n − k n ( ω ) .
ω ∈ A i =1 ω∈ A
Example 2.4 (Counter example). Consider a probability space (Ω, F, P) and the events A1 , A2 , A3 ∈ F. The
condition P( A1 ∩ A2 ∩ A3 ) = P( A1 ) P( A2 ) P( A3 ) is not sufficient to guarantee independence of the three
events. In particular, we see that if
P ( A1 ∩ A2 ∩ A3 ) = P ( A1 ) P ( A2 ) P ( A3 ), P( A1 ∩ A2 ∩ A3c ) ̸= P( A1 ) P( A2 ) P( A3c ),
then P( A1 ∩ A2 ) = P( A1 ∩ A2 ∩ A3 ) + P( A1 ∩ A2 ∩ A3c ) ̸= P( A1 ) P( A2 ).
Definition 2.5. A family of collections of events (Ai ⊆ F : i ∈ I ) is called independent, if for any finite set
F ⊆ I and Ai ∈ Ai for all i ∈ F, we have
P(∩i∈ F Ai ) = ∏ P( Ai ).
i∈ F
3 Conditional Probability
Consider N trials of a random experiment over an outcome space Ω and an event space F. Let ωn ∈ Ω
denote the outcome of the experiment of the nth trial. Consider two events A, B ∈ F and denote the number
2
of times event A and event B occurs by N ( A) and N ( B) respectively. We denote the number of times both
events A and B occurred by N ( A ∩ B). Then, we can write these numbers in terms of indicator functions as
N N N
N ( A) = ∑ 1{ ω n ∈ A } , N ( B) = ∑ 1{ ω n ∈ B } , N ( A ∩ B) = ∑ 1{ ω n ∈ A ∩ B } .
n =1 n =1 n =1
N ( A) N ( B) N ( A∩ B)
We denote the relative frequency of events A, B, A ∩ B in N trials by N , N , N respectively. We can
find the relative frequency of events A, on the trials where B occurred as
N ( A∩ B)
N N ( A ∩ B)
N ( B)
= .
N ( B)
N
Inspired by the relative frequency, we define the conditional probability function conditioned on events.
Definition 3.1. Fix an event B ∈ F such that P( B) > 0, we can define the conditional probability P(·| B) :
F → [0, 1] of any event A ∈ F conditioned on the event B as
P( A ∩ B)
P( A| B) = .
P( B)
Lemma 3.2 (Conditional probability). For any event B ∈ F such that P( B) > 0, the conditional probability
P(·| B) : F → [0, 1] is a probability measure on space (Ω, F ).
Proof. We will show that the conditional probability satisfies all three axioms of a probability measure.
Non-negativity: For all events A ∈ F, we have P( A| B) ⩾ 0 since P( A ∩ B) ⩾ 0.
σ-additivity: For an infinite sequence of mutually disjoint events ( Ai ∈ F : i ∈ N) such that Ai ∩ A j = ∅
for all i ̸= j, we have P(∪i∈N Ai | B) = ∑i∈N P( Ai | B). This follows from disjointness of the sequence
( A i ∩ B ∈ F : i ∈ N).
Certainty: Since Ω ∩ B = B, we have P(Ω| B) = 1.
Remark 2. For two independent events A, B ∈ F such that P( A ∩ B) > 0, we have P( A| B) = P( A) and
P( B| A) = P( B). If either P( A) = 0 or P( B) = 0, then P( A ∩ B) = 0.
Remark 3. For any partition B of the sample space Ω, if P( Bn ) > 0 for all n ∈ N, then from the law of total
probability and the definition of conditional probability, we have
P( A) = ∑ P( A| Bn ) P( Bn ).
n ∈N
4 Conditional Independence
Definition 4.1 (Conditional independence of events). For a probability space (Ω, F, P), a family of events
A ∈ F I is said to be conditionally independent given an event C ∈ F such that P(C ) > 0, if for any finite set
F ⊆ I, we have
P(∩i∈ F Ai |C ) = ∏ P( Ai |C ).
i∈ F
Remark 4. Let C ∈ F be an event such that P(C ) > 0. Two events A, B ∈ F are said to be conditionally
independent given event C, if
P ( A ∩ B | C ) = P ( A | C ) P ( B | C ).
If the event C = Ω, it implies that A, B are independent events.
Remark 5. Two events may be independent, but not conditionally independent and vice versa.
3
Example 4.2. Consider two independent events A, B ∈ F such that P( A ∩ B) > 0 and P( A ∪ B) < 1. Then
the events A and B are not conditionally independent given A ∪ B. To see this, we observe that
P(( A ∩ B) ∩ ( A ∪ B)) P( A ∩ B) P( A) P( B)
P( A ∩ B| A ∪ B) = = = = P ( A | A ∪ B ) P ( B ).
P( A ∪ B) P( A ∪ B) P( A ∪ B)
P( B)
We further observe that P( B| A ∪ B) = P( A∪ B)
̸= P( B) and hence P( A ∩ B| A ∪ B) ̸= P( A| A ∪ B) P( B| A ∪ B).
Example 4.3. Consider two non-independent events A, B ∈ F such that P( A) > 0. Then the events A and B
are conditionally independent given A. To see this, we observe that
P( A ∩ B)
P( A ∩ B| A) = = P ( B | A ) P ( A | A ).
P( A)
4
Lecture-04: Random Variable
1 Random Variable
Definition 1.1 (Random variable). Consider a probability space (Ω, F, P). A random variable X : Ω → R
is a real-valued function from the sample space to real numbers, such that for each x ∈ R the event
A X ( x ) ≜ {ω ∈ Ω : X (ω ) ⩽ x } = { X ⩽ x } = X −1 (−∞, x ] = X −1 ( Bx ) ∈ F.
Remark 1. Recall that the set A X ( x ) is always a subset of sample space Ω for any mapping X : Ω → R, and
A X ( x ) ∈ F is an event when X is a random variable.
Example 1.2 (Constant function). Consider a mapping X : Ω → {c} ⊆ R defined on an arbitrary prob-
ability space (Ω, F, P), such that X (ω ) = c for all outcomes ω ∈ Ω. We observe that
(
∅, x < c,
A X ( x ) = X −1 ( Bx ) =
Ω, x ⩾ c.
That is A X ( x ) ∈ F for all event spaces, and hence X is a random variable and measurable for all event
spaces.
Example 1.3 (Indicator function). For an arbitrary probability space (Ω, F, P) and an event A ∈ F,
consider the indicator function 1 A : Ω → [0, 1]. Let x ∈ R, and Bx = (−∞, x ], then it follows that
Ω, x ⩾ 1,
A X ( x ) = 1 A ( Bx ) = Ac , x ∈ [0, 1),
−1
∅, x < 0.
That is, A X ( x ) ∈ F for all x ∈ R, and hence the indicator function 1 A is a random variable.
FX ( x ) ≜ P( A X ( x )) = P({ X ⩽ x }) = P ◦ X −1 (−∞, x ] = P ◦ X −1 ( Bx ).
1
Example 1.5 (Constant random variable). Let X : Ω → {c} ⊆ R be a constant random variable defined
on the probability space (Ω, F, P). The distribution function is a right-continuous step function at c with
step-value unity. That is, FX ( x ) = 1[c,∞) ( x ). We observe that P({ X = c}) = 1.
Example 1.6 (Indicator random variable). For an indicator random variable 1 A : Ω → {0, 1} defined
on a probability space (Ω, F, P) and an event A ∈ F, we have
1,
x ⩾ 1,
FX ( x ) = 1 − P( A), x ∈ [0, 1),
0, x < 0.
Lemma 1.7 (Properties of distribution function). The distribution function FX for any random variable X sat-
isfies the following properties.
1. Let x1 , x2 ∈ R such that x1 ⩽ x2 . Then for any ω ∈ A x1 , we have X (ω ) ⩽ x1 ⩽ x2 , and it follows that
ω ∈ A x2 . This implies that A x1 ⊆ A x2 . The result follows from the monotonicity of the probability.
2. For any x ∈ R, consider any monotonically decreasing sequence x ∈ RN such that limn xn = x0 . It
follows that the sequence of events A xn = X −1 (−∞, xn ] ∈ F : n ∈ N , is monotonically decreasing
and hence limn∈N A xn = ∩n∈N A xn = A x0 . The right-continuity then follows from the continuity of
probability, since
Similarly, we can take a monotonically decreasing sequence x ∈ RN such that limn xn = −∞, then
( A xn ∈ F : n ∈ N) is a monotonically decreasing sequence of sets and limn A xn = ∩n∈N A xn = ∅. From
the continuity of probability, it follows that limxn →−∞ FX ( xn ) = 0.
Remark 5. If two reals x1 < x2 then FX ( x1 ) ⩽ FX ( x2 ) with equality if and only if P {( x1 < X ⩽ x2 }) = 0. This
follows from the fact that A x2 = A x1 ∪ X −1 ( x1 , x2 ].
2
Remark 6. The event space generated by a random variable is the collection of the inverse of Borel sets,
i.e. σ ( X ) = X −1 ( B) : B ∈ B(R) . This follows from the fact that A X ( x ) = X −1 ( Bx ) and the inverse map
respects countable set operations such as unions, complements, and intersections. That is, if B ∈ B(R) =
σ ({ Bx : x ∈ R}), then X −1 ( B) ∈ σ ({ A X ( x ) : x ∈ R}). Similarly, if A ∈ σ( X ) = σ ({ A X ( x ) : x ∈ R}), then
A = X −1 ( B) for some B ∈ σ ({ Bx : x ∈ R}).
Example 1.9 (Constant random variable). Let X : Ω → {c} ⊆ R be a constant random variable defined
on the probability space (Ω, F, P). Then the smallest event space generated by this random variable is
σ ( X ) = {∅, Ω}.
Example 1.10 (Indicator random variable). Let 1 A be an indicator random variable defined on the
probability space (Ω, F, P) and event A ∈ F, then the smallest event space generated by this random
variable is σ ( X ) = σ ({∅, Ac , Ω}) = {∅, A, Ac , Ω}.
Example 1.12 (Bernoulli random variable). For the probability space (Ω, F, P), the Bernoulli random
variable is a mapping X : Ω → {0, 1} and PX (1) = p. We observe that Bernoulli random variable is an
indicator for the event A ≜ X −1 {1}, and P( A) = p. Therefore, the distribution function FX is given by
FX = (1 − p)1[0,1) + 1[1,∞) .
Lemma 1.13. Any discrete random variable is a linear combination of indicator function over a partition of the sample
space.
Proof. For a discrete random variable X : Ω → X ⊂ R on a probability space (Ω, F, P), the range X is count-
able, and we can define events Ex ≜ {ω ∈ Ω : X (ω ) = x } ∈ F for each x ∈ X. Then the mutually disjoint
sequence of events ( Ex ∈ F : x ∈ X) partitions the sample space Ω. We can write
X (ω ) = ∑ x 1 Ex ( ω ) .
x ∈X
Definition 1.14. Any discrete random variable X : Ω → X ⊆ R defined over a probability space (Ω, F, P),
with finite range is called a simple random variable.
Example 1.15 (Simple random variables). Let X be a simple random variable, then X = ∑ x∈X x1 AX ( x)
where ( A X ( x ) = X −1 { x } ∈ F : x ∈ X) is a finite partition of the sample space Ω. Without loss of gener-
ality, we can denote X = { x1 , . . . , xn } where x1 ⩽ . . . ⩽ xn . Then,
Ω,
x ⩾ xn ,
−1
X (−∞, x ] = ∪ j=1 A X ( x j ), x ∈ [ xi , xi+1 ), i ∈ [n − 1],
i
∅,
x < x1 .
Then the smallest event space generated by the simple random variable X is {∪ x∈S A X ( x ) : S ⊆ X}.
3
1.4 Continuous random variables
Definition 1.16. For a continuous random variable X, there exists density function f X : R → [0, ∞) such
that Z x
FX ( x ) = f X (u)du.
−∞
Example 1.17 (Gaussian random variable). For a probability space (Ω, F, P), Gaussian random vari-
able is a continuous random variable X : Ω → R defined by its density function
( x − µ )2
1
f X (x) = √ exp − , x ∈ R.
2πσ 2σ2
4
Lecture-05: Random Vectors
1 Random vectors
Definition 1.1 (Projection). For a vector x ∈ Rn , we can define πi : Rn → R is the projection of an n-length
vector onto its i-th component, such that πi ( x ) = xi .
Definition 1.2. The Borel sigma algebra over space Rn is defined as the smallest sigma algebra generated
by the family (πi−1 ( Bx ) : x ∈ R, i ∈ [n]) and is denoted by B(Rn ). The elements of the Borel sigma algebra
are called Borel sets.
Remark 1. By definition of B(Rn ), projection πi : Rn → R is a Borel measurable function for all i ∈ [n]. For
a subset A ⊆ R and projection πi : Rn → R, we can write
πi−1 ( A) = { x ∈ Rn : xi ∈ A} = R × · · · × A × · · · × R.
Thus, for any A ∈ B(R), we have πi−1 ( A) ∈ B(Rn ).
Definition 1.3 (Random vectors). Consider a probability space (Ω, F, P) and a finite n ∈ N. A random
vector X : Ω → Rn is an F-measurable mapping from the sample space to an n-length real-valued vector.
That is, for any x ∈ Rn , we have
A X ( x ) ≜ {ω ∈ Ω : X1 (ω ) ⩽ x1 , . . . , Xn (ω ) ⩽ xn } = ∩in=1 Xi−1 (−∞, xi ] ∈ F.
Example 1.4 (Tuple of indicators). Consider a probability space (Ω, F, P), a finite n ∈ N, and events
A1 , . . . , An ∈ F. We define a mapping X : Ω → {0, 1}n by Xi (ω ) ≜ 1 Ai (ω ) for all outcomes ω ∈ Ω. Let
x ∈ Rn , then we can write A X ( x ) = ∩in=1 1− 1
A (− ∞, xi ]. Recall that
i
Ω,
xi ⩾ 1,
1 Ai (−∞, xi ] = Aic ,
−1
xi ∈ [0, 1),
∅,
xi < 0.
It follows that the inverse image A X ( x ) lies in F, and hence X is an F-measurable random vector.
Theorem 1.5. Consider a probability space (Ω, F, P), and a finite n ∈ N. A mapping X : Ω → Rn is a random
vector if and only if Xi ≜ πi ◦ X : Ω → R are random variables for all i ∈ [n].
Proof. We will first show that X : Ω → Rn implies that πi ◦ X is a random variable for any i ∈ [n]. For any
i ∈ [n] and xi ∈ R, we take x = (∞, . . . , xi , . . . , ∞). This implies that πi−1 (−∞, xi ] = R × · · · × (−∞, xi ] × · · · ×
R ∈ B(Rn ). Further, defining A Xi ( xi ) ≜ Xi−1 (−∞, xi ], we observe from the definition of random vectors that
A X ( x ) = ∩nj=1 X j−1 (−∞, x j ] = Xi−1 (−∞, xi ] = A Xi ( xi ) ∈ F. (1)
We will next show that if Xi : Ω → R is a random variable for all i ∈ [n], then X ≜ ( X1 , . . . , Xn ) : Ω → Rn
is a random vector. For any x ∈ Rn , we have A Xi ( xi ) = Xi−1 (−∞, xi ] ∈ F for all i ∈ [n], from the definition
of random variables. From the closure of event set under countable intersections, we have
A X ( x ) = ∩in=1 A Xi ( xi ) ∈ F. (2)
1
1.1 Distribution of random vectors
Definition 1.6. Consider a probability space (Ω, F, P) and a finite n ∈ N. The joint distribution function
of a random vector X : Ω → Rn is defined as the mapping FX : Rn → [0, 1] such that
FX ( x ) ≜ P( A X ( x )) = P(∩in=1 A Xi ( xi )).
Example 1.7 (Tuple of indicators). Consider a probability space (Ω, F, P), a finite n ∈ N, and events
A1 , . . . , An ∈ F, that define the random vector X ≜ (1 A1 , . . . , 1 An ). For any x ∈ Rn , we can define index
sets I0 ( x ) ≜ {i ∈ [n] : xi < 0} and I1 ( x ) ≜ {i ∈ [n] : xi ∈ [0, 1)}, and write the joint distribution function for
this random vector X as
1,
I0 ( x ) ∪ I1 ( x ) = ∅,
FX ( x ) = P(∩i∈ I1 ( x) Ai ), I0 ( x ) = ∅, I1 ( x ) ̸= ∅,
c
I0 ( x ) ̸= ∅.
0,
Definition 1.8. For a random vector X : Ω → Rn defined on the probability space (Ω, F, P) and i ∈ [n], the
distribution of the ith random variable Xi ≜ πi ◦ X : Ω → R is called the ith marginal distribution, and
denoted by FXi : Ω → [0, 1].
Example 1.9 (Tuple of indicators). Consider a probability space (Ω, F, P), a finite n ∈ N, and events
A1 , . . . , An ∈ F, that define the random vector X ≜ (1 A1 , . . . , 1 An ). The ith marginal distribution is given
by
FXi ( x ) = (1 − P( Ai ))1[0,1) ( x ) + 1[1,∞) ( x ).
Corollary 1.10 (Marginal distribution). Consider a random vector X : Ω → Rn defined on a probability space
(Ω, F, P) with the joint distribution FX : Rn → [0, 1]. The ith marginal distribution and can be obtained from the
joint distribution of X as
FXi ( xi ) = lim FX ( x ).
x j →∞, for all j̸=i
Proof. For any i ∈ [n] and xi ∈ R, we have Xi−1 (−∞, xi ] = A X ( x ) for x = (∞, . . . , xi , . . . , ∞) from (1).
Lemma 1.11 (Properties of the joint distribution function). Consider a random vector X : Ω → Rn defined
on the probability space (Ω, F, P). The associated joint distribution function FX : Rn → [0, 1] satisfies the following
properties.
(i) For x, y ∈ Rn such that xi ⩽ yi for each i ∈ [n], we have FX ( x ) ⩽ FX (y).
(ii) The function FX ( x ) is right continuous at all points x ∈ Rn .
(iii) The lower limit is limxi →−∞ FX ( x ) = 0, and the upper limit is limxi →∞,i∈[n] FX ( x ) = 1.
Proof. Consider a random vector X : Ω → Rn defined on the probability space (Ω, F, P) and any x ∈ Rn .
(i) We can verify that A X ( x ) = ∩in=1 A Xi ( xi ) ⊆ ∩in=1 A Xi (yi ) = A(y). The result follows from the mono-
tonicity of probability measure.
(ii) The proof is similar to the proof for single random variable.
(iii) The event A X ( x ) = ∅ when xi = −∞ for some i ∈ [n] and A X ( x ) = Ω when xi = ∞ for all i ∈ [n], hence
the result follow.
2
Example 1.12 (Probability of rectangular events). Consider a probability space (Ω, F, P) and a random
vector X : Ω → R2 . Consider the points x ⩽ y ∈ R2 and the events
Writing x = ( x1 , x2 ) and y = (y1 , y2 ), we observe that the end points of the rectangular event B1 ∩ B2 are
points x, (y1 , x2 ), y, ( x1 , y2 ). Therefore, we can write this event as
B1 ∩ B2 = ( A X (y) \ A X ( x1 , y2 )) \ ( A X (y1 , x2 ) \ A X ( x )) ∈ F.
Example 1.15 (Tuple of indicators). Consider a probability space (Ω, F, P), a finite n ∈ N, and events
A1 , . . . , An ∈ F, that define the random vector X ≜ (1 A1 , . . . , 1 An ). The σ ( X ) = σ ( Ai : i ∈ [n]).
P(∩i∈ F Ai ) = ∏ P( Ai ).
i∈ F
Definition 1.17 (Independent and identically distributed). A random vector X : Ω → Rn defined on the
probability space (Ω, F, P) is called independent if
n
FX ( x ) = ∏ FXi ( xi ), for all x ∈ Rn .
i =1
The random vector X is called identically distributed if each of its components have the identical marginal
distribution, i.e.
FXi = FX1 , for all i ∈ [n].
3
Remark 2. Independence of a random vector implies that events ( A Xi ( xi ) : i ∈ [n]) are independent for any
x ∈ Rn . Defining families Ai ≜ ( A Xi ( x ) : x ∈ R) for all i ∈ [n], we observe that the families (A1 , . . . , An ) are
mutually independent.
Remark 3. In general, if two collection of events are mutually independent, then the event space generated
by them are independent. This can be proved using Dynkin’s π-λ Theorem.
Theorem 1.18. For an independent random vector X : Ω → Rn defined on a probability space (Ω, F, P), the event
spaces generated by its components (σ ( Xi ) : i ∈ [n]) are independent.
Proof. For an we define a family of events Ai ≜ ( Xi−1 (−∞, x ] : x ∈ R) for each i ∈ [n]. From the definition
of independence of random vectors, the families (Ai ⊆ F : i ∈ [n]) are mutually independent. Since σ (Ai ) =
σ( Xi ), the result follows from the previous remark.
Definition 1.19 (Independent random vectors). To random vectors X : Ω → Rn and Y : Ω → Rm defined
on the same probability space (Ω, F, P) are independent, if the collection of events ( A X ( x ) : x ∈ Rn ) and
( AY (y) : y ∈ Rm ) are independent, where A X ( x ) ≜ ∩in=1 Xi−1 (−∞, xi ] and AY (y) ≜ ∩im=1 Yi−1 (−∞, yi ].
Example 1.20 (Independent random vectors). Consider a set of vectors X = {(0, 0, 1), (1, 0, 0)} ⊆ R3 . Con-
sider two independent coin tosses, such that Ω = { H, T }2 , F = P (Ω) and P(ω ) = pk2 (ω ) (1 − p)2−k2 (ω ) ,
where k2 (ω ) = ∑2i=1 1{ωi = H } . We define random vectors
We can verify that X, Y : Ω → R3 are mutually independent random vectors, though we can also check that
X1 , X3 are dependent random variables and so are Y1 , Y3 .
Example 1.22 (Multiple coin tosses). For a probability space (Ω, F, P), such that Ω = { H, T }n , F =
2Ω , P(ω ) = 21n for all ω ∈ Ω.
Consider the random vector X : Ω → R such that Xi (ω ) = 1{ωi = H } for each i ∈ [n]. Observe that X is a
bijection from the sample space to the set {0, 1}n . In particular, X is a discrete random variable.
For any x ∈ [0, 1]n , we can write N ( x ) = ∑in=1 1[0,1) ( xi ). Further, we can write the joint distribution as
1,
xi ⩾ 1 for all i ∈ [n],
1
FX ( x ) = N (x) , xi ∈ [0, 1] for all i ∈ [n],
2
0, xi < 0 for some i ∈ [n].
4
Therefore, it follows that X is an i.i.d. random vector.
Remark 5. For an independent continuous random vector X : Ω → Rn , we have f X ( x ) = ∏in=1 f Xi ( xi ) for all
x ∈ Rn .
Example 1.24 (Gaussian random vectors). For a probability space (Ω, F, P), Gaussian random vector is a
continuous random vector X : Ω → Rn defined by its density function
1 1 T −1
f X (x) = p exp − ( x − µ) Σ ( x − µ) for all x ∈ Rn ,
(2π )n det(Σ) 2
where the mean vector µ ∈ Rn and the positive definite covariance matrix Σ ∈ Rn×n . The components of
the Gaussian random vector are Gaussian random variables, which are independent when Σ is diagonal
matrix and are identically distributed when Σ = σ2 I.
5
Lecture-06: Transformation of random vectors
Here, g−1 (y) is the functional inverse, and not inverse image as we have been seeing typically. We can
think g−1 (y) = g−1 {y}, though this inverse image has at most a single element since g is monotonically
increasing.
Example 1.4. Consider a positive random variable X : Ω → R+ defined on a probability space (Ω, F, P). Let
g : R+ → R+ be such that g( x ) = e−θx for all x ∈ R+ and some θ > 0. Then, g is monotonically decreasing
in X and x = g−1 (y) = − 1θ ln y. This implies that g−1 (−∞, y] = [− 1θ ln y, ∞) ∈ B(R+ ) for all y ∈ R+ . Thus g
is a measurable function, and Y = g( X ) is a random variable.
Proof. For any u, v ∈ R, we can define inverse images A g (u) ≜ g−1 (−∞, u] and Ah (v) ≜ h−1 (∞, v]. Since g, h
are Borel measurable, we have A g (u), Ah (v) ∈ B(R). We can write the following outcome set equality for
the joint event
n o n o
{ g( X ) ⩽ u} ∩ {h(Y ) ⩽ v} = X ∈ g−1 (−∞, u] ∩ Y ∈ h−1 (−∞, v] = X −1 ( A g (u)) ∩ Y −1 ( Ah (v)) ∈ F.
Since X and Y are independent random variables, it follows that X −1 ( A g (u)) and Y −1 ( Ah (v)) are indepen-
dent events, and the result follows.
1
2 Function of random vectors
Proposition 2.1. Consider a random vector X : Ω → Rn defined on the probability space (Ω, F, P), and a Borel
measurable function g : Rn → Rm such that A g (y) ≜ ∩m
j=1 x ∈ R : g j ( x ) ⩽ y j ∈ B(R ) for all y ∈ R . Then,
n n m
g( X ) : Ω → Rm is a random vector. The joint distribution function FY : Rm → [0, 1] for the vector Y ≜ g( X ) is given
by
FY (y) = P( X −1 ( A g (y))), for all y ∈ Rm .
Example 2.2 (Sum of random variables). For a random vector X : Ω → Rn defined on a probability space
(Ω, F, P). Define an addition function + : Rn → R such that +( x ) = ∑in=1 xi for any x ∈ Rn . We can verify
that + is a Borel measurable function and hence Y = +( X ) = ∑in=1 Xi is a random variable. When n = 2 and
X is a continuous random vector with density f X : R2 → R+ , we can write
Z Z
FY (y) = P({Y ⩽ y}) = P({ X1 + X2 ⩽ y}) = f X ( x1 , x2 )dx1 dx2 .
x1 ∈R x2 ⩽y − x1
By applying a change of variable ( x1 , t) = ( x1 , x1 + x2 ) and changing the order of integration, we see that
Z Z
FY (y) = dt dx1 f X ( x1 , t − x1 ).
t⩽y x1 ∈R
dFY (y)
Z
f Y (y) = = f X ( x, y − x )dx.
dy x ∈R
Theorem 2.3. For a continuous random vector X : Ω → Rm defined on the probability space (Ω, F, P) with density
f X : Rm → R+ and an injective and smooth Borel measurable function g : Rm → Rm , such that Y = g( X ) is a
continuous random vector. Then the density of random vector Y is given by
f X (x)
f Y (y) = ,
| J (y)|
∂y j
where x = g−1 (y) and J (y) = ( Jij (y) ≜ ∂xi : i, j ∈ [m]) is the Jacobian matrix.
Proof. For an injective map g : Rm → Rm we have { x } = g−1 {y} for any y ∈ g(Rm ). Further, since g is
smooth, we have dy = J (y)dx + o (|dx |), and thus
Defining set dB(y) ≜ w ∈ Rm : y j ⩽ w j ⩽ y j + dy j , we observe that for any continuous random vector
Y : Ω → Rm , we have
P ◦ Y −1 (dB(y)) = f Y (y) |dy| = P ◦ X −1 dB( x ) = f X ( x ) |dx | . (2)
2
Example 2.4 (Sum of random variables). Suppose that X : Ω → R2 is a continuous random vector and
Y1 = X1 + X2 . Let us compute f Y1 (y1 ) using the above theorem. Let us define a random vector Y : Ω → R2
such that Y = ( X1 + X2 , X2 ) so that | J (y)| = 1. This implies, f Y (y) = f X ( x ). Thus, we may compute the
marginal density of Y1 as,
Z ∞ Z ∞
f Y1 (y1 ) = f X ( x )1{ x2 =y2 ,x1 + x2 =y1 } dy2 = f X (y1 − y2 , y2 )dy2 .
−∞ −∞
3
Lecture-07: Random Processes
1 Introduction
Remark 1. For an arbitrary index set T, and a real-valued function x ∈ RT , the projection operator πt : RT →
R maps x ∈ RT to πt ( x ) = xt .
Definition 1.1 (Random process). Let (Ω, F, P) be a probability space. For an arbitrary index set T and
state space X ⊆ R, a map X : Ω → XT is called a random process if the projections Xt : Ω → X defined by
ω 7→ Xt (ω ) ≜ (πt ◦ X )(ω ) are random variables on the given probability space.
Definition 1.2. For each outcome ω ∈ Ω, we have a function X (ω ) : T 7→ X called the sample path or the
sample function of the process X.
Remark 2. A random process X defined on probability space (Ω, F, P) with index set T and state space
X ⊆ R, can be thought of as
(a) a map X : Ω × T → X,
(b) a map X : T → XΩ , i.e. a collection of random variables Xt : Ω → X for each time t ∈ T,
(c) a map X : Ω → XT , i.e. a collection of sample functions X (ω ) : T → X for each random outcome ω ∈ Ω.
1.1 Classification
State space X can be countable or uncountable, corresponding to discrete or continuous valued process. If
the index set T ⊆ R is countable, the stochastic process is called discrete-time stochastic process or random
sequence. When the index set T is uncountable, it is called continuous-time stochastic process. The index
set T doesn’t have to be time, if the index set is space, and then the stochastic process is spatial process.
When T = Rn × [0, ∞), stochastic process X is a spatio-temporal process.
iii Discrete random process: counting processes, population sampled at birth-death instants, number of
people in queues.
iv Continuous random process: water level in a dam, waiting time till service in a queue, location of a
mobile node in a network.
1.2 Measurability
For random process X : Ω → XT defined on the probability space (Ω, F, P), the projections Xt ≜ πt ◦ X are
F-measurable random variables. Therefore, the set of outcomes A Xt ( x ) ≜ Xt−1 (−∞, x ] ∈ F for all t ∈ T and
x ∈ R.
1
Definition 1.4. A random map X : Ω → XT is called F-measurable and hence a random process, if the set
of outcomes A Xt ( x ) = Xt−1 (−∞, x ] ∈ F for all t ∈ T and x ∈ R.
Definition 1.5. The event space generated by a random process X : Ω → XT defined on a probability space
(Ω, F, P) is given by
σ ( X ) ≜ σ ( A Xt ( x ) : t ∈ T, x ∈ R).
Definition 1.6. For a random process X : Ω → XT defined on the probability space (Ω, F, P), we define the
projection of X onto components S ⊆ T as the random vector XS : Ω → XS , where XS ≜ ( Xs : s ∈ S).
×
Remark 3. Recall that πt−1 (−∞, x ] = s∈T (−∞, xs ] where xs = x for s = t and xs = ∞ for all s ̸= t. The
F-measurability of process X implies that for any countable set S ⊆ T, we have A XS ( xS ) ≜ ∩s∈S A Xs ( xs ) ∈ F
for xS ∈ XS .
Remark 4. We can define A X ( x ) ≜ ∩t∈T A Xt ( xt ) for any x ∈ RT . However, A X ( x ) is guaranteed to be an
event only when S ≜ {t ∈ T : πt ( x ) < ∞} is a countable set. In this case,
A X ( x ) = ∩t∈T A Xt ( xt ) = ∩s∈S A Xs ( xs ) = A XS ( xS ) ∈ F.
σ( X ) = σ( E) = σ ({ En : n ∈ N}).
1.3 Distribution
Definition 1.8. For a random process X : Ω → XT defined on the probability space (Ω, F, P), we define a
finite dimensional distribution FXS : RS → [0, 1] for a finite S ⊆ T by
FXS ( xS ) ≜ P( A XS ( xS )), x S ∈ RS .
Example 1.9. Consider a probability space (Ω, F, P) defined by the sample space Ω = { H, T }N , the event
space F ≜ σ ( E) where En = {ω ∈ Ω : ωn = H } for n ∈ N, and the probability measure P : F → [0, 1] defined
by
P(∩i∈ F Ei ) = p| F| , for all finite F ⊆ N.
Let X : Ω → {0, 1}N defined as Xn (ω ) = 1En (ω ) for all outcomes ω ∈ Ω and n ∈ N. For this random
sequence, we can obtain the finite dimensional distribution FXS : RS → [0, 1] for any finite S ⊆ T and x ∈ RS
in terms of I0 ( x ) ≜ {i ∈ S : xi < 0} and I1 ( x ) ≜ {i ∈ S : xi ∈ [0, 1)}, as
1,
I0 ( x ) ∪ I1 ( x ) = ∅,
FXS ( x ) = (1 − p)| I1 ( x)| , I0 ( x ) = ∅, I1 ( x ) ̸= ∅, (1)
I0 ( x ) ̸= ∅.
0,
To define a measure on a random process, we can either put a measure on subsets of sample paths
( X (ω ) ∈ RT : ω ∈ Ω), or equip the collection of random variables ( Xt ∈ RΩ : t ∈ T ) with a joint measure.
2
Either way, we are interested in identifying the joint distribution F : RT → [0, 1]. To this end, for any x ∈ RT ,
we need to know
!
{ ω ∈ Ω : Xt ( ω ) ⩽ x t } Xt−1 (−∞, xt ]) = P( A X ( x )).
\ \
FX ( x ) ≜ P = P(
t∈ T t∈ T
First of all, we don’t know whether A X ( x ) is an event when T is uncountable. Though, we can verify
that A X ( x ) ∈ F for x ∈ RT such that {t ∈ T : xt < ∞} is countable. Second, even for a simple independent
process with countably infinite T, any function of the above form would be zero if xt is finite for all t ∈ T.
Therefore, we only look at the values of FX ( x ) for x ∈ RT where {t ∈ T : xt < ∞} is finite. That is, for any
finite set S ⊆ T, we focus on the events AS ( xS ) and their probabilities. However, these are precisely the finite
dimensional distributions. Set of all finite dimensional distributions of the stochastic process X : Ω → XT
characterizes its distribution completely.
Example 1.10. Consider a probability space (Ω, F, P) defined by the sample space Ω = { H, T }N and the
event space F ≜ σ ( E) where En = {ω ∈ Ω : ωn = H } for all n ∈ N. Let X : Ω → {0, 1}N defined as Xn (ω ) =
1En (ω ) for all outcomes ω ∈ Ω and n ∈ N. For this random sequence, if we are given the finite dimensional
distribution FXS : RS → [0, 1] for any finite S ⊆ T and x ∈ RS in terms of sets I0 ( x ) ≜ {i ∈ S : xi < 0} and
I1 ( x ) ≜ {i ∈ S : xi ∈ [0, 1)}, as defined in Eq. (??). Then, we can find the probability measure P : F → [0, 1] is
given by
P(∩i∈ F Ei ) = p| F| , for all finite F ⊆ N.
1.4 Independence
Definition 1.11. A random process is independent if the collection of event spaces (σ ( Xt ) : t ∈ T ) is inde-
pendent. That is, for all xS ∈ RS , we have
That is, independence of a random process is equivalent to factorization of any finite dimensional distribu-
tion function into product of individual marginal distribution functions.
Example 1.12. Consider a probability space (Ω, F, P) defined by the sample space Ω = { H, T }N , the event
space F ≜ σ ( E) where En = {ω ∈ Ω : ωn = H } for all n ∈ N, and the probability measure P : F → [0, 1]
defined by
P(∩i∈ F Ei ) = p| F| , for all finite F ⊆ N.
Then, we observe that the random sequence X : Ω → {0, 1}N defined by Xn (ω ) ≜ 1En (ω ) for all outcomes
ω ∈ Ω and n ∈ N, is independent.
Definition 1.13. Two stochastic processes X : Ω → XT1 , Y : Ω → YT2 are independent, if the corresponding
event spaces σ( X ), σ(Y ) are independent. That is, for any x ∈ RS1 , y ∈ RS2 for finite S1 ⊆ T1 , S2 ⊆ T2 , the
events AS1 ( x ) ≜ ∩s∈S1 Xs−1 (−∞, xs ] and BS2 (y) ≜ ∩s∈S2 Ys−1 (−∞, ys ] are independent. That is, the joint finite
dimensional distribution of X and Y factorizes, and
3
Lecture-08: Expectation
1 Expectation
Example 1.1. Consider a probability space (Ω, F, P). We consider N trials of a random experiment, and
define a random vector X : Ω → X N such that Xi ≜ πi ◦ X : Ω → X is a discrete random variable associated
with the trial i ∈ [ N ]. If the marginal distributions of random variables ( X1 , . . . , X N ) are identical with the
common probability mass function PX1 : X → [0, 1], then the empirical mean of random variable X1 can be
written as
1 N
N i∑
m̂ ≜ Xi .
=1
For a random variable X1 : Ω → X, we can define events EX1 ( x ) ≜ X1−1 { x } for each value x ∈ X. The
probability mass function PX1 : X → [0, 1] for the discrete random variable X1 can be estimated for each
x ∈ X as the empirical probability mass function
1 N
1 EX ( x ) .
N i∑
P̂X1 ( x ) ≜
i
=1
Recall that a simple random variable X1 can be written as X1 = ∑ x∈X x1EX (x) , where EX1 ≜ ( EX1 ( x ) ∈
1
F : x ∈ X) is a finite partition of the sample space Ω and PX1 ( x ) = P( EX1 ( x )). That is, we can write the
empirical mean in terms of the empirical PMF as
1 N
m̂ =
N i∑ ∑ x1EXi (x) (ω ) = ∑ x P̂X1 (x).
=1 x ∈X x ∈X
This example motivates the following definition of mean for simple random variables.
Definition 1.2 (Expectation of simple random variable). The mean or expectation of a simple random
variable X : Ω → X ⊆ R defined on a probability space (Ω, F, P), is denoted by E[ X ] and defined as
E[ X ] ≜ ∑ xPX (x).
x ∈X
Remark 1. For an indicator random variable 1 A , we have E1 A = P( A). That is, the expectation of an indi-
cator function is the probability of the indicated set.
Remark 2. Since a simple random variable can be written as X = ∑ x∈X x1EX ( x) where EX ( x ) ≜ X −1 { x } for
all x ∈ X, we can write the expectation of a simple random variable as an integral
Z Z Z
E[ X ] =
Ω
X (ω ) P(dω ) = ∑ x1EX (x) (ω ) P(dω ) = ∑ x
Ω x ∈X Ω
1EX (x) (ω ) P(dω ) = ∑ xE[1EX (x) ] = ∑ xPX ( x).
x ∈X x ∈X x ∈X
Theorem 1.3. Consider a non-negative random variable X : Ω → R+ defined on a probability space (Ω, F, P). There
exists a sequence of non-decreasing non-negative simple random variables Y : Ω → RN
+ such that for all ω ∈ Ω
1
Then E[Yn ] is defined for each n ∈ N, the sequence (E[Yn ] ∈ R+ : n ∈ N) is non-decreasing, and the limit limn E[Yn ] ∈
R+ ∪ {∞} exists. This limit is independent of the choice of the sequence and depends only on the probability space.
Proof. For each n ∈ N and k ∈ 0, . . . , 22n − 1 , we define half-open sets Bn,k ≜ (k2−n , (k + 1)2−n ]. Then,
the collection of sets Bn ≜ ( Bn,k : k ∈ 0, . . . , 22n − 1 ) partitions the set (0, 2n ] for each n ∈ N. Further, we
observe that ∪n∈N (0, 2n ] = R+ and that Bn+1,2k ∪ Bn+1,2k+1 = Bn,k for all n ∈ N and k.
For a non-negative random variable X : Ω → R+ , we define events An,k X = X −1 ( B ) ∈ F, and a sequence
n,k
of simple non-negative random variables Y : Ω → RN + in the following fashion
2n 2n 2n
! !
2 −1 2 −1 2 −1
Yn (ω ) ≜ ∑ 1 An,k
X (ω ) inf X (ω )
X
ω ∈ An,k
= ∑ 1 An,k
X (ω ) inf
X (ω )∈ Bn,k
X (ω ) = ∑ k2−n 1 AX (ω ).
n,k
k =0 k =0 k =0
We observe that Yn is a quantized version of X, and its value is the left end-point k2−n when X ∈ Bn,k for
2n
each k ∈ 0, . . . , 22n − 1 . Since ∪2k=0−1 An,k
X = X −1 (0, 2n ], it follows that we cover the positive real line as
n grows larger and the step size grows smaller. Thus, the limiting random variable can take all possible
non-negative real values. We observe that
2( n +1)
!
2 −1
Yn+1 (ω ) = ∑ 1 AnX+1,k (ω ) inf
X (ω )∈ Bn+1,k
X (ω )
k =0
22n −1
! !!
⩾ ∑ 1 AX (ω ) inf X (ω ) + 1 AX (ω ) inf X (ω )
n+1,2k X (ω )∈ Bn+1,2k n+1,2k+1 X (ω )∈ Bn+1,2k+1
k =0
22n −1
!
⩾ ∑ 1 An,2k
X (ω ) inf
X (ω )∈ Bn,k
X (ω ) = Yn (ω ).
k =0
Definition 1.4 (Expectation of a non-negative random variable). For a non-negative random variable X :
Ω → R defined on the probability space (Ω, F, P), consider the sequence of non-decreasing simple random
variables Y : Ω → RN+ such that limn Yn = X. The expectation of the non-negative random variable X is
defined as
E[ X ] ≜ lim E[Yn ].
n
Definition 1.5 (Expectation of a real random variable). For a real-valued random variable X defined on a
probability space (Ω, F, P), we can define the following functions
We can verify that X+ , X− are non-negative random variables and hence their expectations are well defined.
We observe that X (ω ) = X+ (ω ) − X− (ω ) for each ω ∈ Ω. If at least one of the E[ X+ ] and E[ X− ] is finite,
then the expectation of the random variable X is defined as
E[ X ] ≜ E[ X+ ] − E[ X− ].
2
Theorem 1.6 (Expectation as an integral with respect to the distribution function). For a random variable
X : Ω → R defined on the probability space (Ω, F, P), the expectation is given by
Z
E[ X ] = xdFX ( x ).
R
Proof. It suffices to show this for a non-negative random variable X, and the result follows from the defini-
tion of expectation of a non-negative random variable as the limit of expectation of approximating simple
functions.
2 Properties of Expectations
Theorem 2.1 (Properties). Let X : Ω → R be a random variable defined on the probability space (Ω, F, P).
(i) Linearity: Let a, b ∈ R and X, Y be random variables defined on the probability space (Ω, F, P). If EX, EY,
and aEX + bEY are well defined, then E( aX + bY ) is well defined and
E( aX + bY ) = aEX + bEY.
(ii) Monotonicity: If P { X ⩾ Y } = 1 and E[Y ] is well defined with E[Y ] > −∞, then E[ X ] is well defined and
E [ X ] ⩾ E [Y ] .
(iii) Functions of random variables: Let g : R → R be a Borel measurable function, then g( X ) is a random
variable with E[ g( X )] = x∈R g( x )dF ( x ).
R
(iv) Continuous random variables: Let f X : R → [0, ∞) be the density function, then EX = x∈R x f X ( x )dx.
R
(v) Discrete random variables: Let PX : X → [0, 1] be the probability mass function, then EX = ∑ x∈X xPX ( x ).
(vi) Integration by parts: The expectation EX = x⩾0 (1 − FX ( x ))dx − x<0 FX ( x )dx is well defined when at
R R
least one of the two parts is finite on the right hand side.
Proof. It suffices to show properties (i ) − (iii ) for simple random variables.
(i) Let X = ∑ x∈X x1EX ( x) and Y = ∑y∈Y y1EY (y) be simple random variables, then ( EX ( x ) ∩ EY (y) ∈
F : ( x, y) ∈ X × Y) partition the sample space Ω. Hence, we can write aX + bY = ∑( x,y)∈X×Y ( ax +
by)1{ EX ( x)∩EY (y)} and from linearity of sum it follows that
E[ aX + bY ] = ∑ ( ax + by) P { EX ( x ) ∩ EY (y)}
( x,y)∈X×Y
(ii) From the fact that X − Y ⩾ 0 almost surely and linearity of expectation, it suffices to show that EX ⩾ 0
for non-negative random variable X. It can easily be shown for simple non-negative random variables,
and follows for general non-negative random variables by taking limits.
(iii) It suffices to show this holds true for simple random variables X : Ω → X ⊂ R. Since g : R → R is Borel
measurable, Y ≜ g( X ) : Ω → Y ≜ g(X) is a random variable. It follows that, we can write the following
disjoint union X = ∪y∈Y ∪ x∈ g−1 {y} { x }. Further, for each y ∈ Y, we have
Since EX ( x ) are disjoint for all x ∈ X, we get P( EY (y)) = ∑ x∈ g−1 {y} P( EX ( x )). Using the above two
facts, we can write the expectation
E [Y ] = ∑ yP(EY (y)) = ∑ ∑ g( x ) P( EX ( x )) = ∑ g( x ) P( EX ( x )).
y ∈Y y ∈ g (X ) x ∈ g −1 ( y ) x ∈X
3
(iv) For continuous random variables, we have dFX ( x ) = f X ( x )dx for all x ∈ R.
(v) For discrete random variables X : Ω → X, we have dFX ( x ) = PX ( x ) for all x ∈ X and zero otherwise.
(vi) We can write EX = x<0 xdFX ( x ) − x⩾0 xd(1 − FX )( x ). We apply integration by parts to the first term
R R
4
Lecture-09: Moments and L p spaces
1 Moments
Let X : Ω → R be a random variable defined on the probability space (Ω, F, P) with the distribution function
FX : R → [0, 1].
Example 1.1 (Absolute value function). For the function |·| : R → R+ , we can write the inverse image of
half open sets (−∞, x ] for any x ∈ R as A g ( x ) = g−1 (−∞, x ]. It follows that A g ( x ) = ∅ ∈ B(R) for x < 0
and A g ( x ) = [− x, x ] ∈ B(R) for x ∈ R+ . Since g−1 (−∞, x ] ∈ B( R), it follows that |·| : R → R+ is a Borel
measurable function.
Exercise 1.4 (Polynomial function). For any k ∈ N, we define functions gk : R → R such that gk :
x 7→ x k . Show that gk is Borel measurable for all k ∈ N.
Definition 1.5 (Moments). We define the kth moment of the random variable X as mk ≜ Egk ( X ) = EX k .
First moment EX is called the mean of the random variable.
Remark 1. If E | X |k is finite, then mk exists and is finite.
n o
Remark 2. If P {| X | ⩽ 1} = 1, then P | X |k ⩽ 1 = 1. Therefore, by the monotonicity of expectations E | X |k ⩽
1, and the moments mk exist and are finite for all k ∈ N.
2 L p spaces
Definition 2.1. A vector space over field F is denoted by a set V has (a) vector addition + : V × V → V that
satisfies associativity and commutativity, and has an identity and inverse element for each vector v ∈ V,
and (b) scalar multiplication · : F × V → V that satisfies compatibility of field multiplication, distributivity
with vector and field addition, and has an identity element.
Remark 3. We can verify that the set of random variables is a vector space over reals, and we denote it
by V. To this end, we can verify the associativity and commutativity of addition of random variables.
For any random variable X, the constant function 0 is the identity random variable, and − X is the additive
inverse for vector additions. We can verify the compatibility of scalar multiplication with random variables,
existence of unit scalar 1, and distributivity of scalar multiplications with respect to field addition and vector
addition.
1
Definition 2.2. For a vector space V over field F, a norm on the vector space is a map f : V → R+ that
satisfies
homogeneity: f ( av) = | a| f (v) for all a ∈ F and v ∈ V,
sub-additivity: f (v + w) ⩽ f (v) + f (w) for all v, w ∈ V, and
1
Definition 2.4. We define a function ∥∥ p : L p → R+ defined by ∥∥ p ( X ) = ∥ X ∥ p ≜ (E | X | p ) p for any X ∈ L p
and real p ⩾ 1.
Remark 4. For p = 1, the map ∥∥ p is norm. Therefore, L1 is a normed vector space consisting of random
variables with bounded absolute mean.
Remark 5. We will show that ∥ X ∥∞ = sup {| X (ω )| : ω ∈ Ω}, and hence L∞ is a normed vector space of
bounded random variables.
Remark 6. We will also show that ∥∥ p is a norm for all p ∈ (1, ∞), and hence L p is a normed vector space
of random variables with bounded ∥∥ p norm. In particular, the L2 space consists of random variables with
bounded second moment.
Remark 7. If E | X | N is finite for some N ∈ N, then E | X |k is finite for all k ∈ [ N ]. This follows from the
linearity and monotonicity of expectations, and the fact that
This implies that L N ⊆ Lk for all k ∈ [ N ]. We will show that Lq ⊆ L p for any real numbers 1 ⩽ p ⩽ q.
3 Central Moments
Example 3.1 (Shifted polynomial functions). For any k ∈ N, we define functions hk : R → R such that
hk : x 7→ ( x − m1 )k . Then, hk = gk ( x − m1 ) = gk ◦ f where f : R → R is defined as f ( x ) = x − m1 for all x ∈ R.
Since gk and f are measurable, so is hk .
Definition 3.2 (Central moments). Let X : Ω → R be a random variable defined on the probability space
(Ω, F, P) with finite first moment m1 . We define the kth central moment of the random variable X as
σk ≜ Ehk ( X ) = E( X − m1 )k . The second central moment σ2 = E( X − m1 )2 is called the variance of the
random variable and denoted by σ2 .
Lemma 3.3. The first central moment σ1 = E( X − m1 ) = 0 and the variance σ2 = E( X − m1 )2 for a random
variable X is always non-negative, with equality when X is a constant. That is, m2 ⩾ m21 with equality when X is a
constant.
Proof. Recall that h1 , h2 are Boreal measurable functions, and hence h1 ( X ) = X − m1 and h2 ( X ) = ( X − m1 )2
are random variables. From the linearity of expectations, it follows that σ1 = Eh1 ( X ) = EX − m1 = 0. Since
( X − m1 )2 ⩾ 0 almost surely, it follows from the monotonicity of expectation that 0 ⩽ E( X − m1 )2 . From the
linearity of expectation and expansion of ( X − m1 )2 , we get σ2 = EX 2 − 2m1 EX + m21 = m2 − m21 ⩾ 0.
Remark 8. If second moment is finite, then the first moment is finite. That is, L2 ⊆ L1 .
2
4 Inequalities
Theorem 4.1 (Markov’s inequality). Let X : Ω → R be a random variable defined on the probability space
(Ω, F, P). Then, for any monotonically non-decreasing function f : R → R+ , we have
E[ f ( X )]
P { X ⩾ ϵ} ⩽ .
f (ϵ)
Proof. We can verify that any monotonically non-decreasing function f : R → R+ is Borel measurable.
Hence, f ( X ) is a random variable for any random variable X. Therefore,
f ( X ) = f ( X )1{ X ⩾ϵ } + f ( X )1{ X < ϵ } ⩾ f ( ϵ )1{ X ⩾ϵ } .
The result follows from the monotonicity of expectations.
EX
Corollary 4.2 (Markov). Let X be a non-negative random variable, then P { X ⩾ ϵ} ⩽ ϵ for all ϵ > 0.
Corollary 4.3 (Chebychev). Let X be a random variable with finite mean m1 and variance σ2 , then
σ2
P {| X − m1 | ⩾ ϵ} ⩽ 2 , for all ϵ > 0.
ϵ
Proof. Apply the Markov’s inequality for random variable Y = | X − m1 | ⩾ 0 and increasing function f ( x ) =
x2 for x ⩾ 0.
Corollary 4.4 (Chernoff). Let X be a random variable with finite E[eθX ] for some θ > 0, then
P { X ⩾ ϵ} ⩽ e−θϵ E[eθX ], for all ϵ > 0.
Proof. Apply the Markov’s inequality for random variable X and increasing function f ( x ) = eθx > 0 for
θ > 0.
Definition 4.5 (Convex function). A real-valued function f : R → R is convex if for all x, y ∈ R and θ ∈ [0, 1],
we have
f (θx + (1 − θ )y) ⩽ θ f ( x ) + (1 − θ ) f (y).
Theorem 4.6 (Jensen’s inequality). For any convex function f : R → R and random variable X, we have
f (EX ) ⩽ E f ( X ).
Proof. It suffices to show this for simple random variables X : Ω → X. We show this by induction on cardi-
nality of alphabet X. The inequality is trivially true for |X| = 1. We assume that the inductive hypothesis is
true for |X| = n.
Let X ∈ X, where |X| = n + 1. We can denote X = { x1 , . . . , xn+1 } with pi ≜ P { X = xi } for all i ∈ [n + 1].
p
We observe that ( 1−jp : j ⩾ 2) is a probability mass function for some random variable Y ∈ Y = { x2 , . . . , xn+1 }
1
with cardinality n. Hence, by inductive hypothesis, we have
!
n +1 n +1
pi pi
f ∑ xi = f (EY ) ⩽ E f (Y ) = ∑ f ( x i ).
i =2
1 − p 1 i =2
1 − p1
p
Applying the convexity of f to θ = p1 , x = x1 , y = ∑in=+21 1−ip xi , we get
1
n +1 n +1 n +1 p
pi
f (EX ) = f ( ∑ pi xi ) = f p1 x1 + (1 − p1 ) ∑ x i ⩽ p1 f ( x1 ) + (1 − p1 ) f ∑ i
xi .
i =1 i =2
1 − p1 i =2
1 − p1
From the inductive step, it follows that the RHS is upper bounded by E f ( X ), and the result follows.
Theorem 4.7. For any real numbers 1 ⩽ p ⩽ q < ∞, we have Lq ⊆ L p .
Proof. Let 1 ⩽ p ⩽ q < ∞ and consider a convex function g : R+ → R+ defined by g( x ) ≜ x q/p for all x ∈ R+ .
It follows that g(| X | p ) = | X |q and hence from the Jensen’s inequality, we get
∥ X ∥qp = g(E | X | p ) ⩽ Eg(| X | p ) = ∥ X ∥qq .
3
Lecture-10: Correlation
1 Correlation
Let X : Ω → R and Y : Ω → R be random variables defined on the same probability space (Ω, F, P).
Definition 1.2 (Correlation). For two random variables X, Y defined on the same probability space, the cor-
relation between these two random variables is defined as E[ XY ]. If E[ XY ] = E[ X ]E[Y ], then the random
variables X, Y are called uncorrelated.
Lemma 1.3. If X, Y are independent random variables, then they are uncorrelated.
Proof. It suffices to show for X, Y simple and independent random variables. We can write X = ∑ x∈X x1 AX ( x)
and Y = ∑y∈Y y1 AY (y) . Therefore,
Proof. If X, Y are independent random variables, then the joint distribution FX,Y ( x, y) = FX ( x ) FY (y) for all
( x, y) ∈ R2 . Therefore,
Z Z Z
E[ XY ] = xydFX,Y ( x, y) = xdFX ( x ) ydFY (y) = E[ X ]E[Y ].
( x,y)∈R2 x ∈R y ∈R
Example 1.4 (Uncorrelated dependent random variables). Let X : Ω → R be a continuous random variable
with even density function f X : R → R+ , and g : R → R+ be another even function that is increasing for
y ∈ R+ . Then g is Borel measurable function and Y = g( X ) is a random variable. Further, we can verify
that X, Y are uncorrelated and dependent random variables.
To show dependence of X and Y, we take positive x, y such that FX ( x ) < 1 and x > xy where xy =
g−1 (y) ∩ R+ . Then, we can write the set
AY (y) = Y −1 (−∞, y] = X −1 [− xy , xy ].
Hence, we can write the joint distribution at ( x, y) as
FX,Y ( x, y) = P { X ⩽ x, Y ⩽ y} = P( A X ( x ) ∩ AY (y)) = P( AY (y)) = FY (y) ̸= FX ( x ) FY (y).
Since X has even density function, we have f X ( x ) = f X (− x ) for all x ∈ R. Therefore, we have
Z Z
EXg( X )1{ X <0} = xg( x ) f X ( x )dx = (−u) g(−u) f X (−u)du = −EXg( X )1{X >0} .
x <0 u >0
The last equality follows from the fact that g and f X are even. Therefore, we have
E[ Xg( X )] = E[ Xg( X )1{ X <0} ] + E[ Xg( X )1{ X >0} ] = −E[ Xg( X )1{ X >0} ] + E[ Xg( X )1{ X >0} ] = 0.
1
Theorem 1.5 (AM greater than GM). For any two random variables X, Y, the correlation is upper bounded by the
average of the second moments, with equality iff X = Y almost surely. That is,
1
E[ XY ] ⩽ (EX 2 + EY 2 ).
2
Proof. This follows from the linearity and monotonicity of expectations and the fact that ( X − Y )2 ⩾ 0 with
equality iff X = Y.
Theorem 1.6 (Cauchy-Schwarz inequality). For any two random variables X, Y, the correlation of absolute values
of X and Y is upper bounded by the square root of product of second moments, with equality iff X = αY for any
constant α ∈ R. That is, √
E | XY | ⩽ EX 2 EY 2 .
Proof. For two random variables X and Y, we can define normalized random variables W ≜ √| X | and
EX 2
Z≜ √|Y | , to get the result.
EY 2
2 Covariance
Definition 2.1 (Covariance). For two random variables X, Y defined on the same probability space, the
covariance between these two random variables is defined as cov( X, Y ) ≜ E( X − EX )(Y − EY ).
Lemma 2.2. If the random variables X, Y are uncorrelated, then the covariance is zero.
Proof. We can write the covariance of uncorrelated random variables X, Y as
cov( X, Y ) = E( X − EX )(Y − EY ) = EXY − (EX )(EY ) = 0.
Proof. From the linearity of expectation, we can write the variance of the linear combination as
!2
n n
E ∑ ai (Xi − EXi ) = ∑ a2i Var Xi + ∑ ai a j cov(Xi , Xj ).
i =1 i =1 i̸= j
Definition 2.4 (Correlation coefficient). The ratio of covariance of two random variables X, Y and the
square root of product of their variances is called the correlation coefficient and denoted by
cov( X, Y )
ρ X,Y ≜ p .
Var( X ), Var(Y )
Theorem 2.5 (Correlation coefficient). For any two random variables X, Y, the absolute value of correlation
q co-
Var( X )
efficient is less than or equal to unity, with equality iff X = αY + β almost surely for constants α = Var(Y )
and
β = EX − αEY.
Proof. For two random variables X and Y, we can define normalized random variables W ≜ √X −EX and
Var( X )
Z ≜ √Y −EY . Applying the AM-GM inequality to random variables W, Z, we get
Var(Y )
q
|cov( X, Y )| ⩽ Var( X ) Var(Y ).
2
3 L p spaces
Definition 3.1. A pair ( p, q) ∈ R2 where p, q ⩾ 1 and 1
p + 1q = 1, i s called the conjugate pair, and the spaces
L p and Lq are called dual spaces.
Example 3.2. The dual of L1 space is L∞ . The space L2 is dual of itself, and called a Hilbert space.
Theorem 3.3 (Hölder’s inequality). Consider a conjugate pair ( p, q) and random variables X ∈ L p , Y ∈ Lq . Then,
E | XY | ⩽ ∥ X ∥ p ∥Y ∥q .
n o
Proof. Consider a random variable Z : Ω → { p ln v, q ln w} with probability mass function 1 1
p, q . It follows
from Jensen’s inequality applied to the convex function f ( x ) = e x and the random variable Z, that
vp wq
vw = f (EZ ) ⩽ E f ( Z ) = + .
p q
Vp Wq
It follows that for any random variables V, W, we have VW ⩽ p + q . Taking expectation on both sides,
EV p EW q |X| |Y |
we get from the monotonicity of expectation that EVW ⩽ p + q . Taking V ≜ ∥X∥ p
and W ≜ ∥Y ∥ q
, we
get the result.
Definition 3.4. For a pair of random variables ( X, Y ) ∈ ( L p , Lq ) for conjugate pair ( p, q), we can define inner
product ⟨⟩ : L p × Lq → R by
⟨⟩ ( X, Y ) ≜ ⟨ X, Y ⟩ ≜ EXY.
Remark 1. For X ∈ L p and Y ∈ Lq , the expectation E | XY | is finite from Hölder’s inequality. Therefore, the
inner product ⟨ X, Y ⟩ = E[ XY ] is well defined and finite.
Remark 2. This inner product is well defined for the conjugate pair (1, ∞).
Theorem 3.5 (Minkowski’s inequality). For 1 ⩽ p < ∞, let X, Y ∈ L p be two random variables defined on a
probability space (Ω, F, P). Then,
∥ X + Y ∥ p ⩽ ∥ X ∥ p + ∥Y ∥ p ,
with inequality iff X = αY for some α ⩾ 0 or Y = 0.
Proof. Since addition is a Borel measurable function, X + Y is a random variable. We first show that X + Y ∈
L p , when X, Y ∈ L p . To this end, we observe that g : R+ → R+ defined by g( x ) = x p for all x ∈ R+ , is a
convex function for p ⩾ 1. From the convexity of g, we have
p p
1 1 1 1 1 1 1 1 1 1
X+ Y ⩽ | X | + |Y | = g( | X | + |Y |) ⩽ g(| X |) + g(|Y |) = | X | p + |Y | p .
2 2 2 2 2 2 2 2 2 2
This implies that | X + Y | p ⩽ 2 p−1 (| X | p + |Y | p ).
The inequality holds trivially if ∥ X + Y ∥ p = 0. Therefore, we assume that ∥ X + Y ∥ p > 0, without any
loss of generality. Using the definition of ∥∥ p , triangle inequality, and linearity of expectation we get
∥ X + Y ∥ pp ⩽ (∥ X ∥ p + ∥Y ∥ p ) | X + Y | p−1
q
p 1− 1p
Recall that q = p −1 . Therefore, | X + Y | p−1 = (E | X + Y | p ) and the result follows.
q
Remark 3. We have shown that the map ∥∥ p is a norm by proving the Minkowski’s inequality. Therefore, L p
is a normed vector space. We can define distance between two random variables X1 , X2 ∈ L p by the norm
∥ X1 − X2 ∥ p .
3
Lecture-11: Generating functions
1 Generating functions
Suppose that X : Ω → R is a continuous random variable on the probability space (Ω, F, P) with distribution
function FX : R → [0, 1].
√
Example 1.1. Let j ≜ −1, then we can show that hu : R → C defined by hu ( x ) ≜ e jux = cos(ux ) + j sin(ux )
is also Borel measurable for all u ∈ R. Thus, hu ( X ) : Ω → C is a complex valued random variable on this
probability space.
Definition 1.2. For a random variable X : Ω → R defined on the probability space (Ω, F, P), the character-
istic function Φ X : R → C is defined by Φ X (u) ≜ Ee juX for all u ∈ R and j2 = −1.
Remark 1. The characteristic function Φ X (u) is always finite, since |Φ X (u)| = Ee juX ⩽ E e juX = 1.
Remark 2. For a discrete random variable X : Ω → X with PMF PX : X → [0, 1], the characteristic function
Φ X (u) = ∑ x∈X e jux PX ( x ).
Remark 3. For a continuous
R∞ random variable X : Ω → R with density function f X : R → R+ , the characteristic
function Φ X (u) = −∞ e juX f X ( x )dx.
Example 1.3 (Gaussian random variable). For a Gaussian random variable X : Ω → R with mean µ and
variance σ2 , the characteristic function Φ X is
1
Z ( x − µ )2 u2 σ 2
−
Φ X (u) = √ e jux e 2σ2 dx = exp − + juµ .
2πσ2 x ∈R 2
2 σ2 /2
We observe that |Φ X (u)| = e−u has Gaussian decay with zero mean and variance 1/σ2 .
(k)
Theorem 1.4. If E | X | N is finite for some integer N ∈ N, then Φ X (u) is finite and continuous functions of u ∈ R
(k)
for all k ∈ [ N ]. Further, Φ X (0) = jk E[ X k ] for all k ∈ [ N ].
Proof. Since E | X | N is finite, then so is E | X |k for all k ∈ [ N ]. Therefore, E[ X k ] exists and is finite. Exchanging
derivative and the integration (which can be done sinceh e jux is a bounded i function with all derivatives), and
(k) dk e juX
evaluating the derivative at u = 0, we get Φ X (0) = E duk u=0
= j k E[ X k ].
Theorem 1.5. Two random variables have the same distribution iff they have the same characteristic function.
Proof. It is easy to see the necessity and the sufficiency is difficult.
1
1.2 Moment generating function
Example 1.6. A function gt : R → R+ defined by gt ( x ) ≜ etx is monotone and hence Borel measurable for
all t ∈ R. Therefore, gt ( X ) : Ω → R+ is a positive random variable on this probability space.
Definition 1.7. For a random variable X : Ω → R defined on the probability space (Ω, F, P), the moment
generating function MX : R → R+ is defined by MX (t) ≜ EetX for all t ∈ R where MX (t) is finite.
Remark 4. Characteristic function always exist, however are complex in general. Sometimes it is easier to
work with moment generating functions, when they exist.
n
Lemma 1.8. For a random variable X, if the MGF MX (t) is finite for some t ∈ R, then MX (t) = 1 + ∑n∈N n!
t
E[ X n ].
n
Proof. From the Taylor series expansion of eθ around θ = 0, we get eθ = 1 + ∑n∈N θn! . Therefore, taking
tn n
θ = tX, we can write etX = 1 + ∑n∈N n! X . Taking expectation on both sides, the result follows from the
linearity of expectation, when both sides have finite expectation.
Example 1.9 (Gaussian random variable). For a Gaussian random X : Ω → R with mean µ and
2 variable
2 t σ2
variance σ , the moment generating function MX is MX (t) = exp 2 + tµ .
Ψ X ( z ) ≜ E[ z X ] = ∑ zx PX (x), z ∈ C, |z| ⩽ 1.
x ∈X
Lemma 1.11. For a non-negative simple random variable X : Ω → X, we have |Ψ X (z)| ⩽ 1 for all |z| ⩽ 1.
Proof. Let z ∈ C with |z| ⩽ 1. Let PX : X → [0, 1] be the probability mass function of the non-negative simple
random variable X. Since any realization x ∈ X of random variable X is non-negative, we can write
Theorem 1.12. For a non-negative simple random variable X : Ω → X with finite kth moment EX k , the k-th deriva-
tive of probability generating function evaluated at z = 1 is the k-th order factorial moment of X. That is,
" #
k −1
∏ (X − i)
(k)
Ψ X (1) = E = E[ X ( X − 1)( X − 2) . . . ( X − k + 1)].
i =0
2
Theorem 1.13. Two non-negative integer-valued random variables have the same probability distribution iff their
z-transforms are equal.
(k) (k)
Proof. The necessity is clear. For sufficiency, we see that Ψ X = ∑ x⩾k k!z x−k PX ( x ) and hence Ψ X (0) =
k!PX (k). Further, interchanging the derivative and the summation (by dominated convergence theorem),
we get the second result.
mean µ1 and variance σ2 . The characteristic function Φ X of an i.i.d. Gaussian random vector X : Ω → R
n
σ2
parametrized by (µ1 , σ2 ) is given by Φ X (u) = ∏in=1 Φ Xi (ui ) = exp − 2 ∑in=1 u2i + jµ1 ∑in=1 ui .
Lemma 2.3. For an i.i.d. zero mean unit variance Gaussian vector Z : Ω → Rn , vector α ∈ Rn , and scalar µ ∈ R,
the affine combination Y ≜ µ + ⟨α, Z ⟩ is a Gaussian random variable.
Proof. From the linearity of expectation and the fact that Z is a zero mean vector, we get EY = µ. Further,
from the linearity of expectation and the fact that E[ ZZ T ] = I, we get
n n n
σ2 ≜ Var(Y ) = E(Y − µ)2 = ∑ ∑ αi αk E[Zi Zk ] = ⟨α, α⟩ = ∥α∥22 = ∑ α2i .
i =1 k =1 i =1
2 2
To show that Y is Gaussian, it suffices to show that ΦY (u) = exp(− u 2σ + juµ) for any u ∈ R. Recall that
Z is an independent random vector with individual components being identically zero mean unit variance
2
Gaussian. Therefore, Φ Zi (u) = exp(− u2 ), and we can compute the characteristic function of Y as
n n
u2 σ 2
ΦY (u) = Ee juY = e juµ E ∏ e juαi Zi = e juµ ∏ Φ Zi (uαi ) = exp(− + juµ).
i =1 i =1
2
1
Theorem 2.4. A random vector X : Ω → Rn is Gaussian with parameters (µ, Σ) iff Z ≜ Σ− 2 ( X − µ) is an i.i.d.
zero mean unit variance Gaussian random vector.
1
Proof. Let X = µ + Σ 2 Z for an i.i.d. zero mean unit variance Gaussian random vector Z : Ω → Rn , then
we will show that X is a Gaussian random vector by transformation of random vector densities. Since
∂x j 1
the (i, j)th component of the Jacobian matrix J ( x ) is given by Jij ( x ) = ∂zi = Σi,j
2
for all i, j ∈ [n], we can
3
1
write the Jacobian matrix J ( x ) = Σ 2 , Since the density of Z is f Z (z) = √ 1
exp(− 12 z T z), and from the
(2π )n
transformation of random vectors, we get
1
f Z (Σ− 2 ( x − µ)) 1 1
T −1
f X (x) = = exp − ( x − µ ) Σ ( x − µ ) , x ∈ Rn .
1
det(Σ 2 ) (2π )n/2 det(Σ)1/2 2
1
Conversely, we can show that if X is a Gaussian random vector, then Z = Σ− 2 ( X − µ) is an i.i.d. zero
mean unit variance Gaussian random vector by transformation of random vectors.
Remark 8. A random vector X : Ω → Rn with mean µ ∈ Rn and covariance matrix Σ ∈ Rn×n is Gaussian
1
iff X can be written as X = µ + Σ 2 Z, for an i.i.d. Gaussian random vector Z : Ω → Rn with mean 0 and
variance 1. It follows that EX = µ and Σ = E( X − µ)( X − µ) T .
1
Remark 9. We observe that the components of the Gaussian random vector X = µ + AZ for A = Σ 2 are
Gaussian random variables with mean µi and variance ∑nk=1 A2i,k = ( AA T )i,i = Σi,i , since each component
Xi = µi + ∑nk=1 Ai,k Zk is an affine combination of zero mean unit variance i.i.d. random variables.
Remark 10. For any u ∈ Rn , we compute the characteristic function Φ X from the distribution of Z as
D E 1
Φ X (u) = Ee j⟨u,X ⟩ = E exp j ⟨u, µ⟩ + j A T u, Z = exp( j ⟨u, µ⟩)Φ Z ( A T u) = exp( j ⟨u, µ⟩ − u T Σu).
2
Lemma 2.5. If the components of the Gaussian random vector are uncorrelated, then they are independent.
Proof. If a Gaussian vector is uncorrelated, then the covariance matrix Σ is diagonal. It follows that we can
write f X ( x ) = ∏in=1 f Xi ( xi ) for all x ∈ Rn .
4
Lecture-12: Conditional Expectation
P( A ∩ B)
P( A B) = .
P( B)
Consider a random variable X : Ω → R defined on a probability space (Ω, F, P), with distribution function
FX : R → [0, 1], and a non trivial event B ∈ F such that P( B) > 0.
Definition 1.1. The conditional distribution of X given event B is denoted by FX | B : R → [0, 1] and FX | B ( x )
is defined as the probability of event A X ( x ) ≜ X −1 (−∞, x ] conditioned on event B for all x ∈ R. That is,
P( A X ( x ) ∩ B)
FX | B ( x ) ≜ P( A X ( x )| B) = for all x ∈ R.
P( B)
Remark 1. The conditional distribution FX | B is a distribution function. This follows from the fact that (i)
FX | B ⩾ 0, (ii) FX | B is right continuous, (iii) limx↓−∞ FX | B ( x ) = 0 and limx↑∞ FX | B ( x ) = 1.
Remark 2. For a discrete random variable X : Ω → X, the conditional probability mass function of X given
P( X −1 { x }∩ B)
a non trivial event B is given by PX | B ( x ) = P( B)
for all x ∈ X.
Remark 3. For a continuous random variable X : Ω → R, the conditional density of X given a non trivial
dFX | B ( x )
event B is given by f X | B ( x ) = dx for all x ∈ R.
Example 1.2 (Conditional distribution). Consider the probability space (Ω, F, P) corresponding to a ran-
dom experiment where a fair die is rolled once. For this case, the outcome space Ω = [6], the event space
F = P ([6]), and the probability measure P(ω ) = 61 for all ω ∈ Ω.
We define a random variable X : Ω → R such that X (ω ) = ω for all ω ∈ Ω, and an event B ≜
{ω ∈ Ω : X (ω ) ⩽ 3} = [3] ∈ F. We note that P( B) = 0.5 and the conditional PMF of X given B is
1 1 1
PX | B ( x ) = 1{ x =3} + 1{ x =2} + 1{ x =1} .
3 3 3
Remark 4. For a discrete random variable X : Ω → X, the conditional expectation of X given a non trivial
event B is given by E[ X | B] = ∑ x∈X xPX | B ( x ).
Example 1.4 (Conditional expectation). For the random variable X and event B defined in Example 1.2,
the conditional expectation E[ X | B] = 2.
1
Remark 5. Consider two random variables X, Y defined on this probability space, then for y ∈ R such that
FY (y) > 0, we can define events A X ( x ) ≜ X −1 (−∞, x ] and AY (y) = Y −1 (−∞, y], such that
F ( x, y)
P {X ⩽ x} {Y ⩽ y} = X,Y .
FY (y)
The key observation is that {Y ⩽ y} is a non-trivial event. How do we define conditional expectation based
on events such as {Y = y}? When random variable Y is continuous, this event has zero probability measure.
Lemma 2.4. The mean of conditional expectation of random variable X given event space G is EX.
Proof. From the definition of event space Ω ∈ G, and from the definition of conditional expectation, we get
E[E[ X | G]] = E[E[ X | G]1Ω ] = E[ X 1Ω ] = EX.
Example 2.6 (Conditioning on simple random variables). For a simple random variable Y : Ω → Y ⊆ R
defined on the probability space (Ω, F, P), we define fundamental events Ey ≜ Y −1 {y} ∈ F for all y ∈ Y.
Then the sequence of events E ≜ ( Ey ∈ F : y ∈ Y) partitions the sample space, and we can write the event
space generated by random variable Y as σ (Y ) = (∪y∈ I Ey : I ⊆ Y).
For a random variable X : Ω → R defined on the same probability space, the random variable Z ≜
E[ X Y ] is σ(Y ) measurable. Therefore, E[ X |Y ] = ∑y∈Y αy 1Ey for some α ∈ RY . We verify that Z : Ω → R
2
is a σ (Y ) measurable random variable, since σ ( Z ) ⊆ σ (Y ). We can also check that E | Z | < ∞. Further, we
E[ X 1 E ]
have E[ Z1Ey ] = E[ X 1Ey ] for any y ∈ Y, which implies that αy = P (y)y for any y ∈ Y. Notice that
Y
1 1
Z Z
E[ X | Ey ] = xdFX | Ey ( x ) = xdP( A X ( x ) ∩ Ey ) = E[ X 1Ey ] = αy .
x ∈R PY (y) x ∈R PY (y)
Remark 6. There are three main takeaways from this definition. For a random variable Y, the event space
generated by Y is σ (Y ).
1. The conditional expectation E[ X |Y ] = E[ X |σ (Y )] and is σ(Y ) measurable. That is, E[ X |Y ] is a Borel
measurable function of Y. In particular when Y is discrete, this implies that E[ X |Y ] is a simple random
variable that takes value E[ X | Ey ] when ω ∈ Ey , and the probability of this event is PY (y). When Y is
continuous, E[ X |Y ] is a continuous random variable with density f Y .
2. Expectation is averaging. Conditional expectation is averaging over event spaces. We can observe
that the coarsest averaging is E[ X | {∅, Ω}] = EX and the finest averaging is E[ X |σ ( X )] = X. Further,
E[ X |σ (Y )] is averaging of X over events generated by Y. If we take any event A ∈ σ (Y ) generated by
Y, then the conditional expectation of X given Y is fine enough to find the averaging of X when this
event occurs. That is, E[ X 1 A ] = E[E[ X |Y ]1 A ].
Example 3.2. For the trivial sigma algebra G = {∅, Ω}, the conditional probability is the constant function
P( A | {∅, Ω}) = P( A).
Example 3.3. If A ∈ G, then P( A | G) = 1 A .
Definition 3.4. The conditional distribution of random variable X given sub event space G is defined as
FX |G ( x ) ≜ P( A X ( x ) | G) for all x ∈ R.
Remark 8. Recall that FX |G ( x ) : Ω → [0, 1] a random variable, for each x ∈ R. Further, we observe that FX |G
is monotone nondecreasing in x ∈ R, right continuous at all x ∈ R, and has limits limx↓−∞ FX |G ( x ) = 0 and
limx↑∞ FX |G ( x ) = 1. It follows that FX |G : Ω → [0, 1]R is a random distribution.
Theorem 3.5. Let g : R → R R be a Borel measurable function and G be a sub-event space. Then, the conditional
expectation E[ g( X ) | G] = x∈R g( x )dFX |G ( x ).
Proof. It suffices to show this for simple random variables X : Ω → X. Since g is Borel measurable, then
g( X ) is a random variable. We will show that E[ g( X ) | G] = ∑ x∈X g( x ) PX |G ( x ) by showing that it satis-
fies three properties of conditional expectation. For part (i), we observe that from the definition of condi-
tional probability PX |G ( x ) is a G-measurable random variable for all x ∈ X, and so is the linear combination
3
∑ x∈X g( x ) PX |G ( x ). For part (ii), we let G ∈ G. Then, it follows from the linearity of expectation and the
definition of conditional probability, that
For part (iii), it follows from the triangle inequality, the linearity of expectation, and the definition of condi-
tional probability that E ∑ x∈X g( x ) PX |G ( x ) ⩽ ∑ x∈X | g( x )| EPX |G ( x ) = ∑ x∈X | g( x )| PX ( x ) = E | X | < ∞.
Remark 9. The conditional characteristic function is given by Φ X |G (u) = E[e juX | G] = jux dF
R
x ∈R e X |G ( x ).
Definition 3.6. The conditional distribution of random variable X given random variable Y is defined as
FX |Y ( x ) ≜ P( A X ( x ) | σ (Y )] for all x ∈ R.
Example 3.7 (Conditional distribution given simple random variables). Consider a random variable X :
Ω → R and a simple random variable Y : Ω → Y defined on the same probability space. Since random
variables FX |Y ( x ) = E[1 AX ( x) | Y ] are σ (Y ) measurable, they can can be written as FX |Y ( x ) = ∑y∈Y β x,y 1Ey
for some β x ∈ RY and Ey = Y −1 {y} for all y ∈ Y. Further, we have E[ FX |Y ( x )1Ey ] = E[1 AX ( x) 1Ey ] for
P( A X ( x )∩ Ey )
any y ∈ Y, which implies that β x,y = PY (y)
= FX |Ey ( x ) for any y ∈ Y. It follows that FX |Y is a σ(Y )
measurable simple random variable.
Example 3.8 (Conditional expectation). Consider a random experiment of a fair die being thrown and a
random variable X : Ω → R taking the value of the outcome of the experiment. That is, for outcome space
Ω = [6] and event space F = P (Ω), we have X (ω ) = ω with PX ( x ) = 1/6 for x ∈ [6]. Define another random
variable Y = 1{ X⩽3} . Then the conditional expectation of X given Y is a random variable given by
Example 3.9 (Conditional distribution). Consider the zero-mean Gaussian random variable N : Ω → R
with variance σ2 , and another independent random variable Y ∈ {−1, 1} with PMF (1 − p, p) for some
p ∈ [0, 1]. Let X = Y + N, then the conditional distribution of X given simple random variable Y is
( t − µ )2
Rx −
where FX |Y −1 (µ) is −∞ e
σ2 dt.
Definition 3.10. When X, Y are both continuous random variables, there exists a joint density f X,Y ( x, y) for
all ( x, y) ∈ R2 . For each y ∈ Y such that f Y (y) > 0, we can define a function f X |Y : R2 → R+ such that
f X,Y ( x, y)
f X |Y ( x, y) ≜ , for all x ∈ R.
f Y (y)
Exercise 3.11. For continuous random variables X, Y, show that the function f X |Y −1 (y) is a density of
continuous random variable X for each y ∈ R.
4
Lecture-13: Conditional expectation
Proof. For any x ∈ Rn+1 , any generating event for collection σ ( X1 , . . . , Xn ) is of the form ∩in=1 Xi−1 (−∞, xi ] =
∩in=1 Xi−1 (−∞, xi ] ∩ Xn−+11 (R), a generating event for collection σ( X1 , . . . , Xn+1 ).
1
1. Identity: It follows from the definition that X satisfies all three conditions for conditional expectation.
The event space generated by any constant function is the trivial event space {∅, Ω} ⊆ G for any event
space. Hence, E[c | G] = c.
2. Linearity: Let Z ≜ E[(αX + βY ) | G] and Z1 ≜ E[ X | G] and Z2 ≜ E[Y | G]. We have to show that
Z = αZ1 + βZ2 almost surely.
(i) From the definition of conditional expectation, we have E[ X | G], E[Y | G] are G-measurable. In
addition, the linear map g : R2 → R defined by g( x ) ≜ αx1 + βx2 is Borel measurable. Therefore,
the linear combination αZ1 + βZ2 is G-measurable.
(ii) For any G ∈ G, from the linearity of expectation and definition of conditional expectation, we
have
E[(αZ1 + βZ2 )1G ] = αE[E[ X | G]1G ] + βE[E[Y | G]1G ] = E[(αX + βY )1G ].
(iii) From the definition of conditional expectation, we have E | Z1 | , E | Z2 | are finite. Therefore, we
have
E |αZ1 + βZ2 | ⩽ |α| E | Z1 | + | β| E | Z2 | < ∞.
This implies that αZ1 + βZ2 satisfies three properties of conditional expectation E[(αX + βY ) | G].
From the almost sure uniqueness of conditional expectation, we have Z = αZ1 + βZ2 almost surely.
3. Monotonicity: Let ϵ > 0 and define Aϵ ≜ {E[ X | G] − E[Y | G] > ϵ} ∈ G. Then from the definition of
conditional expectation, we have
0 = lim P( Aϵ ) = P(lim Aϵ ) = P( A0 ).
ϵ ↓0 ϵ
4. Conditional Jensen’s inequality: We will use the fact that a convex function can always be repre-
sented by the supremum of a family of affine functions. Accordingly, we will assume for a convex
function ψ : R → R, we have linear functions ϕi : R → R and constants ci ∈ R for all i ∈ I such that
ψ = supi∈ I (ϕi + ci ).
For each i ∈ I, we have ϕi (E[ X | G]) + ci = E[ϕi ( X ) | G] + ci ⩽ E[ψ( X ) | G] from the linearity and
monotonicity of conditional expectation. It follows that
5. Pulling out what’s known: From the almost sure uniqueness of conditional expectation, it suffices to
show that YE[ X | G] satisfies following three properties of the conditional expectation E[ XY | G]:
(i) the random variable YE[ X | G] is G-measurable,
(ii) E[ XY 1G ] = E[YE[ X | G]1G ] for any event G ∈ G, and
(iii) E |YE[ X | G]| is finite.
Part (i) is true since Y is given to be G-measurable, E[ X | G] is G-measurable by the definition of
conditional expectation, and product function is Borel measurable.
It suffices to show part (ii) and (iii) for simple G-measurable random variables Y = ∑y∈Y y1Ey where
Ey = Y −1 {y} ∈ G. For any G ∈ G, we have G ∩ Ey ∈ G for all y ∈ Y and ∪y∈Y ( G ∩ Ey ) = G.
Part (ii) follow from the linearity of expectation and definition of conditional expectation E[ X | G],
such that
2
Part (iii) follows from the fact that E | XY | is finite and the conditional Jensen’s inequality applied to
convex function |·| : R → R+ to get |E[ X | G]| ⩽ E[| X | | G]. Therefore,
E[|Y | |E[ X | G]|] = ∑ |y| E[|E[X | G]| 1Ey ] ⩽ ∑ |y| E[|X | 1Ey ] = E |XY | .
y ∈Y y ∈Y
6. Tower property: From the almost sure uniqueness of conditional expectation, it suffices to show that
(i) E[ X | H] is H-measurable,
(ii) E[E[ X | H]1 H ] = E[E[ X | G]1 H ] for all H ∈ H, and
(iii) E[|E[ X | H]|] is finite.
Part (i) follows from the definition of conditional expectation, which implies that E[ X | H] is H mea-
surable.
Part (ii) follows from the fact that H ⊆ G, and hence any H ∈ H belongs to G. Therefore, from the
definition of conditional expectation, we have
Part (iii) follows from the conditional Jensen’s inequality applied to convex function |·| : R → R+ to
get |E[ X | H]| ⩽ E[| X | | H], and hence E |E[ X | H]| ⩽ E | X | < ∞.
7. Irrelevance of independent information: From the almost sure uniqueness of conditional expecta-
tion, it suffices to show that
(i) E[ X | G] is σ (G, H)-measurable,
(ii) E[E[ X | G]1G ] = E[ X 1G ] for all G ∈ σ (G, H), and
(iii) E |E[ X | G]| is finite.
Part (i) follows from the definition of conditional expectation and the definition of σ (G, H). Since
E[ X | G] is G-measurable, it is σ (G, H) measurable.
Part (ii) follows from the fact that it suffices to show for events A = G ∩ H ∈ σ (G, H) where G ∈ G and
H ∈ H. In this case,
Part (iii) follows from the conditional Jensen’s inequality applied to convex function |·| : R → R+ to
get |E[ X | G]| ⩽ E[| X | | G]. This implies that E |E[ X | G]| ⩽ E | X | < ∞.
8. L2 -projection: We define L2 (G) ≜ ζ a G measurable random variable : Eζ 2 < ∞ . From the condi-
tional Jensen’s inequality applied to convex function (·)2 : R → R+ , we get that (E[ X | G])2 ⩽ E[ X 2 | G]
a.s.. Since X ∈ L2 , it follows that X 2 ∈ L1 and hence E[ X | G] ∈ L2 . It follows that ζ ∗ ≜ E[ X | G] ∈ L2 (G)
from the definition of conditional expectation.
We first show that X − ζ ∗ is uncorrelated with all ζ ∈ L2 (G). Towards this end, we let ζ ∈ L2 (G) and
observe that
E[( X − ζ ∗ )ζ ] = E[ζX ] − E[ζE[ X | G]] = E[ζX ] − E[E[ζX | G]] = 0.
The above equality follows from the linearity of expectation, the G-measurability of ζ, and the def-
inition of conditional expectation. Since ζ ∗ ∈ L2 (G), we have (ζ − ζ ∗ ) ∈ L2 (G). Therefore, E[( X −
ζ ∗ )(ζ − ζ ∗ )] = 0.
For any ζ ∈ L2 (G), we can write from the linearity of expectation
3
Lecture-14: Almost sure convergence
1 Point-wise convergence
Consider a random sequence X : Ω → RN defined on a probability space (Ω, F, P), then each Xn ≜ πn ◦ X :
Ω → R is a random variable. There are many possible definitions for convergence of a sequence of random
variables. One idea is to consider X (ω ) ∈ RN as a real valued sequence for each outcome ω, and consider
the limn Xn (ω ) for each outcome ω.
Definition 1.1. A random sequence X : Ω → RN defined on a probability space (Ω, F, P) converges point-
wise to a random variable X∞ : Ω → R, if for all outcomes ω ∈ Ω, we have
lim Xn (ω ) = X∞ (ω ).
n
Remark 1. This is a very strong convergence. Intuitively, what happens on an event of probability zero is
not important. We will strive for a weaker notion of convergence, where the sequence of random variable
converge point-wise on a set of outcomes with probability one.
Example 2.2 (Almost sure equality). Two random variables X, Y defined on the probability space
(Ω, F, P) are said to be equal a.s. if the following exception set
N ≜ {ω ∈ Ω : X (ω ) ̸= Y (ω )} ∈ F,
has probability measure P( N ) = 0. Then Y is called a version of X, and we can define an equivalence
class of a.s. equal random variables.
Example 2.3 (Almost sure monotonicity). Two random variables X, Y defined on the probability space
(Ω, F, P) are said to be X ⩽ Y a.s. if the exception set N ≜ {ω ∈ Ω : X (ω ) > Y (ω )} ∈ F has probability
measure P( N ) = 0.
has zero probability. Let X∞ be the point-wise limit of the sequence of random variables X : Ω → RN on
the set N c , then we say that the sequence X converges almost surely to X∞ , and denote it as
lim Xn = X∞ a.s.
n
1
Example 3.2 (Convergence almost surely but not everywhere). Consider the probability space
([0, 1], B([0, 1]), λ) such that λ([ a, b]) = b − a for all 0 ⩽ a ⩽ b ⩽ 1. For each n ∈ N, we define the scaled
indicator random variable Xn : Ω → {0, 1} such that
Xn (ω ) ≜ n1[0, 1 ] (ω ).
n
Let N = {0}, then for any ω ∈ / N, there exists m = ⌈ ω1 ⌉ ∈ N, such that for all n > m, we have Xn (ω ) = 0.
That is, limn Xn = 0 a.s. since λ( N ) = 0. However, Xn (0) = n for all n ∈ N.
4 Convergence in probability
Definition 4.1 (convergence in probability). A random sequence X : Ω → RN defined on the probability
space (Ω, F, P) converges in probability to a random variable X∞ : Ω → R, if limn P( An (ϵ)) = 0 for any
ϵ > 0, where
An (ϵ) ≜ {ω ∈ Ω : | Xn (ω ) − X∞ (ω )| > ϵ} ∈ F.
Remark 2. limn Xn = X∞ a.s. means that for almost all outcomes ω, the difference Xn (ω ) − X∞ (ω ) gets
small and stays small.
Remark 3. limn Xn = X∞ i.p. is a weaker convergence than a.s. convergence, and merely requires that the
probability of the difference Xn (ω ) − X∞ (ω ) being non-trivial becomes small.
Example 4.2 (Convergence in probability but not almost surely). Consider the probability space
([0, 1], B([0, 1]), λ) such that λ([ a, b]) = b − a for all 0 ⩽ a ⩽ b ⩽ 1. For each k ∈ N, we consider the se-
quence Sk = ∑ik=1 i, and define integer intervals Ik ≜ {Sk−1 + 1, . . . , Sk }. Clearly, the intervals ( Ik : k ∈ N)
partition the natural numbers, and each n ∈ N lies in some Ikn , such that n = Skn −1 + in for in ∈ [k n ].
Therefore, for each n ∈ N, we define indicator random variable Xn : Ω → {0, 1} such that
X n ( ω ) = 1 [ i n −1 , i n ] ( ω ) .
kn kn
For any ω ∈ [0, 1], we have Xn (ω ) = 1 for infinitely many values since there exist infinitely many (i, k)
( i −1)
pairs such that k ⩽ ω ⩽ ki , and hence lim supn Xn (ω ) = 1 and hence limn Xn (ω ) ̸= 0. However,
limn Xn (ω ) = 0 in probability, since
1
lim λ { Xn (ω ) ̸= 0} = lim = 0.
n n kn
2
Proof. Let A ∈ FN be a sequence of events.
(a) Let ω ∈ lim supn An = ∩n∈N ∪k⩾n Ak , then ω ∈ ∪k⩾n Ak for all n ∈ N. Therefore, for each n ∈ N, there
exists k n ∈ N such that ω ∈ Akn , and hence
∑ 1 Aj (ω ) ⩾ ∑ 1 Akn (ω ) = ∞.
j ∈N n ∈N
Conversely, if ∑ j∈N 1 A j (ω ) = ∞, then for each n ∈ N there exists a k n ∈ N such that ω ∈ Akn and hence
ω ∈ ∪k⩾n Ak for all n ∈ N.
(b) Let ω ∈ lim infn An = ∪n∈N ∩k⩾n Ak , then there exists n0 (ω ) such that ω ∈ An for all n ⩾ n0 (ω ). Con-
versely, if ∑ j∈N 1 Ac (ω ) < ∞, then there exists n0 (ω ) such that ω ∈ An for all n ⩾ n0 (ω ).
j
Theorem 5.2 (Convergence a.s. implies in probability). If a sequence of random variables X : Ω → RN defined
on a probability space (Ω, F, P) converges a.s. to a random variable X∞ : Ω → R, then it converges in probability to
the same random variable.
Proof. Let limn Xn = X∞ a.s. and ϵ > 0. We define events An ≜ {ω ∈ Ω : | Xn (ω ) − X∞ (ω )| > ϵ} for each
n ∈ N. We will show that limn P( An ) = 0. To this end, let N be the exception set such that
N ≜ ω ∈ Ω : lim inf Xn (ω ) < lim sup Xn (ω ) or lim sup Xn (ω ) = ∞ .
n n n
/ N, there exists an n0 (ω ) such that | Xn − X∞ | ⩽ ϵ for all n ⩾ n0 . That is, ω ∈ Acn for all n ⩾ n0 (ω )
For ω ∈
and hence N c ⊆ lim infn Acn . It follows that 1 = P(lim infn Acn ). Since lim infn Acn = (lim supn An )c , we get
0 = P(lim supn An ) = limn P(∪k⩾n Ak ) ⩾ limn P( An ) ⩾ 0.
6 Borel-Cantelli Lemma
Proposition 6.1 (Borel-Cantelli Lemma). Let A ∈ FN be a sequence of events such that ∑n∈N P( An ) < ∞, then
P { An i.o.} = 0.
Proof. We can write the probability of infinitely often occurrence of An , by the continuity and sub-additivity
of probability as
P(lim sup An ) = lim P(∪k⩾n Ak ) ⩽ lim ∑ P( Ak ) = 0.
n n n
k ⩾n
The last equality follows from the fact that ∑n∈N P( An ) < ∞.
Proposition 6.2 (Borel zero-one law). Let A ∈ FN be a sequence of independent events, then
(
0, iff ∑n P( An ) < ∞,
P { An i.o.} =
1, iff ∑n P( An ) = ∞.
Since 1 − x ⩽ e− x
for all x ∈ R, from the above equation, the continuity of exponential function, and the
hypothesis, we get
m
1 ⩾ P { An i.o.} ⩾ 1 − lim lim e− ∑k=n P( Ak ) = 1 − lim exp(− ∑ P( Ak )) = 1.
n m n
k ⩾n
3
Example 6.3 (Convergence in probability can imply almost sure convergence). Consider a random
Bernoulli sequence X : Ω → {0, 1}N defined on the probability space (Ω, F, P) such that P { Xn = 1} = pn
for all n ∈ N. Note that the sequence of random variables is not assumed to be independent, and
definitely not identical. If limn pn = 0, then we see that limn Xn = 0 in probability.
In addition, if ∑n∈N pn < ∞, then limn Xn = 0 a.s. To see this, we define event An ≜ { Xn = 1} ∈ F
for each n ∈ N. Then, applying the Borel-Cantelli Lemma to sequence of events A ∈ FN , we get
That is, limn Xn = 0 for ω ∈ lim infn Acn , implying almost sure convergence.
From the construction, we have limk nk = ∞, and P( A j ) < 2− j for each j ∈ N. Therefore, ∑k∈N P( Ak ) < ∞,
and hence by the Borel-Cantelli Lemma, we have P(lim supk Ak ) = 0. Let N = lim supk Ak be the exception
/ N, for all but finitely many j ∈ N
set such that for any outcome ω ∈
Xn j ( ω ) − X∞ ( ω ) ⩽ 2− j .
A Limits of sequences
Definition A.1. For any real valued sequence a ∈ RN , we can define
Remark 4. We define en ≜ supk⩾n ak and f n ≜ infk⩾n ak , and observe that f n ⩽ ak for all k ⩾ n. That is,
f 1 , . . . , f n−1 ⩽ an and f k ⩽ ak for all k ⩾ n. It follows that supn f n ⩽ supk⩾n ak = en for all n ∈ N, and hence
lim infn an = supn f n ⩽ infn en = lim supn an .
Definition A.2. A sequence a ∈ RN is said to converge if lim supn an = lim infn an and the limit is defined
as an ≜ limn an = lim supn an = lim infn an .
Theorem A.3. A sequence a ∈ RN converges to a∞ ∈ R if for all ϵ > 0 there exists an integer N ∈ N such that for
all n ⩾ N, we have | an − a∞ | < ϵ.
Proof. Let ϵ > 0 and find the integer Nϵ ∈ N such that an ∈ ( a∞ − ϵ, a∞ + ϵ) for all n ⩾ Nϵ . It follows that
a∞ − ϵ ⩽ f n ⩽ en ⩽ a∞ + ϵ for all n ⩾ Nϵ , and hence a∞ − ϵ ⩽ lim infn an ⩽ lim supn an ⩽ a∞ + ϵ. Since ϵ was
arbitrary, it follows that limn an = a∞ .
4
Proposition A.4. For any sequence a ∈ RN+ , the following statements are true.
(i) If ∑n∈N an < ∞ then limn→∞ ∑k⩾n ak = 0.
(ii) If ∑n∈N an = ∞ then ∑k⩾n ak = ∞ for all k ∈ N.
Proof. We observe that (∑k<n ak : n ∈ N) is a non-decreasing sequence, and hence limn→∞ ∑k<n ak = supn ∑k<n ak =
∑ n ∈N a n .
(i) It follows that ∑k⩾n ak = ∑n∈N an − ∑k<n ak is a non-increasing sequence with limit 0.
(ii) We can write ∑n∈N an = ∑k<n ak + ∑k⩾n ak . Since the first term is finite for all n ∈ N, it follows that the
second term must be infinite for all n ∈ N.
5
Lecture-15: L p convergence
1 L p convergence
Definition 1.1 (Convergence in L p ). Let p ⩾ 1, then we say that a random sequence X : Ω → RN defined
on a probability space (Ω, F, P) converges in L p to a random variable X∞ : Ω → R, if
lim ∥ Xn − X∞ ∥ p = 0.
n
Proposition 1.2 (Convergences L p implies in probability). Consider p ∈ [1, ∞) and a sequence of random
variables X : Ω → RN defined on a probability space (Ω, F, P) such that limn Xn = X∞ in L p , then limn Xn = X∞
in probability.
Proof. Let ϵ > 0, then from the Markov’s inequality applied to random variable | Xn − X | p , we have
E | Xn − X∞ | p
P {| Xn − X∞ | > ϵ} ⩽ .
ϵ
Example 1.3 (Convergence almost surely doesn’t imply convergence in L p ). Consider the probability
space ([0, 1], B([0, 1]), λ) such that λ([ a, b]) = b − a for all 0 ⩽ a ⩽ b ⩽ 1. We define the scaled indicator
random variable Xn : Ω → {0, 1} such that
Xn (ω ) = 2n 1[0, 1 ] (ω ).
n
We define N = {0}, and for any ω ∈ / N, we can find m ≜ ⌈ ω1 ⌉, such that for all n > m, we have Xn (ω ) = 0.
np
Since λ( N ) = 0, it implies that limn Xn = 0 a.s. However, we see that E | Xn | p = 2n .
Remark 2. Convergence almost surely implies convergence in probability. Therefore, above example also
serves as a counterexample to the fact that convergence in probability doesn’t imply convergence in L p .
Theorem 1.4 (L2 weak law of large numbers). Consider a sequence of uncorrelated random variables X : Ω →
RN defined on a probability space (Ω, F, P) such that EXn = µ and Var( Xn ) = σ2 for all n ∈ N. Defining the sum
Sn ≜ ∑in=1 Xi and the n-empirical mean X̄n ≜ Snn , we have limn X̄n = µ in L2 and in probability.
Proof. From the uncorrelatedness of random sequence X, and linearity of expectation, we get
1 σ2
Var( X̄n ) = E( X̄n − µ)2 = 2
E(Sn − nµ)2 = .
n n
It follows that limn X̄n = µ in L2 . Since the convergence in L p implies convergence in probability, the result
holds.
1
Theorem 1.5 (L1 weak law of large numbers). Consider an i.i.d. random sequence X : Ω → RN defined on a
probability space (Ω, F, P) such that E | X1 | < ∞ and EX1 = µ. Defining the sum Sn ≜ ∑in=1 Xi and the n-empirical
mean X̄n ≜ Snn , we have limn X̄n = µ in probability.
Example 1.6 (Convergence in L p doesn’t imply almost surely). Consider the probability space
([0, 1], B([0, 1]), λ) such that λ([ a, b]) = b − a for all 0 ⩽ a ⩽ b ⩽ 1. For each k ∈ N, we consider the se-
quence Sk = ∑ik=1 i, and define integer intervals Ik ≜ {Sk−1 + 1, . . . , Sk }. Clearly, the intervals ( Ik : k ∈ N)
partition the natural numbers, and each n ∈ N lies in some Ikn , such that n = Sk+n−1 + in for in ∈ [k n ].
Therefore, for each n ∈ N, we define indicator random variable Xn : Ω → {0, 1} such that
X n ( ω ) ≜ 1 [ i n −1 , i n ] ( ω ) .
kn kn
For any ω ∈ [0, 1], we have Xn (ω ) = 1 for infinitely many values since there exist infinitely many (i, k)
( i −1)
pairs such that k ⩽ ω ⩽ ki , and hence lim supn Xn (ω ) = 1 and hence limn Xn (ω ) ̸= 0. However,
limn Xn (ω ) = 0 in L p , since
1
E | Xn | p = λ { Xn ( ω ) ̸ = 0} = .
kn
2 L1 convergence theorems
Theorem 2.1 (Monotone Convergence Theorem). Consider a non-decreasing non-negative random sequence
X : Ω → RN + defined on a probability space (Ω, F, P ), such that Xn ∈ L for all n ∈ N. Let X∞ ( ω ) = supn Xn ( ω )
1
En ≜ {ω ∈ Ω : Xn (ω ) ⩾ αY } ∈ F.
From the monotonicity of sequence X, the sequence of events E ∈ FN are monotonically non-decreasing
such that ∪n∈N En = Ω. It follows that
αE[Y 1En ] ⩽ E[ Xn 1En ] ⩽ EXn .
We will use the fact that limn E[Y 1En ] = E[Y ], then αEY ⩽ supn EXn . Taking supremum over all α ∈ (0, 1)
and all simple functions Y ⩽ X∞ , we get EX∞ ⩽ supn EXn .
Theorem 2.2 (Fatou’s Lemma). Consider a non-negative random sequence X : Ω → RN + defined on a probability
space (Ω, F, P). Let X∞ (ω ) ≜ lim infn Xn (ω ) for all ω ∈ Ω, then EX∞ ⩽ lim infn EXn .
Proof. We define Yn ≜ infk⩾n Xk for all n ∈ N. It follows that Y : Ω → RN
+ is a non-negative non-decreasing
sequence of random variables, and X∞ = supn Yn = limn Yn . Applying monotone convergence theorem to
random sequence Y, we get EX∞ = supn EYn . The result follows from the monotonicity of expectation, and
the fact that Yn ⩽ Xk for all k ⩾ n, to get EYn ⩽ infk⩾n EXk .
Theorem 2.3 (Dominated Convergence Theorem). Let X : Ω → RN be a random sequence defined on a proba-
bility space (Ω, F, P). If limn Xn = X∞ a.s. and there exists a Y : Ω → R+ such that Y ∈ L1 and | Xn | ⩽ Y a.s., then
EX∞ = limn EXn .
Proof. From the hypothesis, we have Y + Xn ⩾ 0 a.s. and Y − Xn ⩾ 0 a.s. Therefore, from Fatou’s Lemma
and linearity of expectation, we have
EY + EX∞ ⩽ lim inf E(Y + Xn ) = EY + lim inf EXn , EY − EX∞ ⩽ lim inf E(Y − Xn ) = EY − lim sup EXn .
n n n n
Therefore, we have lim supn EXn ⩽ EX∞ ⩽ lim infn EXn , and the result follows.
2
3 Convergence theorems for conditional means
Proposition 3.1. Let X : Ω → RN be a random sequence on the probability space (Ω, F, P) such that E | Xn | < ∞
for all n ∈ N. Let G and H be event spaces such that G, H ⊂ F. Then the following theorems hold.
1. Conditional monotone convergence theorem: If 0 ⩽ Xn ⩽ Xn+1 a.s., for all n ∈ N and Xn → X∞ a.s. for
X∞ ∈ L1 , then E[ Xn | G] ↑ E[ X∞ | G] a.s.
2. Conditional Fatou’s lemma: If Xn ⩾ 0 a.s., for all n ∈ N, and lim infn Xn ∈ L1 , then E[lim infn Xn | G] ⩽
lim infn E[ Xn | G] a.s.
4 Uniform integrability
Example 4.2 (Single element family). If | T | = 1, then the family is uniformly integrable, since X1 ∈ L1
and lima E[| X1 | 1{| Xt |> a} ] = 0. This is due to the fact that ( Xn ≜ | X | 1{| X |⩽n} : n ∈ N) is a sequence of
increasing random variables limn Xn = X. From monotone convergence theorem, we get limn E | Xn | =
E limn | Xn |. Therefore,
3
Proposition 4.3. Let X ∈ L p and ( An : n ∈ N) ⊂ F be a sequence of events such that limn P( An ) = 0, then
lim ∥| X | 1 An ∥ p = 0.
n
Example 4.4 (Dominated family). If there exists Y ∈ L1 such that supt∈T | Xt | ⩽ |Y |, then the family of
random variables ( Xt : t ∈ T ) is uniformly integrable. This is due to the fact that
Example 4.5 (Finite family). then the family of random variables ( Xt : t ∈ T ) is uniformly integrable.
This is due to the fact that supt∈T | Xt | ⩽ ∑t∈T | Xt | ∈ L1 .
Theorem 4.6 (Convergence in probability with uniform integrability implies convergence in L p ). Con-
sider a sequence of random variables ( Xn : n ∈ N) ⊂ L p for p ⩾ 1. Then the following are equivalent.
(a) The sequence ( Xn : n ∈ N) converges in L p , i.e. limn E | Xn − X | p = 0.
(b) The sequence ( Xn : n ∈ N) is Cauchy in L p , i.e. limm,n→∞ E | Xn − Xm | p = 0.
1
P | Xn − Xm | p > ϵ ⩽ E | Xn − Xm | p .
ϵ
4
(c) =⇒ ( a) : Since the sequence ( Xn : n ∈ N) is convergent in probability to a random variable X, there
exists a subsequence (nk : k ∈ N) ⊂ N such that limk Xnk = X a.s. Since (| Xn | p : n ∈ N) is a family of
uniformly integrable sequence, by Fatou’s Lemma
p
E | X | p ⩽ lim inf E Xnk ⩽ sup E | Xn | p < ∞.
k n
Therefore, X ∈ L1 , and we define An (ϵ) = {| Xn − X | > ϵ} for any ϵ > 0. From Minkowski’s inequality,
we get
∥ Xn − X ∥ p ⩽ ( Xn − X )1 { | Xn − X | p ⩽ϵ } + Xn 1 A n ( ϵ ) + X 1 An (ϵ) .
p p p
5
Lecture-16: Weak convergence
1 Convergence in distribution
Definition 1.1 (convergence in distribution). A random sequence X : Ω → RN defined on a probability
space (Ω, F, P) converges in distribution to a random variable X∞ : Ω′ → R defined on a probability space
(Ω′ , F ′ , P′ ) if limn FXn ( x ) = FX∞ ( x ) at all continuity points x of FX∞ . Convergence in distribution is denoted
by limn Xn = X∞ in distribution.
Proposition 1.2. Consider a random sequenceX : Ω → RN defined on a probability space (Ω, F, P) and a random
variable X∞ : Ω′ → R defined on another probability space (Ω′ , F ′ , P′ ). Then the following statements are equivalent.
(a) limn Xn = X∞ in distribution.
(b) limn E[ g( Xn )] = E[ g( X∞ )] for any bounded continuous Borel measurable function g : R → R.
(c) Characteristic functions converge point-wise, i.e. limn Φ Xn (u) = Φ X∞ (u) for each u ∈ R.
Proof. Let X : Ω → RN be a sequence of random variables and let X∞ : Ω′ → R be a random variable. We
will show that ( a) =⇒ (b) =⇒ (c) =⇒ ( a).
( a) =⇒ (b): Applying the bounded Rconvergence theorem R to any bounded continuous Borel measurable
function g : R → R, we have limn x∈R g( x )dFXn ( x ) = x∈R g( x ) limn dFXn ( x ).
Example 1.3 (Convergence in distribution but not in probability). Consider a sequence of non-degenerate
continuous i.i.d. random variables X : Ω → RN and independent random variable Y : Ω′ → R, all with the
common distribution FY . Then FXn = FY for all n ∈ N, and hence limn Xn = Y in distribution. If the common
2
√ Gaussian with variance σ , then Xn − Y is zero mean Gaussian with variance
distribution FY is zero mean
2σ2 . Therefore, for ϵ < σ π and all n ∈ N
1
Z
2 /4σ2 ϵ
P {| Xn − Y | ⩽ ϵ} = √ e−z dz ⩽ √ < 1.
4πσ2 z∈[−ϵ,ϵ] σ π
1
For the chosen δ > 0, we consider the event An (δ) ≜ {ω ∈ Ω : | Xn (ω ) − X∞ (ω )| > δ} = { Xn ∈
/ [ X∞ − δ, X∞ + δ]} ∈
F, and define events A Xn ( x ) ≜ { Xn ⩽ x } and A X∞ ( x ) ≜ { X∞ ⩽ x }. Then, we can write
A Xn ( x ) ∩ A X∞ ( x + δ ) ⊆ A X∞ ( x + δ ), A Xn ( x ) ∩ AcX∞ ( x + δ) ⊆ An (δ),
A X∞ ( x − δ ) ∩ A Xn ( x ) ⊆ A Xn ( x ), A X∞ ( x − δ) ∩ AcXn ( x ) ⊆ An (δ).
From the above set relations, law of total probability, and union bound, we have
From the convergence in probability, we have limn P( An (δ)) = 0, and the result follows.
Theorem 1.5 (Central Limit Theorem). Consider an i.i.d. random sequence X : Ω → RN defined on a probability
space (Ω, F, P), with EXn = µ and Var( Xn ) = σ2 for all n ∈ N. We define the n-sum as Sn ≜ ∑in=1 Xi and consider
y 2
a standard normal random variable Y : Ω → R with density function f Y (y) = √1 e− 2 for all y ∈ R. Then,
2π
Sn − nµ
lim √ = Y in distribution.
n σ n
Xi − µ
Proof. The classical proof is using the characteristic functions. Let Zi ≜ σ for all i ∈ N, then the shifted
Sn −nµ
and scaled n-sum is given by √
σ n
= √1
n ∑in=1 Zi .
We use the third equivalence in Proposition 1.2 to show
that the characteristic function of converges to the characteristic function of the standard normal. We define
the characteristic functions
(Sn − nµ)
Φn (u) ≜ E exp ju √ , Φ Zi (u) ≜ E exp( juZi ), ΦY (u) ≜ E exp( juY ).
σ n
(y − ju)2
1 u2 u2
Z
ΦY ( u ) = √ e− 2 exp − dy = e− 2 .
2π y ∈R 2
(1)
Since the random sequence Z : Ω → RN is a zero mean i.i.d. sequence, it follows that Φ Z1 (0) = jEZ1 = 0
(2)
and Φ Z1 (0) = j2 EZ12 = −1. Using the Taylor expansion of the characteristic function Φ Z1 , we have
n n n
( X − µ) u2 u2
u
Φn (u) = ∏ E exp ju i √ = Φ Z1 √ = 1− +o .
i =1 σ n n 2n n
Theorem 2.2 (L4 strong law of large numbers). Let X : Ω → RN be a sequence of independent random variables
defined on probability space (Ω, F, P) with bounded mean EXn for each n ∈ N and uniformly bounded fourth central
moment supn∈N E( Xn − EXn )4 ⩽ B < ∞. Then, the empirical n-mean converges to limn ES n
n almost surely.
2
Proof. Recall that E(Sn − ESn )4 = E(∑in=1 ( Xi − EXi ))4 = ∑in=1 E( Xi − EXi )4 + 3 ∑in=1 ∑ j̸=i E( Xi − EXi )2 E( X j −
EX j )2 . Recall that when the fourth moment is bounded, then so is second moment. Hence, supi∈N E( Xi −
EXi )2 ⩽ C for some C ∈ R+ . Therefore, from the Markov’s inequality, we have
Theorem 2.3 (L2 strong law of large numbers). Let X : Ω → RN be a sequence of pair-wise uncorrelated random
variables defined on a probability space (Ω, F, P) with bounded mean EXn for all n ∈ N and uniformly bounded
variance supn∈N Var( Xn ) ⩽ B < ∞. Then, the empirical n-mean converges to limn ES n
n almost surely.
Therefore, ∑n∈N P( Fn ) < ∞ and ∑n∈N P( Gn ) < ∞, and hence by Borel Canteli Lemma, we have
The result follows from the fact that for any k ∈ N, there exists n ∈ N such that k ∈ n2 , . . . , (n + 1)2 − 1
and hence
|Sk − ESk | |Sn2 − ESn2 | |Sk − Sn2 − E(Sk − Sn2 )|
⩽ + .
k n2 n2
Theorem 2.4 (L1 strong law of large numbers). Let X : Ω → RN be a random sequence defined on a probability
space (Ω, F, P) such that supn∈N E | Xn | ⩽ B < ∞. Then, the empirical n-mean converges to limn ES n
n almost surely.
3
Lecture-17: Tractable Random Processes
In general, it is very difficult to characterize a stochastic process completely in terms of its finite dimensional
distribution. However, we have listed few analytically tractable examples below, where we can completely
characterize the stochastic process.
FXS ( xS ) = P (∩s∈S { Xs (ω ) ⩽ xs }) = ∏ F ( xs ).
s∈S
Remark 1. It’s easy to verify that the first and the second moments are independent of time indices. That is,
if 0 ∈ T then Xt = X0 in distribution, and we have
Remark 2. That is, for any finite n ∈ N and t ∈ R, the random vectors ( Xs1 , . . . , Xsn ) and ( Xs1 +t , . . . , Xs1 +t )
have the identical joint distribution for all s1 ⩽ . . . ⩽ sn .
1
Proof. Let X : Ω → XT be an i.i.d. random process, where T ⊆ R. Then, for any finite index subset S ⊆ T, t ∈ T
and xS ∈ RS , we can write
First equality follows from the definition, the second from the independence of process X, the third from the
identical distribution for the process X. In particular, we have shown that process X is also stationary.
Remark 3. For a stationary stochastic process, all the existing moments are shift invariant when they exist.
Definition 1.4. A second order stochastic process X has finite auto-correlation R X (t, t) < ∞ for all t ∈ T.
Remark 4. This implies R X (t1 , t2 ) < ∞ by Cauchy-Schwartz inequality, and hence the mean, auto-correlation,
and the auto-covariance functions are well defined and finite.
Remark 5. For a stationary process X, we have Xt = X0 and ( Xt , Xs ) = ( Xt−s , X0 ) in distribution. Therefore,
for a second order stationary process X, we have
Example 1.6 (Gaussian process). Let X : Ω → RR be a continuous-time Gaussian process, defined by its
finite dimensional distributions. In particular, for any finite S ⊂ R, column vector xS ∈ RS , mean vector
µS ≜ EXS , and the covariance matrix CS ≜ EXS XST , the finite-dimensional density is given by
1 1
f XS ( x S ) = p exp − ( xS − µS ) T CS−1 ( xS − µS ) .
(2π )|S|/2 det(CS ) 2
Proof. For Gaussian random processes, first and the second moment suffice to get any finite dimensional
distribution. Let X be a wide sense stationary Gaussian process and let S ⊆ R be finite. From the wide
sense stationarity of X, we have EXS = µ1S and
2
Remark 7. We can think of Sn as the random location of a particle after n steps, where the particle starts
from origin and takes steps of size Xi at the ith step. From the i.i.d. nature of step-size sequence, we observe
that ESn = nEX1 and CS (n, m) = (n ∧ m) Var[ X1 ].
Remark 8. For the process S : Ω → XN it suffices to look at finite dimensional distributions for finite sets
[n] ⊆ N for all n ∈ N. If the i.i.d. step-size sequence X has a common density function, then from the
transformation of random vectors, we can find the finite dimensional density
n
1
f S1 ,...,Sn (s1 , s2 , . . . , sn ) = f X1 ,...,Xn (s1 , s2 − s1 , . . . , sn − sn−1 ) = f X1 (s1 ) ∏ f X1 (si − si−1 ).
| J (s)| i =2
Proof. The result follows from stationary and independent increment property of the random walk S.
N
Remark 9. For a one-dimensional random walk S : Ω → ZN + with i.i.d. step size sequence X : Ω → {0, 1}
such that P { X1 = 1} = p, the distribution for the random walk at nth step Sn is Binomial (n, p). That is,
n k
P { Sn = k } = p (1 − p)n−k , k ∈ {0, . . . , n} .
k
Example 1.11. Two examples of Lévy processes are Poisson process and Wiener process. The distribution
of Poisson process at time t is Poisson with rate λt and the distribution of Wiener process at time t is zero
mean Gaussian with variance t.
Example 1.12. A random walk S : Ω → XZ+ with i.i.d. step-size sequence X : Ω → XN , is non-stationary with
stationary and independent increments. To see non-stationarity, we observe that the mean mS (n) = nEX1
depends on the step of the random walk. We have already seen the increment process of random walks.
3
1.5 Markov processes
Definition 1.13. A stochastic process X is Markov if conditioned on the present state, future is independent
of the past. We denote the history of the process until time t as Ft = σ( Xs , s ⩽ t). That is, for any ordered
index set T containing any two indices u > t, we have
Remark 10. We next re-write the Markov property more explicitly for the process X. For all x, y ∈ X, finite
set S ⊆ T such that max S < t < u, and HS = ∩s∈S { Xs ⩽ xs } ∈ Ft , we have
Remark 11. When the state space X is countable, we can write HS = ∩s∈S { Xs = xs } and the Markov property
can be written as
P({ Xu = y} | HS ∩ { Xt = x }) = P({ Xu = xu } | { Xt = x }).
Remark 12. In addition, when the index set is countable, i.e. T = Z+ , then we can take past as S =
{0, . . . , n − 1}, present as instant n, and the future as n + 1. Then, the Markov property can be written
as
P({ Xn+1 = y} | Hn−1 ∩ { Xn = x }) = P({ Xn+1 = y} | { Xn = x }),
for all n ∈ Z+ , x, y ∈ X.
We will study this process in detail in coming lectures.
Example 1.14. A random walk S : Ω → XZ+ with i.i.d. step-size sequence X : Ω → XN , is a homogeneous
Markov sequence. For any n ∈ Z+ and x, y, s1 , . . . , sn−1 ∈ X, we can write the conditional probability
4
Lecture-18: Stopping Times
1 Stopping times
Let (Ω, F, P) be a probability space. Consider a random process X : Ω → XT defined on this probability
space with state space X ⊆ R and ordered index set T ⊆ R considered as time.
Definition 1.2. The natural filtration associated with a random process X : Ω → XT is given by F• ≜
(Ft : t ∈ T ) where Ft ≜ σ( Xs , s ⩽ t).
Remark 2. For a random sequence X : Ω → XN , the natural filtration is a sequence F• = (Fn : n ∈ N) of
event spaces Fn ≜ σ( X1 , . . . , Xn ) for all n ∈ N.
Remark 3. If the random sequence X is independent, then the random sequence ( Xn+ j : j ∈ N) is inde-
pendent of the event space σ( X1 , . . . , Xn ).
Example 1.3. For a random walk S with step size sequence X, the natural filtration of the random walk
is identical to that of the step size sequence. That is, σ(S1 , . . . , Sn ) = σ ( X1 , . . . , Xn ) for all n ∈ N. This
j
follows from the fact that for all n ∈ N, we can can write S j = ∑i=1 Xi and X j = S j − S j−1 for all j ∈ [n].
That is, there is a bijection between ( X1 , . . . , Xn ) and (S1 , . . . , Sn ).
Definition 1.4. A random variable τ : Ω → T is called a stopping time with respect to a filtration F• if
(a) the event τ −1 (−∞, t] ∈ Ft for all t ∈ T, and
(b) the random variable τ is finite almost surely, i.e. P {τ < ∞} = 1.
Remark 4. Intuitively, if we observe the process X sequentially, then the event {τ ⩽ t} can be completely
determined by the observation ( Xs , s ⩽ t) until time t. The intuition behind a stopping time is that its
realization is determined by the past and present events but not by future events. That is, given the
history of the process until time t, we can tell whether the stopping time is less than or equal to t or not.
In particular, E[1{τ⩽t} Ft ] = 1{τ⩽t} is either one or zero.
Definition 1.5 (First hitting time). For a process X : Ω → XT and any Borel measurable set A ∈ B(X), we
define the first hitting time τXA : Ω → T ∪ {∞} for the process X to hit states in A, as τXA ≜ inf {t ∈ T : Xt ∈ A} .
Example 1.6. We observe that the event τXA ⩽ t = { Xs ∈ A for some s ⩽ t} ∈ Ft for all t ∈ T. It follows
that, if τA is finite almost surely, then τA is a stopping time with respect to filtration F• .
Proposition 1.7. For a random sequence X : Ω → XN , an almost sure finite discrete random variable τ : Ω →
N ∪ {∞} is a stopping time with respect to this random sequence X iff the event {τ = n} ∈ σ ( X1 , . . . , Xn ) for
all n ∈ N.
Proof. From Definition 1.4, we have {τ = n} = {τ ⩽ n} \ {τ ⩽ n − 1} ∈ σ ( X1 , . . . , Xn ). Conversely, from
the theorem hypothesis, it follows that {τ ⩽ n} = ∪nm=1 {τ = m} ∈ σ ( X1 , . . . , Xn ).
1
Example 1.8. Consider a random sequence X : Ω → XN , with the natural filtration F• , and a measurable
set A ∈ B(X). If the first hitting time τXA : Ω → N ∪ {∞} for the sequence X to hit set A is almost
−1
surely finite, then τXA is a stopping time. This follows from the fact that τXA = n = ∩nk=
1 { Xk ∈
/ A} ∩
{ Xn ∈ A} ∈ Fn for each n ∈ N.
Definition 1.9. Consider a random process X : Ω → XR+ with discrete state space X ⊆ R. For each state
{y},0
y ∈ X, we define τX ≜ 0 and inductively define the kth hitting time to a state y after time t = 0, as
n o
{y},k {y},k−1
τX ≜ inf t > τX : Xt = y , k ∈ N.
n o
{y},k {y},k
Remark 5. We observe that τX ⩽ t ∈ Ft for all times t ∈ R+ . Hence if τX is almost surely finite,
then it is a stopping time for the process X.
Definition 1.10. For a discrete valued random sequence X : Ω → XN , the number of visits to a state
y ∈ X in first n time steps is defined as Ny (n) ≜ ∑nk=1 1{ Xk =y} for all n ∈ N.
Z
Remark 6. We observe that Ny : Ω → Z++ is a random walk with the Bernoulli step size sequence
(1{Xk =y} : k ∈ N). Further, τX
{y},k {k}
= τNy = inf n ∈ N : Ny (n) = k . We also observe that { Ny (n) ⩽ k} =
( k +1) ( k +1)
{τy > n} and { Ny (n) = k} = {τyk ⩽ n < τy }.
Remark 7. We observe that the number of visits to state y in first n steps of X is also given by
n o n o
> n − 1 = ∑ 1n {y},k o .
{y},k {y},k
Ny (n) = sup k ∈ Z+ : τX ⩽ n = inf k ∈ N : τX
τX ⩽n
k ∈N
This implies that Ny (n) + 1 is the first hitting time to set of states {n + 1, n + 2, . . . } for the increasing
{y},k
random sequence (τX : k ∈ N).
Lemma 1.11 (Wald’s Lemma). Consider a random walk S : Ω → RZ+ with i.i.d. step-sizes X : Ω → RN
having finite E | X1 |. Let τ be a finite mean stopping time with respect to this random walk. Then,
E [ S τ ] = E [ X1 ] E [ τ ] .
Proof. Recall that the event space generate by the random walk and the step-sizes are identical. From the
independence of step sizes, it follows that Xn is independent of σ ( X0 , X1 , . . . , Xn−1 ). Since τ is a stopping
time with respect to random walk S, we observe that {τ ⩾ n} = {τ > n − 1} ∈ σ ( X0 , X1 , . . . , Xn−1 ), and
hence it follows that random variable Xn and indicator 1{τ⩾n} are independent and E[ Xn 1{τ⩾n} ] =
EX1 E1{τ⩾n} . Therefore,
" #
τ h i
E[ Sτ ] = E[ ∑ Xn ] = E[ ∑ Xn 1{ τ ⩾n } ] = ∑ EX1 E 1{τ⩾n} = EX1 E ∑ 1{τ⩾n} = E[ X1 ]E[τ ].
n =1 n ∈N n ∈N n ∈N
We exchanged limit and expectation in the above step, which is not always allowed. We were able to
do it by the application of dominated convergence theorem.
{i }
Corollary 1.12. Consider the stopping time τS ≜ min {n ∈ N : Sn = i } for an integer random walk S : Ω →
Z {i }
Z++ with i.i.d. step size sequence X : Ω → ZN . Then, the mean of stopping time EτS = i/EX1 .
Proof. This follows from the Wald’s Lemma and the fact that Sτi = i.
2
Proof. Let τ1 , τ2 be stopping times with respect to a filtration F• = (Ft : t ∈ T ).
i Result follows since the event {min {τ1 , τ2 } > t} = {τ1 > t} ∩ {τ2 > t} ∈ Ft .
ii A topological space is called separable if it contains a countable dense set. Since R+ is separable
and ordered, we assume T = R+ without any loss of generality. It suffices to show that the event
{τ1 + τ2 ⩽ t} ∈ Ft for T = R+ . To this end, we observe that
{τ1 ⩽ t − s, τ2 ⩽ s} ∈ Ft .
[
{τ1 + τ2 ⩽ t} =
s ∈Q+ : s⩽t
Example 1.16. We also observe that the kth hitting time to {1} by a Bernoulli step size sequence X :
Z
Ω → {0, 1}N is the first hitting time to {k} by random walk S : Ω → Z++ . That is,
{1},k {k} { k −1}
τX = τS = inf {n ∈ N : Sn = k} = τS + inf n ∈ N : S {k−1} − S {k−1} = 1 .
τS +n τS
Lemma 1.17. For an i.i.d. Bernoulli random sequence X : Ω → {0, 1}N with EX1 ∈ (0, 1), the kth hitting time
{1},k
to state 1 is a stopping time, and τX = ∑ik=1 Yi where Y : Ω → NN is an i.i.d. random sequence distributed
{1}
identically to τX .
n o
{1}
Proof. When the Bernoulli step size sequence X is i.i.d. with EX1 = p ∈ (0, 1), we get that P τS = n =
(1 − p)n−1 p for all n ∈ N. It follows that
n o n o n o
{1} {1} {1}
P τS < ∞ = P ∪n∈N τS = n = ∑ P τS = n = 1.
n ∈N
{1} {k}
Hence, the random time τS is finite almost surely. We will show that τS is finite almost surely
{ k −1}
for all k ∈ N by induction. By induction hypothesis, τS is finite almost surely. Then S { k −1} −
τS +n
n { k −1}
S {k−1} = ∑ j=1 X {k−1} is the sum of n i.i.d. Bernoulli random variables independent of τS by strong
τS τS +j
{k}
independence property, and hence has distribution identical to Sn . Further, This implies that τS =
{ k −1} {1} {1} {1} { k −1}
τS + τS , where τS has the identical distribution to τX and is independent of τS .
3
Lecture-19: Discrete Time Markov Chains
1 Markov processes
We have seen that i.i.d. sequences are easiest discrete time random processes. However, they don’t capture
correlation well.
Definition 1.1. A stochastic process X : Ω → XT with state space X and ordered index set T is Markov if
conditioned on the present state Xt , future σ ( Xu , u > t) is independent of the past σ ( Xs , s < t). We denote
the history of the process until time t as Ft ≜ σ ( Xs , s ⩽ t). That is, for any Borel measurable set B ∈ B(X)
any two indices u > t, we have
P({ Xu ∈ B} | Ft ) = P({ Xu ∈ B} | σ( Xt )).
Remark 1. We next re-write the Markov property more explicitly for the process X : Ω → RR . For all x, y ∈ X,
finite set S ⊆ R such that max S < t < u, and HS ( xS ) ≜ ∩s∈S { Xs ⩽ xs } ∈ Ft , we have
P({ Xu ⩽ y} | HS ( xS ) ∩ { Xt ⩽ x }) = P({ Xu ⩽ y} | { Xt ⩽ x }).
Remark 4. The above definition is equivalent to P({ Xn+1 ⩽ x } Fn ) = P({ Xn+1 ⩽ x } σ ( Xn )) for discrete
time discrete state space Markov chain X, since Fn = σ ( Hn ( x ) : x ∈ Xn ) and σ( Xn ) = σ ({ Xn = x } , x ∈ X).
Example 1.4 (Random Walk). A random walk S : Ω → XN with independent step-size sequence X : Ω →
XN , is Markov for a countable state space X that is closed under addition. Given a historical event
−1
Hn−1 (s) ≜ ∩nk= 1 { Sk = sk } and the current state { Sn = sn }, we can write the conditional probability
P({Sn+1 = sn+1 } Hn−1 (s) ∩ {Sn = sn }) = P({ Xn+1 = sn+1 − Sn } Hn−1 (s) ∩ {Sn = sn })
= P({Sn+1 = sn+1 } {Sn = sn }) = P { Xn+1 = sn+1 − sn } .
The equality in the second line follows from the independence of the step-size sequence. In particular, from
the independence of Xn+1 from the collection σ (S0 , X1 , . . . , Xn ) = σ (S0 , S1 , . . . , Sn ).
1
1.2 Transition probability matrix
Definition 1.5. We denote the set of all probability mass functions over a countable state space X by
M(X) ≜ ν ∈ [0, 1]X : ∑ x∈X νx = 1 .
Definition 1.6. The transition probability matrix at time n is denoted by P(n) ∈ [0, 1]X×X , such that its
( x, y)th entry is denoted by p xy (n) ≜ P({ Xn+1 = y} { Xn = x }), that is the transition probability of a
discrete time Markov chain X from a state x ∈ X at time n to state y ∈ X at time n + 1.
Remark 5. We observe that each row Px (n) ≜ ( p xy (n) : y ∈ X) ∈ M(X) is the conditional distribution of Xn+1
given the event { Xn = x }.
v Every finite doubly stochastic matrix has a left and right eigenvector with unit eigenvalue. This follows
from the fact that finite stochastic matrices A and A T have a common right eigenvector 1. It follows
that A has a left eigenvector 1 T .
vi For a probability transition matrix P(n), we have ∑y∈X f (y) p xy (n) = E[ f ( Xn+1 ) | { Xn = x }].
Example 1.9 (Integer random walk). For a one-dimensional integer valued random walk S : Ω → ZN with
i.i.d. unit step size sequence X : Ω → {−1, 1}N such that P { X1 = 1} = p, the transition operator P ∈ [0, 1]Z×Z
is given by the entries p xy = p1{y= x+1} + (1 − p)1{y= x−1} for all x, y ∈ Z.
Definition 1.11. Consider a time homogeneous Markov chain X : Ω → XZ+ with countable state space X
and transition matrix P. We would respectively denote the conditional probability of events and conditional
expectation of random variables, conditioned on the initial state { X0 = x }, by
E x [ Y ] ≜ E Y { X0 = x } .
Px ( A) ≜ P( A { X0 = x }),
2
Proposition 1.12. Conditioned on the initial state, any finite dimensional distribution of a homogeneous Markov
chain is stationary. That is, for any finite n, m ∈ Z+ and states x0 , . . . , xn ∈ X, we have
n
P(∩in=1 { Xi = xi } { X0 = x0 }) = P(∩in=1 { Xm+i = xi } { Xm = x0 }) = ∏ p xi−1 xi .
i =1
Proof. Consider a homogeneous Markov chain X : Ω → XZ+ with natural filtration F• such that Fn ≜
σ( X0 , . . . , Xn ) for each n ∈ N. Using the property of conditional probabilities and Markovity of X, we
can write the conditional probability of sample path ( X1 , . . . , Xn ) given the event { X0 = x0 } as
n n
P ∩in=1 { Xi = xi } | { X0 = x0 } = ∏ P { Xi = xi } | ∩ij− 1
= ∏ P { Xi = x i } | { Xi − 1 = x i − 1 } .
=0 X j = x j
i =1 i =1
Using the property of conditional probabilities and Markovity of X, we can write the conditional probability
of sample path ( Xm+1 , . . . , Xm+n ) given the event { Xm = x0 } as
n
P ∩in=1 { Xm+i = xi } | { Xm = x0 } = ∏ P { Xm+i = xi } | { Xm+i−1 = xi−1 } .
i =1
From time-homogeneity of transition probabilities of Markov chain X, it follows that both the transition
probabilities are identical and equal to ∏in=1 p xi−1 ,xi .
Corollary 1.13. The n-step transition probabilities are stationary for any homogeneous Markov chain. That is, for
any states x0 , xn ∈ X and n, m ∈ N, we have P({ Xn+m = xn } | { Xm = x0 }) = P({ Xn = xn } | { X0 = x0 }).
Proof. It follows from summing over all possible paths ( X0 , . . . , Xn ) and ( Xm , . . . , Xm+n ). In particular, we
can partition events { Xn = xn } and { Xm+n = xn } in terms of unions over disjoint paths
[ [
{ Xn = x n } = ∩in=1 { Xi = xi } , { Xm + n = x n } = ∩in=1 { Xm+i = xi } .
x ∈Xn −1 x ∈Xn −1
The result follows from the countable additivity of conditional probability for disjoint events of taking
distinct paths, and the fact that probability of taking same path is identical for both sums.
Example 1.14 (Integer random walk). The time homogeneous Markov chain in Example 1.9 can be repre-
sented by an infinite state weighted graph G = (Z, E, w), where the edge set is
E = {(n, n + 1) : n ∈ Z} ∪ {(n, n − 1) : n ∈ Z} .
We have plotted the sub-graph of the entire transition graph for states {−1, 0, 1} in Figure 1.
Example 1.15 (Sequence of experiments). The time homogeneous Markov chain in Example 1.10 can be
represented by the following two-state weighted transition graph G = ({0, 1} , E, w), plotted in Figure 2.
3
p p
-1 0 1
1− p 1− p
Figure 1: Sub-graph of the entire transition graph for an integer random walk with i.i.d. step-sizes in {−1, 1}
with probability p for the positive step.
0 1
Figure 2: Markov chain for the sequence of experiments with two outcomes.
P { f n ( x, Zn ) = y} = p xy (n).
Proof. It suffices to show that every transition matrix P(n) has a random mapping representation. Then, for
the mapping f n and the i.i.d. sequence Z : Ω → ΛN , we would have Xn = f n ( Xn−1 , Zn ) for all n ∈ N.
Let Λ ≜ [0, 1], and we choose the i.i.d. uniform sequence Z : Ω → ΛN . Since X is countable, it can be
ordered. We let X = N without any loss of generality. We set Fxy (n) ≜ ∑w⩽y p xw (n) and define function
f n : X × Λ → X for all pairs ( x, z) ∈ X × Λ by
Since f n ( x, Zn ) is a discrete random variable taking value y ∈ X, iff the uniform random variable Zn lies in
the interval ( Fx,y−1 (n), Fx,y (n)]. That is, the event { f n ( x, Zn ) = y} = Zn ∈ ( Fx,y−1 (n), Fx,y (n)] for all y ∈ X.
It follows that
P { f n ( x, Zn ) = y} = P Fx,y−1 (n) < Zn ⩽ Fx,y (n) = Fx,y (n) − Fx,y−1 (n) = p xy (n).
4
Lecture-20: Strong Markov Property
1 n-step transition
Definition 1.1. For a time homogeneous Markov chain X : Ω → XZ+ , we can define n-step transition
probability matrix P(n) , with its ( x, y) entry being the n-step transition probability for Xm+n to be in state
(n)
y given the event { Xm = x }. That is, p xy ≜ P({ Xn+m = y} | { Xm = x }) for all x, y ∈ X and m, n ∈ Z+ .
(n) (n)
Remark 1. That is, the row Px = ( p xy : y ∈ X) ∈ M(X) is the conditional distribution of Xn given the initial
state { X0 = x }.
Theorem 1.2. The n-step transition probabilities for a homogeneous Markov chain form a semi-group. That is, for
all positive integers m, n ∈ Z+
P(m+n) = P(m) P(n) .
Proof. The events {{ Xm = z} : z ∈ X} partition the sample space Ω, and hence we can express the event
{ Xm+n = y} as the following disjoint union
It follows from the Markov property and law of total probability that for any states x, y and positive integers
m, n
Corollary 1.3. The n-step transition probability matrix is given by P(n) = Pn for any positive integer n.
Proof. In particular, we have P(n+1) = P(n) P(1) = P(1) P(n) . Since P(1) = P, we have P(n) = Pn by induction.
Definition 1.4. For a time homogeneous Markov chain X : Ω → XZ+ we denote the probability mass func-
tion of Markov chain at step n by πn ∈ M(X).
Lemma 1.5 (Chapman Kolmogorov). The right multiplication of a probability vector with the transition matrix
P transforms the probability distribution of current state to probability distribution of the next state. That is,
Proof. To see this, we fix y ∈ X and from the law of total probability and the definition conditional proba-
bility, we observe that
1
2 Strong Markov property (SMP)
We are interested in generalizing the Markov property to any random times. For a DTMC X : Ω → XZ+
and a random variable τ : Ω → N, we are interested in knowing whether for any historical event Hτ −1 =
∩τn− 1
=0 { Xn = xn } and any state x, y ∈ X, we have
P({ Xτ +1 = y} | Hτ −1 ∩ { Xτ = x }) = p xy .
Example 2.1 (Two-state DTMC). Consider the two state Markov chain X ∈ {0, 1}Z+ such that P0 { X1 = 1} =
q and P1 { X1 = 0} = p for p, q ∈ [0, 1]. Let τ : Ω → N be a random variable defined as
Definition 2.2. Let τ : Ω → N be a stopping time with respect to a random sequence X : Ω → XZ+ . Then for
all states x, y ∈ X and the event Hτ −1 = ∩τn− 1
=0 { Xn = xn }, the process X satisfies the strong Markov property
if
P({ Xτ +1 = y} | { Xτ = x } ∩ Hτ −1 ) = P({ Xτ +1 = y} | { Xτ = x }).
Lemma 2.3. Homogeneous Markov chains satisfy the strong Markov property.
Proof. Let X : Ω → XZ+ be a homogeneous DTMC with transition matrix P, and τ : Ω → N be an associated
−1
stopping time. We take any historical event Hτ −1 = ∩nT= 0 { Xn = xn }, and states x, y ∈ X. From the definition
of conditional probability, the law of total probability, and the Markovity of the process X, we have
= p xy ∑ P({τ = n} | { XT = x } ∩ Hτ −1 ) = p xy .
n ∈Z+
This equality follows from the fact that the event {τ = n} is completely determined by ( X0 , . . . , Xn ).
{y},k
Remark 2. Consider a homogeneous DTMC X : Ω → XZ+ and the first instant τk ≜ τX for the process X to
hit k times, a state y ∈ X. Recall that τ0 ≜ 0 and recurrence time Hk ≜ τk − τk−1 = inf n ∈ N : Xτk−1 +n = y
for all k ∈ N. We define a process Y : Ω → XZ+ where Ym ≜ Xτk +m for all m ∈ Z+ . If τk is almost surely
finite, then it is a stopping time with respect to process X. Using strong Markov property of DTMC X, we
will show that Y is a stochastic replica of X with X0 = y.
2
Remark 4. If τk−1 is almost sure finite, then τk−1 is a stopping time for process X. From the strong Markov
property of homogeneous DTMC X applied to stopping time τk−1 , it follows that the future σ( Xτk−1 + j : j ∈
N) is independent of the past σ ( X0 , . . . , Xτk−1 ) given the present σ ( Xτk−1 ). Since Xτk−1 = y for k ⩾ 2 deter-
ministically, it follows that σ ( Xτk−1 ) is a trivial event space and the future σ ( Xτk−1 + j : j ∈ N) is independent
of the random past σ ( X0 , . . . , Xτk−1 ). We further observe that the distribution of σ ( Xτk−1 + j : j ∈ N) is iden-
tical to distribution of X given X0 = y. Thus, the process ( Xτk−1 + j : j ∈ N) is distributed identically for all
k ⩾ 2.
Remark 5. We observe that the recurrence time satisfies { Hk = n} ∈ σ ( Xτk−1 + j : j ∈ [n]) for all n ∈ N, and
hence the recurrence time Hk is independent of the random past σ ( X0 , . . . , Xτk−1 ). Recursively applying this
fact, we can conclude that ( H1 , . . . , Hk ) are independent random variables. Further, since ( Xτk−1 + j : j ∈ N)
is distributed identically for all k ⩾ 2, it follows that ( Hk : k ⩾ 2) are distributed identically.
Lemma 3.1. If H1 and H2 are almost surely finite, then the random sequence ( Hk : k ⩾ 2) is i.i.d. .
Proof. From the above two remarks, it suffices to show that each term of the random sequence τ : Ω → NN
is almost surely finite. We will show this by induction. Since τ1 = H1 is almost surely finite, it follows
that τ1 is stopping time. Since τ2 = τ1 + H2 is almost surely finite, it follows that τ2 is a stopping time. By
inductive hypothesis τk−1 is almost surely finite, and hence Hk is independent of ( H1 , . . . , Hk ) and identically
distributed to H2 and is almost surely finite. It follows that τk = τk−1 + Hk is almost surely finite, and the
result follows.
3
Lecture-21: Recurrent and transient states
Proof. We can write the event of zero visits to state y as Ny (∞) = 0 = {τ1 = ∞}. Further, we can write
1
Corollary 1.7. For a homogeneous Markov chain X, we have Px Ny (∞) < ∞ = 1{ f yy <1} + (1 − f xy )1{ f yy =1} .
Proof. We can write the event Ny (∞) < ∞ as disjoint union of events Ny (∞) = k , to get the result.
Z+
For a time homogeneous Markov chain X : Ω → X , we have
Remark 4.
(i) Px Ny (∞) = ∞ = f xy 1{ f yy =1} , and
(ii) Py Ny (∞) = ∞ = 1{ f yy =1} .
1− f yy 1{ f yy <1}
f xy
Corollary 1.8. The mean number of visits to state y, starting from a state x is Ex Ny (∞) = +
∞1{ f xy >0, f yy =1} .
ii A recurrent state is visited infinitely often almost surely. This also follows from Corollary 1.7, since
Py Ny (∞) < ∞ = 0 for all recurrent states y ∈ X.
iii In a finite state Markov chain, not all states may be transient.
Proof. To see this, we assume that for a finite state space X, all states y ∈ X are transient. Then, we
know that Ny (∞) is finite almost surely for all states y ∈ X. It follows that, for any initial state x ∈ X
( )
0 ⩽ Px ∑ Ny (∞) = ∞ = Px (∪y∈X Ny (∞) = ∞ ) ⩽ ∑ Px Ny (∞) = ∞ = 0.
y ∈X y ∈X
It follows that ∑y∈X Ny (∞) is also finite almost surely for all states y ∈ X for finite state space X. How-
ever, we know that ∑y∈X Ny (∞) = ∑k∈N ∑y∈X 1{ Xk =y} = ∞. This leads to a contradiction.
(k)
Proposition 1.9. For a homogeneous DTMC X : Ω → XZ+ , a state y ∈ X is recurrent iff ∑k∈N pyy = ∞, and
(k)
transient iff ∑k∈N pyy < ∞.
(k)
Proof. Recall that if the mean recurrence time to a state y ∈ X is Ey Ny (∞) = ∑k∈N pyy finite then the state
is transient and infinite if the state is recurrent.
(k)
(n) ∑nk=1 p xy
Corollary 1.10. For a transient state y ∈ X, the following limits hold limn→∞ p xy = 0, and limn→∞ n = 0.
(n)
Proof. For a transient state y ∈ X and any state x ∈ X, we have Ex Ny (∞) = ∑n∈N p xy < ∞. Since the series
(n)
sum is finite, it implies that the limiting terms in the sequence limn→∞ p xy = 0. Further, we can write
(k)
(k) ∑nk=1 p xy
∑nk=1 p xy ⩽ Ex Ny (∞) ⩽ M for some M ∈ N and hence limn→∞ n = 0.
Lemma 1.11. For any state y ∈ X, let H : Ω → NN be the sequence of almost surely finite inter-visit times to state
y, and Ny (n) = ∑nk=1 1{ Xk =y} be the number of visits to state y in n times. Then, Ny (n) + 1 is a finite mean stopping
time with respect to the sequence H.
2
observe that Ny (n) + 1 ⩽ n + 1 and hence has a finite mean for each n ∈ N. Further, we
Proof. We first
observe that Ny (n) + 1 = k can be completely determined by observing H1 , . . . , Hk , since
( )
k −1 k
Ny (n) + 1 = k = ∑ Hj ⩽ n < ∑ Hj ∈ σ ( H1 , . . . , Hk ).
j =1 j =1
(k)
∑nk=1 p xy
Theorem 1.12. Let x, y ∈ X be such that f xy = 1 and y is recurrent. Then, limn→∞ n = 1
µyy .
Proof. Let y ∈ X be recurrent. The proof consists of three parts. In the first two parts, we will show that
starting from the state y, we have the limiting empirical average of mean number of visits to state y is
limn→∞ n1 Ey Ny (n) = µ1yy . In the third part, we will show that for any starting state x ∈ X such that f xy = 1,
we have the limiting empirical average of mean number of visits to state y is limn→∞ n1 Ex Ny (n) = 1
µyy .
Lower bound: We observe that Ny (n) + 1 is a stopping time with respect to inter-visit times H from Lemma 1.11.
Ny (n)+1 Ny (n)+1
Further, we have ∑ j=1 Hj > n. Applying Wald’s Lemma to the random sum ∑ j=1 Hj , we get
(k)
∑nk=1 pyy
Ey ( Ny (n) + 1)µyy > n. Taking limits, we obtain lim infn∈N n ⩾ 1
µyy .
Upper bound: Let X0 = y and consider a fixed positive integer M ∈ N. Then H is i.i.d. and we define
truncated recurrence times H̄ : Ω → [ M ]N for all j ∈ N as H̄j ≜ M ∧ Hj . It follows that the sequence H̄
is i.i.d. and H̄j ⩽ Hj for all j ∈ N. We define the mean of the truncated recurrence times as µ̄yy ≜ Ey H̄1 .
From the monotonicity of truncation, we get µ̄yy ⩽ µyy .
We define the random variable τ̄k ≜ ∑kj=1 H̄j for all k ∈ N, and τ̄k ⩽ τk for all k ∈ N. We can define
the associated counting process that counts the number of truncated recurrences in first n steps as
N̄y (n) ≜ ∑k∈N 1{τ̄k ⩽n} for all n ∈ N. We conclude that N̄y (n) + 1 is a stopping time with respect to
i.i.d. process H̄, and N̄y (n) ⩾ Ny (n) sample path wise. Further, we have
N̄y (n)+1
∑ H̄j = τ̄N̄y (n)+1 = τ̄N̄y (n) + H̄N̄y (n)+1 ⩽ n + M.
j =1
Applying Wald’s Lemma to stopping time Ny (n) + 1 with respect to i.i.d. sequence H and stopping
time N̄y (n) + 1 with respect to i.i.d. sequence H̄, and monotonicity of expectation, we get
n n k −1 n −1 n−s n −1 n −1
(k−s) (s) (k−s)
∑ pxy ∑ ∑ f xy ∑ pyy ∑ ∑ pyy ∑ pyy ∑
(k) (s) (s) (s) (k)
= pyy = f xy = − f xy .
k =1 k =1 s =0 s =0 k − s =1 s =0 s =0 k>n−s
(k) (k)
(k) ∑nk=1 p xy ∑nk=1 pyy
Since the series ∑k∈N f xy converges, we get limn→∞ n = limn→∞ n .
3
Lecture-22: Communicating classes
1 Communicating classes
(n)
Definition 1.1. For states x, y ∈ X, it is said that state y is accessible from state x if p xy > 0 for some n ∈ Z+ ,
and denoted by x → y. If two states x, y ∈ X are accessible to each other, they are said to communicate with
each other, denoted by x ↔ y.
Proposition 1.2. Communication is an equivalence relation.
Proof. Relation on state space X is a subset of product of sets X × X. Communication is a relation on state
space X, as it relates two states x, y ∈ X. To show equivalence, we have to show reflexivity, symmetry, and
transitivity of the relation.
(0)
Reflexivity: Since P0 = I, it follows that p xx = 1 > 0 for each state x ∈ X, and hence x ↔ x.
Symmetry: If x ↔ y, then we know that x → y and y → x and hence y ↔ x.
(m) (n)
Transitivity: For transitivity, suppose x ↔ y and y ↔ z. Let m, n ∈ Z+ such that p xy > 0 and pyz > 0. Then
(m+n) (m) (n) (m) (n)
by Chapman Kolmogorov equation, we have p xz = ∑w∈X p xw pwz ⩾ p xy pyz > 0. This implies
x → z, and using similar arguments one can show that z → x, and the transitivity follows.
Remark 1. Since the communication is an equivalence relation, it partitions state space X into equivalence
classes.
Definition 1.3. Each equivalence class of communication relation is called a communicating class. A prop-
erty of states is said to be a class property if for each communicating class C , either all states in C have the
property, or none do.
Definition 1.4. A set of states that communicate are called a communicating class.
(i) A communicating class C is called closed if no edges leave this class. That is, for all edges ( x, y) ∈
C × C c , we have p xy = 0.
(ii) An open communicating class is not closed, i.e. there exist an edge that leaves this class. That is, there
exists an edge ( x, y) ∈ C × C c such that p xy > 0.
1
(m) (n)
Proof. Consider two states x ̸= y ∈ X and integers m, n ∈ N be such that p xy pyx > 0. Suppose s ∈ T ( x ),
(s)
that is p xx > 0. Then
(n+m) (n) (m) (n+s+m) (n) (s) (m)
pyy ⩾ pyx p xy > 0, pyy ⩾ pyx p xx p xy > 0.
Hence d(y)|n + m and d(y)|n + s + m, and hence d(y)|s for any s ∈ T ( x ). In particular, it implies that
d(y)|d( x ). By symmetrical arguments, we get d( x )|d(y). Hence d( x ) = d(y).
Definition 1.8. For an irreducible chain, the period of the chain is defined to be the period which is common
to all states. An irreducible Markov chain is called aperiodic if the single communicating class is aperiodic.
(r )
Proposition 1.9. If the transition matrix P is aperiodic and irreducible, then there is an integer r0 such that p xy > 0
for all x, y ∈ X and r ⩾ r0 .
Corollary 1.11. If y is a recurrent state and there exists a state x such that y → x, then x → y and f xy = 1.
Proof. Let y ∈ X be a recurrent state, and consider state x ∈ X such that y → x. We will show that f xy = 1
(n)
and hence f xy > 0 for some n ∈ Z+ and x → y. To this end, we observe that since y → x, there exists an
integer n ∈ Z+ such that the probability of hitting state x for the first time starting from state y in n-steps is
positive. That is, n o
(n) { x },1
f yx ≜ Py { Xn = x, Xn−1 ̸= x, . . . , X1 ̸= x } = Py τX = n > 0.
From the strong Markov property, we have
n o n o n o n o
{y},1 {y},1 { x },1 {y},1 { x },1 (n)
1 − f yy = Py τX = ∞ ⩾ Py τX = ∞, τX = n = Px τX = ∞ Py τX = n = f yx (1 − f xy ).
Since state y is recurrent, it follows that f xy = 1 and hence x → y.
(k)
∑nk=1 p xy
Corollary 1.12. Let x, y ∈ X be in the same communicating class and the state y is recurrent. Then, limn∈N n =
1 (n) 1
µyy . Furthermore, if the state y is aperiodic, then limn∈N p xy = µyy .
Proof. Since y is recurrent and y → x, it follow that f xy = 1 from the previous Lemma. From the Theorem
(k)
∑n p
1.7 in previous lecture, it follows that limn∈N k=n1 xy = µ1yy .
Let the period of the state y be d. Then we know that there exists a positive integer r0 such that for all
(nd)
n ⩾ r0 , we have pyy > 0.
Theorem 1.13. The states in a communicating class are of one of the following types; all transient, or all null
recurrent, or all positive recurrent.
Proof. It suffices to show that if x, y belong to the same communicating class and y is null recurrent, then x
(r ) ( s ) (r ) (ℓ) (s)
is null recurrent as well. We take r, s ∈ N, such that pyx p xy > 0. It follows that pryy+ℓ+s ⩾ pyx p xx Pxy for all
ℓ ∈ N. Hence, for any n > r + s, we have
!
n −r − s
1 n (k) 1 n
n−r−s
1
n k∑ ∑ pyy ⩾ ∑ pxx pyx Pxy .
(k) (ℓ) (r ) ( s )
pyy ⩾
=1
n k =r + s +1
n n − r − s ℓ=1
(ℓ)
Since y is null recurrent LHS goes to zero as n increases, which implies limn∈N n1 ∑nℓ=1 p xx = 0. Hence, x is
null recurrent as well.
2
Lecture-23: Invariant Distribution
1 Invariant Distribution
Let X : Ω → XZ+ be a time-homogeneous Markov chain with transition probability matrix P : X × X → [0, 1].
Definition 1.1. A probability distribution π ∈ M(X) is said to be invariant distribution for the Markov
chain X if it satisfies the global balance equation π = πP.
Definition 1.2. When the initial distribution of a Markov chain is ν ∈ M(X), then the conditional probabil-
ity is denoted by Pν : F → [0, 1] defined by
Definition 1.3. For a Markov chain X : Ω → XZ+ , we denote the distribution of random variable Xn : Ω → X
by νn ∈ M(X) for all n ∈ Z+ . That is, νn ( x ) ≜ Pν0 { Xn = x } for all x ∈ X.
Remark 1. We observe that νn ( x ) = ∑z∈X ν0 (z)( Pn )zx for all x ∈ X.
Pπ { X0 = x0 , . . . , Xn = xn } = Pπ { Xk = x0 , . . . , Xk+n = xn } = π x0 p x0 x1 . . . p xn−1 xn .
v For an irreducible Markov chain, if π x > 0 for some x ∈ X, then the entire invariant vector π is
positive. To this end, we will show that πy > 0 for all states y ∈ X. Let y ∈ X, then from the
(m)
irreducibility of Markov chain, there exists an m ∈ Z+ such that p xy > 0. Further, π = πPm and
(m)
hence πy ⩾ π x p xy > 0.
vi Any scaled version of π satisfies the global balance equation. Therefore, for any invariant vector
α ∈ XR+ of a positive recurrent transition matrix P, the sum ∥α∥1 = ∑ x∈X α x must be finite. We can
normalize α and get an invariant probability measure π = ∥αα∥ .
1
Theorem 1.4. An irreducible Markov chain with transition probability matrix P : X × X → [0, 1] is positive recurrent
iff there exists a unique invariant probability measure π ∈ M(X) that satisfies global balance equation π = πP and
π x = µ1xx > 0 for all x ∈ X.
1
Proof. Consider an irreducible Markov chain X : Ω → XZ+ with transition probability matrix P. We will
first show that positive recurrence of X implies the existence of a positive invariant distribution π and its
uniqueness. Then we will show that the existence of a unique positive invariant distribution π implies
positive recurrence of X.
Implication: For Markov chain X, let the initial state be X0 = x. Recall that the number of visits to state
y ∈ X in the first n steps of the Markov chain X is denoted by Ny (n) = ∑nk=1 1{ Xk =y} . It follows that
{ x },1
∑y∈X Ny (n) = n for each n ∈ N. Let Hx ≜ τX be the first recurrence time to state x ∈ X, then we
have Nx ( Hx ) = 1 and ∑y∈X Ny ( Hx ) = Hx .
∑ 1{Xn =y,n⩽Hx } = ∑ ∑
(n)
vy = Ex Ny ( Hx ) = Ex Px { Xn = y, n ⩽ Hx } = λ xy . (1)
n ∈N n ∈N n ∈N
∑ P({Xn = y}
(n)
λ xy = { Xn−1 = z, n ⩽ Hx , X0 = x }) Px { Xn−1 = z, n ⩽ Hx } .
z̸= x
(n) (n)
Substituting the expression for λ xy in (2) into the expression for vy = ∑n∈N λ xy in (1) and using
the fact that v x = 1, we obtain
( n −1)
vy = p xy + ∑ ∑ λxz pzy = v x p xy + ∑ vz pzy = ∑ vx pxy .
n⩾2 z̸= x z̸= x x ∈X
v
Since v has a finite sum, it follows that π ≜ ∑ x ∈X v x
is an invariant distribution for the Markov
vx 1
chain X with the transition matrix P. In addition, we have π x = ∑ y ∈X v y
= µ xx > 0.
Uniqueness: Next, we show that this is a unique invariant measure independent of the initial state x,
and hence πy = µ1yy > 0 for all y ∈ X. For uniqueness, we observe from the Chapman-Kolmogorov
(k)
equations and invariance of π that π = n1 π ( P + P2 + · · · + Pn ). Hence, πy = ∑ x∈X π x n1 ∑nk=1 p xy
for all states y ∈ X. Taking limit n → ∞ on both sides, and exchanging limit and summation
on right hand side using bounded convergence theorem for summable series π, we get πy =
µyy ∑ x ∈X π x = µyy > 0 for all states y ∈ X.
1 1
1
Converse: Let π be the unique positive invariant distribution of Markov chain X, such that πy = µyy >0
for all states y ∈ X. It follows that the Markov chain X is positive recurrent.
Corollary 1.5. An irreducible Markov chain on a finite state space X has a unique positive stationary distribution π.
2
Definition 1.6. An irreducible, aperiodic, positive recurrent Markov chain is called ergodic.
i For a Markov chain with multiple positive recurrent communicating classes C1 , . . . , Cm , one can
find the positive equilibrium distribution for each class, and extend it to the entire state space X
denoting it by πk for class k ∈ [m]. It is easy to check that any convex combination π = ∑m k =1 α k π k
satisfies the global balance equation π = πP, where αk ⩾ 0 for each k ∈ [m] and ∑m α
k =1 k = 1. Hence,
a Markov chain with multiple positive recurrent classes have a convex set of invariant probability
measures, with the individual invariant distribution πk for each positive recurrent class k ∈ [m]
being the extreme points.
ii Let µ(0) = ex , that is let the initial state of the positive recurrent Markov chain be X0 = x. Then, we
know that
1 1 n (k) 1
πy = = lim ∑ p xy = lim Ex Ny (n).
µyy n→∞ n k=1 n→∞ n
Theorem 1.7. For an ergodic Markov chain X with invariant distribution π, and nth step distribution µ(n), we
have limn→∞ µ(n) = π in the total variation distance.
Proof. Consider independent time homogeneous Markov chains X : Ω → XZ+ and Y : Ω → XZ+ each with
transition matrix P. The initial state of Markov chain X is assumed to be X0 = x, whereas the Markov chain
Y is assumed to have an initial distribution π. It follows that Y is a stationary process, while X is not. In
particular,
(n)
µy (n) = Px { Xn = y} = p xy , Pπ {Yn = y} = πy .
Let τ = inf{n ∈ Z+ : Xn = Yn } be the first time that two Markov chains meet, called the coupling time.
Finiteness: First, we show that the coupling time is almost surely finite. To this end, we define a a new
Markov chain on state space X × X with transition probability matrix Q such that q(( x, w), (y, z)) =
p xy pwz for each pair of states ( x, w), (y, z) ∈ X × X. The n-step transition probabilities for this couples
Markov chain are given by
(n) (n)
q(n) (( x, w), (y, z)) ≜ p xy pwz .
Ergodicity: Since the Markov chain X with transition probability matrix P is irreducible and aperi-
(n) (n)
odic, for each x, y, w, z ∈ X there exists an n0 ∈ Z+ such that q(n) (( x, w), (y, z)) = p xy pwz > 0 for
all n ⩾ n0 from a previous Lemma on aperiodicity. Hence, the irreducibility and aperiodicity of
this new product Markov chain follows.
Invariant: It is easy to check that θ ( x, w) = π x πw is the invariant distribution for this product Markov
chain, since θ ( x, w) > 0 for each ( x, w) ∈ X × X, ∑ x,w∈X θ ( x, w) = 1, and for each (y, z) ∈ X × X,
we have
∑ θ (x, w)q((x, w), (y, z)) = ∑ πx pxy ∑ πw pwz = πy πz = θ (y, z).
x,w∈X x ∈X w ∈X
Recurrence: This implies that the product Markov chain is positive recurrent, and each state ( x, x ) ∈
X × X is reachable with unit probability from any initial state (y, w) ∈ X × X.
3
Coupled process: Second, we show that from the coupling time onwards, the evolution of two Markov
chains is identical in distribution. That is, for each y ∈ X and n ∈ Z+ ,
This follows from the strong Markov property for the joint process where τ is stopping time for the
joint process (( Xn , Yn ) : n ∈ Z+ ) such that Xτ = Yτ , and both marginals have the identical transition
matrix.
Limit: For any y ∈ X, we can write the difference as
(n)
p xy − πy = | Px { Xn = y, n < τ } − Pπ {Yn = y, n < τ }| ⩽ 2Pδx ,π (τ > n).
Since the coupling time is almost surely finite for each initial state x, y ∈ X, we have ∑n∈N Pδx ,π {τ = n} =
1 and the tail-sum Pδx ,π {τ > n} goes to zero as n grows large, and the result follows.
Remark 4. The probability flux from a single node x to single node y is denoted by Φ( x, y) = π x p xy .
Definition 2.2. For a time homogeneous Markov chain X : Ω → XZ+ with probability transition matrix
P represented as the weighted transition graph G = (X, E, w), a cut is defined as the partition (X1 , X2 ) of
nodes.
Lemma 2.3. Probability flux balances across cuts.
Proof. A cut for the state space X is given by a partition (X1 , X2 ). We show that Φ(X1 , X2 ) = Φ(X2 , X1 ). To
this see, we observe that by exchanging sums from the monotone convergence theorem, and exchanging x
and y as running variables, we can write the probability flux Φ(X1 , X2 ) as
Corollary 2.4. For any states y ∈ X, we have πy (1 − pyy ) = πy ∑ x̸=y pyx = ∑ x̸=y π x p xy .
Proof. It follows from probability flux balancing across the cut ({y} , {y}c ).
4
Lecture-24: Poisson Point Processes
Definition 1.1. A simple point process is a random countable collection of distinct points S : Ω → XN , such
that the distance ∥Sn ∥ → ∞ as n → ∞.
Remark 1. Since S is a simple point process, each point Sn is unique. Therefore, we can identify S as a
random set of points in X and S ∩ A is the random set of points in A.
Remark 2. For any simple point process S, we have P({Sn = Sm for any n ̸= m}) = 0 and |S ∩ A| is finite
almost surely for any bounded set A ∈ B(X).
Example 1.2 (Simple point process on the half-line). We can simplify this definition for d = 1. When
X = R+ , one can order the points of the process S : Ω → RN N , such that
n process S̃ : Ω → Ro
+ to get ordered +
S̃n = S(n) is the nth order statistics of S. That is, S(0) ≜ 0, and S(n) ≜ inf Sk > S(n−1) : k ∈ N . such that
S(1) < S(2) < · · · < S(n) < . . . , and limn∈N S(n) = ∞. We will call this an arrival process.
Definition 1.3. Corresponding to a point process S : Ω → XN , we denote the number of points in a set
A ∈ B(X) by
N ( A) ≜ |S ∩ A| = ∑ 1 A (Sn ), where we have N (∅) = 0.
n ∈N
B(X)
The resulting process N : Ω → Z+ is called a counting process for the point process S : Ω → XN .
Remark 3. Let A ∈ B(X)k be a bounded partition of B ∈ B(X). From the disjointness of ( A1 , . . . , Ak ), we have
k k k
N ( B) = ∑ 1∪ik=1 Ai (Sn ) = ∑ ∑ 1 Ai (Sn ) = ∑ ∑ 1 Ai (Sn ) = ∑ N ( Ai ).
n ∈N n ∈N i =1 i =1 n ∈N i =1
Definition 1.4. A counting process is simple if the underlying point process is simple.
Remark 4. For a simple counting process N, we have N ({ x }) ⩽ 1 almost surely for all x ∈ X.
Remark 5. Let N : Ω → Z+ B(X) be the counting process for the point process S : Ω → XN .
i Note that the point process S and the counting process N carry the same information.
ii The distribution of point process S is completely characterized by the finite dimensional distributions
of random vectors ( N ( A1 ), . . . , N ( Ak )) for any bounded sets A1 , . . . , Ak ∈ B(X) and finite k ∈ N.
Example 1.5 (Simple point process on the half-line). Since the Borel measurable sets B(R+ ) are generated
by half-open intervals {(0, t] : t ∈ R+ }, we denote the counting process by N : Ω → Z+ R+ , where Nt ≜
N (0, t] = ∑n∈N 1{Sn ∈(0,t]} is the number of points in the half-open interval (0, t]. For s < t, the number of
points in interval (s, t] is N (s, t] = N (0, t] − N (0, s] = Nt − Ns .
1
Theorem 1.6 (Rényi). Distribution of a simple point process S : Ω → XN on a locally compact second countable
space X is completely determined by void probabilities ( P { N ( A) = 0} : A ∈ B(X)).
Proof. It suffices to show that the finite dimensional distributions of S on locally compact sets are character-
ized by void probabilities.
Step 1: We will show this by induction on the number of points k in a bounded set A ∈ B. Let A1 , . . . , Ak , B ∈
B(X) locally compact, then we will show that uk ≜ P(∩ik=1 { N ( Ai ) > 0} ∩ { N ( B) = 0}) can be com-
puted from void probabilities. From k = 1, we have
P { N ( A1 ) > 0, N ( B) = 0} = P { N ( B) = 0} − P { N ( B ∪ A1 ) = 0} .
Step 2: For any locally compact set B ∈ B(X), there exists a sequence of nested partitions Bn ≜ ( Bn,j : j ∈
[ Jn ]) that eventually separates the points in S ∩ B as n → ∞. We define the number of subsets of
1 1{ N ( B )>0} where
Jn
partition ( Bn,j : j ∈ [ Jn ]) that consist of at least one point in S ∩ B, as Hn ( B) ≜ ∑ j=
n,j
Hn ( B) ↑ N ( B) almost surely.
Step 3: We next show that for all locally compact sets B1 , . . . , Bk ∈ B(X) and j1 , . . . , jk ∈ N, the probability
P(∩ik=1 { Hn ( Bi ) = ji }) can be expressed in terms of void probabilities. We observe that
n o n o
P(∩ik=1 { Hn ( B) = ji }) = ∑ i
P ∩ik=1 ∩ j∈Ti N ( Bn,j ) > 0 ∩ N (∪ j/∈∪k
i =1 Ti
i
Bn,j ) =0 .
T1 ,...,Tk ⊆[ Jn ]:| T1 |= j1 ,...,| Tk |= jk
Step 4: For a simple point process, we have the following almost sure limit limn ∩ik=1 { Hn ( Bi ) = ji } =
∩ik=1 { N ( Bi ) = ji }. The result follows from the continuity of probability.
Definition 1.7. The intensity measure Λ : B(X) → R+ is defined for each bounded set A ∈ X as its scaled
volume in terms of the intensity density λ : Rd → R+ , as
Z
Λ( A) ≜ λ( x )dx.
x∈ A
If the intensity density λ( x ) = λ for all x ∈ Rd , then Λ( A) = λ | A|. In particular for partition A1 , . . . , Ak for
a set B, we have Λ( B) = ∑ik=1 Λ( Ai ).
2
Proof. It is clear that if the marginal distribution of the counting process N is Poisson with intensity measure
Λ, then the void probability P { N ( A) = 0} = e−Λ( A) is exponential for any bounded set A ∈ B(X).
Conversely, we assume that the void probabilities are exponentially distributed with intensity mea-
sure Λ. It follows from the linearity of intensity measure that for any finite, bounded, and disjoint sets
B1 , . . . , Bk ∈ B(X), we have
n o k k
P(∩ik=1 { N ( Bi ) = 0}) = P N (∪ik=1 Bi ) = 0 = e−Λ(∪i=1 Bi ) = ∏ e−Λ( Bi ) = ∏ P { N ( Bi ) = 0} .
k
i =1 i =1
That is, the Bernoulli random vector (1{ N ( Bi )=0} : i ∈ [k ]) is independent for any finite k ∈ N and bounded
disjoint B(X) measurable sets B1 , . . . , Bk . Next we consider a set B ∈ B(X) and a partition Bn ≜ ( Bn,j : j ∈ [ Jn ])
Λ( B)
1 1{ N ( B
J
of B such that Λ( Bn,j ) = Jn for all j ∈ [ Jn ]. It follows that Hn ( B) ≜ ∑ j=
n
} is the sum of Jn
n,j )>0
i.i.d. Bernoulli random variables with success probability pn ≜ 1 − e−Λ( B)/Jn , and hence has a Binomial
distribution with parameters ( Jn , pn ). Therefore,
e−Λ( B) Jn
m
Jn !
P { Hn ( B) = m} = (eΛ( B)/Jn pn )m = e−Λ( B) eΛ( B)/Jn − 1 .
m! m ( Jn − m)!
Recall that Hn ( B) ↑ N ( B) as n → ∞ in the proof of Rényi’s Theorem, and limn→∞ Jn = ∞ and limn∈N Bn,j =
0. Thus, limn→∞ ( J J−n !m)! (eΛ( B)/Jn − 1)m = Λ( B)m . Taking limit n → ∞ on both sides of the above equation,
n
we get the result.
B(X)
Definition 2.3. A counting process N : Ω → Z+ has the completely independence property, if for any
collection of finite disjoint and bounded sets A1 , . . . , Ak ∈ B(X), the vector ( N ( A1 ), . . . , N ( Ak )) : Ω → Zk+ is
independent. That is,
!
k k
= ∏ P { N ( Ai ) = ni } , n ∈ Zk+ .
\
P { N ( Ai ) = ni }
i =1 i =1
Definition 2.4. A simple point process S : Ω → XN is Poisson point process, if the associated counting
B(X)
process N : Ω → Z+ has complete independence property and the marginal distributions are Poisson.
Remark 8. Recall that for any partition A ∈ B(X)k of a bounded set B ∈ B(X), we have N ( B) = ∑ik=1 N ( Ai )
and therefore it follows from the linearity of expectations that Λ( B) = EN ( B) = ∑ik=1 EN ( Ai ) = ∑ik=1 Λ( Ai ).
Thus, this is a valid intensity measure.
Remark 9. For a Poisson process with intensity measure Λ, it follows from the definition that for any finite
k ∈ Z+ , and bounded mutually disjoint sets A1 , . . . , Ak ∈ B(X), we have
k
Λ ( A i ) ni
P ∩ik=1 { N ( Ai ) = ni } = ∏ e−Λ( Ai ) , n ∈ Zk+ .
i =1
ni !
Definition 2.6. If the intensity measure Λ of a Poisson process S satisfies Λ( A) = λ | A| for all bounded
A ∈ B(X), then we call S a homogeneous Poisson point process and λ is its intensity.
3 Equivalent characterizations
Theorem 3.1 (Equivalences). Following are equivalent for a simple counting process N : Ω → Z+ B(X) .
i Process N is Poisson with locally finite intensity measure Λ.
3
ii For each bounded A ∈ B(X), we have P { N ( A) = 0} = e−Λ( A) .
iii For each bounded A ∈ B(X), the number of points N ( A) is a Poisson with parameter Λ( A).
iv Process N has the completely independence property, and EN ( A) = Λ( A) for all bounded sets A ∈ B(X).
Proof. We will show that i =⇒ ii =⇒ iii =⇒ iv =⇒ i .
i =⇒ ii It follows from the definition of Poisson point processes and definition of Poisson random vari-
ables.
ii =⇒ iii From Corollary 2.2, we know that if void probabilities are exponential, then the marginal distri-
butions are Poisson.
iii =⇒ iv We will show this in two steps.
Mean: Since the distribution of random variable N ( A) is Poisson, it has mean EN ( A) = Λ( A).
CIP: Consider a partition A ∈ Bk for a bounded set B ∈ B(X), then Λ( B) = Λ( A1 ) + · · · + Λ( Ak ).
Consider all partitions n ∈ Zk+ of a non-negative integer m ∈ Z+ , to write
P { N ( B) = m} = ∑ P { N ( A1 ) = n1 , . . . , N ( A k ) = n k } .
n1 +···+nk =m
Using the definition of Poisson distribution, we can write the LHS of the above equation as
Λ( B)m k
(∑k Λ( Ai ))m
P { N ( B ) = m } = e−Λ( B) = ∏ e − Λ ( A i ) i =1 .
m! i =1
m!
m n
Since the expansion of ( a1 + · · · + ak )m = ∑n1 +···+nk =m (n
1 ,...,nk
) ∏ik=1 ai i , we get
k !
k
−Λ( Ai ) Λ ( Ai ))
1
m ni
P { N ( B) = m} = ∑ ∏ e−Λ( Ai ) Λ( Ai ))ni = ∑ ∏e .
m! n1 +···+ nk =n n 1 , . . . , n k i =1 n1 +···+nk =m i =1
ni !
R
Corollary 3.2 (Poisson process on the half-line). A random process N : Ω → Z++ indexed by time t ∈ Z+ is
the counting process associated with a one-dimensional Poisson process S : Ω → RN
+ having intensity measure Λ iff
(a) Starting with N0 = 0, the process Nt takes a non-negative integer value for all t ∈ R+ ;
(b) the increment Ns − Nt is surely nonnegative for any s ⩾ t;
(c) the increments Nt1 , Nt2 − Nt1 , . . . , Ntn − Ntn−1 are independent for any 0 < t1 < t2 < · · · < tn−1 < tn ;
(d) the increment Ns − Nt is distributed as Poisson random variable with parameter Λ(t, s] for s ⩾ t.
The Poisson process is homogeneous with intensity λ, iff in addition to conditions ( a), (b), (c), the distribution of
the increment Nt+s − Nt depends on the value s ∈ R+ but is independent of t ∈ R+ . That, is the increments are
stationary.
Proof. We have already seen that definition of Poisson processes implies all four conditions. Conditions ( a)
and (b) imply that N is a simple counting process on the half-line, condition (c) is the complete indepen-
dence property of the point process, and condition (d) provides the intensity measure. The result follows
from the equivalence iv in Theorem 3.1.
4
Lecture-25: Poisson processes: Conditional distribution
Proposition 1.1. Let k ∈ N be any positive integer. Consider a Poisson point process S : Ω → XN with intensity
measure Λ : B(X) → R+ , a finite partition A ∈ B(X)k that partitions a bounded set B ∈ B(X), and a vector n ∈ Zk+
that partitions a non negative integer m ∈ Z+ . Then,
k ni
Λ ( Ai )
m
P({ N ( A1 ) = n1 , . . . , N ( Ak ) = nk } { N ( B) = m}) =
n1 , . . . , n k ∏ Λ( B)
. (1)
i =1
Proof. From the definition of conditional probability and the fact that ∩ik=1 { N ( Ai ) = ni } ⊆ { N ( B) = m}, we
can write the conditional probability on LHS as the ratio
P { N ( A1 ) = n1 , . . . , N ( A k ) = n k , N ( B ) = m } P { N ( A1 ) = n1 , . . . , N ( A k ) = n k }
= .
P { N ( B) = m} P { N ( B) = m}
From the complete independence property and Poisson marginals for the joint distribution of ( N ( A1 ), . . . , N ( Ak ))
for the partition A ∈ B(X)k , and the fact that the intensity measure sums over disjoint sets, i.e. Λ( B) =
∑ik=1 Λ( Ai ), we can rewrite the RHS of the above equation as
Λ ( A ) ni
∏ik=1 e−Λ( Ai ) nii! k ni
Λ ( Ai )
P { N ( A1 ) = n1 , . . . , N ( A k ) = n k } m
P { N ( B) = m}
= Λ( B)m
= ∏ Λ( B)
.
e−Λ( B) m! n1 , . . . , n k i =1
Remark 1. Consider a Poisson point process S : Ω → XN with intensity measure Λ : B(X) → R+ and count-
B(X)
ing process N : Ω → Z+ . Let A ∈ B(X)k be a partition for bounded set B ∈ B(X).
Λ ( Ai )
i Defining pi ≜ Λ( B)
, we see that ( p1 , . . . , pk ) ∈ M([k]) is a probability distribution. We also observe that
When N ( B) = 1, we can call the point of S in B as S1 without any loss of generality. That is, if we call
{S1 } = S ∩ B, then we have
pi = P({S1 ∈ Ai } {S1 ∈ B}).
Similarly, when N ( B) = ni , we call the points of S in B as S1 , . . . , Sni and denote S ∩ B = {S1 , . . . , Sni }.
For this case, we observe
n
P({ N ( Ai ) = ni } { N ( B) = ni }) = P({|S ∩ Ai | = ni } {|S ∩ B| = ni }) = pi i
ni
= P(∩nj=i 1 S j ∈ Ai {{S1 , . . . , Sni } = S ∩ B}) = ∏ P( S j ∈ Ai
S j ∈ B ).
j =1
1
ii We can rewrite the Equation (1) as a multinomial distribution, where
m n n
P({ N ( A1 ) = n1 , . . . , N ( Ak ) = nk } { N ( B) = m}) = p 1 . . . pk k .
n1 , . . . , n k 1
iii For any finite set F ⊆ N of size m ∈ Z+ and n ∈ Zk+ a partition of m, we define Pk ( F, n) to be the
collection of all k-partitions E ∈ P (N)k of F such that | Ei | = ni for i ∈ [k]. Then, the multinomial
coefficient accounts for number of partitions of m points into sets with n1 , . . . , nk points. That is,
m
= |Pk ([m], n)| .
n1 , . . . , n k
∑ P(∩ik=1 ∩S j ∈Ei S j ∈ Ai
= {S ∩ B = F }).
E∈Pk ( F,n1 ,...,nk )
vi Equating the RHS of the above equation term-wise, we obtain that conditioned on each of these points
falling inside the window B, the conditional probability of each point falling in partition Ai is indepen-
dent of all other points and given by pi . That is, we have
k k k ni
Λ ( Ai )
{S ∩ B = F }) = ∏ ∏ Sj ∈ B ) = ∏ =∏
n
P(∩ik=1
∩Sj ∈Ei S j ∈ Ai P ( S j ∈ Ai pi i .
i =1 S j ∈ Ei i =1 i =1
Λ( B)
It means that given m points in the window B, the location of these points are independently and
Λ(·)
identically distributed in B according to the distribution Λ( B) .
vii If the Poisson process is homogeneous, the distribution is uniform over the window B.
viii For a Poisson process with intensity measure Λ and any bounded set A ∈ B, the number of points
N ( A) in the set A is a Poisson random variable with parameter Λ( A). Given the number of points
λ( x )
N ( A), the location of all the points in S ∩ A are i.i.d. with density Λ( A) for all x ∈ A.
Remark 2 (Simulating a homogeneous Poisson point process). If we are interested in simulating a two di-
mensional homogeneous Poisson point process with density λ in a uniform square A = [0, 1] × [0, 1]. Then,
n
we first generate the random variable N ( A) : Ω → Z+ that takes value n with probability e−λ λn! . Next,
for each of the N ( A) = n points, we generate the location ( Xi , Yi ) ∈ R2 uniformly at random. That is,
X : Ω → [0, 1]n and Y : Ω → [0, 1]n are independent i.i.d. uniform sequences.
2
Corollary 1.2. For a homogeneous Poisson point process on the half-line with ordered set of points S̃ = (S(n) ∈
R+ : n ∈ N), we can write the conditional density of ordered points (S(1) , . . . , S(k) ) given the event { Nt = k} as the
ordered statistics of i.i.d. uniformly distributed random variables. Specifically, we have
Proof. Given { Nt = k}, we can denote the points of the Poisson process in (0, t] by S1 , . . . , Sk . From the above
remark, we know that S1 , . . . , Sk are i.i.d. uniform in (0, t], conditioned on the number of points Nt = k.
Hence, we can write
k k
t
F (t1 , . . . , tk ) = P(∩ik=1 {Si ∈ (0, ti ]} { Nt = k}) = ∏ P({Si ∈ (0, ti ]} {Si ∈ (0, t]}) = ∏ i 1{0<ti ⩽t} .
S1 ,...,Sk Nt =k
i =1 i =1
t
Therefore, for 0 < t1 < · · · < tk < 1 and (h1 , . . . , hk ) sufficiently small, we have
k
h
P ∩ik=1 {Si ∈ (ti , ti + hi ]} = ∏ i .
i =1
t
Since (S1 , . . . , Sk ) are conditionally independent given S ∩ A = {S1 , . . . , Sk }, it follows that any permutation
σ : [k ] → [k ], the conditional joint distribution of (Sσ(1) , . . . , Sσ(k) ) is identical to that of (S1 , . . . , Sk ) given
S ∩ A = {S1 , . . . , Sk }. Further, we observe that the order statistics of (Sσ(1) , . . . , Sσ(k) ) is identical to that of
(S1 , . . . , Sk ). Therefore, we can write the following equality for the events
n o n o
∩ik=1 S(i) ∈ (ti , ti + hi ] = ∪σ:[k]→[k] permutation ∩ik=1 Sσ(i) ∈ (ti , ti + hi ] .
3
Lecture-26: Properties of Poisson point processes
1 Laplace functional
Let X = Rd be the d-dimensional Euclidean space. Recall that all the random points are unique for a
simple point process S : Ω → XN , and hence S can also be considered as a set of countable points in X.
B(X)
Let N : Ω → Z+ be the counting process associated with the simple point process S.
/ S and dN ( x ) = δx 1{ x∈S} . Hence, for any Borel measur-
Remark 1. We observe that dN ( x ) = 0 for all x ∈
able function f : X → R and bounded A ∈ B(X), we have x∈ A f ( x )dN ( x ) = ∑ x∈S∩ A f ( x ).
R
Remark 2. For a simple function f = ∑ik=1 ti 1 Ai , we can write the Laplace functional as a function of the
vector (t1 , t2 , . . . , tk ), LS ( f ) = E exp − ∑ik=1 ti A dN ( x ) = E exp − ∑ik=1 ti N ( Ai ) . We observe that
R
i
this is a joint Laplace transform of the random vector ( N ( A1 ), . . . , N ( Ak )). This way, one can compute
all finite dimensional distribution of the counting process N.
Proposition 1.2. The Laplace functional of a Poisson point process S : Ω → XN with intensity measure Λ :
B(X) → R+ evaluated at any non-negative Borel measurable function f : X → R+ , is
Z
− f (x)
LS ( f ) = exp − (1 − e )dΛ( x ) .
X
Proof. For a bounded Borel measurable set A ∈ B(X), consider the truncated function g = f 1 A . Then,
Z Z
LS ( g) = E exp(− g( x )dN ( x )) = E exp(− f ( x )dN ( x )).
X A
Clearly dN ( x ) = δx 1{ x∈S} and hence we can write LS ( g) = E exp (− ∑ x∈S∩ A f ( x )). We know that the
probability of N ( A) = |S ∩ A| = n points in set A is given by
Λ( A)n
P { N ( A ) = n } = e−Λ( A) .
n!
Given there are n points in set A, the density of n point locations are independent and given by
n
dΛ( xi )
f ( x1 , . . . , x n ) = ∏ 1 .
S1 ,...,Sn N ( A)=n
i =1
Λ ( A ) { xi ∈ A }
Result follows from taking increasing sequences of sets Ak ↑ X and monotone convergence theorem.
1
1.1 Superposition of point processes
Definition 1.3. Let Sk : Ω → XN be a simple point process with intensity measures Λk : B(X) → R+ and
B(X)
counting process Nk : Ω → Z+ , for each k ∈ N. The superposition of point processes (Sk : k ∈ N) is
defined as a point process S ≜ ∪k Sk .
Remark 3. The counting process associated with superposition point process S : Ω → XN is given by
B(X)
N : Ω → Z+ defined by N ≜ ∑k Nk , and the intensity measure of point process S is given by Λ :
B(X) → R+ defined by Λ = ∑k Λk from monotone convergence theorem.
Remark 4. The superposition process S is simple iff ∑k Nk is locally finite.
Theorem 1.4. The superposition of independent Poisson point processes (Sk : k ∈ N) with intensities (Λk : k ∈
N) is a Poisson point process with intensity measure ∑k Λk if and only if the latter is a locally finite measure.
Proof. Consider the superposition S = ∪k Sk of independent Poisson point processes Sk ∈ X with inten-
sity measures Λk . We will prove just the sufficiency part this theorem. We assume that ∑k Λk is locally
finite measure. It is clear that N ( A) = ∑k Nk ( A) is finite by locally finite assumption, for all bounded
sets A ∈ B(X). In particular, we have dN ( x ) = ∑k dNk ( x ) for all x ∈ X. From the monotone convergence
theorem and the independence of counting processes, we have for a non-negative Borel measurable
function f : X → R+ ,
Z
! Z
!
LS ( f ) = E exp − f ( x ) ∑ dNk ( x ) = ∏ LSk = exp − (1 − e − f ( x ) ) ∑ Λ k ( x ) .
X k k X k
Proof. Let A ∈ B(X) be a bounded Boreal measurable set, and let f : X → R+ be a non-negative function.
Let N ( p) be the associated counting process to the thinned point process S( p) . Hence, for any bounded
set A ∈ B(X), we have N ( p) ( A) = ∑ x∈S∩ A Y ( x ). That is, dN ( p) ( x ) = δx Y ( x )1{ x∈S} . Therefore, for
any non-negative function g( x ) = f ( x )1{ x∈ A} , we can write x∈X g( x )dN ( p) ( x ) = x∈ A f ( x )dN ( p) ( x ) =
R R
∑ x∈S∩ A f ( x )Y ( x ). We can write the Laplace functional of the thinned point process S( p) for the non-
negative function g( x ) = f ( x )1{ x∈ A} , as
Z n
LS( p) ( g) = E E[exp − f ( x )dN ( x ) | N ( A)] = ∑ P { N ( A) = n} ∏ E[− f (Si )Y (Si ) | {Si ∈ A}].
p
A n ∈Z+ i =1
The first equality follows from the definition of Laplace functional and taking nested expectations.
Second equality follows from the fact that the distribution of all points of a Poisson point process are
i.i.d. . Since Y is a Bernoulli process independent of the underlying process S with E[Y (Si )] = p(Si ), we
get
E[e− f (Si )Y (Si ) | {Si ∈ S ∩ A}] = E[e− f (Si ) p(Si ) + (1 − p(Si )) | {Si ∈ S ∩ A}].
Λ′ ( x )
From the distribution Λ( A)
for x ∈ S ∩ A for the Poisson point process S, we get
Z n Z
1
LS ( p ) ( g ) = e − Λ ( A ) ∑ ( p( x )e− f (x) + (1 − p( x ))dΛ( x ) = exp − (1 − e− g(x) ) p( x )dΛ( x ) .
n ∈Z+
n! A X
Result follows from taking increasing sequences of sets Ak ↑ X and monotone convergence theorem.
2
Lecture-27: Poisson process on the half-line
N (t)
t
S̃1 S̃2 S̃2 S̃4
Definition 1.1. The points of discontinuity are also called the arrival instants of the counting process N.
The nth arrival instant is a random variable denoted S̃n : Ω → R+ , defined inductively as
Definition 1.2. The inter arrival time between (n − 1)th and nth arrival is denoted by Xn and written as
Xn ≜ S̃n − S̃n−1 .
Remark 1. For a simple point process, we have P { Xn = 0} = P { Xn ⩽ 0} = 0.
R
Lemma 1.3. Simple counting process N : Ω → Z++ and arrival process S̃ : Ω → RN
+ are inverse processes, i.e.
S̃n ⩽ t = { Nt ⩾ n} .
Proof. Let ω ∈ S̃n ⩽ t , then NS̃n = n by definition. Since N is a non-decreasing process, we have Nt ⩾
NS̃n = n. Conversely, let ω ∈ { Nt ⩾ n}, then it follows from definition that S̃n ⩽ t.
R
Corollary 1.4. For arrival instants S̃ : Ω → RN + associated with a counting process N : Ω → Z+ we have
+
1
c
= { Nt ⩾ n + 1}c = { Nt < n + 1}. Hence,
Proof. It is easy to see that S̃n+1 > t = S̃n+1 ⩽ t
{ Nt = n} = { Nt ⩾ n, Nt < n + 1} = S̃n ⩽ t, S̃n+1 > t .
Lemma 1.5. Let Fn ( x ) be the distribution function for Sn , then Pn (t) ≜ P { Nt = n} = Fn (t) − Fn+1 (t).
Proof. It suffices to observe that following is a union of disjoint events,
S̃n ⩽ t = S̃n ⩽ t, S̃n+1 > t ∪ S̃n ⩽ t, S̃n+1 ⩽ t .
(Λ(s,t])n−k
The transition probability matrix is P(s, t) with its (k, n)th entry given by e−Λ(s,t] (n−k)!
.
Remark 2. A Markov process X : Ω → XR is time homogeneous if the transition matrix P(s, t) = P(t − s)
for all t ⩾ s. Thus the counting process for a homogeneous Poisson point process is time homogeneous
Markov process, as the transition probability matrix P(s, t) = P(t − s) with its (k, n)th entry given by
(λ(t−s))n−k
e−λ(t−s) (n−k)!
.
R
Theorem 2.2. The counting process N : Ω → Z++ associated with a simple Poisson point process S : Ω → RN
+ is
strongly Markov.
R
Proposition 2.3. A simple counting process N : Ω → Z++ is associated with a homogeneous Poisson process
with a constant intensity density λ, iff the inter-arrival time sequence X : Ω → RN
+ are i.i.d. random variables with
an exponential distribution of rate λ.
Proof. Let Nt be a counting process associated with a homogeneous Poisson point process on half-line with
constant intensity density λ. From equivalence iii in Theorem ??, we obtain for any positive integer t,
P { X1 > t} = P { Nt = 0} = e−λt .
It suffices to show that inter-arrivals time sequence X : Ω → RN + is i.i.d. . We can show that N is Markov
process with strong Markov property. Since the sequence of ordered points S̃ : Ω → RN + is a sequence of
stopping times for the counting process, it follows from the strong Markov property of this process that
( NS̃n +t − NS̃n : t ⩾ 0) is independent of σ( Ns : s ⩽ S̃n ) and hence of S̃n and NS̃n . Further, we see that
n o
Xn+1 = inf t > 0 : NS̃n +t − NS̃n = 1 .
It follows that X : Ω → RN + is an independent sequence. For homogeneous Poisson point process, we have
NS̃n +t − NS̃n = Nt in distribution, and hence Xn+1 has same distribution as X1 for each n ∈ N.
For the given i.i.d. inter-arrival time sequence X : Ω → RN + distributed exponentially with rate λ, we
n
define the nth arrival instant S̃n ≜ ∑i=1 Xi for each n ∈ N, and the number of arrivals in time duration (0, t]
2
as Nt ≜ ∑n∈N 1{S̃n ⩽t} for all t ∈ R+ . It follows that Nt is path wise non-decreasing, integer-valued, right
continuous, and simple since P { X1 ⩽ 0} = 0. Therefore, N is a simple counting process such that
P { Nt = 0} = P { X1 > t} = e−λt .
It follows that the void probabilities are exponential and hence the random variable Nt is Poisson with
parameter λt for all t ∈ R+ . Hence, N is a counting process associated with a homogeneous Poisson process
with the constant intensity density λ from the equivalence ii in Theorem ??.
For many proofs regarding Poisson processes, we partition the sample space with the disjoint events
{ Nt = n} for n ∈ Z+ . We need the following lemma that enables us to do that.
Lemma 2.4. For any finite time t > 0, the number of points on the interval (0, t] from a Poisson process is finite
almost surely.
Proof. By strong law of large numbers, we have limn→∞ Snn = E[ X1 ] = λ1 almost surely. Fix t > 0 and we
define a sample space subset M = {ω n∈ Ω : N (ω, t)o= ∞}. For any ω ∈ M, we have Sn (ω ) ⩽ t for all n ∈ N.
Sn
This implies lim supn n = 0 and ω ̸∈ limn Snn = 1
λ . Hence, the probability measure for set M is zero.
−1 (λt)k
(b) The distribution function is Fn (t) ≜ P S̃n ⩽ t = 1 − e−λt ∑nk=
0 k! .
λ(λs)n−1 −λs
(c) The density function is Gamma distributed with parameters n and λ. That is, f n (s) = ( n −1) !
e .
R
Corollary 2.6. Consider the counting process N : Ω → Z++ associated with the Poisson arrival process S̃ : Ω → RN
+
having constant intensity density λ. The following are true.
(a) The relation between distribution of nth arrival instant and probability mass function for the counting process is
given by Fn (t) = ∑ j⩾n Pj (t).
(b) For each t ∈ R+ , the probability mass function PNt ∈ M(Z+ ) for discrete random variable Nt : Ω → Z+ is
(λt)n
given by Pn (t) ≜ PNt (n) = P { Nt = n)} = e−λt n! .
(c) The relation between distribution of nth arrival instant and the mean of the counting process is given by ∑n∈N Fn (t) =
ENt .
(d) For each t ∈ R+ , the mean E[ Nt ] = λt, explaining the rate parameter λ for the Poisson process.
Proof. We observe the inverse relationship S̃n ⩽ t = { Nt ⩾ n} for all n ∈ Z+ and t ∈ R+ .
result follows by taking the probability on both sides of the inverse relationship, to get Fn (t) =
(a) The
P S̃n ⩽ t = P { Nt ⩾ n} = ∑ j⩾n P { Nt = j} = ∑ j⩾n Pj (t).
(b) The result follows from the explicit from for the distribution of S̃n and recognizing that Pn (t) = Fn (t) −
Fn+1 (t).
(c) The result from the following observation ∑n∈N Fn (t) = E ∑n∈N 1{ Nt ⩾n} = ∑n∈N P { Nt ⩾ n} = ENt .
(d) The result follows by summing the distribution function of the nth arrivals, to get ENt = ∑n∈N Fn (t) =
(λt)k k −1
e−λt ∑n∈N ∑k⩾n k! = λte−λt ∑k∈N ((λt )
k −1) !
= λt.
Remark 3. A Poisson process is not a stationary process. That is, the finite dimensional distributions are not
shift invariant. This is clear from looking at the first moment ENt = λt, which is linearly increasing in time.
3
Lecture-28: Compound Poisson Processes
Remark 1. Recall that Fs = σ ( Zu , u ∈ (0, s]) is the collection of historical events until time s associated
with the process Z. If S̃n is almost surely finite for all n ∈ N, then the sequence of jump times S̃ : Ω → RN+
is a sequence of stopping times with respect to the natural filtration F• of the process Z.
R
Remark 2. Let N : Ω → Z++ be the simple counting process associated with the number of jumps of
compound Poisson process Z in (0, t] defined by Nt ≜ ∑n∈N 1{S̃n ⩽t} for all t ∈ R+ . Then, S̃n and Yn are
the respectively the arrival instant and the size of the nth jump, and we can write Zt = ∑iN=t 1 Yi .
R
Proposition 1.3. A stochastic process Z : Ω → R++ is a compound Poisson process iff its jump times form a
Poisson process and the jump sizes form an i.i.d. random sequence independent of the jump times.
Proof. We will prove it in two steps.
Implication: Let Z be a compound Poisson process with the jump instant sequence S̃ : Ω → RN + and the
N R+
jump size sequence Y : Ω → R+ . We will show that the counting process N : Ω → Z+ is simple
and has stationary and independent increments and the jump size sequence Y is i.i.d. .
Independence of jumps and increments: From the definition of jump instant sequence S̃, it fol-
lows that the counting process N is adapted to the natural filtration F• of the compound
N
Poisson process Z. Since Zt+s − Zt = ∑i=t+Nst +1 Yi , and the compound Poisson processes have
independent increments, it follows that the increment ( Nt+s − Nt : s ⩾ 0) and (YNt + j : j ∈ N)
are independent of the past Ft .
Stationarity: Let’s assume that step sizes are positive, then we have
n o
S̃n = inf t > S̃n−1 : Zt > ZS̃n−1 , and { Nt+s − Nt = 0} = { Zt+s − Zt = 0} .
From the stationarity of the increments it follows that the probability P { Nt+s − Nt = 0} is
independent of t and equal to e−λs for some λ ∈ R+ . It follows that the counting process
R
N : Ω → Z++ has stationary increments and the associated jump sequence S̃ is homogeneous
Poisson with intensity density λ.
Strong Markovity: The compound Poisson process has the Markov property from stationary and
independent increment property. Further, since each sample path t 7→ Zt is right continuous,
the process satisfies the strong Markov property at each almost sure stopping time.
1
Inter jump times i.i.d. : We will inductively show that S̃ : Ω → RN + is a stopping time sequence and
N
hence the inter jump times sequence X : Ω → R+ defined by Xn ≜ S̃n − S̃n−1 for each n ∈ N
is an i.i.d. sequence. From the exponential distribution of S̃1 , it follows that it is almost surely
finite and hence a stopping time. From the stationary of increments of compound Poisson
process and the strong Markov property at stopping times S̃2 − S̃1 = S̃1 is independent of
FS̃1 and identical in distribution to S̃1 . The result follows inductively.
Jump size i.i.d. : From strong Markov property of Z, the jump size Yn is independent of the past
FS̃n−1 = σ ( Zu : u ⩽ S̃n−1 ) and from stationarity it is identically distributed to Y1 for each
n ∈ N. It follows that the jump size sequence Y is i.i.d. and independent of jump instant
sequence S̃.
Superposition: Similar arguments can be used to show for negative jump sizes. For real jump
sizes, we can form two independent Poisson processes with negative and positive jumps,
and the superposition of these two processes is Poisson.
Converse: Let X : Ω → RN+ be an i.i.d. inter-jump sequence distributed exponentially with rate λ and
Y : Ω → RN be an i.i.d. jump size sequence independent of X. We can define the jump instant
sequence S̃ : Ω → RN n
+ defined as S̃n ≜ ∑i =1 Xi for each n ∈ N, the counting process for the number
R
of jumps N : Ω → Z++ defined as Nt ≜ ∑n∈N 1{S̃n ⩽t} for each t ∈ R+ , and the compound process
Z : Ω → RR+ defined as Zt ≜ ∑nN=t 1 Yn for each t ∈ R+ .
Finitely many jumps: Since Nt is finite for any finite t, it follows that the compound Poisson pro-
cess Z has finitely many jumps in finite intervals.
Independence of increments: For any finite n ∈ N and finite intervals Ii for i ∈ [n], we can write
N( I )
Z ( Ii ) = ∑k=1i Yik , where Yik denotes the kth jump size in the interval Ii . Since the independent
sequence ( N ( Ii ) : i ∈ [n]) and Y : Ω → RN
+ are also mutually independent, it follows that Z ( Ii )
are independent.
Stationarity of increments: Further, the stationarity of the increments of the compound process is
inferred from the distribution of Z ( Ii ), which is
( )
m
P { Z ( Ii ) ⩽ x } = ∑ P { Z ( Ii ) ⩽ x, N ( Ii ) = m} = ∑ P ∑ Yik ⩽ x P { N ( Ii ) = m} .
m ∈Z+ m ∈Z+ k =1
Now define Zt ≜ YNt as the amount spent by the customers arriving until time t ∈ R+ . Then
R
Z : Ω → R++ is a compound Poisson Process.
• Let the time between successive failures of a machine be independent and exponentially dis-
tributed. The cost of repair is i.i.d. random at each failure. Then the total cost of repair in a
certain time t is a compound Poisson Process.