Course Notes
Course Notes
Semester 2, 2023-2024
Prahlad Vaidyanathan
Contents
Two Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
I. Probability Spaces 4
1. Examples of Random Phenomena . . . . . . . . . . . . . . . . . . . . . . 4
2. Probability Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3. Properties of Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4. Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5. Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2
2. Distribution of Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3. Conditional Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4. Properties of Multivariate Distributions . . . . . . . . . . . . . . . . . . . 76
VIII.
Moment Generating Functions and Characteristic Functions 90
1. Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . 90
2. Characteristic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3. Inversion Formulas and the Continuity Theorem . . . . . . . . . . . . . . 97
4. The Weak Law of Large Numbers and the Central Limit Theorem . . . . 100
5. Review and Important Formulae . . . . . . . . . . . . . . . . . . . . . . . 105
Two Experiments
Remark 0.1 (The Monty Hall Problem). Imagine you are on a game show. There are
three doors, one with a prize behind it.
A B C
You’re allowed to pick any door, so you choose the first one at random, door A.
Before opening Door A, the rules of the game require the host to open one of the other
doors and let you switch your choice if you want. Because the host doesn’t want to give
away the game, they always open an empty door. In your case, the host opens door C:
no prize, as expected. “Do you want to switch to door B?” the host asks.
Remark 0.2. Now the same problem but with 100 doors.
1 2 3 ... 100
You again pick Door 1. The host now opens all doors except Door 72. Now should you
switch to Door 72?
Here, you have to ask yourself: Why did they not open Door 72? Almost certainly
because that’s where the prize is hidden! Maybe you got really lucky and picked right
3
with the first door at the beginning. But it’s way more likely you didn’t, and now Door
72 must be the right one!
Solution: In the Monty Hall Problem, it is always correct to switch doors as the chance
of your success increases. Indeed, in the first problem, the chance of your success is
double if you switch!
Remark 0.3 (The Gambler’s Fallacy). In a game of Roulette, a wheel is spun in one
direction and a ball placed on it is spun in the other direction. The ball eventually stops
and lands in one of 37 slots on the edge of the wheel. The slot is coloured either red or
black (The zero slot is coloured green).
When you reach the table, you are told that the ball has landed in a black slot 26 times
in a row. So is the next one likely to be a red?
Answer 2: No, the next one is likely to be black. The game must be rigged to only
give blacks!
Answer 3: No, the next is equally like to be black or red. Each roll of the wheel
is a purely random event, similar to a coin flip. The 26 blacks so far is merely a
coincidence.
4
Solution: Indeed, Answer 3 is correct (See Example I.5.4).
(End of Day 1)
5
I. Probability Spaces
1. Examples of Random Phenomena
Definition 1.1. Probability Theory is the study of random phenomena.
(ii) Examples:
(a) Addition of two numbers is deterministic.
(b) If a ball is dropped from a height d metres in a vacuum, the time taken to
reach the bottom is deterministic.
(c) Flipping a coin is a random phenomenon.
(d) The outcome of an election is random.
(e) Other random phenomena: Card games, DRS system in cricket, effects of
climate change (in practice), Expansion of a species in a region, etc.
(iii) A random phenomenon cannot be predicted exactly, but many such phenomena
show statistical regularity. i.e. If you repeat the experiment many many times, the
relative frequency with which a single outcome occurs may be predicted.
(iv) Example: If you toss a coin, the outcome is either H or T . If you toss the coin n
times, the numbers
Number of H Number of T
Nn (H) := and Nn (T ) :=
n n
are called the relative frequencies. We know from experience, that
1
Nn (H) → as n → ∞.
2
1
The number 2
is thus called the probability that H occurs.
6
(i) This is a random phenomenon, so we we repeat the experiment n times and write
the relative frequency of seeing a k (for 1 ≤ k ≤ 20) as
Then,
2
N10 (1) =
10
N10 (2) = 0
..
.
1
N10 (17) =
10
2
N10 (18) =
10
..
.
1
For each 1 ≤ k ≤ 20, let pk = 20
. Then we expect that
lim Nn (k) = pk .
n→∞
Observe that
0 ≤ pk ≤ 1.
p1 + p2 + . . . + p20 = 1.
(ii) Suppose we now want more information from this experiment. For instance, we
may want to know how many even numbers have been rolled. Once again, we may
take the relative frequency
Number of times an even number is rolled
Nn (even) := .
n
And we may consider
peven := lim Nn (even).
n→∞
7
(iii) Suppose the faces are coloured red (1,2,3,4,8,9,10), blue (5,6,7,11,18,19,20) and
yellow (the rest). We may now want to know pyellow . Again, it is clear that
X
pyellow = pk .
k is yellow
(iv) In order to make all these experiments more systematic, we introduce the idea the
set
Ω := {w1 , w2 , . . . , w20 }.
where each wk corresponds to the number k ∈ {1, 2, . . . , 20}. We now assign
wk 7→ pk .
This set A might be {the set of even numbers} or {the numbers painted yellow}
or anything else. This function
P : A → P (A)
Therefore, this approach lets us ask (and answer) more questions from the same
experiments.
Example 1.3. Consider a pointer that is free to spin about the centre of a circle of
radius 1. If the pointer is spun, it comes to rest at an angle θ (in radians) from the X
axis. This is the outcome of the experiment.
8
(i) Again, this is a random phenomenon. For each t ∈ [0, 2π), θ can take the value t,
so we may set
Ω := [0, 2π)
(ii) Given a point, say t = 1, what it the probability that a random spin will come
to rest at t? Here, the idea of relative frequency is problematic because there are
infinitely many choices. Instead we may ask different questions.
(iii) What is the probability that a random spin will come to rest in such a way that θ
lies in the upper half-circle? Answer: 1/2 of course. So if
A = {t ∈ [0, 2π) : 0 ≤ t ≤ π} ⇒ P (A) = 1/2
Similarly,
A = {t ∈ [0, 2π) : 0 ≤ t ≤ π/2} ⇒ P (A) = 1/4.
More generally, if s ∈ [0, 2π) and
s
As = {t ∈ [0, 2π) : 0 ≤ t ≤ s} ⇒ P (As ) =
2π
The same is true for any sector whose angle is at the origin is s. Hence, if
s−r
A[r,s] = {t ∈ [0, 2π) : r ≤ t ≤ s} ⇒ P (A[r,s] ) = .
2π
(iv) Given a subset A ⊂ Ω, can we measure P (A)? Answer: Not always, but it is
possible for many subsets.
(v) If s ∈ [0, 2π) is fixed and A = {s}, what is P (A)?
Solution: Consider
1
An = {t : s − 1/n ≤ t ≤ s} ⇒ P (An ) = .
2nπ
Now, {s} ⊂ An so by a theorem we will prove later, it will follow that
0 ≤ P ({s}) = P (An ).
This is true for each n ∈ N so P ({s}) = 0. Hence, the probability of hitting any
one angle is zero!
2. Probability Spaces
Given a random phenomenon, we first need a set Ω consisting of all possible outcomes.
For some subsets A ⊂ Ω, we must assign a probability
A 7→ P (A).
9
Such a subset is called an event. We say that an event A occurs if a given outcome w
belongs to A. The function P must satisfy three conditions:
P (∅) = 0, P (Ω) = 1.
(i) ∅ ∈ A.
Example 2.2.
(iii) If Ω = [a, b], there is a ‘nice’ σ-algebra L that contains all intervals in Ω. This is
called the Lebesgue σ-algebra.
(i) P (Ω) = 1.
10
The triple (Ω, A, P ) is called a probability space.
Example 2.4.
(i) Let Ω be a finite set, A = P(Ω), then we may define
|A|
P (A) :=
|Ω|
where |·| denotes the cardinality. This probability space is called a symmetric probability space.
(ii) Let Ω = {H, T } and A = P(Ω) = {∅, {H}, {T }, Ω}. Define P : A → [0, ∞) by
1 1
P (∅) = 0, P ({H}) = , P ({T }) = , P (Ω) = 1.
2 2
This is an example of a symmetric probability space.
(End of Day 2)
(iii) In the previous example, we may also define
1 2
P (∅) = 0, P ({H}) = , P ({T }) = , P (Ω) = 1.
3 3
This is a probability space, but is not symmetric.
(iv) Let Ω = [a, b] and A = L. There is a function µ : L → [0, ∞) such that
µ((c, d)) = µ((c, d]) = µ([c, d)) = |d − c|
for all a ≤ c < d ≤ b. The function P : L → [0, ∞) given by
P (A) := µ(A)/(b − a)
is a probability measure on Ω. This probability space is called the uniform probability space.
(v) If Ω ⊂ Rd is a ‘nice’ set, we may define a similar σ-algebra L that contains all
rectangles of the form
Yd
(ai , bi ).
i=1
This is also called the Lebesgue σ-algebra and it also carries a Lebesgue measure
µ so that !
Yd d
Y
µ (ai , bi ) = (bi − ai ).
i=1 i=1
11
Remark 2.5.
(i) If Ω is a finite set, we will typically take A = P(Ω).
(ii) If Ω is infinite, it will almost always occur as a subset of Rd for d ∈ {1, 2, 3}. In
that case, we will usually use some modification of the Lebesgue measure as above.
3. Properties of Probabilities
Lemma 3.1. Suppose (Ω, A, P ) is a probability space. Then
(i) For any A, B ∈ A,
P (B) = P (B ∩ A) + P (B ∩ Ac ).
(iii) P (∅) = 0.
P (A) ≤ P (B).
(v) If B ⊂ A, then
P (A) − P (B) = P (A \ B)
where A \ B = A ∩ B c .
Proof.
(i) Note that
B = (A ∩ B) ∪ (Ac ∩ B)
and these two sets are disjoint. Therefore,
12
(iv) If A ⊂ B, then A ∩ B = A so by part (i),
because P (B ∩ Ac ) ≥ 0.
as desired.
(vi) This follows from part (ii) and the fact that
∞
!c ∞
[ \
An = Acn .
n=1 n=1
by De Morgan’s laws.
Remark
S∞ 3.2. Consider part (vi) above: Each An represents an event, so if B :=
n=1 An , then
Hence,
∞
!
\
P Acn = probability that none of these events occur.
n=1
Example 3.3. Suppose three perfectly balanced and identical coins are tossed. Find
the probability that at least one of them lands heads.
Solution: Each coin C1 , C2 , C3 lands with two possibilities H, T . So there are 8 possible
outcomes of this experiment. In other words, |Ω| = 8. Let A1 be the event that C1 lands
heads. Let A2 and A3 be defined analogously. We are then looking for
P (A1 ∪ A2 ∪ A3 ).
However, D := Ac1 ∩ Ac2 ∩ Ac3 is the event that none of them lands heads. In other words,
each of them lands tails. So, |D| = 1 and
1
P (D) = .
8
Thus, P (A1 ∪ A2 ∪ A3 ) = 7/8.
13
(End of Day 3)
From now onwards, (Ω, A, P ) will denote a fixed probability space. We also write ⊔ to
denote disjoint union (implicitly implying that the sets in question are mutually disjoint).
Lemma 3.4.
Proof.
P (A) = P (A ∩ B) + P (A ∩ B c )
⇒ P (A) − P (A ∩ B) = P (A ∩ B c )
⇒ P (A) + P (B) − P (A ∩ B) = P (B) + P (A ∩ B c ).
(b) Now suppose the result is true for k = (n − 1), and consider n sets as above.
Write B1 := A1 ∪ A2 ∪ . . . ∪ An−1 and B2 := An . Then,
n
!
[
P Ai = P (B1 ∪ B2 )
i=1
≤ P (B1 ) + P (B2 )
n−1
!
[
=P Ai + P (An )
i=1
n−1
X
≤ P (Ai ) + P (An ).
i=1
14
Theorem 3.5. Let {A1 , A2 , . . .} be a countable collection of events.
(i) If A1 ⊂ A2 ⊂ . . . and A = ∞
S
n=1 An , then
T∞
(ii) If A1 ⊃ A2 ⊃ . . . and A = n=1 An , then
Proof.
(i) Let B1 := A1 and for n ≥ 2, set Bn := An ∩Acn−1 . Then, {B1 , B2 , . . .} are mutually
disjoint, so
∞
! ∞
[ X
P Bn = P (Bn ).
n=1 n=1
S∞
(ii) Let Bn := Acn , then B1 ⊂ B2 ⊂ . . .. So if B := n=1 Bn , then by part (i),
However,
∞
\ ∞
\
c
B = Bnc = An = A.
n=1 n=1
Hence,
15
4. Conditional Probability
Example 4.1. Suppose a box has r red balls labelled 1, 2, . . . , r, and b black balls
labelled 1, 2, . . . , b. Assume that the probability of choosing any one ball is
1
.
(b + r)
(In other words, this is a uniform probability). Suppose that a ball is drawn from the
box, and that it is known to be red. What is the probability that it is labelled 1?
Solution: The probability space is Ω = {r1, r2, . . . , rr, b1, b2, . . . , bb}. Consider two
events:
A := chosen ball is red = {r1, r2, . . . , rr}
B := chosen ball is number 1 = {r1, b1}
We are not looking for P (A ∩ B). Indeed, we are given that A has occurred, and we
wish to find the probability that B will occur.
16
Notice that
Nn (A ∩ B) P (A ∩ B)
lim = .
n→∞ Nn (A) P (A)
Hence the definition.
P (A ∩ B) |A ∩ B| 1
P (B|A) = = =
P (A) |A| r
2
P (B) =
(b + r)
(End of Day 4)
Example 4.5. Suppose that the population of a city is 40% male and 60% female.
Also, 50% of the males smoke and 30% of the females smoke. Find the probability that
a smoker is male.
Solution: Here, Ω is the set of all people in the city. Consider the following events:
P (M ) := 0.4
P (F ) := 0.6
P (S|M ) := 0.5
P (S|F ) := 0.3
P (M ∩ S)
P (M |S) =
P (S)
17
Note that
P (M ∩ S) = P (S|M )P (M ) = 0.5 × 0.4 = 0.2
P (S) = P (S ∩ M ) + P (S ∩ F )
P (S ∩ F ) = P (S|F )P (F ) = (0.6)(0.3) = 0.18
⇒ P (S) = 0.2 + 0.18 = 0.38
0.2
⇒ P (M |S) =
0.38
Observe that
P (M )P (S|M )
P (M |S) =
P (F )P (S|F ) + P (M )P (S|M )
Theorem 4.6 (Bayes’ Rule). Let A1 , A2 , . . . , An be n mutually disjoint events with
Ω = ni=1 Ai . Let B be an event with P (B) > 0. Then,
F
P (Ai )P (B|Ai )
P (Ai |B) = Pn
k=1 P (Ak )P (B|Ak )
Proof. We have !
n
G n
G
B=B∩ Ak = (Ak ∩ B)
k=1 k=1
Hence,
n
X n
X
P (B) = P (Ak ∩ B) = P (Ak )P (B|Ak ).
k=1 k=1
Moreover,
P (Ai ∩ B) = P (Ai )P (B|Ai ).
Taking a ratio gives us Bayes’ rule.
Example 4.7. Consider the Monty Hall Problem from Remark .0.1: Imagine you are
on a game show. There are three doors, one with a prize behind it.
A B C
You’re allowed to pick any door, so you choose the first one at random, door A.
Before opening Door A, the rules of the game require the host (Monty Hall) to open one
of the other doors and let you switch your choice if you want. Because the host doesn’t
want to give away the game, they always open an empty door. In your case, the host
opens door C: no prize, as expected. “Do you want to switch to door B?” the host asks.
18
Solution: Once you have opened Door A, there are four possible outcomes:
Define
A := the event that the prize is behind Door A = {Ab, Ac}
B := the event that the prize is behind Door B = {Bc}
C := the event that the prize is behind Door C = {Cb}
P (A ∩ D)
P (A|D) =
P (D)
P ({Ac})
=
P ({Ac}) + P ({Bc})
1/6
=
1/6 + 1/3
1
=
3
P ({Bc})
P (B|D) =
1/6 + 1/3
2
= .
3
Hence, it is correct to switch to Door B!
5. Independence
Definition 5.1. Two events A and B are said to be independent if
P (A ∩ B) = P (A)P (B).
19
Remark 5.2.
(ii) We say that events {A1 , A2 , . . . , An } are mutually independent if for any 1 ≤ i1 ≤
i2 ≤ . . . ≤ ik ≤ n, we have
(iii) Consider Ω = {1, 2, 3, 4} with each point having probability 1/4. Let
Then,
1
P (A ∩ B) = P ({1}) = = P (A)P (B)
4
P (B ∩ C) = P (B)P (C)
P (A ∩ C) = P (A)P (C)
1
P (A ∩ B ∩ C) = ̸= P (A)P (B)P (C).
4
So the events {A, B, C} are pairwise independent but not mutually independent.
Example 5.3. Consider an experiment where a coin is tossed n times with the condition
that
P ({H}) = p and P ({T }) = 1 − p.
for some fixed 0 ≤ p ≤ 1. Construct the corresponding probability space.
Solution:
20
(i) The experiment has 2n possible outcomes, so we may take
{x}
where x has k copies of 1 and (n−k) copies of 0. Assume without loss of generality
that
x = (1, 1, . . . , 1, 0, 0, . . . , 0)
| {z } | {z }
k times (n−k) times
(ii) Let Ai be the event that the ith toss yields a H. Then, the events {A1 , A2 , . . . , An }
are mutually independent and
P (Ai ) = p
for all 1 ≤ i ≤ n. Hence,
(iii) Let Bk be the event that a H occurs precisely k times, then Bk is made up of all
vectors of the same type. Hence,
n k
P (Bk ) = p (1 − p)n−k .
k
(iv) The events {B0 , B1 , . . . , Bn } are all mutually disjoint and their union is Ω. There-
fore, we get
n n
X X n k
1 = P (Ω) = P (Bk ) = p (1 − p)n−k .
k=0 k=0
k
Example 5.4. Consider the Gambler’s Fallacy from Remark .0.3: In a game of Roulette,
a wheel is spun in one direction and a ball placed on it is spun in the other direction.
The ball eventually stops and lands in one of 37 slots on the edge of the wheel. The slot
is coloured either red or black (The zero slot is coloured green).
21
When you reach the table, you are told that the ball has landed in a black slot 26 times
in a row. So is the next one likely to be a red?
Answer 2: No, the next one is likely to be black. The game must be rigged to only
give blacks!
Answer 3: No, the next is equally like to be black or red. Each roll of the wheel
is a purely random event, similar to a coin flip. The 26 blacks so far is merely a
coincidence.
Solution: Let Ai be the event that the ith ball lands in a black slot. Then, the Ai are
mutually independent and
18
P (Ai ) = .
37
Hence, 27
18
P (A1 ∩ A2 ∩ . . . ∩ A27 ) = ≈ 3.5 × 10−9
37
However, this is not the probability we are interested in! We are interested in
18
P (A27 |∩26
i=1 Ai ) = P (A27 ) = ≈ 0.48.
37
In other words, Answer 3 is correct.
22
II. Combinatorial Analysis
Remark 0.1. Consider the following types of counting problems:
(i) A multiple choice form has 20 questions; each question has 3 choices. In how many
possible ways can the exam be completed?
(ii) Consider a horse race with 10 horses. How many ways are there to gamble on the
placings (1st, 2nd, 3rd).
(iii) How many different throws are possible with 3 dice? (A ‘throw’ is an unordered
set of the form {1, 4, 5}).
(iv) Veena has a collection of 20 CDs, she wants to take 3 of them to work. How many
possibilities does she have?
Each of these counting problems can be cast as a special case of a single question:
Consider an Urn (a bowl) with n different labelled balls in it. Suppose that k balls are
to be chosen from this urn and the numbers noted. This can be done in a few different
ways:
Balls can be drawn one-by-one or all k can be drawn together. In the first case,
the order in which the balls are chosen can be noted. In the second case, this is
not possible.
Once a ball is drawn, it may be put back into the urn before the second ball
is drawn (or it may not be put back). This is called drawing with or without
replacement.
(End of Day 6)
3 4 6
1 2 5
23
In each case, the k chosen balls are called a sample. We will now write
(1, 2, 3)
for an ordered sample, and
{1, 2, 3}
for an unordered sample. There are now four different experiments, which we enumerate.
The examples given above fall into the following categories:
(i) Drawing 20 balls from an urn containing 3 balls, noting the order, with replace-
ment.
(ii) Drawing 3 balls from an urn containing 10 balls, noting the order, without replace-
ment.
(iii) Drawing 3 balls from an urn containing 6 balls, without noting the order, with
replacement.
(iv) Drawing 3 balls from an urn containing 20 balls, without noting the order, without
replacement.
24
2. Ordered Sample without Replacement
(Permutations)
Here, each outcome is an arrangement of the numbers in a given order. Such an ar-
rangement is called a permutation.
(1, 2, 3)
(1, 2, 4)
..
.
n n!
P k := n × (n − 1) × . . . (n − k + 1) =
(n − k)!
{1, 2, 3}
{1, 2, 4}
{1, 3, 4}
{2, 3, 4}
25
Definition 4.1. A multiset is a set of the form
M = {(a, m(a)) : a ∈ A}
M = {1 · a, 3 · b, 6 · c}
Example 4.2. If n = 4 and k = 3, the possible outcomes are all multisets of the form
where d1 + d2 + d3 + d4 = 3.
Theorem 4.3. The total number of distinct k samples from an n-element set such that
repetition is allowed and ordering does not matter is
n+k−1 n+k−1
=
k n−1
Proof. The number we are after is the same as the number of distinct solutions to the
equation
d1 + d2 + . . . + dn = k
where di ∈ {0, 1, 2, . . .}. Given such a tuple (d1 , d2 , . . . , dn ), we associate to it a list of
the form
+, +, . . . , +, −, +, +, . . . , +, −, . . . , −, +, +, . . . , +
| {z } | {z } | {z }
d1 times d2 times dn times
which has precisely (n−1) minus signs. We can think of this problem as having n+k −1
positions to fill, of which (n−1) must be ‘−’ signs and the remaining are ‘+’ signs. There
are a total of
n+k−1
n−1
such solutions.
(End of Day 7)
26
In our examples at the beginning of the section, we have:
(i) A multiple choice form has 20 questions; each question has 3 choices. In how many
possible ways can the exam be completed?
Solution: 320 .
(ii) Consider a horse race with 10 horses. How many ways are there to gamble on the
placings (1st, 2nd, 3rd).
10 8!
Solution: P3 = 5!
.
6+3−1 8
Solution: 3
= 3
.
(iv) Veena has a collection of 20 CDs, she wants to take 3 of them to work. How many
possibilities does she have?
20
Solution: 3
.
5. Examples
Example 5.1 (The Birthday Problem). Assume that people’s birthdays occur with
equal probability on each of the 365 days of the year. Given a group of n people, find
the probability that no two people in the group have the same birthday.
Solution: Here, the k th person has a birthday bk ∈ {1, 2, . . . , 365} =: S. Hence, the
ordered set of all birthdays is given by a tuple
(b1 , b2 , . . . , b365 )
Since birthdays can repeat, the set Ω consists of all such ordered n-samples with replace-
ments, so
|Ω| = 365n .
Let A be the event that no two bk are equal, then |A| coincides with an ordered n-sample
of S without replacement. Hence,
27
Hence,
We know that if n ≥ 366, then two people must share a birthday. However, this tells us
that even if n = 56,
P (A) ≈ 0.01,
so it is very likely that two people in the group will share a birthday.
Example 5.2. A deck of playing cards consists of 52 cards - 13 cards in four suits (clubs,
spades, hearts, diamonds). The thirteen cards are 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A where
J is Jack, Q is Queen, K is King and A is Ace. Many games are played with these cards;
one example is poker, where each player is dealt five cards at random.
(i) How many poker hands are there?
(ii) The face value of a card is its number. So, 2 of Spades and 2 of clubs have the
same face value. A hand of poker is said to be four of a kind if it has four cards
of a single face value (and one card is forced to have another face value). What is
the probability of a four of a kind?
52
Solution: Here, Ω consists of 5
poker hands. If A is the event that a hand is a
four of a kind, we wish to find
|A|
P (A) = .
|Ω|
To find |A|, there are 13 choices for the face value making up the four, and 48
remaining choices for the fifth card. Therefore,
13 × 48
P (A) = 52
.
5
28
(iii) What is the probability that a poker hand has exactly three clubs?
Solution: Again, we wish to find |A| where A is the event that a hand has exactly
three clubs. There are 13 clubs and we wish to choose 3 of them. The remaining
two cards are chosen from the remaining 39 = 52 − 13 cards, so we have
13 39
|A| = × .
3 2
and
|A|
P (A) = 52 .
5
29
III. Discrete Random Variables
1. Definitions
Example 1.1. Consider an experiment of tossing a coin 3 times, where p := P ({H})
and (1 − p) = P ({T }). For each toss, if it lands H, you get |1 and if it lands T , you
lose |1. We wish to know the possible values for the amount of money you would earn,
which we denote by X:
w X(w) P ({w})
(H,H,H) 3 p3
(H,H,T) 1 p2 (1 − p)
(H,T,H) 1 p2 (1 − p)
(H,T,T) −1 p(1 − p)2
(T,H,H) 1 p2 (1 − p)
(T,H,T) −1 p(1 − p)2
(T,T,H) −1 p(1 − p)2
(T,T,T) −3 (1 − p)3
Here, we may wish to know the probability that we earn exactly |1 or at least |1. These
events may be described using X as
Note that
A = {(H, H, T ), (H, T, H), (T, H, H)} ⇒ P (A) = 3p2 (1 − p)
B = {(H, H, T ), (H, T, H), (T HH, HHH} ⇒ P (B) = p3 + 3p2 (1 − p).
30
(ii) For each x ∈ R, the set {w ∈ Ω : X(w) = x} is an event (it belongs to A).
(End of Day 8)
Remark 1.3.
f (x) := P (X = x)
is called the discrete density function of X (also called probability density function or
probability mass function). For clarity, this is sometimes denoted fX as well.
Example 1.5. Consider the random variable in Example 1.1 with p = 0.4. Then,
Ω = {0, 1}
where 0 denotes failure and 1 denotes success. We may take A = P(Ω). Fix p ∈ [0, 1]
and define P : A → R by
31
Let X : Ω → R be the function X(0) = 0 And X(1) = 1. Then X is a random variable
(by part (ii) of Remark 1.3) and
P (X = 1) = p
P (X = 0) = (1 − p)
P (X = x) = 0 if x ∈
/ {0, 1}.
Thus, the density function is given by
p
: if x = 1
f (x) = (1 − p) : if x = 0
0 : otherwise.
This is called the Bernoulli density function or Bernoulli distribution (we will discuss
distribution functions later) with parameter p. We will write
X ∼ Bern(p)
Moreover, we may take A = P(Ω) and P : A → [0, ∞) may be defined on singleton sets
as
P ({(x1 , x2 , . . . , xn )}) = pk (1 − p)n−k
where k denotes the number of times 1 occurs in the tuple. Then, (Ω, A, P ) is a proba-
bility space.
X(x1 , x2 , . . . , xn ) = k
where k is as above. Then, X is a random variable (by part (ii) of Remark 1.3) and
takes values in {0, 1, . . . , n}. Now, for each 0 ≤ k ≤ n,
n
|{(x1 , x2 , . . . , xn ) : 1 occurs exactly k times}| =
k
by section 3. Hence,
P (X = k) = P ({(x1 , x2 , . . . , xn ) : 1 occurs exactly k times})
n k
= p (1 − p)n−k .
k
32
Hence, the corresponding density function is
(
n x
x
p (1 − p)n−x : if x ∈ {0, 1, . . . , n}
f (x) =
0 : otherwise.
This is called the Binomial density with parameters n and p. Again, we write
X ∼ B(n, p)
when X is a random variable with this density function. Note that
Bern(p) = B(1, p).
Example 1.8 (Simplest form of Mendelian inheritance). Suppose that eye colour (brown
or blue) is determined by a pair of genes. Suppose that the allele for brown eyes (B) is
dominant, while the allele for blue eyes (b) is recessive. So an individual can have one
of three possible pairs: purely dominant (BB), hybrid (Bb), or purely recessive (bb). In
the first two cases, they will have brown eyes, and if they have bb then they will have
blue eyes.
Suppose that two hybrid parents have four children. What is the probability that exactly
3 out of 4 children will have brown eyes?
Solution:
(i) For a single child, there are three possible pairs from the set S = {BB, Bb, bb}.
Since the parents are hybrid, the probabilities are
1 1 1
P ({BB}) = , P ({Bb}) = and P ({bb}) = .
4 2 4
Now,
3
A := {brown eyes} = {BB, Bb} ⇒ P (A) =
4
1
B := {blue eyes} = {bb} ⇒ P (B) = .
4
Since we only care about the colour, we have an experiment whose ‘success’ prob-
ability is 3/4 and ‘failure’ probability is 1/4.
X ∼ B(4, 3/4).
Hence, 3
4 3 1 27
P (X = 3) = = .
3 4 4 64
33
(End of Day 9)
Example 1.9. Consider a population of N balls, of which r are red and b = (N − r)
are black. Suppose a random sample of n ≤ N is chosen. Thus,
N
|Ω| = .
n
Let X denote the number of red balls drawn. Then, X takes values 0, 1, . . . , n. Then,
r N −r
k n−k
P (X = k) = N
n
34
Lemma 1.12. Let X be a discrete random variable and f : R → R denote its density
function. We denote the range of X by by S := {x1 , x2 , . . .} as it is either finite or
countably infinite. Then,
Proof.
{x ∈ R : f (x) ̸= 0} ⊂ S.
(iii) Consider the set {x1 , x2 , . . .}, then consider the events
Ai = {w ∈ Ω : X(w) = xi }
Hence,
∞
X ∞
X
1 = P (Ω) = P (Ai ) = f (xi ).
i=1 i=1
Ω = {x ∈ R : f (x) ̸= 0} = {x1 , x2 , . . .}
P ({xi }) = f (xi ).
X(xi ) = xi .
35
Definition 1.14. A function f : R → R is called a discrete density function if it satisfies
the three conditions of Lemma 1.12.
Remark 1.15.
(i) The conclusion of Lemma 1.12 and Remark 1.13 is that a discrete random variable
X may be used to construct a discrete density function f , and conversely, every
discrete density function f may be used to construct a discrete random variable
X in such a way that fX = f .
(ii) Also, if we want to describe an experiment with finite (or countably infinite) out-
comes {x1 , x2 , . . .}, we may simply be given a discrete density function f : R → R
whose possible values are {x1 , x2 , . . .}. Then, we may take
Ω := {x1 , x2 , . . .}
A := P(Ω)
P ({xi }) := f (xi ).
This defines a probability space (Ω, A, P ). Moreover, the random variable X :
Ω → R given by inclusion
X(xi ) = xi
is naturally associated to f . Thus, the probability space is somewhat unnecessary
if you already have a density function.
(End of Day 10)
Example 1.16 (Uniform Density). Consider the experiment of picking a point at ran-
dom from a finite set S = {a1 , a2 , . . . , an } ⊂ R in a way that each point has equal
probability. Here, there is a natural density function
(
1
: if x ∈ S
f (x) = n
0 : otherwise.
This density function describes an experiment and a random variable, called the Uniform density
which we denote by U (n).
Example 1.17 (Geometric Density). Consider an experiment with two outcomes as in
Example 1.6. We repeat this experiment and each repetition is independent. Let X be
the random variable that counts the number of failures before the first success. Then
P (X = 1) = p
P (X = 2) = (1 − p)p
and so on. Therefore, if f : R → R denotes the corresponding density function, then
(
p(1 − p)x−1 : if x ∈ N
f (x) =
0 : otherwise.
This is called the Geometric Density with parameter p. We write
X ∼ Geom(p).
36
Example 1.18 (Poisson Density). Let λ > 0 be fixed.
( x
e−λ λx! : if x ∈ {0, 1, 2, . . .}
f (x) =
0 : otherwise.
(iii) Finally,
∞ ∞
X X λx
f (x) = e−λ = e−λ eλ = 1.
x=0 x=0
x!
By Lemma 1.12, f is a density function, called the Poisson density with parameter λ.
We write
X ∼ Po(λ)
Poisson random variables are important in applications (such as understanding any
system with a ‘queue’). We will discuss these later.
f (x) := P (X = x)
37
(i) If s ≤ t, then {w ∈ Ω : X(w) ≤ s} ⊂ {w ∈ Ω : X(w) ≤ t}. Hence,
F (s) ≤ F (t)
so F is non-decreasing.
(ii) Suppose the density function f takes finitely many values {x1 , x2 , . . . , xk } and we
list then in increasing order x1 < x2 < . . . < xk . Then,
0 : if t < x1 .
: if x1 ≤ t < x2
f (x1 )
f (x1 ) + f (x2 ) : if x2 ≤ t < x3
F (t) = . ..
.. .
f (x1 ) + f (x2 ) + . . . + f (xk−1 ) : if xk−1 ≤ t < xk
1 : if xk ≤ t.
Hence, for each interval [xi , xi+1 ), F is constant, and there is a jump of f (xj ) are
the point xj .
(iii) If (a, b] is an interval in R, then
{w ∈ Ω : X(w) ≤ b} = {w ∈ Ω : X(w) ≤ a} ⊔ {w ∈ Ω : a < X(w) ≤ b}
Hence, by part (v) of Lemma I.3.1,
P ({w ∈ Ω : a < X(w) ≤ b}) = P (a < X ≤ b) = F (b) − F (a).
Example 2.4. Let X be the number rolled on a 20-sided dice. What is the probability
that the number rolled is at least 10?
Hence,
9 11
P (X ≤ 9) = F (9) = ⇒ P (X ≥ 10) = .
20 20
38
(End of Day 11)
Example 2.5. Suppose X ∼ Geom(p) (see Example 1.17), so its density function is
given by (
p(1 − p)x−1 : x ∈ {1, . . .}
f (x) =
0 : otherwise
(i) We compute the distribution function for X: If t < 1, then F (t) = 0, and if t ≥ 1,
then
[t]
X 1 − (1 − p)[t]
F (t) = p(1 − p)x−1 = p = 1 − (1 − p)[t] .
x=1
1 − (1 − p)
Then, P (An ) = P (X > n) is the probability that the one fails at least n times
before the first success. Note that An ⊂ An−1 in general. Now fix n, m ∈ N and
consider the conditional probability
P (An+m ∩ An )
P (X > n + m|X > n) = P (An+m |An ) =
P (An )
P (An+m )
=
P (An )
P (X > n + m)
=
P (X > n)
(1 − p)n+m
=
(1 − p)n
= (1 − p)m
= P (X > m)
This has a practical implication: If you know that the machine has failed n times,
then the probability that it fails another m times, is the same as the unconditional
probability that it fails m times. In other words, the machine has no memory.
The property
P (X > n + m|X > n) = P (X > m)
is called the memoryless property of the Geometric density.
39
3. Discrete Random Vectors
Example 3.1. For a single experiment, one may be interested in many different random
variables at the same time. For instance, if an urn has n labelled balls, both black and
red, and you select k of them. We may be interested in
In some cases, we may wish to compare random variables, which leads to the next
definition.
Definition 3.2. Let (Ω, A, P ) be a probability space, and let X1 , X2 , . . . , Xr be r dis-
crete random variables on it. Define Y : Ω → Rr by
(ii) Let X be the number on the first die, and Z be the larger of the two numbers.
Then,
1
P (X = 1, Z = 1) = but P (X = 2, Z = 1) = 0.
36
Hence, the joint density function f : R2 → R is given the following table (values
are multiplied by 36):
40
z,x 1 2 3 4 5 6
1 1 0 0 0 0 0
2 1 2 0 0 0 0
3 1 1 3 0 0 0
4 1 1 1 4 0 0
5 1 1 1 1 5 0
6 1 1 1 1 1 6
In other words,
1
36
: if 1 ≤ x ≤ 6, x < z ≤ 6
x
f (x, z) = 36
: if 1 ≤ x ≤ 6, z = x
0 : otherwise
Remark 3.5.
(i) The density function defined as above has the following properties:
(a) f (x) ≥ 0 for all x ∈ Rr .
(b) S := {x ∈ Rr : f (x) ̸= 0} is finite or countably infinite.
P
(c) x∈S f (x) = 1.
(ii) Any function f : Rr → R satisfying these three conditions is called a discrete r-dimensional density
It is also called the joint density function of the variables (X1 , X2 , . . . , Xr ).
(iii) As Remark 1.13, any such function is the density function of some r dimensional
random vector.
(iv) The 1-dimensional density function associated to the random variable Xi is then
called the marginal density function and is denoted by fXi .
(v) Given two random variables X, Y , we write
Lemma 3.6. Let X and Y be two random variables with joint density function f : R2 →
R. If fX and fY denote the marginal density functions of X and Y respectively, then
X
fX (x) = f (x, y)
y∈R
X
fY (y) = f (x, y).
x∈R
41
Proof. Suppose X takes values {x1 , x2 , . . .}, then for any y ∈ R
P (Y = y) = P ({w ∈ Ω : Y (w) = y})
∞
G
= P ( {w ∈ Ω : X(w) = xi and Y (w) = y})
i=1
∞
X
= P (X = xi , Y = y).
i=1
Hence, if f is the joint density function of X and Y , then the marginal density of Y is
given by
X∞
fY (y) = f (xi , y)
i=1
Similarly, if Y takes values {y1 , y2 , . . .}, then the marginal density of X is given by
∞
X
fX (x) = f (x, yj )
j=1
Example 3.7. Consider part (ii) Example 3.4 where X is the number on the first die
while Z is the maximum of the two numbers. Then, we can represent the probabilities
(multiplied by 36) as a table:
z,x 1 2 3 4 5 6
1 1 0 0 0 0 0
2 1 2 0 0 0 0
3 1 1 3 0 0 0
4 1 1 1 4 0 0
5 1 1 1 1 5 0
6 1 1 1 1 1 6
42
For Z, the marginal density is given by
1
36
: if z = 1
3
36
: if z = 2
5
: if z = 3
36
7
fZ (z) = 36 : if z = 4
9
: if z = 5
36
11
36
: if z = 6
0 : otherwise
(i) Two random variables X and Y are said to be independent if for any x, y ∈ R
P (X = x, Y = y) = P (X = x)P (Y = y).
Example 4.2. Suppose two six-sided dice are rolled. Let X denote the number on the
first die, Y the number on the second die, and Z the maximum of the two numbers.
(ii) However,
2 1 3
P (X = 1, Z = 2) = while P (X = 1) = and P (Z = 2) = .
36 6 36
Hence, X and Z are not independent.
Remark 4.3.
43
(i) If X is a random variable with density function f , and if A ⊂ R then
X X
P (X ∈ A) = P ({w ∈ Ω : X(w) ∈ A}) = P (X = x) = f (x)
x∈A x∈A
(ii) If X and Y are two random variables with joint density function fX,Y and if
A, B ⊂ R are two sets, then
Lemma 4.4. If X, Y are independent random variables and A and B are two subsets
of R, then
P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B).
Proof. Suppose fX,Y denotes the joint density and fX and fY denote the marginal den-
sities. Note that
XX
P (X ∈ A, Y ∈ B) = fX,Y (x, y)
x∈A y∈B
XX
= fX (x)fY (y)
x∈A y∈B
! !
X X
= fX (x) fY (y)
x∈A y∈B
= P (X ∈ A)P (Y ∈ B).
Remark 4.5.
(i) Let X, Y be two random variables on a probability space (Ω, A, P ). Then, we may
define a number of new random variables:
(a) Define Z : Ω → R by
Z(w) := X(w) + Y (w).
Predictably, we write (X + Y ) for this random variable.
(b) Define Z : Ω → R by
44
In general, given a ‘nice’ function g : R2 → R, we may define Z : Ω → R by
Hence, X
fX+Y (z) = fX (x)fY (z − x).
This is called the convolution of the two functions fX and fY . In particular, if X and Y
both take only non-negative integer values, then X + Y only takes non-negative integer
values and for z ≥ 0,
Xz
fX+Y (z) = fX (x)fY (z − x).
x=0
The next definition only makes sense for integer valued random variables.
Definition 5.2. Let X be an integer valued discrete random variable. The probability generating functi
of X is ΦX : [−1, 1] → R given by
∞
X
ΦX (t) := fX (x)tx
x=0
Remark 5.3.
45
(i) Note that the series converges absolutely because ∞
P
x=0 fX (x) = 1. Therefore, ΦX
is differentiable and ∞
X
′
ΦX (t) = xfX (x)tx−1
x=1
ΦX (0) = P (X = 0)
Φ′X (0) = fX (1) = P (X = 1)
Φ′′X (0) = 2fX (2) = 2P (X = 2)
P (X = 1) = P (Y = 1).
Hence,
n
X n
ΦX (t) = (pt)x (1 − p)n−x
x=0
x
= (tp + 1 − p)n
46
Hence,
∞ x
−λ λ
X
ΦX (t) = e tx
x=0
x!
−λ λt
=e e
λ(t−1)
=e
Lemma 5.6. If X1 , X2 , . . . , Xr are mutually independent, integer-valued random vari-
ables, then
ΦX1 +X2 +...+Xr (t) = ΦX1 (t)ΦX2 (t) . . . ΦXr (t)
for all t ∈ [−1, 1].
Proof. Assume r = 2 and X = X1 and Y = X2 . Fix t ∈ [−1, 1]. Then
∞
X
ΦX+Y (t) = fX+Y (z)tz
z=0
∞ X
X z
= fX (x)fY (z − x)tz
z=0 x=0
∞
X ∞
X
x
= fX (x)t fY (z − x)tz−x
x=0 z=x
∞
! ∞
!
X X
= fX (x)tx fY (y)ty
x=0 y=0
= ΦX (t)ΦY (t).
Proof.
(i) If Xi ∼ B(ni , p), then
ΦXi (t) = (pt + 1 − p)ni .
By Lemma 5.6,
ΦX1 +X2 +...+Xr (t) = (pt + 1 − p)n1 +n2 +...+nr .
By part (iii) of Remark 5.3,
X1 + X2 + . . . + Xr ∼ B(n1 + n2 + . . . + nr , p).
47
(ii) Similar (check!)
48
IV. Expectation of Discrete Random
Variables
1. Definition of Expectation
Definition 1.1. Let X be a discrete random variable on a probability space (Ω, A, P )
with range {x1 , x2 , . . .} and density function f . Suppose that
∞
X
|xj |f (xj ) < ∞
j=1
EX = 0P (X = 0) + 1P (X = 1) = P (X = 1) = p.
Therefore,
n
X n x
EX = x p (1 − p)n−x
x=0
x
49
Therefore,
n
X n−1 x
EX = n p (1 − p)n−x
x=1
x − 1
n−1
X n − 1 i+1
=n p (1 − p)n−i−1
i=0
i
= np(p + 1 − p)n−1
= np.
Example 1.4. Suppose X ∼ U (n). In other words, the density function of X is
(
1
n
: x ∈ {1, 2, . . . , n}
f (x) =
0 : otherwise
Then,
n
X i n+1
EX = = .
i=1
n 2
Example 1.5. Suppose X ∼ Geom(p) for some 0 < p < 1, then
(
p(1 − p)x−1 : if x ∈ {1, 2, . . .}
f (x) =
0 : otherwise.
We verify that
∞
X
if (i) < ∞
i=1
To see this, observe that
∞
X ∞
X
if (i) = ip(1 − p)i−1
i=1 i=1
X∞
=p i(1 − p)i−1
i=1
∞
!
d X i
= −p (1 − x) |x=p −1
dx i=1
d 1
= −p |x=p −1
dx x
−1
= −p −1
p2
1−p
= .
p
1−p
Hence the series converges absolutely and EX = p
.
50
Example 1.6. There is an example of a random variable (density function) whose
expectation does not exist because the series not absolutely convergent. See [HPS,
Example 4, Section 4.1].
2. Properties of Expectation
Lemma 2.1. Let X = (X1 , X2 , . . . , Xr ) be an r-dimensional random vector with joint
density function fX . Let φ : Rr → R be a function and define
Z := φ(X).
In that case, X
EZ = φ(x)fX (x).
x∈Rr
Theorem 2.2. Let X and Y be two discrete random variables with finite expectation.
E(X + Y ) = EX + EY.
Proof. Assume for simplicity that X and Y both have finite ranges, say X has range
{x1 , x2 , . . . , xn } and Y has range {y1 , y2 , . . . , ym }.
51
(ii) Clearly, cX has range {cx1 , cx2 , . . . , cxn } and
P (cX = cxi ) = P (X = xi ).
Therefore,
n
X
E(cX) = cxi f (xi ) = cE(X).
i=1
(iii) Clearly, X + Y has range {xi + yj }. Let f denote the joint density function and
fX and fY denote the marginal density functions. Then,
m
X n
X
fX (xi ) = f (xi , yj ) and fY (yj ) = f (xi , yj ).
j=1 i=1
Therefore,
n X
X m
E(X + Y ) = (xi + yj )f (xi , yj )
i=1 j=1
Xn X m n X
X m
= xi f (xi , yj ) + yj f (xi , yj )
i=1 j=1 i=1 j=1
Xn m
X
= xi fX (xi ) + yj fY (yj )
i=1 j=1
= EX + EY.
52
Example 2.3. Let X1 , X2 , . . . , Xn each independent Bernoulli random variables with
common parameter p. If
X := X1 + X2 + . . . + Xn
then by Theorem III.5.7, X ∼ B(n, p). Moreover,
as in Example 1.3.
Example 2.4. Suppose there are N balls in an urn of which r are red and b = (N − r)
aer black. If a random sample of n ≤ N is chosen and X denotes the number of red
balls, then
X ∼ Hyp(N, r, n)
with density function
r N −r
(x)( n−x ) : if 0 ≤ x ≤ n, 0 ≤ n − x ≤ n
f (x) = (Nn )
0 : otherwise
Proof. The joint density for X and Y is given by f (x, y) = fX (x)fY (y). Thus by
Lemma 2.1,
X
E(XY ) = xyf (x, y)
x,y
X
= xfX (x)yfY (y)
x,y
! !
X X
= xfX (x) yfY (y) = E(X)E(Y ).
x y
53
3. Moments
Definition 3.1. Let X be a discrete random variable and r ≥ 0.
(i) We say that X has a moment of order r if X r has finite expectation. In that case,
we define the rth moment of X to be
E(X r ).
(ii) If µ = E(X) is the mean, then the rth central mean is defined as
E(X − µ)r
(iii) If X has a moment of order r ≥ 0, then it has moments of order k for each
1 ≤ k ≤ r.
(iv) The standard deviation of X is a measure of how much X varies from the mean.
µ = cP (X = c) = c.
Hence, P (X = µ) = 1 so X = µ.
54
(vi) Suppose we wish to minimize the function g : R → R given by
g(a) := E(X − a)2 ,
then we write
g(a) = E(X)2 − 2aE(X) + a2
Differentiating with respect to a gives g ′ (a) = −2E(X) + 2a so the extremum
occurs at a = E(X). Moreover,
g ′′ (a) = 2 > 0
so this is a minimum. Hence, VarX is the minimum value that g takes.
(vii) If X is integer valued with probability generating function ΦX : [−1, 1] → R.
Suppose that there is a t0 > 1 so that
∞
X
fX (x)tx0 < ∞
x=0
(This need not be the case, but it is if X is finitely valued). Then, we differentiate
with respect to t to get
X∞
Φ′X (t) = xfX (x)tx−1 .
x=1
55
4. Variance of a Sum
Remark 4.1. If X and Y are two discrete random variables, then
Var(X + Y ) = E[(X + Y ) − E(X + Y )]2
= E[(X − EX) + (Y − EY )]2
= E[X − EX]2 + E[Y − EY ]2 + 2E[(X − EX)(Y − EY )]
Definition 4.2. The covariance of X and Y is
Cov(X, Y ) := E[(X − EX)(Y − EY )]
= E[XY − (EX)Y − X(EY ) + (EX)(EY )]
= E[XY ] − (EX)(EY ) − (EX)(EY ) + (EX)(EY )
= E[XY ] − (EX)(EY ).
Remark 4.3.
(i) Hence,
Var(X + Y ) = VarX + VarY + 2Cov(X, Y ).
(ii) By Theorem 2.5, Cov(X, Y ) = 0 whenever X and Y are independent. In that case,
Var(X + Y ) = VarX + VarY.
Lemma 4.4.
(i) If X1 , X2 , . . . , Xn are discrete random variables with finite variance, then
n
X n X
X n
Var(X1 + X2 + . . . + Xn ) = Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i=1 j=i+1
Example 4.5. Let X1 , X2 , . . . , Xn be independent and such that Xi ∼ Bern(p) for all
1 ≤ i ≤ n. If
X := X1 + X2 . . . + Xn
then X ∼ B(n, p) by Theorem III.5.7. By Lemma 4.4,
n
X
Var(X) = Var(Xi ).
i=1
Now, E(Xi ) = p from Example 1.2. Since Xi = Xi2 , we have E(Xi2 ) = p as well. Hence,
Var(Xi ) = p − p2 = p(1 − p).
Hence, Var(X) = np(1 − p).
56
Example 4.6. Suppose there are N balls in an urn of which r are red and b = (N − r)
are black (as in Example 2.4). If a random sample of n ≤ N is chosen and X denotes
the number of red balls, then
X ∼ Hyp(N, r, n)
Consider n random variables X1 , X2 , . . . , Xn where Xi (w) = 1 if and only if the ith ball
drawn is red. Then,
r
EXi = P (Xi = 1) = .
N
Note that the Xi are not independent and X = X1 + X2 + . . . + Xn . Therefore,
nr
EX = .
N
and n n X
n
X X
Var(X) = Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i=1 j=i+1
Now consider
r r−1
E(Xi Xj ) = P (Xi = 1, Xj = 1) =
N N −1
Cov(Xi , Xj ) = E(Xi Xj ) − E(Xi )E(Xj )
r(r − 1) r2
= − 2
N (N − 1) N
Hence,
r r n−1
Var(X) = n 1− 1−
N N N −1
5. Correlation Coefficient
Definition 5.1. Suppose X and Y have finite non-zero variance. The correlation coefficient
of X and Y is given by
Cov(X, Y )
ρ(X, Y ) = p
Var(X)Var(Y )
X and Y are said to be uncorrelated if ρ(X, Y ) = 0.
Remark 5.2. If X and Y are independent, then by Theorem 2.5, X and Y are uncor-
related.
Proof.
57
(i) If Y ∼ 0 then P (Y = 0) = 1. Then, P (XY = 0) = 1 so E(XY ) = 0. Then, the
inequality holds trivially and indeed, equality holds because E(Y 2 ) = 0.
Moreover,
E(X 2 )E(Y 2 ) = E(a2 Y 2 )E(Y 2 ) = a2 [E(Y 2 )]2 .
Hence equality holds.
(iii) Now assume that P (Y = 0) < 1 so that E(Y 2 ) > 0. Now note that for any λ ∈ R,
Now,
g ′ (λ) = 2λE(Y 2 ) − 2E(XY ) and g ′′ (λ) = 2E(Y 2 ) > 0.
E(XY )
Hence g attains its minimum at a = E(Y 2 )
. Now
[E(XY )]2
0 ≤ g(a) = E(X 2 ) − .
E(Y 2 )
g(a) = E(X − aY )2 = 0.
−1 ≤ ρ(X, Y ) ≤ 1
In other words,
Cov(X, Y )2 ≤ Var(X)Var(Y ).
Hence, |ρ(X, Y )| ≤ 1. The case of equality can also be verified similarly (Check!).
58
6. Chebyshev’s Inequality
Remark 6.1. Suppose X is a nonnegative random variable with finite expectation, and
let t > 0. Define (
0 : if X(w) < t
Y (w) =
t : if X(w) ≥ t
Then, Y is a discrete random variable with
Hence,
E(Y ) = 0P (Y =) + tP (Y = t) = tP (X ≥ t).
Moreover, X ≥ Y so by Theorem 2.2, EX ≥ EY . Hence,
EX
P (X ≥ t) ≤ .
t
Theorem 6.2 (Chebyshev’s Inequality). Let X be a random variable with mean µ and
variance σ 2 . Then, for any t > 0,
σ2
P (|X − µ| ≥ t) ≤ .
t2
Proof. Consider the random variable Z := (X − µ)2 and apply the previous remark.
Then,
EZ E(X − µ)2 σ2
P (Z ≥ t2 ) ≤ 2 = = .
t t2 t2
Now observe that Z ≥ t2 ⇔ |X − µ| ≥ t.
(ii) Define
X 1 + X2 + . . . + Xn
Yn := .
n
We would expect that Yn would be close to µ as n → ∞. Since the Xi are
independent,
nµ
E(Yn ) = =µ
n
n
1 X σ2
Var(Yn ) = 2 Var(Xi ) =
n i=1 n
59
(iii) For a fixed δ > 0, by Chebyshev’s inequality
σ2
P (|Yn − µ| ≥ δ) ≤
nδ 2
⇒ lim P (|Yn − µ| ≥ δ) = 0.
n→∞
⇒ lim P (|Yn − µ| < δ) = 1.
n→∞
Sn := X1 + X2 + . . . + Xn
By Theorem 6.4,
Sn p(1 − p)
P −µ ≥δ ≤
n nδ 2
Now, the function p 7→ p(1 − p) has its maximum value on [0, 1] at p = 1/2, so
1
p(1 − p) ≤
4
for all 0 < p < 1. Hence,
Sn 1
P −µ ≥δ ≤
n 4nδ 2
Hence, given ϵ > 0, if we choose n ∈ N so that
1
<ϵ
4nδ 2
then we can ensure that after n trials,
Sn
P − µ ≥ δ < ϵ.
n
60
V. Continuous Random Variables
1. Random Variables and their distribution functions
Definition 1.1. Let (Ω, A, P ) be a probability space. A random variable is a function
X:Ω→R
(ii) Suppose Ω = [a, b] is a closed interval in R, then there is a σ-algebra A called the
Lebesgue σ-algebra that contains all open, closed and half-open intervals. On this
σ-algebra, there is a probability measure P such that
|y − x|
P ([x, y]) = = P ([x, y)) = P ((x, y)).
(b − a)
The triple (Ω, A, P ) is a probability space, and many random variables and defined
on such spaces.
(iii) Consider Example I.1.3: A pointer that is free to spin about the centre of a circle
of radius 1. If the pointer is spun, it comes to rest at an angle (in radians) from
the X axis. Define
Ω = [0, 2π)
Recall from Example I.2.2 that there is a σ-algebra A on Ω, called the Lebesgue σ-
algebra that contains all open (and half-open) intervals. There is also a probability
measure P on A such that
[a, b]
P ([a, b]) = .
2π
This is the probability space (Ω, A, P ). Let X be the angle that the pointer comes
to rest at. Then, for any x ∈ R,
{w ∈ Ω : X(w) ≤ x} = [0, x] ∩ Ω ∈ A.
61
(iv) Consider a dart board of radius 1. A dart is thrown at the board, and we measure
the distance of the dart from the origin, and denote it by X. Then,
Ω := {(a, b) ∈ R2 : a2 + b2 ≤ 1}.
and A is the Lebesgue σ-algebra which contains all open (and half-open) discs,
and P be the Lebesgue measure (as above). Then, for any x ∈ R,
{w ∈ Ω : X(w) ≤ x} = {(a, b) ∈ Ω : a2 + b2 ≤ x2 } ∈ A.
1 πx2
P (X ≤ x) = Area({(a, b) ∈ Ω : a2 + b2 ≤ x2 }) = = x2
π π
Therefore,
0
:x<0
F (x) = x2 :0≤1
1 : x > 1.
(ii) If s ≤ t, then
F (s) ≤ F (t)
so F is non-decreasing.
62
(iv) Moreover, consider
F (−∞) = lim F (n) = lim P (X ≤ n).
n→−∞ n→−∞
Similarly,
F (+∞) = 1.
63
Example 1.7.
(i) Discrete random variables are not continuous. For a continuous random variable,
the notion of density function (as in Definition III.1.4) does not make sense.
(ii) In Example 1.2, both the spinning pointer and the dart board define continuous
random variables.
Definition 1.8. A distribution function is a function F : R → R satisfying the following
properties:
(i) 0 ≤ F (x) ≤ 1 for all x ∈ R.
Example 2.2.
(i) Given real numbers a < b, define f : R → R by
0
:t<a
1
f (t) = (b−a) : a ≤ t ≤ b
0 :t>b
This is called the uniform density function on the interval [a, b].
Therefore, f is a density function and is called the exponential density with pa-
rameter λ.
64
(iii) Define f : R → R by
1
f (t) = .
π(1 + x2 )
Then, for any M > 0,
Z M
1 1
f (t)dt = arctan(t)|M
−M = (arctan(M ) − arctan(−M )).
−M π π
Hence,
∞
π −π
Z
1 1
f (t)dt = lim (arctan(M ) − arctan(−M )) = − = 1.
−∞ M →∞ π π 2 2
Thus, f is a density function, called the Cauchy density.
Definition 2.3. Let f : R → R be a density function as above. Define F : R → R by
Z x
F (x) = f (t)dt.
−∞
P (Y ≤ y) = 0.
Now if y ≥ 0, then
√ √ √ √
G(y) := P (Y ≤ y) = P (− y ≤ X ≤ y) = F ( y) − F (− y)
⇒ g(y) = G′ (y)
1 √ 1 √
= √ f ( y) + √ f (− y)
2 y 2 y
65
So the density function of Y is
(
0 : if y < 0
g(y) = 1 √ 1 √
√ f ( y) +
2 y
√
2 y
f (− y) : if y ≥ 0
Example 2.6. If X is uniformly distributed on (0, 1) and λ > 0 is fixed, find the density
function of Y := −λ−1 log(1 − X).
66
3. Normal, Exponential and Gamma Densities
a. Normal Densities
Example 3.1. Let g : R → R be given by
2 /2
g(x) = e−x .
R∞
If c := −∞
g(x)dx, then one can show (using polar coordinates) that
c2 = 2π.
Therefore, if
1 2
f (t) := √ e−t /2
2π
then f is a density function called the standard normal density. We write X ∼ n(0, 1).
Remark 3.2.
Therefore,
Φ(−x) = 1 − Φ(x).
This is useful because we only need to determine values of Φ(x) when x ≥ 0.
67
(v) The other values of Φ are given at the end of the textbook. For instance, the table
tells us that
Φ(3) = 0.9987 and therefore Φ(−3) = 0.0013.
Therefore, the majority of the area of the curve lies between [−3, 3]. For a number
z ∈ R, the value Φ(z) is often called the z-score.
(vi) Suppose X ∼ n(0, 1) and µ, σ ∈ R with σ > 0, then define
Y := µ + σX.
Then, Y is also a continuous random variable, and its distribution function is given
by
y−µ
G(y) = P (Y ≤ y) = P X ≤
σ
y−µ
=Φ
σ
So the density function of Y is g : R → R given by
1 2 2
g(y) = √ e−(y−µ) /2σ
σ 2π
This is a normal density with mean µ and variance σ 2 , and is denoted Y ∼ n(µ, σ 2 ).
Note that
b−µ a−µ
P (a ≤ Y ≤ b) = Φ −Φ
σ σ
This can be used to make calculations for all normal densities using the table.
(vii) For instance, if X ∼ n(1, 4), then
3−1 0−1
P (0 ≤ Y ≤ 3) = Φ −Φ
2 2
= Φ(1) − (1 − Φ(1/2))
= Φ(1) + Φ(0.5) − 1
= 0.5328 (from the table)
b. Exponential Densities
Recall, the density function for an exponential random variable with parameter λ > 0
is given by (
λe−λx : x ≥ 0
f (x) =
0 :x<0
and the distribution function is given by
(
1 − e−λx :x≥0
F (x) =
0 :x<0
68
Remark 3.3. Note that for any a ≥ 0, we have
Hence, if b ≥ 0, then
P (X > a)P (X > b) = P (X > a + b)
Equivalently,
P (X > a + b|X > a) = P (X > b).
This is the continuous analog of the memoryless property of the geometric random
variable.
c. Gamma Densities
Definition 3.4. For α > 0, define
Z ∞
Γ(α) := xα−1 e−x dx
0
This integral is finite but does not have a closed form expression. This is called the
Gamma function.
Definition 3.5. For a given α > 0 and λ > 0, define the gamma density by
(
λα α−1 −λx
Γ(α)
x e :x≥0
g(x) :=
0 :x<0
Remark 3.6.
(i) Note that g(x) ≥ 0 and by the definition of the Gamma function,
Z ∞ Z ∞
λα
g(x)dx = xα−1 e−λx dx = 1.
−∞ Γ(α) 0
69
A short calculation shows that
( 2
√1 e−y/2σ :y>0
g(y) = σ 2π
0 :y≤0
Γ(α + 1) = αΓ(α).
70
VI. Jointly Distributed Random
Variables
1. Properties of Bivariate Distributions
Definition 1.1. Let X and Y be two random variables on a probability space (Ω, A, P ).
The joint distribution function is defined as F : R2 → R by
F (x, y) := P (X ≤ x and Y ≤ y)
Remark 1.2.
(i) We can directly use the joint distribution function to compute certain probabili-
ties:
P (a < X ≤ b, Y ≤ d) = F (b, d) − F (a, d)
P (a < X ≤ b, c < Y ≤ d) = F (b, d) − F (b, c) − F (a, d) + F (a, c)
Definition 1.3.
(i) The marginal distribution functions are defined as
FX (x) = P (X ≤ x) and FY (y) := P (Y ≤ y).
then f is called the joint density function. Again, as before, for continuous random
variables, it is not true that
f (x, y) = P (X = x, Y = y).
Remark 1.4.
(i) Observe that
Z b Z d
P (a < X ≤ b, c < Y ≤ d) = f (u, v)dv du
a c
71
(ii) Taking A = R2 , we get Z ∞ Z ∞
f (u, v)dvdu = 1.
−∞ −∞
Hence, fX is the density function for X and is called the marginal density. Simi-
larly, Z ∞
fY (y) = f (u, y)du
−∞
∂ 2F
= f (x, y)
∂x∂y
Remark 1.6.
(i) More generally, for any two sets A, B ⊂ R, we have
72
(iv) In particular, if f1 and f2 are two (one dimensional) density functions as in Defi-
nition V.2.1, then
f (x, y) := f1 (x)f2 (y)
defines a joint density function.
Example 1.7. If X ∼ n(0, 1) and Y ∼ n(0, 1) are independent, then the joint density
function is given by
1 2 1 2 1 −(x2 +y2 )/2
f (x, y) = √ e−x /2 √ e−y /2 = e
2π 2π 2π
−3x2 /8
Z ∞
2 2 √
fX (x) = ce e−u /2 du = ce−3x /8 2π.
−∞
Hence, √
3 −(x2 −xy+y2 )/2
f (x, y) = e
4π
is a density function. Note that
X ∼ n(0, 4/3).
Similarly, one can show that Y ∼ n(0, 4/3). Since
73
2. Distribution of Sums
Remark 2.1. Suppose Z = X +Y is the sum of two random variables with joint density
function f . Then, for any z ∈ R, if
Az = {(x, y) : x + y ≤ z}
and the same for Y as well. The density function for (X + Y ) is given by
Z ∞
fX+Y (z) = fX (x)fY (z − x)dx
−∞
fX+Y (z) = 0
74
if z < 0 and if z ≥ 0, then
Z z
fX+Y (z) = fX (x)fY (z − x)dx
0
Z z
=λ2
e−λx e−λ(z−x) dx
0
2 −λz
=λ e
Theorem 2.3. Suppose X ∼ n(µ1 , σ12 ) and Y ∼ n(µ2 , σ22 ) are both independent random
variables, then
X + Y ∼ n(µ1 + µ2 , σ12 + σ22 ).
Proof. We will assume that µ1 = µ2 = 0 and σ1 = σ2 = 1. The general proof is similar
but more technical. Then,
1 2 1 2
fX (x) = √ e−x /2 and fY (y) = √ e−y /2
2π 2π
By the convolution formula,
Z ∞
1 −1
(x2 +(z−x)2 )
fX+Y (z) = e 2 dx
2π ∞
Observe that
√ √ z2
x2 + (z − x)2 = 2x2 + z 2 − 2zx = ( 2x − z/ 2)2 +
2
√ √ √
Hence, if u := 2x − z/ 2, then du = 2dx so
2
e−z /4 ∞ −u2 /2 1
Z
fX+Y (z) = e √ du
2π −∞ 2
1 2 √
= √ e−z /4 2π
2 2π
1 2
= √ √ e−z /4
2π 2
Hence,
(X + Y ) ∼ n(0, 2).
Theorem 2.4. Suppose X ∼ Γ(α1 , λ) and Y ∼ Γ(α2 , λ) are independent random vari-
ables, then X + Y ∼ Γ(α1 + α2 , λ).
Proof. Omitted. See [HPS, Section 6.2].
75
3. Conditional Densities
Remark 3.1. Suppose X and Y are discrete random variables with joint density func-
tion f and x, y ∈ R. If P (X = x) > 0, then
P (Y = y, X = x) f (x, y)
P (Y = y|X = x) = =
P (X = x) fX (x)
The conditional density function may thus be defined as
(
f (x,y)
: if 0 < fX (x) < ∞
fY |X (y|x) = fX (x)
0 : otherwise.
For each such x ∈ R, this is a density function in the sense of Lemma III.1.12.
Definition 3.2. Let X and Y be two random variables with joint density function f .
Then, the conditional density function is defined as
(
f (x,y)
: if 0 < fX (x) < ∞
fY |X (y|x) = fX (x)
0 : otherwise.
Remark 3.3.
(i) Note that for any a ≤ b and x ∈ R
Z b
P (a < Y ≤ b|X = x) = fY |X (y|x)dy.
a
fY (y) = fY |X (y|x)
for any x, y ∈ R.
(iv) You should think of the conditional density as the vertical cross-section of the joint
density function surface at the point X = x, which is further scaled by fX (x) so
that its area is 1.
76
Example 3.4. Consider X, Y with joint density function
√
3 −x2 +xy−y2
f (x, y) = e 2
4π
We saw in Example 1.8 that X and Y are dependent normal variables with X ∼ n(0, 4/3)
and Y ∼ n(0, 4/3). Note that for any x ∈ R,
√
3 2
fX (x) = √ e−3x /8
2 2π
Hence, the conditional density is given by
√ 2 2
3 −x +xy−y
4π
e 2
fY |X (y|x) = √
√ 3 e−3x /8
2
2 2π
1 2
= √ e−(y−x/2) /2
2π
Hence, for each x ∈ R, the conditional density if Y given X = x is the normal density
n(x/2, 1).
fX (x)fY |X (y|x)
fX|Y (x|y) = R ∞
f (x)fY |X (y|x)dx
−∞ X
by definition. Moreover,
Z ∞ Z ∞
fY (y) = f (x, y)dx = fX (x)fY |X (y|x)dx.
−∞ −∞
77
4. Properties of Multivariate Distributions
Definition 4.1. Suppose X1 , X2 , . . . , Xn are n random variables.
F (x1 , x2 , . . . , xn ) = P (X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn )
Remark 4.2.
∂ nF
f (x1 , x2 , . . . , xn ) = (x1 , x2 , . . . , xn )
∂x1 ∂x2 . . . ∂xn
(iii) In particular, Z
f (u1 , u2 , . . . , un )dun dun−1 . . . du1 = 1.
Rn
(v) If f : Rn → R is the joint density function, then the marginal density functions
are defined as
Z ∞ Z ∞
fX1 (x1 ) = ... f (x1 , u2 , . . . , un )dun dun−1 . . . du2 .
−∞ −∞
78
(vi) The variables X1 , X2 , . . . , Xn are said to be independent if
n
Y
P (a1 < X1 ≤ b1 , . . . , an < Xn ≤ bn ) = P (ai < Xi ≤ bi ).
i=1
Proof. For n = 2, this was proved in Theorem 2.3. For n > 2, we use induction. Assume
that the result is true for (n − 1) variables. Let Y := X2 + X3 + . . . + Xn . By induction
hypothesis,
Y ∼ n(ν, λ2 )
Pn Pn 2
where ν = i=2 µi and λ2 = i=2 σi . Moreover, by Theorem 4.4, Y and X1 are
independent. Therefore, by the n = 2 case,
X1 + Y ∼ n(µ, σ 2 )
as desired.
79
VII. Expectations and the Central
Limit Theorem
1. Expectations of Continuous Random Variables
Definition 1.1. Let X be a continuous random variable with density function f . We
say that X has finite expectation if
Z ∞
|x|f (x)dx < ∞.
−∞
80
Example 1.4. If X ∼ Exp(λ), then X ∼ Γ(1, λ), so
1
EX =
λ
by the previous example.
Z = φ(X1 , X2 , . . . , Xn )
81
(i) If X m has finite expectation, then the mth moment of X is defined as
Z ∞
m
EX = xm f (x)dx.
−∞
p
(iv) The standard deviation of X is σ := Var(X).
Remark 2.2.
(i) As in the discrete case, σ = 0 if and only if X is constant.
Example 2.3. If X ∼ Γ(α, λ), then we find the moments and variance of X.
Solution: For m ≥ 1,
Z ∞
λα
m
E(X ) = xm+α−1 e−λx dx
Γ(α) 0
λα Γ(α + m)
=
Γ(α) λα+m
α(α + 1) . . . (α + m − 1)
=
λm
By the earler calculation, E(X) = α/λ, so the variance is given by
α(α + 1) α2 α
σ 2 = E(X 2 ) − E(X)2 = 2
− 2 = 2.
λ λ λ
82
Remark 2.5. Suppose X is a random variable with density function f that is symmetric,
i.e.
f (−x) = f (x)
Then, X and −X have the same density function. In particular, if m ∈ N is odd, then
Example 2.6. Suppose X ∼ n(µ, σ 2 ), then we calculate the mean and variance of X.
Y ∼ Γ(1/2, 1/2σ 2 ).
Therefore,
1/2
Var(X) = EY = 2
= σ2.
1/2σ
Remark 2.8.
(i) As before,
(ii) If X and Y are independent, then f (x, y) = fX (x)fY (y) for all x, y ∈ R2 . There-
fore,
Z ∞Z ∞
E(XY ) = xyfX (x)fY (y)dydx
−∞ −∞
Z ∞ Z ∞
= xfX (x) yfX (y)dy dx
−∞ −∞
= µY µX
Hence, Cov(X, Y ) = 0.
83
Example 2.9. Consider the function f : R2 → R given by
√
3 −[x2 −xy+y2 ]/2
f (x, y) = e
4π
Then, by Example VI.1.8, X ∼ n(0, 4/3) and Y ∼ n(0, 4/3). We now calculate
Cov(X, Y ).
Now,
Z ∞ Z ∞
−[(y−x/2)2 /2] x −u2 /2
ye dy = u+
e du
−∞ −∞ 2
Z ∞
x ∞ −u2 /2
Z
−u2 /2
= ue du + e du
−∞ 2 −∞
x√
= 2π.
2
Therefore, √ Z ∞
6π 2 /8
E(XY ) = x2 e−3x dx
8π −∞
(1/2)(1/2 + 1) 3/4 3 × 64 16
E(Z 2 ) = 2
= = = .
(3/8) 9/64 4×9 3
84
We will always assume that they have finite mean µ and variance σ 2 .
Remark 3.2. Define
Sn := X1 + X2 + . . . + Xn
Then,
E(Sn ) = nµ and Var(Sn ) = nσ 2 .
Therefore, if
Sn − nµ
Sn∗ := √
σ n
then Sn∗ has mean 0 and variance 1.
(iii) If X1 ∼ Bern(p) for some 0 < p < 1, then Sn ∼ B(n, p). Hence,
n k
P (Sn = k) = p (1 − p)n−k
k
Here, µ = p and σ 2 = p(1 − p), so
Sn − np
Sn∗ = p
np(1 − p)
De Moivre (ca 1700) and Laplace (ca 1800) proved that in this case,
P (Sn∗ ≤ x) ≈ Φ(x)
where Φ is the standard normal distribution. This was generalized to its current
form by Lindemann in 1922.
Theorem 3.4. Let X1 , X2 , . . . , Xn be i.i.d random variables with mean µ and finite
non-zero variance σ 2 . If Sn = X1 + X2 + . . . + Xn , then for any x ∈ R,
Sn − nµ
lim P √ ≤ x = Φ(x)
n→∞ σ n
85
a. Normal Approximation
Remark 3.5. For n large, the Central Limit Theorem says that
Sn − nµ
P √ ≤ x ≈ Φ(x)
σ n
Rewriting this, we get
x − nµ
P (Sn ≤ x) ≈ Φ √
σ n
This is called a normal approximation formula.
Example 3.6. Suppose the length of life of a light bulb is exponentially distributed by
µ = 10 days. As soon as one bulb burns out, another is installed in its place. Find the
probability that more than 50 light bulbs will be needed in year.
Solution: Let Xn denote the length of life of the nth bulb. We assume that X1 , X2 , . . .
are all independent with Xi ∼ Exp(λ) where λ = 1/10 (so that µ = 10 and σ 2 = 100).
Then, if
Sn := X1 + X2 + . . . + Xn
then we wish to calculate P (S50 < 365). Now, by the normal approximation formula,
365 − 50µ
P (S50 < 365) ≈ Φ √
σ 50
365 − 500
=Φ √
10 50
= Φ(−1.91) = 0.028.
Example 3.7. An instructor has 60 exams to grade in sequence. The time required to
grade each exam is independent and identically distributed with mean 15 minutes and
standard deviation 2 minutes. Approximate the probability that she will grade all the
exams
(i) in the first 14 hours?
(ii) in the first 16 hours?
Solution: Let Xi be the time taken to grade the ith exam, then we are interested in
60
X
S60 = Xi .
i=1
86
(i) For part (i), we are interested in
840 − (60 × 15)
P (S60 ≤ 840) ≈ Φ √
2 60
30
= Φ −√
60
= Φ(−5.47)
P (S60 ≤ 840) ≈ 0.
because Φ(3.9) ≈ 1.
Note: In practice, X1 will be much larger than X60 because it gets easier to grade as
you go along. Therefore, X1 is most certainly not a normal distribution!
b. Additional Material
Here are some additional material to help understand the Central Limit Theorem:
(i) Youtube video: https://www.youtube.com/watch?v=jvoxEYmQHNM
(ii) Geogebra applet: https://www.geogebra.org/m/xqvcg8sm
(iii) Geogebra applet: https://www.geogebra.org/m/n4SujFmy
c. Random Sampling
Remark 3.8. Consider a probability space (Ω, A, P ) and a random variable X : Ω → R.
We think of values x := X(w) taken by X as outcomes of an experiment. Let F be the
distribution function of X. In practice, we often do not know what F is.
(i) To understand F better, a statistician would obtain n independent observations on
X, i.e. She would obtain n values x1 , x2 , . . . , xn assumed by X. Each xi is regarded
as a value assumed by a random variable Xi , 1 ≤ i ≤ n where X1 , X2 , . . . , Xn are in-
dependent RVs with common distribution F . The observed values (x1 , x2 , . . . , xn )
are then taken to be values of (X1 , X2 , . . . , Xn ).
87
(ii) The set {X1 , X2 , . . . , Xn } is called a sample of size n taken from a population distribution
F.
(iv) A simple random sample is one choice of tuple (x1 , x2 , . . . , xn ) which is chosen
without replacement. Note: In this case, the variables X1 , X2 , . . . , Xn are not
independent, but for large populations and small sample size, it is not very different
from sampling with replacement (in which case they are independent).
(v) In practice, one often observes not the values {x1 , x2 , . . . , xn } but a single value
φ(x1 , x2 , . . . , xn ).
Example 3.9. Suppose X ∼ Bern(p) for some 0 < p < 1, which is possibly unknown.
If five independent observations are {0, 1, 1, 0, 0}, then this is a realization of the sample
{X1 , X2 , . . . , X5 }. The sample mean is
2
x=
5
which is the value assumed by the RV X, and the sample variance is
P5
x2 − 5x2
s = i=1 i
2
= 0.3
4
which is the value assumed by the RV S 2 .
Theorem 3.10. For sufficiently large n, the distribution of the sample mean X approx-
imates a normal distribution (regardless of the original distribution F ).
88
Remark 3.11. Fix c > 0 and consider
Sn
P − µ ≥ c = P (Sn ≤ nµ − nc) + P (Sn ≥ nµ + nc)
n
−nc nc
≈Φ √ +1−Φ √
σ n σ n
√
c n
=2 1−Φ
σ
√
Hence, if δ := c n/σ, then
Sn
P − µ ≥ c ≈ 2(1 − Φ(δ))
n
where √
√ 0.1 n √
δ = c n/σ = = n20
2
In other words, √
n
Φ = 1 − 0.025 = 0.975
20
From the Standard Normal table, we see that
√
n20 = 1.96 ⇒ n ≈ 307.35
89
d. Understanding the CLT
Remark 3.13. Suppose you have 10,000 people, each weighing between 30-100 kilo-
grams.
If you had all their weights, you could take an average to find the population mean.
(i) If you don’t have all that data (or you cannot process so much), you could find
the population mean by the following steps:
Choose a group of 30 and take the average weight. This is the sample mean.
Repeat this process 50 times. Each time you choose a group of 30 people and
take an average, you get a different sample mean.
Take the average of all these 50 values. This number will be close to the
population mean.
(ii) You may also tabulate all the 50 sample means and draw a histogram. This
histogram will resemble a bell curve.
90
For part (i), you do not need the CLT. The Weak Law of Large Numbers assures you of
it already. For part (ii), however, you do need the CLT.
(i) It does not say anything about any individual’s weight. That is still a random
phenomenon.
(ii) It does not answer questions like What fraction of people weigh more than 60kg?.
i.e. What is P (X > 60)? where X is the random variable that assigns each person
their weight. To answer this, we need the probability distribution for X, which we
do not have!
(iii) The number 30 chosen above may not be good enough, depending on your popu-
lation. The vague term ‘close to’ used above is also not precise unless you know
something about the distribution of X.
91
VIII. Moment Generating Functions
and Characteristic Functions
1. Moment Generating Function
Remark 1.1. If X is a random variable and t ∈ R, then Y := etX is a random variable
given by
∞ n
tX(w)
X t X(w)n
Y (w) := e = .
n=0
n!
Note that this series always converges, so Y is well-defined. It may or may not have
finite expectation though.
Definition 1.2. The moment generating function of a random variable X is defined as
MX (t) := E(etX )
which is defined for all t ∈ R such that etX has finite expectation.
Example 1.3. If X ∼ Bern(p), then
MX (t) = E(etX )
= et·0 P (X = 0) + et·1 P (X = 1)
= (1 − p) + et p
Remark 1.4. Suppose X is a random variable taking only positive integer values, then
the probability generating function (see Definition III.5.2) is given by
∞
X
ΦX (t) = tn P (X = n)
n=0
Hence,
∞
X
MX (t) = E(etX ) = etn P (X = n) = ΦX (et ).
n=0
Example 1.5. If X ∼ B(n, p), then we know that the probability generating function
is given by
ΦX (t) = (pt + 1 − p)n
Hence,
MX (t) = (pet + 1 − p)n .
92
Example 1.6. If X ∼ Po(λ), then we had computed in Example III.5.5 that
ΦX (t) = eλ(t−1)
Therefore,
t
MX (t) = eλ(e −1) .
Example 1.7. Suppose X ∼ n(µ, σ 2 ), then
Z ∞
1 2 2
tX
MX (t) = E(e ) = √ etx e−(x−µ) /2σ dx
σ 2π −∞
Z ∞
1 2 2
= √ et(y+µ) e−y /2σ dy
σ 2π −∞
Z ∞
eµt 2 2
= √ ety−y /2σ dy
σ 2π −∞
Now note that
y2 (y − σ 2 t)2 σ 2 t2
ty − = − +
2σ 2 2σ 2 2
Hence,
2 2 ∞
eµt eσ t /2
Z
2 t)2 /2σ 2
MX (t) = √ e−(y−σ dy
σ 2π −∞
2 2
= eµt+σ t /2
93
(iii) If X1 , X2 , . . . , Xn are i.i.d and Sn = X1 + X2 + . . . + Xn , then
Proof.
(i) Follows from the definition.
Theorem 1.10. Suppose X is a random variable such that MX (t) is finite for all t ∈
(−δ, δ) for some δ > 0, then for each n ∈ N,
dn
E(X n ) = MX (t)|t=0
dtn
Proof. Consider the series
∞ n n
!
tX
X t X
MX (t) = E(e ) = E
n=0
n!
∞
X tn
= E (X n )
n=0
n!
t2
= 1 + tE(X) + E(X 2 ) + . . .
2!
Differentiating term by term and plugging in t = 0 gives the desired result. All this
works because the series converges absolutely by assumption.
Example 1.11. If X ∼ n(0, σ 2 ), then
∞
σ 2 t2 /2
X σ 2n t2n
MX (t) = e =
n=0
2n n!
Hence,
E(X m ) = 0
whenever m is odd. Moreover, if m = 2n is even,
σ 2n (2n)!
E(X m ) = .
2n n!
In particular, E(X 2 ) = σ 2 (as we already know from Example VII.2.6).
94
2. Characteristic functions
Remark 2.1.
Thus,
eit + e−it eit − e−it
cos(t) = , and sin(t) =
2 2i
(iv) If h(t) = f (t) + ig(t) is a complex valued function, then
(v) In particular,
d ct
e = cect
dt
And
b
ebc − eac
Z
ct
e dt = .
a c
95
(vi) Let (Ω, A, P ) be a probability space. A complex valued function Z : Ω → C can
be expressed in the form
Z(w) = X(w) + iY (w)
where X and Y are real-valued. We say that Z is a (complex) random variable if
both X and Y are random variables.
(vii) We say that Z has finite expectation if X and Y both have finite expectation,
which is equivalent to requiring that
E(|Z|) < ∞.
EZ := E(X) + iE(Y ).
φX (t) := E(eitX )
Remark 2.3.
φX (t) = MX (it).
However, φX is defined even when MX may not be. Indeed, |eitX(w) | = 1 for all
w ∈ Ω, so that
|φX (t)| ≤ E(1) = 1
for all t ∈ R.
96
(ii) If X has density function fX : R → R, then
Z ∞
φX (t) = eitx fX (x)dt
−∞
(iii) Moreover,
φX (0) = E(eit0 ) = 1
(iv) If c ∈ R then
φcX (t) = E(eitcX ) = φX (ct).
for all t ∈ R.
for all t ∈ R.
Hence,
φX (t) = (peit + 1 − p)n
97
Hence,
Z ∞
φX (t) = eitx fX (x)dt
−∞
Z b
1
= eitx dt
(b − a) a
eibt − eiat
=
it(b − a)
Example 2.7. If X ∼ n(µ, σ 2 ), then we had seen that
2 /t2 /2
MX (t) = eµt+σ .
Hence,
2 t2 /2
φX (t) = eitµ−σ .
Example 2.8. If X ∼ Exp(λ), then
(
λe−λx : if x ≥ 0
fX (x) =
0 : if x < 0.
Hence,
Z ∞
φX (t) = λ xeitx−λx dx
0
λ
= eitx |∞
0
(λ − it)
λ
=
(λ − it)
Note that unlike MX (t) in Example 1.8, this converges everywhere on R.
Theorem 2.9. If X1 , X2 , . . . , Xn are i.i.d. random variables and Sn = X1 +X2 +. . .+Xn ,
then
φSn (t) = (φX1 (t))n
Proof. Exactly as in Lemma 1.9.
Remark 2.10. Note that ∞ n
X i E(X n )
φX (t) = tn
n=0
n!
This series converges absolutely, so by differentiating, we get
φ′X (0) = iE(X)
More generally,
(n)
φX (0) = in E(X n )
This can be used to compute the higher moments of X exactly as in Theorem 1.10.
98
3. Inversion Formulas and the Continuity Theorem
Remark 3.1.
(i) Recall that if X is a non-negative integer valued discrete random variable, then
we defined the probability generating function as
∞
X
φX (t) = tn fX (n).
n=0
(ii) Now suppose X is any integer-valued random variable, then the characteristic
function ∞
X
φX (t) = eitn fX (n)
n=−∞
Proof. If j = k, then
ei(j−k)t = 1
for all t ∈ [−π, π]. So, Z π
1
ei(j−k)t dt = 1.
2π −π
If j ̸= k, then
π
ei(j−k)t π
Z
ei(j−k)t dt = |
−π (i(j − k) −π
ei(j−k)π − e−i(j−k)π
=
i(j − k)
sin((j − k)π)
= 2i
i(j − k)
=0
99
This is called the inversion formula.
Proof. By definition,
∞
X
φX (t) = eitn fX (n).
n=−∞
Therefore,
π ∞
1 π −ikt itn
Z Z
1 −ikt
X
φX (t)e dt = e e fX (n)dt
2π −π n=−∞
π −π
∞
1 π it(n−k)
X Z
= fX (n) e dt
n=−∞
π −π
= fX (k)
by Lemma 3.2.
The following is an analogue for continuous random variables.
Note that
σ 2 t2 2ixt + σ 2 t2 (σt + ix/σ)2 x2
ixt + = = + 2
2 2 2 2σ
100
Hence,
2 2 ∞
e−x /2σ
Z
2 /2
fX (x) = e−(σt+ix/σ) dt
2π −∞
2 2 Z ∞
e−x /2σ 2
= e−y /2 dy
2πσ −∞
1 2
= √ e−x /2
σ 2π
Hence, X ∼ n(0, σ 2 ).
Remark 3.7. Suppose X and Y are two random variables such that
φX (t) = φY (t)
for all t ∈ R, and suppose we know that φX is integrable. Then, by Theorem 3.5, we
have
fX (x) = fY (x)
for all x ∈ R. Hence, X ∼ Y . The next theorem tells us that this result is true even if
we do not assume that φX is integrable.
Theorem 3.8 (Uniqueness Theorem). If X and Y are two random variables with the
same characteristic functions, then X and Y have the same distribution function.
(End of Day 28)
Example 3.9. Suppose X ∼ n(µ1 , σ12 ) and Y ∼ n(µ2 , σ22 ) are two independent normal
random variables. Then,
2 2 /2 2 2 /2
φX (t) = eiµ1 t−σ1 t and φY (t) = eiµ2 t−σ2 t .
Since they are independent, we know that
2 2 2 /2
φX+Y (t) = φX (t)φY (t) = ei(µ1 +µ2 )t−(σ1 +σ2 )t .
By the uniqueness theorem, it follows that
X + Y ∼ n(µ1 + µ2 , σ12 + σ22 ).
This was the conclusion of Theorem VI.2.3.
Theorem 3.10 (Continuity Theorem). Let (Xn ) be a sequence of random variables and
X be a random variable such that
lim φXn (t) = φX (t)
n→∞
101
Proof. Assume that all the characteristic functions are (uniformly) integrable. Then, by
a convergence theorem from analysis, we can say that, for each x ∈ R,
Z ∞ Z ∞
−itx
lim e φXn (t)dt = e−itx φX (t)dt.
n→∞ −∞ −∞
Now if the characteristic functions are not integrable, one must apply this theorem to
2 2
the function φX (t)e−c t /2 (see the book for the details).
φX (0) = 1
log(φX (t))
102
(iii) Suppose X has finite mean µ. Then,
φ′X (0) = iµ
log(φX (t)) log(φX (t)) − log(φX (0))
⇒ lim = lim
t→0 t t→0 t−0
d
= log(φX (t))|t=0
dt
φ′X (0)
=
φX (0)
= iµ
Hence,
log(φX (t)) − iµt
lim = 0.
t→0 t
(iv) Now suppose X has finite variance σ 2 . Then,
φ′′X (0) = −E(X 2 ) = −(µ2 + σ 2 )
Now by L’Hospital’s rule,
φ′X (t)
log(φX (t)) − iµt φX (t)
− iµ
lim = lim
t→0 t2 t→0 2t
φ′X (t) − iµφX (t)
= lim
t→0 2tφX (t)
′
φ (t) − iµφX (t)
= lim X
t→0 2t
φ′′X (t) − iµφ′X (t)
= lim
t→0 2
φ′′X (0) − iµφ′X (0)
=
2
−(µ + σ 2 ) + µ2
2
=
2
σ2
=− .
2
(v) Let Z : Ω → R denote the random variable
Z(w) = 0 for all w ∈ Ω.
Then, P (Z = 0) = 1, so
φZ (t) = ei·0 P (Z = 0) = 1.
Moreover, the distribution function of Z is
(
0 : if x < 0
FZ (x) =
1 : if x ≥ 0
103
Theorem 4.2 (Weak Law of Large Numbers). Let X1 , X2 , . . . be a sequence of i.i.d.
random variables having finite mean µ. Set
Sn := X1 + X2 + . . . + Xn
then
lim φZn (t) = φZ (t)
n→∞
by part (v) of Remark 4.1. Therefore, by the Continuity Theorem (Theorem 3.10),
104
Hence, for any ϵ > 0, we have
lim P (Zn < −ϵ) = lim FZn (−ϵ)
n→∞ n→∞
= FZ (−ϵ) = 0
lim P (Zn > ϵ) = 1 − lim P (Zn < ϵ)
n→∞ n→∞
= 1 − lim FZn (ϵ)
n→∞
= 1 − FZ (ϵ)
=0
Hence,
lim P (|Zn | > ϵ) = 0.
n→∞
−inµt √ −t2
lim √ + n log(φX1 (t/σ n)) =
n→∞ σ n 2
105
even when t = 0. Hence, we conclude that for any t ∈ R,
2 /2
lim φSn∗ (t) = e−t
n→∞
This is the characteristic function of Z ∼ n(0, 1). So by the Continuity Theorem Theo-
rem 3.10, we have that for any x ∈ R,
lim FSn∗ (x) = Φ(x).
n→∞
Solution:
P100Here, X = Bern(1/2) represents a single coin toss, so we are interested in
S100 = i=1 Xi where the Xi are i.i.d. random variables with Xi ∼ Bern(1/2). We wish
to determine
P (S100 > 60) = 1 − P (S100 ≤ 60)
We know that
E(S100 ) = 100E(X) = 50
Var(S100 ) = 100Var(X) = 100(1/2)(1 − 1/2) = 25.
Example 4.5. A drug is supposed to be 75% effective. It is tested on 100 people. Using
a normal approximation, estimate the probability that at least 70 people will be cured.
We wish to determine
P (S100 ≥ 70).
106
We know that
E(S100 ) = 100E(X) = 75
Var(S100 ) = 100Var(X) = 100 × 0.75 × 0.25 = 18.75.
Hence,
107
P ((X, Y ) ∈ B)) =
RR
B
f (x, y)dydx
R∞
Marginal of X: fX (x) = −∞
f (x, y)dy.
X and Y independent ⇔ f (x, y) = fX (x)fY (y).
Continuous:
(vii) Expectation:
Discrete RV: E(X) =
P
x xP (X = x)
R∞
Continuous RV: E(X) = xf (x)dx.−∞
R∞
Function of RV: E(φ(X)) = −∞ φ(x)f (x)dx.
Linearity: E(aX + bY ) = aE(X) + bE(Y ).
If X and Y are independent, then E(XY ) = E(X)E(Y ).
Markov Inequality: If X is non-negative, then
E(X)
P (X ≥ t) ≤ .
t
Chebyshev’s inequality:
σ2
P (|X − µ| ≥ t) ≤ .
t2
108
(End of Day 30)
109
(xii) Weak Law of Large Numbers:
Sn
lim P − µ > ϵ = 0.
n→∞ n
110
Bibliography
[HPS] Hoel, Port, Stone, Introduction to Probability Theory, Houghton Mifflin Co.
(1971)
111