0% found this document useful (0 votes)
4 views

Course Notes

QFT Notes concise

Uploaded by

Rounak Mukherjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Course Notes

QFT Notes concise

Uploaded by

Rounak Mukherjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

MTH 202: Probability and Statistics

Semester 2, 2023-2024

Prahlad Vaidyanathan
Contents
Two Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

I. Probability Spaces 4
1. Examples of Random Phenomena . . . . . . . . . . . . . . . . . . . . . . 4
2. Probability Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3. Properties of Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4. Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5. Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

II. Combinatorial Analysis 21


1. Ordered Sample with Replacement . . . . . . . . . . . . . . . . . . . . . 22
2. Ordered Sample without Replacement (Permutations) . . . . . . . . . . . 23
3. Unordered Sample without Replacement (Combinations) . . . . . . . . . 23
4. Unordered Sample with Replacement . . . . . . . . . . . . . . . . . . . . 23
5. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

III. Discrete Random Variables 28


1. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2. Computations with Densities . . . . . . . . . . . . . . . . . . . . . . . . . 35
3. Discrete Random Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4. Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . 41
5. Sums of Independent Random Variables . . . . . . . . . . . . . . . . . . 43

IV.Expectation of Discrete Random Variables 47


1. Definition of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2. Properties of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3. Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4. Variance of a Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5. Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6. Chebyshev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

V. Continuous Random Variables 59


1. Random Variables and their distribution functions . . . . . . . . . . . . . 59
2. Densities of Continuous Random Variables . . . . . . . . . . . . . . . . . 62
3. Normal, Exponential and Gamma Densities . . . . . . . . . . . . . . . . 65

VI.Jointly Distributed Random Variables 69


1. Properties of Bivariate Distributions . . . . . . . . . . . . . . . . . . . . 69

2
2. Distribution of Sums . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3. Conditional Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4. Properties of Multivariate Distributions . . . . . . . . . . . . . . . . . . . 76

VII.Expectations and the Central Limit Theorem 78


1. Expectations of Continuous Random Variables . . . . . . . . . . . . . . . 78
2. Moments of Continuous Random Variables . . . . . . . . . . . . . . . . . 79
3. The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 82

VIII.
Moment Generating Functions and Characteristic Functions 90
1. Moment Generating Function . . . . . . . . . . . . . . . . . . . . . . . . 90
2. Characteristic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3. Inversion Formulas and the Continuity Theorem . . . . . . . . . . . . . . 97
4. The Weak Law of Large Numbers and the Central Limit Theorem . . . . 100
5. Review and Important Formulae . . . . . . . . . . . . . . . . . . . . . . . 105

Two Experiments
Remark 0.1 (The Monty Hall Problem). Imagine you are on a game show. There are
three doors, one with a prize behind it.

A B C

You’re allowed to pick any door, so you choose the first one at random, door A.

Before opening Door A, the rules of the game require the host to open one of the other
doors and let you switch your choice if you want. Because the host doesn’t want to give
away the game, they always open an empty door. In your case, the host opens door C:
no prize, as expected. “Do you want to switch to door B?” the host asks.

Remark 0.2. Now the same problem but with 100 doors.

1 2 3 ... 100

You again pick Door 1. The host now opens all doors except Door 72. Now should you
switch to Door 72?

Here, you have to ask yourself: Why did they not open Door 72? Almost certainly
because that’s where the prize is hidden! Maybe you got really lucky and picked right

3
with the first door at the beginning. But it’s way more likely you didn’t, and now Door
72 must be the right one!

Solution: In the Monty Hall Problem, it is always correct to switch doors as the chance
of your success increases. Indeed, in the first problem, the chance of your success is
double if you switch!

We will see why shortly (See Example I.4.7).

Remark 0.3 (The Gambler’s Fallacy). In a game of Roulette, a wheel is spun in one
direction and a ball placed on it is spun in the other direction. The ball eventually stops
and lands in one of 37 slots on the edge of the wheel. The slot is coloured either red or
black (The zero slot is coloured green).

When you reach the table, you are told that the ball has landed in a black slot 26 times
in a row. So is the next one likely to be a red?

ˆ Answer 1: Yes, it is likely to be red. 27 blacks in a row is an extremely unlikely


outcome.

ˆ Answer 2: No, the next one is likely to be black. The game must be rigged to only
give blacks!

ˆ Answer 3: No, the next is equally like to be black or red. Each roll of the wheel
is a purely random event, similar to a coin flip. The 26 blacks so far is merely a
coincidence.

4
Solution: Indeed, Answer 3 is correct (See Example I.5.4).

(End of Day 1)

5
I. Probability Spaces
1. Examples of Random Phenomena
Definition 1.1. Probability Theory is the study of random phenomena.

(i) A phenomenon is an experiment, with some fixed inputs. A deterministic phe-


nomenon is one where the outcome of the experiment can be exactly predicted
from the inputs. A random phenomenon is one that is not deterministic.

(ii) Examples:
(a) Addition of two numbers is deterministic.
(b) If a ball is dropped from a height d metres in a vacuum, the time taken to
reach the bottom is deterministic.
(c) Flipping a coin is a random phenomenon.
(d) The outcome of an election is random.
(e) Other random phenomena: Card games, DRS system in cricket, effects of
climate change (in practice), Expansion of a species in a region, etc.

(iii) A random phenomenon cannot be predicted exactly, but many such phenomena
show statistical regularity. i.e. If you repeat the experiment many many times, the
relative frequency with which a single outcome occurs may be predicted.

(iv) Example: If you toss a coin, the outcome is either H or T . If you toss the coin n
times, the numbers
Number of H Number of T
Nn (H) := and Nn (T ) :=
n n
are called the relative frequencies. We know from experience, that
1
Nn (H) → as n → ∞.
2
1
The number 2
is thus called the probability that H occurs.

Example 1.2. A twenty-sided die (a D20, an icosahedron) is rolled repeatedly. Each


time it is rolled, the number is noted down (this is the outcome of the experiment).

6
(i) This is a random phenomenon, so we we repeat the experiment n times and write
the relative frequency of seeing a k (for 1 ≤ k ≤ 20) as

Number of times we see a k


Nn (k) := .
n
For instance, if n = 10, we may get

1, 5, 7, 18, 11, 12, 14, 17, 1, 18.

Then,
2
N10 (1) =
10
N10 (2) = 0
..
.
1
N10 (17) =
10
2
N10 (18) =
10
..
.
1
For each 1 ≤ k ≤ 20, let pk = 20
. Then we expect that

lim Nn (k) = pk .
n→∞

Observe that
ˆ 0 ≤ pk ≤ 1.
ˆ p1 + p2 + . . . + p20 = 1.

(ii) Suppose we now want more information from this experiment. For instance, we
may want to know how many even numbers have been rolled. Once again, we may
take the relative frequency
Number of times an even number is rolled
Nn (even) := .
n
And we may consider
peven := lim Nn (even).
n→∞

However, it is clear that

Nn (even) = Nn (2) + Nn (4) + . . . + Nn (20) and therefore


peven = p2 + p4 + . . . + p20

7
(iii) Suppose the faces are coloured red (1,2,3,4,8,9,10), blue (5,6,7,11,18,19,20) and
yellow (the rest). We may now want to know pyellow . Again, it is clear that
X
pyellow = pk .
k is yellow

(iv) In order to make all these experiments more systematic, we introduce the idea the
set
Ω := {w1 , w2 , . . . , w20 }.
where each wk corresponds to the number k ∈ {1, 2, . . . , 20}. We now assign

wk 7→ pk .

Then, for any subset A ⊂ Ω, we define


X
P (A) := pk .
k∈A

This set A might be {the set of even numbers} or {the numbers painted yellow}
or anything else. This function

P : A → P (A)

is now defined for all subsets of Ω. Therefore, if

A = {the set of even numbers}


B = {the numbers painted yellow}, then
A ∩ B = {even numbers painted yellow}.

We may then ask: What is P (A ∩ B)? Similarly,

B c = {the numbers not painted yellow}

and we can ask: What is P (B c )?

Therefore, this approach lets us ask (and answer) more questions from the same
experiments.

(v) Observe that


ˆ P (∅) = 0 and P (Ω) = 1.
ˆ If A ∩ B = ∅, then P (A ∪ B) = P (A) + P (B).

Example 1.3. Consider a pointer that is free to spin about the centre of a circle of
radius 1. If the pointer is spun, it comes to rest at an angle θ (in radians) from the X
axis. This is the outcome of the experiment.

8
(i) Again, this is a random phenomenon. For each t ∈ [0, 2π), θ can take the value t,
so we may set
Ω := [0, 2π)

(ii) Given a point, say t = 1, what it the probability that a random spin will come
to rest at t? Here, the idea of relative frequency is problematic because there are
infinitely many choices. Instead we may ask different questions.
(iii) What is the probability that a random spin will come to rest in such a way that θ
lies in the upper half-circle? Answer: 1/2 of course. So if
A = {t ∈ [0, 2π) : 0 ≤ t ≤ π} ⇒ P (A) = 1/2
Similarly,
A = {t ∈ [0, 2π) : 0 ≤ t ≤ π/2} ⇒ P (A) = 1/4.
More generally, if s ∈ [0, 2π) and
s
As = {t ∈ [0, 2π) : 0 ≤ t ≤ s} ⇒ P (As ) =

The same is true for any sector whose angle is at the origin is s. Hence, if
s−r
A[r,s] = {t ∈ [0, 2π) : r ≤ t ≤ s} ⇒ P (A[r,s] ) = .

(iv) Given a subset A ⊂ Ω, can we measure P (A)? Answer: Not always, but it is
possible for many subsets.
(v) If s ∈ [0, 2π) is fixed and A = {s}, what is P (A)?

Solution: Consider
1
An = {t : s − 1/n ≤ t ≤ s} ⇒ P (An ) = .
2nπ
Now, {s} ⊂ An so by a theorem we will prove later, it will follow that

0 ≤ P ({s}) = P (An ).

This is true for each n ∈ N so P ({s}) = 0. Hence, the probability of hitting any
one angle is zero!

2. Probability Spaces
Given a random phenomenon, we first need a set Ω consisting of all possible outcomes.
For some subsets A ⊂ Ω, we must assign a probability
A 7→ P (A).

9
Such a subset is called an event. We say that an event A occurs if a given outcome w
belongs to A. The function P must satisfy three conditions:

ˆ P (∅) = 0, P (Ω) = 1.

ˆ 0 ≤ P (A) ≤ 1 for all events A.

ˆ If A ∩ B = ∅, then P (A ∪ B) = P (A) + P (B).

It may not be possible to say that


X
P (A) = P ({w})
w∈A

for each subset A ⊂ Ω.

Definition 2.1. Let Ω be a set. A collection A of subsets of Ω is called a σ-field (or


σ-algebra) if the following conditions hold:

(i) ∅ ∈ A.

(ii) If A ∈ A, then Ac = Ω \ A ∈ A. (Hence, Ω ∈ A).

(iii) If {A1 , A2 , . . .} are a countable collection in A, then



[
An ∈ A.
n=1

Example 2.2.

(i) If Ω is any set, A := {∅, Ω} is a σ-algebra.

(ii) If Ω is any set, A = P(Ω), the power set, is a σ-algebra.

(iii) If Ω = [a, b], there is a ‘nice’ σ-algebra L that contains all intervals in Ω. This is
called the Lebesgue σ-algebra.

Definition 2.3. Let Ω be a set and A be a σ-algebra on Ω. A probability measure on


(Ω, A) is a function
P : A → [0, ∞)
satisfying the following conditions:

(i) P (Ω) = 1.

(ii) If {A1 , A2 , . . .} ⊂ A are mutually disjoint, then



! ∞
[ X
P An = P (An ).
n=1 n=1

10
The triple (Ω, A, P ) is called a probability space.
Example 2.4.
(i) Let Ω be a finite set, A = P(Ω), then we may define
|A|
P (A) :=
|Ω|
where |·| denotes the cardinality. This probability space is called a symmetric probability space.
(ii) Let Ω = {H, T } and A = P(Ω) = {∅, {H}, {T }, Ω}. Define P : A → [0, ∞) by
1 1
P (∅) = 0, P ({H}) = , P ({T }) = , P (Ω) = 1.
2 2
This is an example of a symmetric probability space.
(End of Day 2)
(iii) In the previous example, we may also define
1 2
P (∅) = 0, P ({H}) = , P ({T }) = , P (Ω) = 1.
3 3
This is a probability space, but is not symmetric.
(iv) Let Ω = [a, b] and A = L. There is a function µ : L → [0, ∞) such that
µ((c, d)) = µ((c, d]) = µ([c, d)) = |d − c|
for all a ≤ c < d ≤ b. The function P : L → [0, ∞) given by
P (A) := µ(A)/(b − a)
is a probability measure on Ω. This probability space is called the uniform probability space.

(v) If Ω ⊂ Rd is a ‘nice’ set, we may define a similar σ-algebra L that contains all
rectangles of the form
Yd
(ai , bi ).
i=1
This is also called the Lebesgue σ-algebra and it also carries a Lebesgue measure
µ so that !
Yd d
Y
µ (ai , bi ) = (bi − ai ).
i=1 i=1

Again, we may define P : L → [0, ∞) by


µ(A)
P (A) = .
µ(Ω)
This defines a probability measure and the space is also called a uniform probability space.

11
Remark 2.5.
(i) If Ω is a finite set, we will typically take A = P(Ω).

(ii) If Ω is infinite, it will almost always occur as a subset of Rd for d ∈ {1, 2, 3}. In
that case, we will usually use some modification of the Lebesgue measure as above.

3. Properties of Probabilities
Lemma 3.1. Suppose (Ω, A, P ) is a probability space. Then
(i) For any A, B ∈ A,
P (B) = P (B ∩ A) + P (B ∩ Ac ).

(ii) For any A ∈ A,


P (Ac ) = 1 − P (A).

(iii) P (∅) = 0.

(iv) If A, B ∈ A are such that A ⊂ B, then

P (A) ≤ P (B).

(v) If B ⊂ A, then
P (A) − P (B) = P (A \ B)
where A \ B = A ∩ B c .

(vi) If {A1 , A2 , . . .} is any sequence of sets, then



! ∞
!
[ \
P =1−P Acn
n=1 n=1

Proof.
(i) Note that
B = (A ∩ B) ∪ (Ac ∩ B)
and these two sets are disjoint. Therefore,

P (B) = P (A ∩ B) + P (Ac ∩ B).

(ii) Since A ∪ Ac = Ω and A ∩ Ac = ∅, we have

P (A) + P (Ac ) = P (Ω) = 1.

(iii) Since ∅c = Ω, we have


P (∅) + 1 = 1 ⇒ P (∅) = 0.

12
(iv) If A ⊂ B, then A ∩ B = A so by part (i),

P (B) = P (A) + P (B ∩ Ac ) ≥ P (A)

because P (B ∩ Ac ) ≥ 0.

(v) If C := A \ B, then C ∩ B = ∅ and C ⊔ B = A. Hence,

P (A) = P (B) + P (C)

as desired.

(vi) This follows from part (ii) and the fact that

!c ∞
[ \
An = Acn .
n=1 n=1

by De Morgan’s laws.

Remark
S∞ 3.2. Consider part (vi) above: Each An represents an event, so if B :=
n=1 An , then

P (B) = probability that at least one of these events occur.

Hence,

!
\
P Acn = probability that none of these events occur.
n=1

Example 3.3. Suppose three perfectly balanced and identical coins are tossed. Find
the probability that at least one of them lands heads.

Solution: Each coin C1 , C2 , C3 lands with two possibilities H, T . So there are 8 possible
outcomes of this experiment. In other words, |Ω| = 8. Let A1 be the event that C1 lands
heads. Let A2 and A3 be defined analogously. We are then looking for

P (A1 ∪ A2 ∪ A3 ).

However, D := Ac1 ∩ Ac2 ∩ Ac3 is the event that none of them lands heads. In other words,
each of them lands tails. So, |D| = 1 and
1
P (D) = .
8
Thus, P (A1 ∪ A2 ∪ A3 ) = 7/8.

13
(End of Day 3)

From now onwards, (Ω, A, P ) will denote a fixed probability space. We also write ⊔ to
denote disjoint union (implicitly implying that the sets in question are mutually disjoint).

Lemma 3.4.

(i) If A, B are two events (not necessarily mutually disjoint), then

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

(ii) Let A1 , A2 , . . . , An be a finite collection of events (not necessarily mutually dis-


joint). Then, !
n
[ Xn
P Ai ≤ P (Ai ).
i=1 i=1

Proof.

(i) Write A = (A ∩ B) ∪ (A ∩ B c ), then

P (A) = P (A ∩ B) + P (A ∩ B c )
⇒ P (A) − P (A ∩ B) = P (A ∩ B c )
⇒ P (A) + P (B) − P (A ∩ B) = P (B) + P (A ∩ B c ).

However, A ∪ B = B ⊔ (A ∩ B c ), so the result follows.

(ii) We prove this by induction on n.


(a) Suppose n = 2, then by part (i),

P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) − P (A1 ∩ A2 ) ≤ P (A1 ) + P (A2 ).

(b) Now suppose the result is true for k = (n − 1), and consider n sets as above.
Write B1 := A1 ∪ A2 ∪ . . . ∪ An−1 and B2 := An . Then,
n
!
[
P Ai = P (B1 ∪ B2 )
i=1
≤ P (B1 ) + P (B2 )
n−1
!
[
=P Ai + P (An )
i=1
n−1
X
≤ P (Ai ) + P (An ).
i=1

where the last inequality follows by induction. Hence the result.

14
Theorem 3.5. Let {A1 , A2 , . . .} be a countable collection of events.

(i) If A1 ⊂ A2 ⊂ . . . and A = ∞
S
n=1 An , then

P (A) = lim P (An ).


n→∞

T∞
(ii) If A1 ⊃ A2 ⊃ . . . and A = n=1 An , then

P (A) = lim P (An ).


n→∞

Proof.

(i) Let B1 := A1 and for n ≥ 2, set Bn := An ∩Acn−1 . Then, {B1 , B2 , . . .} are mutually
disjoint, so

! ∞
[ X
P Bn = P (Bn ).
n=1 n=1

However, for any fixed n ∈ N,


n
G n
X
An = Bi ⇒ P (An ) = P (Bi ).
i=1 i=1

This last term is the partial sum of the series. Therefore,


n
X
P (A) = lim P (Bn ) = lim P (An ).
n→∞ n→∞
i=1

S∞
(ii) Let Bn := Acn , then B1 ⊂ B2 ⊂ . . .. So if B := n=1 Bn , then by part (i),

P (B) = lim P (Bn ).


n→∞

However,

\ ∞
\
c
B = Bnc = An = A.
n=1 n=1

Hence,

P (A) = P (B c ) = 1−P (B) = 1− lim P (Bn ) = 1− lim (1 − P (An )) = lim P (An ).


n→∞ n→∞ n→∞

15
4. Conditional Probability
Example 4.1. Suppose a box has r red balls labelled 1, 2, . . . , r, and b black balls
labelled 1, 2, . . . , b. Assume that the probability of choosing any one ball is
1
.
(b + r)
(In other words, this is a uniform probability). Suppose that a ball is drawn from the
box, and that it is known to be red. What is the probability that it is labelled 1?

Solution: The probability space is Ω = {r1, r2, . . . , rr, b1, b2, . . . , bb}. Consider two
events:
A := chosen ball is red = {r1, r2, . . . , rr}
B := chosen ball is number 1 = {r1, b1}

We are not looking for P (A ∩ B). Indeed, we are given that A has occurred, and we
wish to find the probability that B will occur.

Definition 4.2. Suppose P (A) > 0. The conditional probability of B given A is


P (A ∩ B)
P (B|A) = .
P (A)
If P (A) = 0, the conditional probability of B given A is undefined.
Remark 4.3. Explanation of Definition 4.2 using relative frequency: Suppose n trials
of an experiment are conducted, let
Nn (A) := the number of times A occurs
Nn (B) := the number of times B occurs
Nn (A ∩ B) := the number of times A and B both occur
Then, we expect that
Nn (A)
P (A) = lim
n→∞ n
Nn (B)
P (B) = lim
n→∞ n
Nn (A ∩ B)
P (A ∩ B) = lim
n→∞ n
We are only interested in those experiments where A occurs. Out of these Nn (A) ex-
periments, we wish to count the number of experiments where B also occurs. In other
words, we care about the ratio
Nn (A ∩ B)
Nn (A)

16
Notice that
Nn (A ∩ B) P (A ∩ B)
lim = .
n→∞ Nn (A) P (A)
Hence the definition.

Example 4.4. In Example 4.1, we have

P (A ∩ B) |A ∩ B| 1
P (B|A) = = =
P (A) |A| r

Note that the unconditional probability of B is

2
P (B) =
(b + r)

(End of Day 4)

Note that, by definition,


P (A ∩ B) = P (B)P (A|B)
We use this formula repeatedly.

Example 4.5. Suppose that the population of a city is 40% male and 60% female.
Also, 50% of the males smoke and 30% of the females smoke. Find the probability that
a smoker is male.

Solution: Here, Ω is the set of all people in the city. Consider the following events:

M := {the person is male}


F := {the person is female}
S := {the person smokes}

The given information is that

P (M ) := 0.4
P (F ) := 0.6
P (S|M ) := 0.5
P (S|F ) := 0.3

Our question asks us to find

P (M ∩ S)
P (M |S) =
P (S)

17
Note that
P (M ∩ S) = P (S|M )P (M ) = 0.5 × 0.4 = 0.2
P (S) = P (S ∩ M ) + P (S ∩ F )
P (S ∩ F ) = P (S|F )P (F ) = (0.6)(0.3) = 0.18
⇒ P (S) = 0.2 + 0.18 = 0.38
0.2
⇒ P (M |S) =
0.38

Observe that
P (M )P (S|M )
P (M |S) =
P (F )P (S|F ) + P (M )P (S|M )
Theorem 4.6 (Bayes’ Rule). Let A1 , A2 , . . . , An be n mutually disjoint events with
Ω = ni=1 Ai . Let B be an event with P (B) > 0. Then,
F

P (Ai )P (B|Ai )
P (Ai |B) = Pn
k=1 P (Ak )P (B|Ak )

Proof. We have !
n
G n
G
B=B∩ Ak = (Ak ∩ B)
k=1 k=1
Hence,
n
X n
X
P (B) = P (Ak ∩ B) = P (Ak )P (B|Ak ).
k=1 k=1
Moreover,
P (Ai ∩ B) = P (Ai )P (B|Ai ).
Taking a ratio gives us Bayes’ rule.
Example 4.7. Consider the Monty Hall Problem from Remark .0.1: Imagine you are
on a game show. There are three doors, one with a prize behind it.

A B C

You’re allowed to pick any door, so you choose the first one at random, door A.

Before opening Door A, the rules of the game require the host (Monty Hall) to open one
of the other doors and let you switch your choice if you want. Because the host doesn’t
want to give away the game, they always open an empty door. In your case, the host
opens door C: no prize, as expected. “Do you want to switch to door B?” the host asks.

18
Solution: Once you have opened Door A, there are four possible outcomes:

Ab := the prize is behind Door A and Monty opened Door B


Ac := the prize is behind Door A and Monty opened Door C
Bc := the prize is behind Door B and Monty opened Door C
Cb := the prize is behind Door C and Monty opened Door b

Define
A := the event that the prize is behind Door A = {Ab, Ac}
B := the event that the prize is behind Door B = {Bc}
C := the event that the prize is behind Door C = {Cb}

Then, it is clear that


1
P (A) = P (B) = P (C) =
3
Moreover,
P (A) 1
P ({Ab}) = P ({Ac}) = = .
2 6
Since Monty opened Door C, the event D := {Ac, Bc} has occured. In order to determine
whether to switch doors or not, we wish to compute two conditional probabilities:

P (A ∩ D)
P (A|D) =
P (D)
P ({Ac})
=
P ({Ac}) + P ({Bc})
1/6
=
1/6 + 1/3
1
=
3
P ({Bc})
P (B|D) =
1/6 + 1/3
2
= .
3
Hence, it is correct to switch to Door B!

5. Independence
Definition 5.1. Two events A and B are said to be independent if

P (A ∩ B) = P (A)P (B).

19
Remark 5.2.

(i) Note that if A and B are independent, then

P (A|B) = P (A) and P (B|A) = P (B).

Hence, the two events do not affect each other.


(End of Day 5)

(ii) We say that events {A1 , A2 , . . . , An } are mutually independent if for any 1 ≤ i1 ≤
i2 ≤ . . . ≤ ik ≤ n, we have

P (Ai1 ∩ Ai2 ∩ . . . ∩ Aik ) = P (Ai1 )P (Ai2 ) . . . P (Aik ).

(iii) Consider Ω = {1, 2, 3, 4} with each point having probability 1/4. Let

A = {1, 2}, B = {1, 3} and C = {1, 4}.

Then,
1
P (A ∩ B) = P ({1}) = = P (A)P (B)
4
P (B ∩ C) = P (B)P (C)
P (A ∩ C) = P (A)P (C)
1
P (A ∩ B ∩ C) = ̸= P (A)P (B)P (C).
4
So the events {A, B, C} are pairwise independent but not mutually independent.

If A and B are independent, then A and B c are also independent because

P (A)P (B c ) = P (A)[1 − P (B)]


= P (A) − P (A)P (B)
= P (A) − P (A ∩ B)
= P (A \ (A ∩ B)) by part (iv) of Lemma 3.1)
= P (A ∩ B c )

Example 5.3. Consider an experiment where a coin is tossed n times with the condition
that
P ({H}) = p and P ({T }) = 1 − p.
for some fixed 0 ≤ p ≤ 1. Construct the corresponding probability space.

Solution:

20
(i) The experiment has 2n possible outcomes, so we may take

Ω = {(x1 , x2 , . . . , xn ) : xi ∈ {0, 1} for all 1 ≤ i ≤ n}.

Here, if xi = 1, we think of it as ‘Heads’ and if xi = 0 we think of it as ‘Tails’. The


σ-algebra is A = P(Ω). We now wish to assign P (·) to sets of the form

{x}

where x has k copies of 1 and (n−k) copies of 0. Assume without loss of generality
that
x = (1, 1, . . . , 1, 0, 0, . . . , 0)
| {z } | {z }
k times (n−k) times

(ii) Let Ai be the event that the ith toss yields a H. Then, the events {A1 , A2 , . . . , An }
are mutually independent and
P (Ai ) = p
for all 1 ≤ i ≤ n. Hence,

P ({x}) = P (A1 ∩ A2 ∩ . . . ∩ Ak ∩ Ack+1 ∩ Ack+2 . . . Acn )


= P (A1 )P (A2 ) . . . P (Ak )P (Ack+1 ) . . . P (Acn )
= pk (1 − p)n−k

(iii) Let Bk be the event that a H occurs precisely k times, then Bk is made up of all
vectors of the same type. Hence,
 
n k
P (Bk ) = p (1 − p)n−k .
k

(iv) The events {B0 , B1 , . . . , Bn } are all mutually disjoint and their union is Ω. There-
fore, we get
n n  
X X n k
1 = P (Ω) = P (Bk ) = p (1 − p)n−k .
k=0 k=0
k

Example 5.4. Consider the Gambler’s Fallacy from Remark .0.3: In a game of Roulette,
a wheel is spun in one direction and a ball placed on it is spun in the other direction.
The ball eventually stops and lands in one of 37 slots on the edge of the wheel. The slot
is coloured either red or black (The zero slot is coloured green).

21
When you reach the table, you are told that the ball has landed in a black slot 26 times
in a row. So is the next one likely to be a red?

ˆ Answer 1: Yes, it is likely to be red. 27 blacks in a row is an extremely unlikely


outcome.

ˆ Answer 2: No, the next one is likely to be black. The game must be rigged to only
give blacks!

ˆ Answer 3: No, the next is equally like to be black or red. Each roll of the wheel
is a purely random event, similar to a coin flip. The 26 blacks so far is merely a
coincidence.

Solution: Let Ai be the event that the ith ball lands in a black slot. Then, the Ai are
mutually independent and
18
P (Ai ) = .
37
Hence,  27
18
P (A1 ∩ A2 ∩ . . . ∩ A27 ) = ≈ 3.5 × 10−9
37
However, this is not the probability we are interested in! We are interested in
18
P (A27 |∩26
i=1 Ai ) = P (A27 ) = ≈ 0.48.
37
In other words, Answer 3 is correct.

22
II. Combinatorial Analysis
Remark 0.1. Consider the following types of counting problems:

(i) A multiple choice form has 20 questions; each question has 3 choices. In how many
possible ways can the exam be completed?

(ii) Consider a horse race with 10 horses. How many ways are there to gamble on the
placings (1st, 2nd, 3rd).

(iii) How many different throws are possible with 3 dice? (A ‘throw’ is an unordered
set of the form {1, 4, 5}).

(iv) Veena has a collection of 20 CDs, she wants to take 3 of them to work. How many
possibilities does she have?

Each of these counting problems can be cast as a special case of a single question:
Consider an Urn (a bowl) with n different labelled balls in it. Suppose that k balls are
to be chosen from this urn and the numbers noted. This can be done in a few different
ways:

ˆ Balls can be drawn one-by-one or all k can be drawn together. In the first case,
the order in which the balls are chosen can be noted. In the second case, this is
not possible.

ˆ Once a ball is drawn, it may be put back into the urn before the second ball
is drawn (or it may not be put back). This is called drawing with or without
replacement.

(End of Day 6)

3 4 6

1 2 5

Figure II.1.: Urn with 6 balls

23
In each case, the k chosen balls are called a sample. We will now write
(1, 2, 3)
for an ordered sample, and
{1, 2, 3}
for an unordered sample. There are now four different experiments, which we enumerate.
The examples given above fall into the following categories:
(i) Drawing 20 balls from an urn containing 3 balls, noting the order, with replace-
ment.
(ii) Drawing 3 balls from an urn containing 10 balls, noting the order, without replace-
ment.
(iii) Drawing 3 balls from an urn containing 6 balls, without noting the order, with
replacement.
(iv) Drawing 3 balls from an urn containing 20 balls, without noting the order, without
replacement.

1. Ordered Sample with Replacement


If there are n balls and a sample of size k is chosen, then we have
nk
possible outcomes. In fact, we may take
Ω := {(x1 , x2 , . . . , xk ) : xi ∈ {1, 2, . . . , n} for all 1 ≤ i ≤ k}

Example 1.1. If n = 4 and k = 3, the possible outcomes may be given as


(1, 1, 1)
(1, 1, 2)
(1, 1, 3)
(1, 1, 4)
(1, 2, 1)
..
.
(4, 4, 1)
(4, 4, 2)
(4, 4, 3)
(4, 4, 4)
There are 4 × 4 × 4 = 64 possible outcomes. Here, Ω would consist of all these 64 points.

24
2. Ordered Sample without Replacement
(Permutations)
Here, each outcome is an arrangement of the numbers in a given order. Such an ar-
rangement is called a permutation.

Example 2.1. For instance, if n = 4 and k = 3, the possible outcomes are:

(1, 2, 3)
(1, 2, 4)
..
.

Note that (1, 1, 1) is no longer a valid option.

The number of permutations of size k from {1, 2, . . . , n} is

n n!
P k := n × (n − 1) × . . . (n − k + 1) =
(n − k)!

3. Unordered Sample without Replacement


(Combinations)
Here, each outcome is an unordered subset of {1, 2, . . . , n}. Such an set is called a
combination.

Example 3.1. If n = 4 and k = 3, the possible outcomes are:

{1, 2, 3}
{1, 2, 4}
{1, 3, 4}
{2, 3, 4}

The number of combinations of size k from {1, 2, . . . , n} is


n
n Pk n!
Ck = = .
k! k!(n − k)!

4. Unordered Sample with Replacement


Here, each outcome is a multiset; a ‘subset’ of {1, 2, . . . , n} in which repetitions are
allowed.

25
Definition 4.1. A multiset is a set of the form

M = {(a, m(a)) : a ∈ A}

where A is a set and m : A → N0 is a function. We denote the multiset in the form

M = {1 · a, 3 · b, 6 · c}

Example 4.2. If n = 4 and k = 3, the possible outcomes are all multisets of the form

{(1, d1 ), (2, d2 ), (3, d3 ), (4, d4 )}

where d1 + d2 + d3 + d4 = 3.
Theorem 4.3. The total number of distinct k samples from an n-element set such that
repetition is allowed and ordering does not matter is
   
n+k−1 n+k−1
=
k n−1
Proof. The number we are after is the same as the number of distinct solutions to the
equation
d1 + d2 + . . . + dn = k
where di ∈ {0, 1, 2, . . .}. Given such a tuple (d1 , d2 , . . . , dn ), we associate to it a list of
the form
+, +, . . . , +, −, +, +, . . . , +, −, . . . , −, +, +, . . . , +
| {z } | {z } | {z }
d1 times d2 times dn times

which has precisely (n−1) minus signs. We can think of this problem as having n+k −1
positions to fill, of which (n−1) must be ‘−’ signs and the remaining are ‘+’ signs. There
are a total of  
n+k−1
n−1
such solutions.

(End of Day 7)

We summarize this section in a table: For a sample of size k to be chosen from an


n-element set, we have

Ordered Sample with Replacement nk


n
Ordered Sample without Replacement Pk 
n+k−1
Unordered Sample with Replacement k−1
n

Unordered Sample without Replacement k

26
In our examples at the beginning of the section, we have:
(i) A multiple choice form has 20 questions; each question has 3 choices. In how many
possible ways can the exam be completed?

Solution: 320 .

(ii) Consider a horse race with 10 horses. How many ways are there to gamble on the
placings (1st, 2nd, 3rd).

10 8!
Solution: P3 = 5!
.

(iii) How many different throws are possible with 3 dice?

6+3−1 8
 
Solution: 3
= 3
.

(iv) Veena has a collection of 20 CDs, she wants to take 3 of them to work. How many
possibilities does she have?

20

Solution: 3
.

5. Examples
Example 5.1 (The Birthday Problem). Assume that people’s birthdays occur with
equal probability on each of the 365 days of the year. Given a group of n people, find
the probability that no two people in the group have the same birthday.

Solution: Here, the k th person has a birthday bk ∈ {1, 2, . . . , 365} =: S. Hence, the
ordered set of all birthdays is given by a tuple

(b1 , b2 , . . . , b365 )

Since birthdays can repeat, the set Ω consists of all such ordered n-samples with replace-
ments, so
|Ω| = 365n .
Let A be the event that no two bk are equal, then |A| coincides with an ordered n-sample
of S without replacement. Hence,

|A| = 365P n = 365 × 364 × . . . × (365 − n + 1)

27
Hence,

|A| 365 × 364 × . . . × (365 − n + 1)


P (A) = =
|Ω| 365 × 365 × . . . × 365
    
1 2 n−1
= 1− 1− ... 1 −
365 365 365

We know that if n ≥ 366, then two people must share a birthday. However, this tells us
that even if n = 56,
P (A) ≈ 0.01,
so it is very likely that two people in the group will share a birthday.

Example 5.2. A deck of playing cards consists of 52 cards - 13 cards in four suits (clubs,
spades, hearts, diamonds). The thirteen cards are 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A where
J is Jack, Q is Queen, K is King and A is Ace. Many games are played with these cards;
one example is poker, where each player is dealt five cards at random.
(i) How many poker hands are there?

Solution: A hand is sample of 5 cards is an unordered sample chosen without


replacement, so the number of hands is
 
52
.
5

(ii) The face value of a card is its number. So, 2 of Spades and 2 of clubs have the
same face value. A hand of poker is said to be four of a kind if it has four cards
of a single face value (and one card is forced to have another face value). What is
the probability of a four of a kind?

52

Solution: Here, Ω consists of 5
poker hands. If A is the event that a hand is a
four of a kind, we wish to find
|A|
P (A) = .
|Ω|
To find |A|, there are 13 choices for the face value making up the four, and 48
remaining choices for the fifth card. Therefore,
13 × 48
P (A) = 52
 .
5

28
(iii) What is the probability that a poker hand has exactly three clubs?

Solution: Again, we wish to find |A| where A is the event that a hand has exactly
three clubs. There are 13 clubs and we wish to choose 3 of them. The remaining
two cards are chosen from the remaining 39 = 52 − 13 cards, so we have
   
13 39
|A| = × .
3 2

and
|A|
P (A) = 52 .

5

29
III. Discrete Random Variables
1. Definitions
Example 1.1. Consider an experiment of tossing a coin 3 times, where p := P ({H})
and (1 − p) = P ({T }). For each toss, if it lands H, you get |1 and if it lands T , you
lose |1. We wish to know the possible values for the amount of money you would earn,
which we denote by X:
w X(w) P ({w})
(H,H,H) 3 p3
(H,H,T) 1 p2 (1 − p)
(H,T,H) 1 p2 (1 − p)
(H,T,T) −1 p(1 − p)2
(T,H,H) 1 p2 (1 − p)
(T,H,T) −1 p(1 − p)2
(T,T,H) −1 p(1 − p)2
(T,T,T) −3 (1 − p)3
Here, we may wish to know the probability that we earn exactly |1 or at least |1. These
events may be described using X as

A := {w ∈ Ω : X(w) = 1}, and


B := {w ∈ Ω : X(w) ≥ 1}.

Note that
A = {(H, H, T ), (H, T, H), (T, H, H)} ⇒ P (A) = 3p2 (1 − p)
B = {(H, H, T ), (H, T, H), (T HH, HHH} ⇒ P (B) = p3 + 3p2 (1 − p).

We write these events as


{X = 1} and {X ≥ 1}
respectively, and we think of X as a measurement.
Definition 1.2. Let (Ω, A, P ) be a probability space. A discrete random variable is a
function
X:Ω→R
such that
(i) The range of X contains either finite or countably many elements.

30
(ii) For each x ∈ R, the set {w ∈ Ω : X(w) = x} is an event (it belongs to A).

(End of Day 8)

Remark 1.3.

(i) For a random variable X, we will often write

P (X = x) := P ({w ∈ Ω : X(w) = x}).

(ii) If Ω is a finite set and A = P(Ω), then if X : Ω → R is any function,


(a) The range of X is necessarily finite.
(b) For any x ∈ R,
{w ∈ Ω : X(w) = x} ∈ A

Hence, in this case, any function X : Ω → R is a discrete random variable.

Definition 1.4. The function f : R → R defined by

f (x) := P (X = x)

is called the discrete density function of X (also called probability density function or
probability mass function). For clarity, this is sometimes denoted fX as well.

Example 1.5. Consider the random variable in Example 1.1 with p = 0.4. Then,

f (−3) = P (X = −3) = P ({T T T }) = (1 − p)3 = 0.216


f (−1) = P (X = −1) = P ({T T H, T HT, HT T }) = 3p(1 − p)2 = 0.432
f (1) = P (X = 1) = P ({HHT, HT H, T HH}) = 3p2 (1 − p) = 0.288
f (3) = P (X = 3) = P ({HHH}) = p3 = 0.064
f (x) = 0 if x ∈
/ {3, 1, −1, −3}.

Example 1.6 (Bernoulli Density). Suppose an experiment is performed whose outcome


is classified either as a success or as a failure. We may take

Ω = {0, 1}

where 0 denotes failure and 1 denotes success. We may take A = P(Ω). Fix p ∈ [0, 1]
and define P : A → R by

P ({1}) = p and P ({0}) = (1 − p).

Then (Ω, A, P ) is a probability space.

31
Let X : Ω → R be the function X(0) = 0 And X(1) = 1. Then X is a random variable
(by part (ii) of Remark 1.3) and

P (X = 1) = p
P (X = 0) = (1 − p)
P (X = x) = 0 if x ∈
/ {0, 1}.
Thus, the density function is given by

p
 : if x = 1
f (x) = (1 − p) : if x = 0

0 : otherwise.

This is called the Bernoulli density function or Bernoulli distribution (we will discuss
distribution functions later) with parameter p. We will write

X ∼ Bern(p)

if X is a random variable with probability density function as above.


Example 1.7 (Binomial Density). Consider a success/failure experiment as above with
probability of success being p (p ∈ [0, 1] is fixed). Fix n ∈ N and suppose the experiment
is repeated n times (as in Example 1.1). Here, we may take

Ω = {(x1 , x2 , . . . , xn ) : xi ∈ {0, 1}}

Moreover, we may take A = P(Ω) and P : A → [0, ∞) may be defined on singleton sets
as
P ({(x1 , x2 , . . . , xn )}) = pk (1 − p)n−k
where k denotes the number of times 1 occurs in the tuple. Then, (Ω, A, P ) is a proba-
bility space.

Let X : Ω → R denote the random variable

X(x1 , x2 , . . . , xn ) = k

where k is as above. Then, X is a random variable (by part (ii) of Remark 1.3) and
takes values in {0, 1, . . . , n}. Now, for each 0 ≤ k ≤ n,
 
n
|{(x1 , x2 , . . . , xn ) : 1 occurs exactly k times}| =
k
by section 3. Hence,
P (X = k) = P ({(x1 , x2 , . . . , xn ) : 1 occurs exactly k times})
 
n k
= p (1 − p)n−k .
k

32
Hence, the corresponding density function is
( 
n x
x
p (1 − p)n−x : if x ∈ {0, 1, . . . , n}
f (x) =
0 : otherwise.
This is called the Binomial density with parameters n and p. Again, we write
X ∼ B(n, p)
when X is a random variable with this density function. Note that
Bern(p) = B(1, p).
Example 1.8 (Simplest form of Mendelian inheritance). Suppose that eye colour (brown
or blue) is determined by a pair of genes. Suppose that the allele for brown eyes (B) is
dominant, while the allele for blue eyes (b) is recessive. So an individual can have one
of three possible pairs: purely dominant (BB), hybrid (Bb), or purely recessive (bb). In
the first two cases, they will have brown eyes, and if they have bb then they will have
blue eyes.

Suppose that two hybrid parents have four children. What is the probability that exactly
3 out of 4 children will have brown eyes?

Solution:
(i) For a single child, there are three possible pairs from the set S = {BB, Bb, bb}.
Since the parents are hybrid, the probabilities are
1 1 1
P ({BB}) = , P ({Bb}) = and P ({bb}) = .
4 2 4
Now,
3
A := {brown eyes} = {BB, Bb} ⇒ P (A) =
4
1
B := {blue eyes} = {bb} ⇒ P (B) = .
4
Since we only care about the colour, we have an experiment whose ‘success’ prob-
ability is 3/4 and ‘failure’ probability is 1/4.

(ii) In our problem, there are four children, so

X ∼ B(4, 3/4).

Hence,    3  
4 3 1 27
P (X = 3) = = .
3 4 4 64

33
(End of Day 9)
Example 1.9. Consider a population of N balls, of which r are red and b = (N − r)
are black. Suppose a random sample of n ≤ N is chosen. Thus,
 
N
|Ω| = .
n
Let X denote the number of red balls drawn. Then, X takes values 0, 1, . . . , n. Then,
r N −r
 
k n−k
P (X = k) = N

n

whenever 0 ≤ k ≤ n, 0 ≤ n − k ≤ n and 0 otherwise. Therefore,


 r N −r
 (x)( n−x ) : if 0 ≤ x ≤ n, 0 ≤ n − x ≤ n
f (x) = (Nn )
0 : otherwise
This is the Hypergeometric density and is denoted
Hyp(N, r, n)
Example 1.10 (Constant Random Variable). Given any probability space (Ω, A, P )
and any constant c ∈ R, we may define the constant function
X : Ω → R given by X(w) = c for all w ∈ Ω.
Then, the corresponding density function is
(
1 : if x = c
f (x) =
̸ c
0 : if x =

Example 1.11 (Indicator Random Variable - Bernoulli revisited). Lett (Ω, A, P ) be a


probability space and A ∈ A be a fixed event. Define X : Ω → R by
(
1 : if w ∈ A
X(w) =
0 : if w ∈ / A.

Then, X is a discrete random variable with density function given by


f (1) = P (X = 1) = P (A), and f (0) = P (X = 0) = 1 − P (A).
Therefore, if p := P (A), then

p
 : if x = 1
f (x) = 1 − p : if x = 0

0 : otherwise.

Hence, X ∼ Bern(p) from Example 1.6.

34
Lemma 1.12. Let X be a discrete random variable and f : R → R denote its density
function. We denote the range of X by by S := {x1 , x2 , . . .} as it is either finite or
countably infinite. Then,

(i) f (x) ≥ 0 for all x ∈ R.

(ii) The set {x ∈ R : f (x) ̸= 0} is either finite or countably infinite.


P∞
(iii) i=1 f (xi ) = 1.

Proof.

(i) If x ∈ R, then f (x) = P (X = x) ≥ 0.

(ii) If f (x) ̸= 0, then P (X = x) ̸= 0, so x must be in the range of X. Hence,

{x ∈ R : f (x) ̸= 0} ⊂ S.

(iii) Consider the set {x1 , x2 , . . .}, then consider the events

Ai = {w ∈ Ω : X(w) = xi }

If i ̸= j, then Ai ∩ Aj = ∅. Moreover, if ω ∈ Ω, then X(w) = xj for some j ∈ N.


Hence, w ∈ Aj . Thus,
G∞
Ω= Ai
i=1

Hence,

X ∞
X
1 = P (Ω) = P (Ai ) = f (xi ).
i=1 i=1

Remark 1.13 (Density function determines the Random Variable). Suppose f : R → R


is a function that satisfies the conditions of Lemma 1.12. Then, define

Ω = {x ∈ R : f (x) ̸= 0} = {x1 , x2 , . . .}

Let A = P(Ω) and define P : A → [0, ∞) by

P ({xi }) = f (xi ).

Then, (Ω, A, P ) is a probability space. Define X : Ω → R by

X(xi ) = xi .

Then, X is a random variable and f is its density function.

35
Definition 1.14. A function f : R → R is called a discrete density function if it satisfies
the three conditions of Lemma 1.12.
Remark 1.15.
(i) The conclusion of Lemma 1.12 and Remark 1.13 is that a discrete random variable
X may be used to construct a discrete density function f , and conversely, every
discrete density function f may be used to construct a discrete random variable
X in such a way that fX = f .
(ii) Also, if we want to describe an experiment with finite (or countably infinite) out-
comes {x1 , x2 , . . .}, we may simply be given a discrete density function f : R → R
whose possible values are {x1 , x2 , . . .}. Then, we may take
Ω := {x1 , x2 , . . .}
A := P(Ω)
P ({xi }) := f (xi ).
This defines a probability space (Ω, A, P ). Moreover, the random variable X :
Ω → R given by inclusion
X(xi ) = xi
is naturally associated to f . Thus, the probability space is somewhat unnecessary
if you already have a density function.
(End of Day 10)
Example 1.16 (Uniform Density). Consider the experiment of picking a point at ran-
dom from a finite set S = {a1 , a2 , . . . , an } ⊂ R in a way that each point has equal
probability. Here, there is a natural density function
(
1
: if x ∈ S
f (x) = n
0 : otherwise.
This density function describes an experiment and a random variable, called the Uniform density
which we denote by U (n).
Example 1.17 (Geometric Density). Consider an experiment with two outcomes as in
Example 1.6. We repeat this experiment and each repetition is independent. Let X be
the random variable that counts the number of failures before the first success. Then
P (X = 1) = p
P (X = 2) = (1 − p)p
and so on. Therefore, if f : R → R denotes the corresponding density function, then
(
p(1 − p)x−1 : if x ∈ N
f (x) =
0 : otherwise.
This is called the Geometric Density with parameter p. We write
X ∼ Geom(p).

36
Example 1.18 (Poisson Density). Let λ > 0 be fixed.
( x
e−λ λx! : if x ∈ {0, 1, 2, . . .}
f (x) =
0 : otherwise.

Note that f satisfies the following conditions:

(i) f (x) ≥ 0 for all x ∈ R.

(ii) The set {x ∈ R : f (x) ̸= 0} is countable.

(iii) Finally,
∞ ∞
X X λx
f (x) = e−λ = e−λ eλ = 1.
x=0 x=0
x!

By Lemma 1.12, f is a density function, called the Poisson density with parameter λ.
We write
X ∼ Po(λ)
Poisson random variables are important in applications (such as understanding any
system with a ‘queue’). We will discuss these later.

2. Computations with Densities


Remark 2.1. Let X be a discrete random variable on a probability space (Ω, A, P ).
The density function of X is given by the formula

f (x) := P (X = x)

In other words, it calculates the probability of the event B = {w ∈ Ω : X(w) = x}.


However, we may be interested in other similar events. For instance, suppose X takes
values in {0, 1, 2, . . .}, then we may be interested in
5
X 5
X
P ({w ∈ Ω : X(w) ≤ 5}) = P (X = i) = f (i).
i=0 i=0

In general, it is such events that we are often interested in.

Definition 2.2. The distribution function of a discrete random variable X is given by


X
F (t) = P (X ≤ t) = f (x)
x≤t

Remark 2.3. Some immediate observations:

37
(i) If s ≤ t, then {w ∈ Ω : X(w) ≤ s} ⊂ {w ∈ Ω : X(w) ≤ t}. Hence,
F (s) ≤ F (t)
so F is non-decreasing.
(ii) Suppose the density function f takes finitely many values {x1 , x2 , . . . , xk } and we
list then in increasing order x1 < x2 < . . . < xk . Then,



 0 : if t < x1 .
: if x1 ≤ t < x2



 f (x1 )

f (x1 ) + f (x2 ) : if x2 ≤ t < x3

F (t) = . ..

 .. .





 f (x1 ) + f (x2 ) + . . . + f (xk−1 ) : if xk−1 ≤ t < xk

1 : if xk ≤ t.

Hence, for each interval [xi , xi+1 ), F is constant, and there is a jump of f (xj ) are
the point xj .
(iii) If (a, b] is an interval in R, then
{w ∈ Ω : X(w) ≤ b} = {w ∈ Ω : X(w) ≤ a} ⊔ {w ∈ Ω : a < X(w) ≤ b}
Hence, by part (v) of Lemma I.3.1,
P ({w ∈ Ω : a < X(w) ≤ b}) = P (a < X ≤ b) = F (b) − F (a).

Example 2.4. Let X be the number rolled on a 20-sided dice. What is the probability
that the number rolled is at least 10?

Solution: Let Ω = {1, 2, . . . , 20} and X : Ω → R be uniformly distributed on Ω. i.e.,


the density function is given by
(
1
: x ∈ {1, 2, . . . , 20}
f (x) = 20
0 : otherwise.

The distribution function F is given by


[t]
X 1 [t]
F (t) = = .
x=0
20 20

Hence,
9 11
P (X ≤ 9) = F (9) = ⇒ P (X ≥ 10) = .
20 20

38
(End of Day 11)

Example 2.5. Suppose X ∼ Geom(p) (see Example 1.17), so its density function is
given by (
p(1 − p)x−1 : x ∈ {1, . . .}
f (x) =
0 : otherwise

(i) We compute the distribution function for X: If t < 1, then F (t) = 0, and if t ≥ 1,
then
[t]
X 1 − (1 − p)[t]
F (t) = p(1 − p)x−1 = p = 1 − (1 − p)[t] .
x=1
1 − (1 − p)

(ii) We compute P (X > n) when n ∈ N: Observe that

P (X > n) = 1 − P (X ≤ n) = 1 − F (n) = (1 − p)n

(iii) For each n ∈ N, consider

An = {w ∈ Ω : X(w) > n}.

Then, P (An ) = P (X > n) is the probability that the one fails at least n times
before the first success. Note that An ⊂ An−1 in general. Now fix n, m ∈ N and
consider the conditional probability

P (An+m ∩ An )
P (X > n + m|X > n) = P (An+m |An ) =
P (An )
P (An+m )
=
P (An )
P (X > n + m)
=
P (X > n)
(1 − p)n+m
=
(1 − p)n
= (1 − p)m
= P (X > m)

This has a practical implication: If you know that the machine has failed n times,
then the probability that it fails another m times, is the same as the unconditional
probability that it fails m times. In other words, the machine has no memory.
The property
P (X > n + m|X > n) = P (X > m)
is called the memoryless property of the Geometric density.

39
3. Discrete Random Vectors
Example 3.1. For a single experiment, one may be interested in many different random
variables at the same time. For instance, if an urn has n labelled balls, both black and
red, and you select k of them. We may be interested in

X := {the number of red balls chosen}


Y := {the minimum number (label) on the balls chosen}
Z := {the maximum number (label) on the balls chosen}.

In some cases, we may wish to compare random variables, which leads to the next
definition.
Definition 3.2. Let (Ω, A, P ) be a probability space, and let X1 , X2 , . . . , Xr be r dis-
crete random variables on it. Define Y : Ω → Rr by

Y (w) := (X1 (w), X2 (w), . . . , Xr (w))

Such a function is called an r-dimensional random vector.


Definition 3.3. Given such a random vector, the associated density function is given
by f : Rr → R as

f (x) = P ({w ∈ Ω : X1 (w) = x1 , X2 (w) = x2 , . . . , Xr (w) = xr } = P (X1 = x1 , X2 = x2 , . . . , Xr = xr )


= P (Y = x)

Example 3.4. Suppose two six-sided dice are thrown simultaneously.


(i) Let X denote the number on the first die, and Y the number on the second die.
Then,
Ω = {(i, j) : 1 ≤ i, j ≤ 6}
Now,
|{(1, 1)}| 1
P (X = 1, Y = 1) = = .
|Ω| 36
Thus, if f : R2 → R denotes the joint density function, then
(
1
: if 1 ≤ x ≤ 6 and 1 ≤ y ≤ 6
f (x, y) = 36
0 : otherwise.

(ii) Let X be the number on the first die, and Z be the larger of the two numbers.
Then,
1
P (X = 1, Z = 1) = but P (X = 2, Z = 1) = 0.
36
Hence, the joint density function f : R2 → R is given the following table (values
are multiplied by 36):

40
z,x 1 2 3 4 5 6
1 1 0 0 0 0 0
2 1 2 0 0 0 0
3 1 1 3 0 0 0
4 1 1 1 4 0 0
5 1 1 1 1 5 0
6 1 1 1 1 1 6

In other words, 
1
 36
 : if 1 ≤ x ≤ 6, x < z ≤ 6
x
f (x, z) = 36
: if 1 ≤ x ≤ 6, z = x

0 : otherwise

(End of Day 12)

Remark 3.5.
(i) The density function defined as above has the following properties:
(a) f (x) ≥ 0 for all x ∈ Rr .
(b) S := {x ∈ Rr : f (x) ̸= 0} is finite or countably infinite.
P
(c) x∈S f (x) = 1.

(ii) Any function f : Rr → R satisfying these three conditions is called a discrete r-dimensional density
It is also called the joint density function of the variables (X1 , X2 , . . . , Xr ).
(iii) As Remark 1.13, any such function is the density function of some r dimensional
random vector.
(iv) The 1-dimensional density function associated to the random variable Xi is then
called the marginal density function and is denoted by fXi .
(v) Given two random variables X, Y , we write

{X = x, Y = y} := {w ∈ Ω : X(w) = x and Y (w) = y}

Lemma 3.6. Let X and Y be two random variables with joint density function f : R2 →
R. If fX and fY denote the marginal density functions of X and Y respectively, then
X
fX (x) = f (x, y)
y∈R
X
fY (y) = f (x, y).
x∈R

In both cases, the sum is over a finite or countable set.

41
Proof. Suppose X takes values {x1 , x2 , . . .}, then for any y ∈ R
P (Y = y) = P ({w ∈ Ω : Y (w) = y})

G
= P ( {w ∈ Ω : X(w) = xi and Y (w) = y})
i=1

X
= P (X = xi , Y = y).
i=1

Hence, if f is the joint density function of X and Y , then the marginal density of Y is
given by
X∞
fY (y) = f (xi , y)
i=1

Similarly, if Y takes values {y1 , y2 , . . .}, then the marginal density of X is given by

X
fX (x) = f (x, yj )
j=1

Example 3.7. Consider part (ii) Example 3.4 where X is the number on the first die
while Z is the maximum of the two numbers. Then, we can represent the probabilities
(multiplied by 36) as a table:

z,x 1 2 3 4 5 6
1 1 0 0 0 0 0
2 1 2 0 0 0 0
3 1 1 3 0 0 0
4 1 1 1 4 0 0
5 1 1 1 1 5 0
6 1 1 1 1 1 6

The marginal density of X is given by


6
X 6 1
fX (1) = f (1, z) = =
z=1
36 6
6
X 6 1
fX (2) = f (2, z) = =
z=1
36 6

and so on. Therefore, (


1
6
: if 1 ≤ x ≤ 6
fX (x) =
0 : otherwise.

42
For Z, the marginal density is given by

1


 36
: if z = 1
 3


 36
: if z = 2

5

: if z = 3
 36

7
fZ (z) = 36 : if z = 4
 9
: if z = 5



 36
11




 36
: if z = 6

0 : otherwise

4. Independent Random Variables


Definition 4.1.

(i) Two random variables X and Y are said to be independent if for any x, y ∈ R

P (X = x, Y = y) = P (X = x)P (Y = y).

In other words, the joint density function is given by

f (x, y) = fX (x)fY (y).

(ii) A collection X1 , X2 , . . . , Xr of random variables are said to be mutually independent


if the joint density function is given by

f (x1 , x2 , . . . , xr ) = fX1 (x1 )fX2 (x2 ) . . . fXr (xr ).

Example 4.2. Suppose two six-sided dice are rolled. Let X denote the number on the
first die, Y the number on the second die, and Z the maximum of the two numbers.

(i) For any 1 ≤ x, y ≤ 6,


1
P (X = x, Y = y) = = P (X = x)P (Y = y).
36
This is also true if x ∈
/ {1, 2, . . . , 6} or y ∈
/ {1, 2, . . . , 6}, so X and Y are indepen-
dent.

(ii) However,
2 1 3
P (X = 1, Z = 2) = while P (X = 1) = and P (Z = 2) = .
36 6 36
Hence, X and Z are not independent.

Remark 4.3.

43
(i) If X is a random variable with density function f , and if A ⊂ R then
X X
P (X ∈ A) = P ({w ∈ Ω : X(w) ∈ A}) = P (X = x) = f (x)
x∈A x∈A

(ii) If X and Y are two random variables with joint density function fX,Y and if
A, B ⊂ R are two sets, then

P (X ∈ A, Y ∈ B) = P ({w ∈ Ω : X(w) ∈ A and Y (w) ∈ B})


XX
= fX,Y (x, y)
x∈A y∈B

Lemma 4.4. If X, Y are independent random variables and A and B are two subsets
of R, then
P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B).

Proof. Suppose fX,Y denotes the joint density and fX and fY denote the marginal den-
sities. Note that
XX
P (X ∈ A, Y ∈ B) = fX,Y (x, y)
x∈A y∈B
XX
= fX (x)fY (y)
x∈A y∈B
! !
X X
= fX (x) fY (y)
x∈A y∈B

= P (X ∈ A)P (Y ∈ B).

Remark 4.5.

(i) Let X, Y be two random variables on a probability space (Ω, A, P ). Then, we may
define a number of new random variables:
(a) Define Z : Ω → R by
Z(w) := X(w) + Y (w).
Predictably, we write (X + Y ) for this random variable.
(b) Define Z : Ω → R by

Z(w) := min{X(w), Y (w)}.

Again, we write min{X, Y } for this random variable.


(c) Similarly, we define max{X, Y } as well.

44
In general, given a ‘nice’ function g : R2 → R, we may define Z : Ω → R by

Z(w) := g(X(w), Y (w)).

This is a discrete random variable, and is denoted g(X, Y ).

(ii) More generally, given r random variables X1 , X2 , . . . , Xr and a ‘nice’ function


g : Rr → R, we may define Z : Ω → R by

Z(w) := g(X1 (w), X2 (w), . . . , Xr (w)).

This is a discrete random variable and is written as g(X1 , X2 , . . . , Xr ).

(End of Day 13)

5. Sums of Independent Random Variables


Remark 5.1. Suppose X and Y are independent random variables and Z = X + Y .
Let {x1 , x2 , . . .} be the range of X, then for any z ∈ R,m

G
{w ∈ Ω : Z(w) = z} = {w ∈ Ω : X(w) = xi , and Y (w) = z − xi }
i=1

Since X and Y are independent,



X ∞
X
P (Z = z) = P (X = xi )P (Y = z − xi ) = fX (xi )fY (z − xi ).
i=1 i=1

Hence, X
fX+Y (z) = fX (x)fY (z − x).
This is called the convolution of the two functions fX and fY . In particular, if X and Y
both take only non-negative integer values, then X + Y only takes non-negative integer
values and for z ≥ 0,
Xz
fX+Y (z) = fX (x)fY (z − x).
x=0

The next definition only makes sense for integer valued random variables.
Definition 5.2. Let X be an integer valued discrete random variable. The probability generating functi
of X is ΦX : [−1, 1] → R given by

X
ΦX (t) := fX (x)tx
x=0

Remark 5.3.

45
(i) Note that the series converges absolutely because ∞
P
x=0 fX (x) = 1. Therefore, ΦX
is differentiable and ∞
X

ΦX (t) = xfX (x)tx−1
x=1

This can be done repeatedly.

(ii) Observe that

ΦX (0) = P (X = 0)
Φ′X (0) = fX (1) = P (X = 1)
Φ′′X (0) = 2fX (2) = 2P (X = 2)

and so on. In general,


(k)
ΦX (0)
P (X = k) = .
k!
(iii) If X and Y are two random variables as above such that ΦX = ΦY as functions,
then
P (X = 0) = ΦX (0) = ΦY (0) = P (Y = 0).
Differentiating once, we see that

P (X = 1) = P (Y = 1).

And more generally, P (X = k) = P (Y = k) for all k ∈ {0, 1, 2, . . .}. Hence, X


and Y are equivalent random variables. In particular, X and Y have the same
distribution function.

Example 5.4. If X ∼ B(n, p), then


( 
n x
x
p (1 − p)n−x : x ∈ {0, 1, 2, . . . , n}
fX (x) =
0 : otherwise.

Hence,
n  
X n
ΦX (t) = (pt)x (1 − p)n−x
x=0
x
= (tp + 1 − p)n

Example 5.5. If X ∼ Po(λ) for some λ > 0, then


( x
e−λ λx! : x ∈ {0, 1, 2, . . .}
fX (x) =
0 : otherwise

46
Hence,
∞ x
−λ λ
X
ΦX (t) = e tx
x=0
x!
−λ λt
=e e
λ(t−1)
=e
Lemma 5.6. If X1 , X2 , . . . , Xr are mutually independent, integer-valued random vari-
ables, then
ΦX1 +X2 +...+Xr (t) = ΦX1 (t)ΦX2 (t) . . . ΦXr (t)
for all t ∈ [−1, 1].
Proof. Assume r = 2 and X = X1 and Y = X2 . Fix t ∈ [−1, 1]. Then

X
ΦX+Y (t) = fX+Y (z)tz
z=0
∞ X
X z
= fX (x)fY (z − x)tz
z=0 x=0

X ∞
X
x
= fX (x)t fY (z − x)tz−x
x=0 z=x

! ∞
!
X X
= fX (x)tx fY (y)ty
x=0 y=0

= ΦX (t)ΦY (t).

Theorem 5.7. Let X1 , X2 , . . . , Xr be independent random variables.


(i) If Xi ∼ B(ni , p) for some n1 , n2 , . . . , nr ∈ N and 0 ≤ p ≤ 1, then
X1 + X2 + . . . + Xr ∼ B(n1 + n2 + . . . + nr , p).

(ii) If Xi ∼ Po(λi ) for some λ1 , λ2 , . . . , λr > 0, then


X1 + X2 + . . . + Xr ∼ Po(λ1 + λ2 + . . . + λr ).

Proof.
(i) If Xi ∼ B(ni , p), then
ΦXi (t) = (pt + 1 − p)ni .
By Lemma 5.6,
ΦX1 +X2 +...+Xr (t) = (pt + 1 − p)n1 +n2 +...+nr .
By part (iii) of Remark 5.3,
X1 + X2 + . . . + Xr ∼ B(n1 + n2 + . . . + nr , p).

47
(ii) Similar (check!)

(End of Day 14)

48
IV. Expectation of Discrete Random
Variables
1. Definition of Expectation
Definition 1.1. Let X be a discrete random variable on a probability space (Ω, A, P )
with range {x1 , x2 , . . .} and density function f . Suppose that

X
|xj |f (xj ) < ∞
j=1

Then, the expectation or mean of X is defined as



X
EX := xj f (xj ).
j=1
P∞
Note that if j=1 |xj |f (xj ) = +∞, then the expectation is not defined.
We think of this as a weighted average of all the values X takes.
Example 1.2. Suppose Ω = {0, 1} and X ∼ Bern(p) for some 0 ≤ p ≤ 1. Then, X
takes two values {0, 1} and

EX = 0P (X = 0) + 1P (X = 1) = P (X = 1) = p.

Example 1.3. Suppose Ω = {(x1 , x2 , . . . , xn ) : xi ∈ {0, 1}} and X : Ω → R is X ∼


B(n, p) for some 0 ≤ p ≤ 1. Then the density function is given by
( 
n x
x
p (1 − p)n−x : x ∈ {0, 1, . . . , n}
f (x) =
0 : otherwise.

Therefore,
n  
X n x
EX = x p (1 − p)n−x
x=0
x

We use the fact that for all j ≥ 1,


   
n n−1
j =n .
j j−1

49
Therefore,
n  
X n−1 x
EX = n p (1 − p)n−x
x=1
x − 1
n−1  
X n − 1 i+1
=n p (1 − p)n−i−1
i=0
i
= np(p + 1 − p)n−1
= np.
Example 1.4. Suppose X ∼ U (n). In other words, the density function of X is
(
1
n
: x ∈ {1, 2, . . . , n}
f (x) =
0 : otherwise
Then,
n
X i n+1
EX = = .
i=1
n 2
Example 1.5. Suppose X ∼ Geom(p) for some 0 < p < 1, then
(
p(1 − p)x−1 : if x ∈ {1, 2, . . .}
f (x) =
0 : otherwise.
We verify that

X
if (i) < ∞
i=1
To see this, observe that

X ∞
X
if (i) = ip(1 − p)i−1
i=1 i=1
X∞
=p i(1 − p)i−1
i=1

!
d X i
= −p (1 − x) |x=p −1
dx i=1
 
d 1
= −p |x=p −1
dx x
 
−1
= −p −1
p2
1−p
= .
p
1−p
Hence the series converges absolutely and EX = p
.

50
Example 1.6. There is an example of a random variable (density function) whose
expectation does not exist because the series not absolutely convergent. See [HPS,
Example 4, Section 4.1].

(End of Day 15)

2. Properties of Expectation
Lemma 2.1. Let X = (X1 , X2 , . . . , Xr ) be an r-dimensional random vector with joint
density function fX . Let φ : Rr → R be a function and define

Z := φ(X).

Then, Z has finite expectation if


X
|φ(x)|fX (x) < ∞.
x∈Rr

In that case, X
EZ = φ(x)fX (x).
x∈Rr

Proof. Omitted. See [HPS, Section 4.1, Theorem 1].

Theorem 2.2. Let X and Y be two discrete random variables with finite expectation.

(i) If c ∈ R and P (X = c) = 1, then EX = c.

(ii) If c ∈ R, then cX has finite expectation and E(cX) = cEX.

(iii) X + Y has finite expectation and

E(X + Y ) = EX + EY.

(iv) If P (X ≥ Y ) = 1, then EX ≥ EY . Moreover, if P (X = Y ) = 1, then EX = EY .

(v) |EX| ≤ E(|X|) where |X| is the random variable w 7→ |X(w)|.

Proof. Assume for simplicity that X and Y both have finite ranges, say X has range
{x1 , x2 , . . . , xn } and Y has range {y1 , y2 , . . . , ym }.

(i) If P (X = c) = 1, then the density function of X is given by


(
1 : if x = c
f (x) =
0 : otherwise.
P
Therefore, EX = x∈R xf (x) = cf (c) = c.

51
(ii) Clearly, cX has range {cx1 , cx2 , . . . , cxn } and

P (cX = cxi ) = P (X = xi ).

Therefore, the density function for cX is


(
f (xi ) : if x = cxi
g(x) =
0 : otherwise

Therefore,
n
X
E(cX) = cxi f (xi ) = cE(X).
i=1

(iii) Clearly, X + Y has range {xi + yj }. Let f denote the joint density function and
fX and fY denote the marginal density functions. Then,
m
X n
X
fX (xi ) = f (xi , yj ) and fY (yj ) = f (xi , yj ).
j=1 i=1

Therefore,
n X
X m
E(X + Y ) = (xi + yj )f (xi , yj )
i=1 j=1
Xn X m n X
X m
= xi f (xi , yj ) + yj f (xi , yj )
i=1 j=1 i=1 j=1
Xn m
X
= xi fX (xi ) + yj fY (yj )
i=1 j=1

= EX + EY.

(iv) Omitted. See the book.

(v) Note that −|X| ≤ X ≤ |X| so by part (iv),

−E(|X|) ≤ E(X) ≤ E(|X|).

Hence, |E(X)| ≤ E(|X|).

(End of Day 16)

52
Example 2.3. Let X1 , X2 , . . . , Xn each independent Bernoulli random variables with
common parameter p. If
X := X1 + X2 + . . . + Xn
then by Theorem III.5.7, X ∼ B(n, p). Moreover,

EX = EX1 + EX2 + . . . + EXn = np.

as in Example 1.3.

Example 2.4. Suppose there are N balls in an urn of which r are red and b = (N − r)
aer black. If a random sample of n ≤ N is chosen and X denotes the number of red
balls, then
X ∼ Hyp(N, r, n)
with density function
 r N −r
 (x)( n−x ) : if 0 ≤ x ≤ n, 0 ≤ n − x ≤ n
f (x) = (Nn )
0 : otherwise

To calculate EX, we may consider n random variables X1 , X2 , . . . , Xn where Xi (w) = 1


if and only if the ith ball drawn is red. Then,
r
EXi = P (Xi = 1) = .
N
Note that the Xi are not independent. However, X = X1 + X2 + . . . + Xn and therefore
nr
EX = .
N
Theorem 2.5. If X and Y are independent, then E(XY ) = E(X)E(Y ).

Proof. The joint density for X and Y is given by f (x, y) = fX (x)fY (y). Thus by
Lemma 2.1,
X
E(XY ) = xyf (x, y)
x,y
X
= xfX (x)yfY (y)
x,y
! !
X X
= xfX (x) yfY (y) = E(X)E(Y ).
x y

53
3. Moments
Definition 3.1. Let X be a discrete random variable and r ≥ 0.
(i) We say that X has a moment of order r if X r has finite expectation. In that case,
we define the rth moment of X to be

E(X r ).

(ii) If µ = E(X) is the mean, then the rth central mean is defined as

E(X − µ)r

(iii) The variance of X is

VarX := E(X−µ)2 = E(X 2 −2µX+µ2 ) = E(X 2 )−2µE(X)+µ2 = E(X 2 )−(EX)2 .



(iv) The standard deviation of X is σ := VarX.
Remark 3.2.
(i) If µ is the mean of X, then E(X − µ) = E(X) − µE(1) = 0. The higher central
means can be non-zero though.

(ii) By Lemma 2.1, X


E(X r ) = xr f (x).
x

(iii) If X has a moment of order r ≥ 0, then it has moments of order k for each
1 ≤ k ≤ r.

(iv) The standard deviation of X is a measure of how much X varies from the mean.

(v) Var(X) = 0 if and only if X is constant.


Proof. If X = c, then P (X = c) = 1 so

µ = cP (X = c) = c.

Moreover, VarX = E(X 2 ) − (EX)2 = c2 − c2 = 0.

Conversely, if VarX = 0, then E(X − µ)2 = 0. In other words, if f denotes the


density function of X, then
X
(x − µ)2 f (x) = 0 ⇒ f (x) = 0 unless x = µ.
x

Hence, P (X = µ) = 1 so X = µ.

54
(vi) Suppose we wish to minimize the function g : R → R given by
g(a) := E(X − a)2 ,
then we write
g(a) = E(X)2 − 2aE(X) + a2
Differentiating with respect to a gives g ′ (a) = −2E(X) + 2a so the extremum
occurs at a = E(X). Moreover,
g ′′ (a) = 2 > 0
so this is a minimum. Hence, VarX is the minimum value that g takes.
(vii) If X is integer valued with probability generating function ΦX : [−1, 1] → R.
Suppose that there is a t0 > 1 so that

X
fX (x)tx0 < ∞
x=0

(This need not be the case, but it is if X is finitely valued). Then, we differentiate
with respect to t to get
X∞
Φ′X (t) = xfX (x)tx−1 .
x=1

Hence, Φ′X (1) = E(X). Similarly,



X
Φ′′X (t) = x(x − 1)fX (x)tx−1
x=2

so that Φ′′X (1) = E(X(X − 1)) = E(X 2 − X) = E(X 2 ) − E(X). Hence,


VarX = Φ′′X (1) + Φ′X (1) − (Φ′X (1))2 .
Such formulae are also applicable for moments of higher order.
Example 3.3. Suppose X ∼ Po(λ), then the probability generating function of X is
given by
ΦX (t) := eλ(t−1) .
Hence, Φ′X (t) = λeλ(t−1) and Φ′′X (t) = λ2 eλ(t−1) . At t = 1, we get
Φ′X (1) = λ and Φ′′X (1) = λ2 .
Hence,
EX = λ
and
VarX = λ2 + λ − λ2 = λ.

(End of Day 17)

55
4. Variance of a Sum
Remark 4.1. If X and Y are two discrete random variables, then
Var(X + Y ) = E[(X + Y ) − E(X + Y )]2
= E[(X − EX) + (Y − EY )]2
= E[X − EX]2 + E[Y − EY ]2 + 2E[(X − EX)(Y − EY )]
Definition 4.2. The covariance of X and Y is
Cov(X, Y ) := E[(X − EX)(Y − EY )]
= E[XY − (EX)Y − X(EY ) + (EX)(EY )]
= E[XY ] − (EX)(EY ) − (EX)(EY ) + (EX)(EY )
= E[XY ] − (EX)(EY ).
Remark 4.3.
(i) Hence,
Var(X + Y ) = VarX + VarY + 2Cov(X, Y ).

(ii) By Theorem 2.5, Cov(X, Y ) = 0 whenever X and Y are independent. In that case,
Var(X + Y ) = VarX + VarY.

Lemma 4.4.
(i) If X1 , X2 , . . . , Xn are discrete random variables with finite variance, then
n
X n X
X n
Var(X1 + X2 + . . . + Xn ) = Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i=1 j=i+1

(ii) If X1 , X2 , . . . , Xn are mutually independent with common variance σ 2 , then


Var(X1 + X2 + . . . + Xn ) = nσ 2 .

Example 4.5. Let X1 , X2 , . . . , Xn be independent and such that Xi ∼ Bern(p) for all
1 ≤ i ≤ n. If
X := X1 + X2 . . . + Xn
then X ∼ B(n, p) by Theorem III.5.7. By Lemma 4.4,
n
X
Var(X) = Var(Xi ).
i=1

Now, E(Xi ) = p from Example 1.2. Since Xi = Xi2 , we have E(Xi2 ) = p as well. Hence,
Var(Xi ) = p − p2 = p(1 − p).
Hence, Var(X) = np(1 − p).

56
Example 4.6. Suppose there are N balls in an urn of which r are red and b = (N − r)
are black (as in Example 2.4). If a random sample of n ≤ N is chosen and X denotes
the number of red balls, then
X ∼ Hyp(N, r, n)
Consider n random variables X1 , X2 , . . . , Xn where Xi (w) = 1 if and only if the ith ball
drawn is red. Then,
r
EXi = P (Xi = 1) = .
N
Note that the Xi are not independent and X = X1 + X2 + . . . + Xn . Therefore,
nr
EX = .
N
and n n X
n
X X
Var(X) = Var(Xi ) + 2 Cov(Xi , Xj ).
i=1 i=1 j=i+1

Now consider
r r−1
E(Xi Xj ) = P (Xi = 1, Xj = 1) =
N N −1
Cov(Xi , Xj ) = E(Xi Xj ) − E(Xi )E(Xj )
r(r − 1) r2
= − 2
N (N − 1) N
Hence,  
r  r n−1
Var(X) = n 1− 1−
N N N −1

5. Correlation Coefficient
Definition 5.1. Suppose X and Y have finite non-zero variance. The correlation coefficient
of X and Y is given by
Cov(X, Y )
ρ(X, Y ) = p
Var(X)Var(Y )
X and Y are said to be uncorrelated if ρ(X, Y ) = 0.

Remark 5.2. If X and Y are independent, then by Theorem 2.5, X and Y are uncor-
related.

Theorem 5.3 (Schwarz Inequality). If X and Y have finite variance, then

[E(XY )]2 ≤ E(X 2 )E(Y 2 )

Moreover, equality holds if and only if Y ∼ 0 or X ∼ aY for some a ∈ R.

Proof.

57
(i) If Y ∼ 0 then P (Y = 0) = 1. Then, P (XY = 0) = 1 so E(XY ) = 0. Then, the
inequality holds trivially and indeed, equality holds because E(Y 2 ) = 0.

(ii) If X ∼ aY for some a ∈ R, then P (X = aY ) = 1. Then,

E(XY ) = E(aY 2 ) = aE(Y 2 ).

Moreover,
E(X 2 )E(Y 2 ) = E(a2 Y 2 )E(Y 2 ) = a2 [E(Y 2 )]2 .
Hence equality holds.

(iii) Now assume that P (Y = 0) < 1 so that E(Y 2 ) > 0. Now note that for any λ ∈ R,

0 ≤ E(X − λY )2 = λ2 E(Y 2 ) − 2λE(XY ) + E(X 2 ) =: g(λ).

Now,
g ′ (λ) = 2λE(Y 2 ) − 2E(XY ) and g ′′ (λ) = 2E(Y 2 ) > 0.
E(XY )
Hence g attains its minimum at a = E(Y 2 )
. Now

[E(XY )]2
0 ≤ g(a) = E(X 2 ) − .
E(Y 2 )

This proves the inequality. Moreover, equality holds if and only if

g(a) = E(X − aY )2 = 0.

This happens if and only if X ∼ aY .

(End of Day 18)

Corollary 5.4. For any two X and Y as above,

−1 ≤ ρ(X, Y ) ≤ 1

and |ρ(X, Y )| = 1 if and only if X ∼ aY for some a ∈ R.

Proof. By Theorem 5.3 applied to (X − EX) and (Y − EY ), we have

E[(X − EX)(Y − EY )]2 ≤ E(X − EX)2 E(Y − EY )2 .

In other words,
Cov(X, Y )2 ≤ Var(X)Var(Y ).
Hence, |ρ(X, Y )| ≤ 1. The case of equality can also be verified similarly (Check!).

58
6. Chebyshev’s Inequality
Remark 6.1. Suppose X is a nonnegative random variable with finite expectation, and
let t > 0. Define (
0 : if X(w) < t
Y (w) =
t : if X(w) ≥ t
Then, Y is a discrete random variable with

P (Y = 0) = P (X < t) and P (Y = 1) = P (X ≥ t).

Hence,
E(Y ) = 0P (Y =) + tP (Y = t) = tP (X ≥ t).
Moreover, X ≥ Y so by Theorem 2.2, EX ≥ EY . Hence,
EX
P (X ≥ t) ≤ .
t
Theorem 6.2 (Chebyshev’s Inequality). Let X be a random variable with mean µ and
variance σ 2 . Then, for any t > 0,

σ2
P (|X − µ| ≥ t) ≤ .
t2
Proof. Consider the random variable Z := (X − µ)2 and apply the previous remark.
Then,
EZ E(X − µ)2 σ2
P (Z ≥ t2 ) ≤ 2 = = .
t t2 t2
Now observe that Z ≥ t2 ⇔ |X − µ| ≥ t.

Remark 6.3. Let X1 , X2 , . . . , Xn be independent random variables with a common


distribution with finite mean µ and variance σ 2 .

(i) We think of of these Xi as n independent measurements of the same quantity, and


is called a random sample of size n from its common distribution.

(ii) Define
X 1 + X2 + . . . + Xn
Yn := .
n
We would expect that Yn would be close to µ as n → ∞. Since the Xi are
independent,

E(Yn ) = =µ
n
n
1 X σ2
Var(Yn ) = 2 Var(Xi ) =
n i=1 n

59
(iii) For a fixed δ > 0, by Chebyshev’s inequality

σ2
P (|Yn − µ| ≥ δ) ≤
nδ 2
⇒ lim P (|Yn − µ| ≥ δ) = 0.
n→∞
⇒ lim P (|Yn − µ| < δ) = 1.
n→∞

Theorem 6.4 (Weak Law of Large Numbers). Let X1 , X2 , . . . be independent random


variables having a common distribution with finite mean µ. Set

Sn := X1 + X2 + . . . + Xn

Then, for any δ > 0,  


Sn
lim P −µ ≥δ = 0.
n→∞ n
Example 6.5. Suppose X1 , X2 , . . . , Xn are independent random variables with Xi ∼
Bern(p) for some 0 ≤ p ≤ 1. Then,

µ = p and σ 2 = p(1 − p).

By Theorem 6.4,  
Sn p(1 − p)
P −µ ≥δ ≤
n nδ 2
Now, the function p 7→ p(1 − p) has its maximum value on [0, 1] at p = 1/2, so
1
p(1 − p) ≤
4
for all 0 < p < 1. Hence,  
Sn 1
P −µ ≥δ ≤
n 4nδ 2
Hence, given ϵ > 0, if we choose n ∈ N so that
1

4nδ 2
then we can ensure that after n trials,
 
Sn
P − µ ≥ δ < ϵ.
n

(End of Day 19)

60
V. Continuous Random Variables
1. Random Variables and their distribution functions
Definition 1.1. Let (Ω, A, P ) be a probability space. A random variable is a function

X:Ω→R

such that, for each x ∈ R, the set {w ∈ Ω : X(w) ≤ x} belongs to A.


Example 1.2.
(i) Note that, for a random variable, the range of X need not be countable. Typically,
measurements such as time, distance, weight, etc. are random variables. Every
discrete random variable is also a random variable.

(ii) Suppose Ω = [a, b] is a closed interval in R, then there is a σ-algebra A called the
Lebesgue σ-algebra that contains all open, closed and half-open intervals. On this
σ-algebra, there is a probability measure P such that
|y − x|
P ([x, y]) = = P ([x, y)) = P ((x, y)).
(b − a)

The triple (Ω, A, P ) is a probability space, and many random variables and defined
on such spaces.

(iii) Consider Example I.1.3: A pointer that is free to spin about the centre of a circle
of radius 1. If the pointer is spun, it comes to rest at an angle (in radians) from
the X axis. Define
Ω = [0, 2π)
Recall from Example I.2.2 that there is a σ-algebra A on Ω, called the Lebesgue σ-
algebra that contains all open (and half-open) intervals. There is also a probability
measure P on A such that
[a, b]
P ([a, b]) = .

This is the probability space (Ω, A, P ). Let X be the angle that the pointer comes
to rest at. Then, for any x ∈ R,

{w ∈ Ω : X(w) ≤ x} = [0, x] ∩ Ω ∈ A.

Therefore, X is a random variable.

61
(iv) Consider a dart board of radius 1. A dart is thrown at the board, and we measure
the distance of the dart from the origin, and denote it by X. Then,

Ω := {(a, b) ∈ R2 : a2 + b2 ≤ 1}.

and A is the Lebesgue σ-algebra which contains all open (and half-open) discs,
and P be the Lebesgue measure (as above). Then, for any x ∈ R,

{w ∈ Ω : X(w) ≤ x} = {(a, b) ∈ Ω : a2 + b2 ≤ x2 } ∈ A.

Therefore, X is a random variable.


Definition 1.3. The distribution function of a random variable X is the function F :
R → R given by
F (x) := P (X ≤ x).
Example 1.4.
(i) In the spinning pointer example,

0
 :x<0
x
F (x) = 2π
: 0 ≤ x < 2π

1 : x ≥ 2π.

(ii) In the dart board example, if 0 ≤ x ≤ 1, we have

1 πx2
P (X ≤ x) = Area({(a, b) ∈ Ω : a2 + b2 ≤ x2 }) = = x2
π π
Therefore, 
0
 :x<0
F (x) = x2 :0≤1

1 : x > 1.

Remark 1.5. This is a repeat of Remark III.2.3.


(i) For any x ∈ R, 0 ≤ F (x) ≤ 1.

(ii) If s ≤ t, then
F (s) ≤ F (t)
so F is non-decreasing.

(iii) If (a, b] is an interval in R, then

P (a < X ≤ b) = F (b) − F (a).

(End of Day 20)

62
(iv) Moreover, consider
F (−∞) = lim F (n) = lim P (X ≤ n).
n→−∞ n→−∞

Now, An := {w ∈ Ω : X(w) ≤ n} form a decreasing sequence of sets with


∩−∞
n=−1 An = ∅. Therefore, by Theorem I.3.5,

F (−∞) = lim P (An ) = P (∅) = 0.


n→−∞

Similarly,
F (+∞) = 1.

(v) Fix x ∈ R, and consider the right limit


F (x+) = lim F (x + h) = lim F (x + 1/n)
h→0,h>0 n→∞

(because F is non-decreasing and bounded, this limit exists). Now, set Bn := {w ∈


Ω : X(w) ≤ x + 1/n}, then (Bn ) is a decreasing sequence of sets with

\
B := Bn = {w ∈ Ω : X(w) ≤ x}
n=1

Again, by Theorem I.3.5,


F (x+) = lim P (X ≤ x + 1/n) = P (X ≤ x) = F (x).
n→∞

(vi) Now fix x ∈ R and consider the left limit


F (x−) = lim F (x − h) = lim F (x − 1/n).
h→0,h>0 n→∞

Again, if Cn := {w ∈ Ω : X(w) ≤ x − 1/n}, then (Cn ) is an increasing family of


sets with ∞
[
C := Cn = {w ∈ Ω : X(w) < x}
n=1
Hence,
F (x−) = lim P (Cn ) = P (C) = P (X < x).
n→∞

Note that, in general, F (x−) ̸= F (x).


(vii) Observe that for any x ∈ R,
F (x+) − F (x−) = P (X ≤ x) − P (X < x) = P (X = x).

Corollary 1.6. The distribution function F is continuous if and only if


P (X = x) = 0
for all x ∈ R. In that case, X is called a continuous random variable.

63
Example 1.7.
(i) Discrete random variables are not continuous. For a continuous random variable,
the notion of density function (as in Definition III.1.4) does not make sense.

(ii) In Example 1.2, both the spinning pointer and the dart board define continuous
random variables.
Definition 1.8. A distribution function is a function F : R → R satisfying the following
properties:
(i) 0 ≤ F (x) ≤ 1 for all x ∈ R.

(ii) F is a non-decreasing function

(iii) F (−∞) = 0 and F (+∞) = 1.

(iv) F (x+) = F (x) for all x ∈ R.

2. Densities of Continuous Random Variables


Definition 2.1. A density function is a non-negative function f : R → R such that
Z ∞
f (x)dx = 1.
−∞

Example 2.2.
(i) Given real numbers a < b, define f : R → R by

0
 :t<a
1
f (t) = (b−a) : a ≤ t ≤ b

0 :t>b

This is called the uniform density function on the interval [a, b].

(ii) Fix λ > 0 and consider f : R → R given by


(
λe−λt : t > 0
f (t) =
0 : t ≤ 0.

Then, f is non-negative and for any M > 0,


Z M
f (t)dt = −e−λt |M
0 = −e
−λM
+ 1 → 1 as M → ∞.
−∞

Therefore, f is a density function and is called the exponential density with pa-
rameter λ.

64
(iii) Define f : R → R by
1
f (t) = .
π(1 + x2 )
Then, for any M > 0,
Z M
1 1
f (t)dt = arctan(t)|M
−M = (arctan(M ) − arctan(−M )).
−M π π
Hence,
∞  
π −π
Z
1 1
f (t)dt = lim (arctan(M ) − arctan(−M )) = − = 1.
−∞ M →∞ π π 2 2
Thus, f is a density function, called the Cauchy density.
Definition 2.3. Let f : R → R be a density function as above. Define F : R → R by
Z x
F (x) = f (t)dt.
−∞

Then, F is a distribution function as per Definition 1.8. Moreover, if f is continuous at


a point x ∈ R, then
F ′ (x) = f (x).
Most (but not all) distribution functions arise this way.
(End of Day 21)
Remark 2.4. If X is a continuous random variable with distribution function F and
density function f , then
Z b
P (a ≤ X ≤ b) = P (a < X ≤ b) = P (a ≤ X < b) = f (t)dt = F (b) − F (a)
a

This quantity is represented by the area under the curve of f .


Example 2.5. If X is a continuous random variable with distribution function F and
density function f , find the density function for Y := X 2 .

Solution: Note that Y takes values in R+ , so for any y < 0,

P (Y ≤ y) = 0.

Now if y ≥ 0, then
√ √ √ √
G(y) := P (Y ≤ y) = P (− y ≤ X ≤ y) = F ( y) − F (− y)
⇒ g(y) = G′ (y)
1 √ 1 √
= √ f ( y) + √ f (− y)
2 y 2 y

65
So the density function of Y is
(
0 : if y < 0
g(y) = 1 √ 1 √
√ f ( y) +
2 y

2 y
f (− y) : if y ≥ 0

Example 2.6. If X is uniformly distributed on (0, 1) and λ > 0 is fixed, find the density
function of Y := −λ−1 log(1 − X).

Solution: Note that the density function of X is


(
0 : if t < 0 or t > 1
f (t) =
1 : if 0 ≤ t ≤ 1.

So the distribution function of X is



0 : if x < 0

F (x) = x : if 0 ≤ x ≤ 1

1 : if x > 1

If G denotes the distribution function of Y , then G(y) = 0 if y < 0. Now if y ≥ 0, then

G(y) = P (Y ≤ y) = P (log(1 − X) ≥ −λ)


= P (1 − X ≥ e−λ )
= P (X ≤ 1 − e−λ )
= F (1 − e−λ )
= 1 − e−λ .

So the density function of Y is


(
0 :y<0
g(y) =
λe−λ :y≥0

This is the exponential density with parameter λ.

66
3. Normal, Exponential and Gamma Densities
a. Normal Densities
Example 3.1. Let g : R → R be given by
2 /2
g(x) = e−x .
R∞
If c := −∞
g(x)dx, then one can show (using polar coordinates) that

c2 = 2π.

Therefore, if
1 2
f (t) := √ e−t /2

then f is a density function called the standard normal density. We write X ∼ n(0, 1).

Remark 3.2.

(i) Note that f (t) = f (−t), so it is a symmetric random variable.

(ii) Let Φ denote the corresponding distribution function, then


Z x
1 2
Φ(x) = √ e−t /2 dt.
2π −∞
This does not have a closed form solution.

(iii) However, we may observe some properties such as


Z x Z −x 
1 −t2 /2 −t2 /2
Φ(x) + Φ(−x) = √ e dt + e dt .
2π −∞ −∞
Z x Z ∞ 
1 −t2 /2 −t2 /2
=√ e dt + e dt
2π −∞ x
Z ∞
1 2
=√ e−t /2 dt
2π −∞
= 1.

Therefore,
Φ(−x) = 1 − Φ(x).
This is useful because we only need to determine values of Φ(x) when x ≥ 0.

(iv) From the above calculation, we know that


1
Φ(0) =
2

67
(v) The other values of Φ are given at the end of the textbook. For instance, the table
tells us that
Φ(3) = 0.9987 and therefore Φ(−3) = 0.0013.
Therefore, the majority of the area of the curve lies between [−3, 3]. For a number
z ∈ R, the value Φ(z) is often called the z-score.
(vi) Suppose X ∼ n(0, 1) and µ, σ ∈ R with σ > 0, then define
Y := µ + σX.
Then, Y is also a continuous random variable, and its distribution function is given
by
 
y−µ
G(y) = P (Y ≤ y) = P X ≤
σ
 
y−µ

σ
So the density function of Y is g : R → R given by
1 2 2
g(y) = √ e−(y−µ) /2σ
σ 2π
This is a normal density with mean µ and variance σ 2 , and is denoted Y ∼ n(µ, σ 2 ).
Note that    
b−µ a−µ
P (a ≤ Y ≤ b) = Φ −Φ
σ σ
This can be used to make calculations for all normal densities using the table.
(vii) For instance, if X ∼ n(1, 4), then
   
3−1 0−1
P (0 ≤ Y ≤ 3) = Φ −Φ
2 2
= Φ(1) − (1 − Φ(1/2))
= Φ(1) + Φ(0.5) − 1
= 0.5328 (from the table)

b. Exponential Densities
Recall, the density function for an exponential random variable with parameter λ > 0
is given by (
λe−λx : x ≥ 0
f (x) =
0 :x<0
and the distribution function is given by
(
1 − e−λx :x≥0
F (x) =
0 :x<0

68
Remark 3.3. Note that for any a ≥ 0, we have

P (X > a) = 1 − P (X ≤ a) = 1 − F (a) = e−λa

Hence, if b ≥ 0, then
P (X > a)P (X > b) = P (X > a + b)
Equivalently,
P (X > a + b|X > a) = P (X > b).
This is the continuous analog of the memoryless property of the geometric random
variable.

c. Gamma Densities
Definition 3.4. For α > 0, define
Z ∞
Γ(α) := xα−1 e−x dx
0

This integral is finite but does not have a closed form expression. This is called the
Gamma function.

Definition 3.5. For a given α > 0 and λ > 0, define the gamma density by
(
λα α−1 −λx
Γ(α)
x e :x≥0
g(x) :=
0 :x<0

This is denoted by Γ(x; α, λ).

Remark 3.6.

(i) Note that g(x) ≥ 0 and by the definition of the Gamma function,
Z ∞ Z ∞
λα
g(x)dx = xα−1 e−λx dx = 1.
−∞ Γ(α) 0

(by a change of variable t := λx). Hence, g is a density function.

(ii) Suppose X ∼ n(0, σ 2 ) and Y := X 2 , then the density function of X is


1 2 2
f (x) = √ e−x /2σ
σ 2π
By Example 2.5, the density function of Y is given by
(
1 √ √

2 y
(f ( y) − f (− y)) : y > 0
g(y) =
0 : y ≤ 0.

69
A short calculation shows that
( 2
√1 e−y/2σ :y>0
g(y) = σ 2π
0 :y≤0

In other words, Y is a Gamma random variable with parameters α = 1/2 and


λ = 1/2σ 2 .

(iii) Some values of Γ can be calculated. For instance,


Z ∞
Γ(1) = e−x dx = 1.
0

(iv) One can prove that for any α > 0,

Γ(α + 1) = αΓ(α).

Therefore, it follows that Γ(n) = (n − 1)! for all n ∈ N.

(End of Day 22)

70
VI. Jointly Distributed Random
Variables
1. Properties of Bivariate Distributions
Definition 1.1. Let X and Y be two random variables on a probability space (Ω, A, P ).
The joint distribution function is defined as F : R2 → R by
F (x, y) := P (X ≤ x and Y ≤ y)
Remark 1.2.
(i) We can directly use the joint distribution function to compute certain probabili-
ties:
P (a < X ≤ b, Y ≤ d) = F (b, d) − F (a, d)
P (a < X ≤ b, c < Y ≤ d) = F (b, d) − F (b, c) − F (a, d) + F (a, c)

Definition 1.3.
(i) The marginal distribution functions are defined as
FX (x) = P (X ≤ x) and FY (y) := P (Y ≤ y).

(ii) If there is a function f : R2 → R such that


Z y Z x 
F (x, y) = f (u, v)dv du
−∞ −∞

then f is called the joint density function. Again, as before, for continuous random
variables, it is not true that
f (x, y) = P (X = x, Y = y).

Remark 1.4.
(i) Observe that
Z b Z d 
P (a < X ≤ b, c < Y ≤ d) = f (u, v)dv du
a c

More generally, for any subset A ⊂ R2 that is ‘nice’, we have


Z Z
P ((X, Y ) ∈ A) = f (u, v)dvdu
A

71
(ii) Taking A = R2 , we get Z ∞ Z ∞
f (u, v)dvdu = 1.
−∞ −∞

(iii) Given a joint density function f : R2 → R, we define


Z ∞
fX (x) = f (x, v)dv
−∞

Then fX is a density function (as in Definition V.2.1) and


Z x
Fx (x) = fX (t)dt.
−∞

Hence, fX is the density function for X and is called the marginal density. Simi-
larly, Z ∞
fY (y) = f (u, y)du
−∞

is a density function for Y .

(iv) Under certain conditions (f should be twice continuously differentiable), we have

∂ 2F
= f (x, y)
∂x∂y

Definition 1.5. X and Y are said to be independent if whenever a ≤ b and c ≤ d, we


have
P (a < X ≤ b, c < Y ≤ d) = P (a < X ≤ b)P (c < Y ≤ d).
Equivalently, they are independent if and only if the joint distribution function is a
product of marginal distribution functions:

F (x, y) = FX (x)FY (y)

Remark 1.6.
(i) More generally, for any two sets A, B ⊂ R, we have

P ((X, Y ) ∈ A × B) = P (X ∈ A)P (Y ∈ B).

(ii) Moreover, if f : R2 → R is a joint density function, then X and Y are independent


if and only if
f (x, y) = fX (x)fY (y).

(iii) Note that an abstract joint density function is a non-negative function f : R2 → R


such that Z ∞Z ∞
f (x, y)dydx = 1.
−∞ −∞

72
(iv) In particular, if f1 and f2 are two (one dimensional) density functions as in Defi-
nition V.2.1, then
f (x, y) := f1 (x)f2 (y)
defines a joint density function.

Example 1.7. If X ∼ n(0, 1) and Y ∼ n(0, 1) are independent, then the joint density
function is given by
1 2 1 2 1 −(x2 +y2 )/2
f (x, y) = √ e−x /2 √ e−y /2 = e
2π 2π 2π

This is called the standard bivariate normal density.

Example 1.8. Suppose f : R2 → R is given by


2 −xy+y 2 )/2
f (x, y) = ce−(x

We need to find c so that it is a density function: Write

−(x2 − xy + y 2 ) = −(y − x/2)2 − 3x2 /4

So the marginal is given by


Z ∞
−3x2 /8 2 /2
fX (x) = ce e−(y−x/2) dy
−∞

A change of variable: u := (y − x/2) gives

−3x2 /8
Z ∞
2 2 √
fX (x) = ce e−u /2 du = ce−3x /8 2π.
−∞

Hence, X ∼ n(0, σ 2 ) where



√ 1 3
c 2π = √ ⇒ c =
σ 2π 4π

Hence, √
3 −(x2 −xy+y2 )/2
f (x, y) = e

is a density function. Note that
X ∼ n(0, 4/3).
Similarly, one can show that Y ∼ n(0, 4/3). Since

f (x, y) ̸= fX (x)fY (y)

it follows that X and Y are not independent.

73
2. Distribution of Sums
Remark 2.1. Suppose Z = X +Y is the sum of two random variables with joint density
function f . Then, for any z ∈ R, if

Az = {(x, y) : x + y ≤ z}

Then this is the half-plane under the line y = z − x. Hence,


Z Z
FZ (z) = P (Az ) = f (x, y)dydx
Az
Z ∞ Z z−x
= f (x, y)dydx
−∞ −∞
Z ∞Z z
= f (x, v − x)dvdx
Z−∞z Z −∞

= f (x, v − x)dxdv
−∞ −∞

Therefore, the density function of Z is


Z ∞
fX+Y (z) = f (x, z − x)dx
−∞

Moreover, if X and Y are independent, then


Z ∞
fX+Y (z) = fX (x)fY (z − x)dx
−∞

This is called the convolution of the two functions fX and fY .


Example 2.2. Suppose X and Y are independent random variables each having an
exponential distribution with parameter λ. Find the distribution function of (X + Y ).

Solution: The density of X is given by


(
λe−λx : if x ≥ 0
fX (x) =
0 : if x < 0.

and the same for Y as well. The density function for (X + Y ) is given by
Z ∞
fX+Y (z) = fX (x)fY (z − x)dx
−∞

Since fX (x) = 0 if x < 0 and fY (z − x) = 0 if x > z, it follows that

fX+Y (z) = 0

74
if z < 0 and if z ≥ 0, then
Z z
fX+Y (z) = fX (x)fY (z − x)dx
0
Z z
=λ2
e−λx e−λ(z−x) dx
0
2 −λz
=λ e

Hence, X + Y ∼ Γ(2, λ).

Theorem 2.3. Suppose X ∼ n(µ1 , σ12 ) and Y ∼ n(µ2 , σ22 ) are both independent random
variables, then
X + Y ∼ n(µ1 + µ2 , σ12 + σ22 ).
Proof. We will assume that µ1 = µ2 = 0 and σ1 = σ2 = 1. The general proof is similar
but more technical. Then,
1 2 1 2
fX (x) = √ e−x /2 and fY (y) = √ e−y /2
2π 2π
By the convolution formula,
Z ∞
1 −1
(x2 +(z−x)2 )
fX+Y (z) = e 2 dx
2π ∞

Observe that
√ √ z2
x2 + (z − x)2 = 2x2 + z 2 − 2zx = ( 2x − z/ 2)2 +
2
√ √ √
Hence, if u := 2x − z/ 2, then du = 2dx so
2
e−z /4 ∞ −u2 /2 1
Z
fX+Y (z) = e √ du
2π −∞ 2
1 2 √
= √ e−z /4 2π
2 2π
1 2
= √ √ e−z /4
2π 2
Hence,
(X + Y ) ∼ n(0, 2).

Theorem 2.4. Suppose X ∼ Γ(α1 , λ) and Y ∼ Γ(α2 , λ) are independent random vari-
ables, then X + Y ∼ Γ(α1 + α2 , λ).
Proof. Omitted. See [HPS, Section 6.2].

75
3. Conditional Densities
Remark 3.1. Suppose X and Y are discrete random variables with joint density func-
tion f and x, y ∈ R. If P (X = x) > 0, then
P (Y = y, X = x) f (x, y)
P (Y = y|X = x) = =
P (X = x) fX (x)
The conditional density function may thus be defined as
(
f (x,y)
: if 0 < fX (x) < ∞
fY |X (y|x) = fX (x)
0 : otherwise.

Then, fY |X is a non-negative function and if 0 < fX (x) < ∞, then


X X f (x, y) fX (x)
fY |X (y|x) = = = 1.
y∈R y∈R
fX (x) fX (x)

For each such x ∈ R, this is a density function in the sense of Lemma III.1.12.
Definition 3.2. Let X and Y be two random variables with joint density function f .
Then, the conditional density function is defined as
(
f (x,y)
: if 0 < fX (x) < ∞
fY |X (y|x) = fX (x)
0 : otherwise.

Remark 3.3.
(i) Note that for any a ≤ b and x ∈ R
Z b
P (a < Y ≤ b|X = x) = fY |X (y|x)dy.
a

This can also be done using calculus (see the textbook).

(ii) Observe that for any x, y ∈ R,

f (x, y) = fX (x)fY |X (y|x)

(iii) In particular, X and Y are independent if and only if

fY (y) = fY |X (y|x)

for any x, y ∈ R.

(iv) You should think of the conditional density as the vertical cross-section of the joint
density function surface at the point X = x, which is further scaled by fX (x) so
that its area is 1.

76
Example 3.4. Consider X, Y with joint density function

3 −x2 +xy−y2
f (x, y) = e 2

We saw in Example 1.8 that X and Y are dependent normal variables with X ∼ n(0, 4/3)
and Y ∼ n(0, 4/3). Note that for any x ∈ R,

3 2
fX (x) = √ e−3x /8
2 2π
Hence, the conditional density is given by
√ 2 2
3 −x +xy−y

e 2
fY |X (y|x) = √
√ 3 e−3x /8
2
2 2π
1 2
= √ e−(y−x/2) /2

Hence, for each x ∈ R, the conditional density if Y given X = x is the normal density
n(x/2, 1).

Theorem 3.5 (Bayes’ Rule). For any x, y ∈ R, we have

fX (x)fY |X (y|x)
fX|Y (x|y) = R ∞
f (x)fY |X (y|x)dx
−∞ X

Proof. We know that


f (x, y)
fX|Y (x|y) =
fY (y)
if 0 < fY (y) < ∞. Now note that

f (x, y) = fX (x)fY |X (y|x)

by definition. Moreover,
Z ∞ Z ∞
fY (y) = f (x, y)dx = fX (x)fY |X (y|x)dx.
−∞ −∞

(End of Day 23)

77
4. Properties of Multivariate Distributions
Definition 4.1. Suppose X1 , X2 , . . . , Xn are n random variables.

(i) The joint distribution function is given by F : Rn → R defined as

F (x1 , x2 , . . . , xn ) = P (X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn )

(ii) A function f : Rn → R is called a joint density function for X1 , . . . , Xn if it is


non-negative and
Z x1 Z x2 Z xn
F (x1 , x2 , . . . , xn ) = ... f (u1 , u2 , . . . , un )dun dun−1 . . . du1 .
−∞ −∞ −∞

Remark 4.2.

(i) Under certain mild conditions, we have

∂ nF
f (x1 , x2 , . . . , xn ) = (x1 , x2 , . . . , xn )
∂x1 ∂x2 . . . ∂xn

(ii) For any ‘nice’ subset A ⊂ Rn , we have


Z
P ((X1 , X2 , . . . , Xn ) ∈ A) = f (u1 , u2 , . . . , un )dun dun−1 . . . du1 .
A

(iii) In particular, Z
f (u1 , u2 , . . . , un )dun dun−1 . . . du1 = 1.
Rn

and for any n-cell (‘rectangle’ in n dimensions), we have


Z b1 Z bn
P (a1 < X1 ≤ b1 , . . . , an < Xn ≤ bn ) = ... f (u1 , u2 , . . . , un )dun dun−1 . . . du1 .
a1 an

(iv) The marginal distribution function for Xj is defined FXj : R → R defined as

FXj (x) = P (Xj ≤ x).

(v) If f : Rn → R is the joint density function, then the marginal density functions
are defined as
Z ∞ Z ∞
fX1 (x1 ) = ... f (x1 , u2 , . . . , un )dun dun−1 . . . du2 .
−∞ −∞

and others are defined similarly.

78
(vi) The variables X1 , X2 , . . . , Xn are said to be independent if
n
Y
P (a1 < X1 ≤ b1 , . . . , an < Xn ≤ bn ) = P (ai < Xi ≤ bi ).
i=1

Equivalently, this happens if and only if


n
Y
F (x1 , x2 , . . . , xn ) = FXi (xi ).
i=1

and equivalently, if and only if


n
Y
f (x1 , x2 , . . . , xn ) = fXi (xi ).
i=1

Example 4.3. Suppose X1 , X2 , . . . , Xn are independent random variables, each of which


is exponentially distributed with density λ. Then, the joint density function is given by
(
λn e−λ(x1 +x2 +...+xn ) : if xi > 0 for all 1 ≤ i ≤ n
f (x1 , x2 , . . . , xn ) =
0 : otherwise.

Proof. Exercise. Try it for n = 2 first!

Theorem 4.4. Suppose X1 , X2 , . . . , Xn are independent random variables. Suppose

Y = φ(X1 , X2 , . . . , Xm ) and Z = ψ(Xm+1 , Xm+2 , . . . , Xn )

for two functions φ and ψ. Then, Y and Z are independent.

Corollary 4.5. Suppose X1 , X2 , . . . , Xn are independent random variables with Xi ∼


n(µi , σi2 ). Then,
X1 + X2 + . . . + Xn ∼ n(µ, σ 2 )
where µ = ni=1 µi and σ 2 = ni=1 σi2 .
P P

Proof. For n = 2, this was proved in Theorem 2.3. For n > 2, we use induction. Assume
that the result is true for (n − 1) variables. Let Y := X2 + X3 + . . . + Xn . By induction
hypothesis,
Y ∼ n(ν, λ2 )
Pn Pn 2
where ν = i=2 µi and λ2 = i=2 σi . Moreover, by Theorem 4.4, Y and X1 are
independent. Therefore, by the n = 2 case,

X1 + Y ∼ n(µ, σ 2 )

as desired.

79
VII. Expectations and the Central
Limit Theorem
1. Expectations of Continuous Random Variables
Definition 1.1. Let X be a continuous random variable with density function f . We
say that X has finite expectation if
Z ∞
|x|f (x)dx < ∞.
−∞

If this happens, we define the expectation of X to be


Z ∞
EX := xf (x)dx
−∞

This is also called the mean of X.


Example 1.2. If X ∼ U (a, b), then
(
1
(b−a)
: if a ≤ x ≤ b
f (x) =
0 : otherwise.
Hence,
b
x2
Z
x (b + a)
EX = dx = |ba = .
a (b − a) 2(b − a) 2
Example 1.3. If X ∼ Γ(α, λ), then
λα α−1 −λx
f (x) = x e .
Γ(α)
Hence,
Z ∞
λα
EX = xα e−λx dx
Γ(α) −∞
λα Γ(α + 1)
=
Γ(α) λα+1
α
=
λ
because Γ(α + 1) = αΓ(α).

80
Example 1.4. If X ∼ Exp(λ), then X ∼ Γ(1, λ), so
1
EX =
λ
by the previous example.

Example 1.5. If X is a Cauchy distribution, then


1
f (x) =
π(1 + x2 )
so

|x| 2 ∞
Z Z
1 x
2
dx = dx
π −∞ (1 + x ) π 0 (1 + x2 )
Z M
2 x
= lim dx
π M →∞ 0 (1 + x2 )
1
= lim ln(1 + x2 )|M
0
π M →∞
1
= lim ln(1 + M 2 ) = +∞.
π M →∞
So X does not have finite expectation, so EX is undefined.

Theorem 1.6. Let X1 , X2 , . . . , Xn be random variables with joint density f : Rn → R.


Let Z be the random variable defined as

Z = φ(X1 , X2 , . . . , Xn )

Then, Z has finite expectation if


Z ∞ Z ∞
... |φ(x1 , x2 , . . . , xn )|f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn < ∞
−∞ −∞

and in that case,


Z ∞ Z ∞
EZ = ... φ(x1 , x2 , . . . , xn )f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn
−∞ −∞

(End of Day 24)

2. Moments of Continuous Random Variables


Definition 2.1. Let X be a random variable with density function f and mean µ, and
let m ∈ N.

81
(i) If X m has finite expectation, then the mth moment of X is defined as
Z ∞
m
EX = xm f (x)dx.
−∞

(ii) The mth central moment of X is defined as


Z ∞
m
E(X − µ) = (x − µ)m f (x)dx.
−∞

(iii) The variance of X is


Z ∞
2
Var(X) = E(X − µ) = (x − µ)2 f (x)dx.
−∞

p
(iv) The standard deviation of X is σ := Var(X).
Remark 2.2.
(i) As in the discrete case, σ = 0 if and only if X is constant.

(ii) As in the discrete case,

Var(X) = E(X 2 ) − E(X)2 .

Example 2.3. If X ∼ Γ(α, λ), then we find the moments and variance of X.

Solution: For m ≥ 1,
Z ∞
λα
m
E(X ) = xm+α−1 e−λx dx
Γ(α) 0
λα Γ(α + m)
=
Γ(α) λα+m
α(α + 1) . . . (α + m − 1)
=
λm
By the earler calculation, E(X) = α/λ, so the variance is given by

α(α + 1) α2 α
σ 2 = E(X 2 ) − E(X)2 = 2
− 2 = 2.
λ λ λ

Example 2.4. If X ∼ Exp(λ) = Γ(1, λ), then


1 1
µ= and σ 2 = 2 .
λ λ

82
Remark 2.5. Suppose X is a random variable with density function f that is symmetric,
i.e.
f (−x) = f (x)
Then, X and −X have the same density function. In particular, if m ∈ N is odd, then

E(X m ) = E((−X)m ) = −E(X m ) ⇒ E(X m ) = 0.

Example 2.6. Suppose X ∼ n(µ, σ 2 ), then we calculate the mean and variance of X.

Solution: Note that (X − µ) ∼ n(0, σ 2 ). This is a symmetric density function, so in


particular,
E(X − µ) = 0 ⇒ E(X) = µ.
Now if Y = (X − µ)2 , then by Remark V.3.6,

Y ∼ Γ(1/2, 1/2σ 2 ).

Therefore,
1/2
Var(X) = EY = 2
= σ2.
1/2σ

Definition 2.7. Suppose X and Y have joint density function f : R2 → R, means µX


and µY and finite second moments. Then the covariance of X and Y is defined as
Z ∞Z ∞
Cov(X, Y ) = E(X − µX )(Y − µY ) = (x − µX )(y − µY )f (x, y)dxdy
−∞ −∞

Remark 2.8.

(i) As before,

Cov(X, Y ) = E(XY − µX Y − µY X + µX µY ) = E(XY ) − E(X)E(Y ).

(ii) If X and Y are independent, then f (x, y) = fX (x)fY (y) for all x, y ∈ R2 . There-
fore,
Z ∞Z ∞
E(XY ) = xyfX (x)fY (y)dydx
−∞ −∞
Z ∞ Z ∞ 
= xfX (x) yfX (y)dy dx
−∞ −∞
= µY µX

Hence, Cov(X, Y ) = 0.

83
Example 2.9. Consider the function f : R2 → R given by

3 −[x2 −xy+y2 ]/2
f (x, y) = e

Then, by Example VI.1.8, X ∼ n(0, 4/3) and Y ∼ n(0, 4/3). We now calculate
Cov(X, Y ).

Solution: We know from Example 2.6 that µX = µY = 0. Now,


√ Z ∞Z ∞
3 2 2
E(XY ) = xye−[x −xy+y ]/2 dydx

√ Z−∞ −∞
∞ Z ∞
3 2 2
= xye−3x /8 e−[(y−x/2) /2 dydx

√ Z−∞

−∞
Z ∞ 
3 −3x2 /8 −[(y−x/2)2 /2]
= xe ye dy dx
4π −∞ −∞

Now,
Z ∞ Z ∞
−[(y−x/2)2 /2] x  −u2 /2

ye dy = u+
e du
−∞ −∞ 2
Z ∞
x ∞ −u2 /2
Z
−u2 /2
= ue du + e du
−∞ 2 −∞
x√
= 2π.
2
Therefore, √ Z ∞
6π 2 /8
E(XY ) = x2 e−3x dx
8π −∞

Now if Z ∼ n(0, 4/3), then Z 2 ∼ Γ(1/2, 3/8) by Remark V.3.6. Therefore, by

(1/2)(1/2 + 1) 3/4 3 × 64 16
E(Z 2 ) = 2
= = = .
(3/8) 9/64 4×9 3

Hence, a short calculation shows that


1
E(XY ) = = Cov(X, Y ).
2

3. The Central Limit Theorem


Definition 3.1. X1 , X2 , . . . , Xn are said to be i.i.d. if they are independent and identi-
cally distributed.

84
We will always assume that they have finite mean µ and variance σ 2 .
Remark 3.2. Define
Sn := X1 + X2 + . . . + Xn
Then,
E(Sn ) = nµ and Var(Sn ) = nσ 2 .
Therefore, if
Sn − nµ
Sn∗ := √
σ n
then Sn∗ has mean 0 and variance 1.

Question: What is the distribution function of this random variable Sn∗ ?


Example 3.3.
(i) If X1 ∼ n(µ, σ 2 ), then Sn ∼ n(nµ, nσ 2 ). Therefore,

Sn∗ ∼ n(0, 1).

(ii) If X1 ∼ Po(λ), then µ = σ 2 = λ, so


Sn − nλ
Sn∗ = √

(iii) If X1 ∼ Bern(p) for some 0 < p < 1, then Sn ∼ B(n, p). Hence,
 
n k
P (Sn = k) = p (1 − p)n−k
k
Here, µ = p and σ 2 = p(1 − p), so
Sn − np
Sn∗ = p
np(1 − p)
De Moivre (ca 1700) and Laplace (ca 1800) proved that in this case,

P (Sn∗ ≤ x) ≈ Φ(x)

where Φ is the standard normal distribution. This was generalized to its current
form by Lindemann in 1922.

(End of Day 25)

Theorem 3.4. Let X1 , X2 , . . . , Xn be i.i.d random variables with mean µ and finite
non-zero variance σ 2 . If Sn = X1 + X2 + . . . + Xn , then for any x ∈ R,
 
Sn − nµ
lim P √ ≤ x = Φ(x)
n→∞ σ n

85
a. Normal Approximation
Remark 3.5. For n large, the Central Limit Theorem says that
 
Sn − nµ
P √ ≤ x ≈ Φ(x)
σ n
Rewriting this, we get  
x − nµ
P (Sn ≤ x) ≈ Φ √
σ n
This is called a normal approximation formula.
Example 3.6. Suppose the length of life of a light bulb is exponentially distributed by
µ = 10 days. As soon as one bulb burns out, another is installed in its place. Find the
probability that more than 50 light bulbs will be needed in year.

Solution: Let Xn denote the length of life of the nth bulb. We assume that X1 , X2 , . . .
are all independent with Xi ∼ Exp(λ) where λ = 1/10 (so that µ = 10 and σ 2 = 100).
Then, if
Sn := X1 + X2 + . . . + Xn
then we wish to calculate P (S50 < 365). Now, by the normal approximation formula,
 
365 − 50µ
P (S50 < 365) ≈ Φ √
σ 50
 
365 − 500
=Φ √
10 50
= Φ(−1.91) = 0.028.

Therefore, it is very unlikely we will need more than 50 lightbulbs.

Example 3.7. An instructor has 60 exams to grade in sequence. The time required to
grade each exam is independent and identically distributed with mean 15 minutes and
standard deviation 2 minutes. Approximate the probability that she will grade all the
exams
(i) in the first 14 hours?
(ii) in the first 16 hours?

Solution: Let Xi be the time taken to grade the ith exam, then we are interested in
60
X
S60 = Xi .
i=1

86
(i) For part (i), we are interested in
 
840 − (60 × 15)
P (S60 ≤ 840) ≈ Φ √
2 60
 
30
= Φ −√
60
= Φ(−5.47)

Note that Φ is an increasing function, and Φ(−3.9) ≈ 0. Hence,

P (S60 ≤ 840) ≈ 0.

Hence, she will almost certainly not finish in 14 hours.

(ii) However, for part (ii), we are interested in

P (S60 ≤ 960) ≈ Φ(+5.47) ≈ 1

because Φ(3.9) ≈ 1.

Hence, she will almost certainly be done in 16 hours.

Note: In practice, X1 will be much larger than X60 because it gets easier to grade as
you go along. Therefore, X1 is most certainly not a normal distribution!

b. Additional Material
Here are some additional material to help understand the Central Limit Theorem:
(i) Youtube video: https://www.youtube.com/watch?v=jvoxEYmQHNM
(ii) Geogebra applet: https://www.geogebra.org/m/xqvcg8sm
(iii) Geogebra applet: https://www.geogebra.org/m/n4SujFmy

c. Random Sampling
Remark 3.8. Consider a probability space (Ω, A, P ) and a random variable X : Ω → R.
We think of values x := X(w) taken by X as outcomes of an experiment. Let F be the
distribution function of X. In practice, we often do not know what F is.
(i) To understand F better, a statistician would obtain n independent observations on
X, i.e. She would obtain n values x1 , x2 , . . . , xn assumed by X. Each xi is regarded
as a value assumed by a random variable Xi , 1 ≤ i ≤ n where X1 , X2 , . . . , Xn are in-
dependent RVs with common distribution F . The observed values (x1 , x2 , . . . , xn )
are then taken to be values of (X1 , X2 , . . . , Xn ).

87
(ii) The set {X1 , X2 , . . . , Xn } is called a sample of size n taken from a population distribution
F.

(iii) The set {x1 , x2 , . . . , xn } is called a realization of the sample.

(iv) A simple random sample is one choice of tuple (x1 , x2 , . . . , xn ) which is chosen
without replacement. Note: In this case, the variables X1 , X2 , . . . , Xn are not
independent, but for large populations and small sample size, it is not very different
from sampling with replacement (in which case they are independent).

(v) In practice, one often observes not the values {x1 , x2 , . . . , xn } but a single value

φ(x1 , x2 , . . . , xn ).

The random variable Z := φ(X1 , X2 , . . . , Xn ) is called a (sample) statistic, pro-


vided it does not depend on any unknown parameters.

(vi) For example,


X1 + X2 + . . . + Xn
X :=
n
is called the sample mean, and
n Pn 2
2
X (Xi − X)2 i=1 Xi2 − nX
S = =
i=1
n−1 n−1

is called the sample variance. S is called the sample standard deviation.

Example 3.9. Suppose X ∼ Bern(p) for some 0 < p < 1, which is possibly unknown.
If five independent observations are {0, 1, 1, 0, 0}, then this is a realization of the sample
{X1 , X2 , . . . , X5 }. The sample mean is
2
x=
5
which is the value assumed by the RV X, and the sample variance is
P5
x2 − 5x2
s = i=1 i
2
= 0.3
4
which is the value assumed by the RV S 2 .

We may now rephrase the Central Limit theorem as follows.

Theorem 3.10. For sufficiently large n, the distribution of the sample mean X approx-
imates a normal distribution (regardless of the original distribution F ).

88
Remark 3.11. Fix c > 0 and consider
 
Sn
P − µ ≥ c = P (Sn ≤ nµ − nc) + P (Sn ≥ nµ + nc)
n
   
−nc nc
≈Φ √ +1−Φ √
σ n σ n
  √ 
c n
=2 1−Φ
σ

Hence, if δ := c n/σ, then
 
Sn
P − µ ≥ c ≈ 2(1 − Φ(δ))
n

This is also called a normal approximation formula.

(End of Day 26)

Example 3.12. An astronomer measures the distance d to a star in light years. He


believes that each measurement is independent and identically distributed with common
variance of 4 light years. How many measurements does he need to make to be 95% sure
that his estimated distance (the sample mean) is accurate to within ±0.1 light years?

Solution: We wish to find n to ensure that


 
Sn
P − d ≥ 0.1 < 0.05
n

We use the normal approximation formula to find n so that

2(1 − Φ(δ)) = 0.05

where √
√ 0.1 n √
δ = c n/σ = = n20
2
In other words, √ 
n
Φ = 1 − 0.025 = 0.975
20
From the Standard Normal table, we see that

n20 = 1.96 ⇒ n ≈ 307.35

Hence, he should make 308 observations.

89
d. Understanding the CLT
Remark 3.13. Suppose you have 10,000 people, each weighing between 30-100 kilo-
grams.

If you had all their weights, you could take an average to find the population mean.

(i) If you don’t have all that data (or you cannot process so much), you could find
the population mean by the following steps:
ˆ Choose a group of 30 and take the average weight. This is the sample mean.
ˆ Repeat this process 50 times. Each time you choose a group of 30 people and
take an average, you get a different sample mean.
ˆ Take the average of all these 50 values. This number will be close to the
population mean.

(ii) You may also tabulate all the 50 sample means and draw a histogram. This
histogram will resemble a bell curve.

ˆ The mean of this bell curve is close to the population mean.



ˆ The standard deviation (multiplied by 30) of this bell curve is close to the
population standard deviation.

90
For part (i), you do not need the CLT. The Weak Law of Large Numbers assures you of
it already. For part (ii), however, you do need the CLT.

Remark 3.14. What the CLT does not say:

(i) It does not say anything about any individual’s weight. That is still a random
phenomenon.

(ii) It does not answer questions like What fraction of people weigh more than 60kg?.
i.e. What is P (X > 60)? where X is the random variable that assigns each person
their weight. To answer this, we need the probability distribution for X, which we
do not have!

(iii) The number 30 chosen above may not be good enough, depending on your popu-
lation. The vague term ‘close to’ used above is also not precise unless you know
something about the distribution of X.

91
VIII. Moment Generating Functions
and Characteristic Functions
1. Moment Generating Function
Remark 1.1. If X is a random variable and t ∈ R, then Y := etX is a random variable
given by
∞ n
tX(w)
X t X(w)n
Y (w) := e = .
n=0
n!
Note that this series always converges, so Y is well-defined. It may or may not have
finite expectation though.
Definition 1.2. The moment generating function of a random variable X is defined as

MX (t) := E(etX )
which is defined for all t ∈ R such that etX has finite expectation.
Example 1.3. If X ∼ Bern(p), then
MX (t) = E(etX )
= et·0 P (X = 0) + et·1 P (X = 1)
= (1 − p) + et p
Remark 1.4. Suppose X is a random variable taking only positive integer values, then
the probability generating function (see Definition III.5.2) is given by

X
ΦX (t) = tn P (X = n)
n=0

Hence,

X
MX (t) = E(etX ) = etn P (X = n) = ΦX (et ).
n=0

Example 1.5. If X ∼ B(n, p), then we know that the probability generating function
is given by
ΦX (t) = (pt + 1 − p)n
Hence,
MX (t) = (pet + 1 − p)n .

92
Example 1.6. If X ∼ Po(λ), then we had computed in Example III.5.5 that

ΦX (t) = eλ(t−1)

Therefore,
t
MX (t) = eλ(e −1) .
Example 1.7. Suppose X ∼ n(µ, σ 2 ), then
Z ∞
1 2 2
tX
MX (t) = E(e ) = √ etx e−(x−µ) /2σ dx
σ 2π −∞
Z ∞
1 2 2
= √ et(y+µ) e−y /2σ dy
σ 2π −∞
Z ∞
eµt 2 2
= √ ety−y /2σ dy
σ 2π −∞
Now note that
y2 (y − σ 2 t)2 σ 2 t2
ty − = − +
2σ 2 2σ 2 2
Hence,
2 2 ∞
eµt eσ t /2
Z
2 t)2 /2σ 2
MX (t) = √ e−(y−σ dy
σ 2π −∞
2 2
= eµt+σ t /2

Example 1.8. If X ∼ Γ(α, λ), then


Z ∞
λα
MX (t) = etx xα−1 e−λx dx
Γ(α) 0
Z ∞
λα
= xα−1 e(−(λ−t)x dx
Γ(α) 0
λα Γ(α)
=
Γ(α) (λ − t)α
provided −∞ < t < λ. If t ≥ λ, then this integral diverges. Hence,
 α
λ
MX (t) = , −∞ < t < λ.
λ−t
Lemma 1.9.
(i) If c ∈ R is a scalar and X is a random variable, then

McX (t) = MX (ct).

(ii) If X and Y are independent, then

MX+Y (t) = MX (t)MY (t).

93
(iii) If X1 , X2 , . . . , Xn are i.i.d and Sn = X1 + X2 + . . . + Xn , then

MSn = (MX1 (t))n .

Proof.
(i) Follows from the definition.

(ii) Here, etX and etY are also independent, so

MX+Y (t) = E(et(X+Y ) ) = E(etX etY ) = E(etX )E(etY ) = MX (t)MY (t).

(iii) Follows from part (ii).

Theorem 1.10. Suppose X is a random variable such that MX (t) is finite for all t ∈
(−δ, δ) for some δ > 0, then for each n ∈ N,
dn
E(X n ) = MX (t)|t=0
dtn
Proof. Consider the series
∞ n n
!
tX
X t X
MX (t) = E(e ) = E
n=0
n!

X tn
= E (X n )
n=0
n!
t2
= 1 + tE(X) + E(X 2 ) + . . .
2!
Differentiating term by term and plugging in t = 0 gives the desired result. All this
works because the series converges absolutely by assumption.
Example 1.11. If X ∼ n(0, σ 2 ), then

σ 2 t2 /2
X σ 2n t2n
MX (t) = e =
n=0
2n n!

Hence,
E(X m ) = 0
whenever m is odd. Moreover, if m = 2n is even,
σ 2n (2n)!
E(X m ) = .
2n n!
In particular, E(X 2 ) = σ 2 (as we already know from Example VII.2.6).

94
2. Characteristic functions
Remark 2.1.

(i) A complex number is one of the form z = x + iy where x, y ∈ R and i2 = −1.


Write C for the set of all complex numbers. For z ∈ C, define

|z| = (x2 + y 2 )1/2

and the distance between two complex numbers if |z1 − z2 .

(ii) For z ∈ C, define



z
X zn
e = .
n=0
n!
This series converges absolutely for each z ∈ C, so defines a function exp : C → C.
Note that
ez1 +z2 = ez1 ez2
Moreover, for any t ∈ R,
eit = cos(t) + i sin(t).
Then, |eit | = 1 for all t ∈ R.

(iii) Since cos(−t) = cos(t) and sin(−t) = − sin(t), we have

e−it = cos(t) − i sin(t).

Thus,
eit + e−it eit − e−it
cos(t) = , and sin(t) =
2 2i
(iv) If h(t) = f (t) + ig(t) is a complex valued function, then

h′ (t) = f ′ (t) + ig ′ (t)

provided f and g are differentiable. Similarly,


Z Z Z
h(t)dt = f (t)dt + i g(t)dt

provided the integrals exist.

(v) In particular,
d ct
e = cect
dt
And
b
ebc − eac
Z
ct
e dt = .
a c

95
(vi) Let (Ω, A, P ) be a probability space. A complex valued function Z : Ω → C can
be expressed in the form
Z(w) = X(w) + iY (w)
where X and Y are real-valued. We say that Z is a (complex) random variable if
both X and Y are random variables.

(vii) We say that Z has finite expectation if X and Y both have finite expectation,
which is equivalent to requiring that

E(|Z|) < ∞.

In that case, we define the expectation of Z as

EZ := E(X) + iE(Y ).

(viii) Note that expectation is linear in this case as well:

E(a1 Z1 + a2 Z2 ) = a1 E(Z1 ) + a2 E(Z2 )

whenever a1 , a2 ∈ C and Z1 , Z2 are complex random variables with finite expecta-


tion.

(ix) Given a real-valued random variable X, consider a new random variable Y : Ω → C


by

itX(w)
X (itX(w))n
Y (w) := e =
n=0
n!

Definition 2.2. The characteristic function of X is defined as φX : R → C by

φX (t) := E(eitX )

(End of Day 27)

Remark 2.3.

(i) Whenever MX (t) is defined, we have

φX (t) = MX (it).

However, φX is defined even when MX may not be. Indeed, |eitX(w) | = 1 for all
w ∈ Ω, so that
|φX (t)| ≤ E(1) = 1
for all t ∈ R.

96
(ii) If X has density function fX : R → R, then
Z ∞
φX (t) = eitx fX (x)dt
−∞

This is a sum if X is a discrete random variable.

(iii) Moreover,
φX (0) = E(eit0 ) = 1

(iv) If c ∈ R then
φcX (t) = E(eitcX ) = φX (ct).

(v) If X and Y are independent random variables, then

φX+Y (t) = E(eit(X+Y ) ) = E(eitX eitY )


= E(eitX )E itY )
= φX (t)φY (t)

(vi) If X = a is a constant random variable, then

φX (t) = E(eita ) = eita

for all t ∈ R.

(vii) Hence if X is a random variable and a, b ∈ R, then

φa+bX (t) = eita φX (bt)

for all t ∈ R.

Example 2.4. If X ∼ B(n, p), then

MX (t) = (pet + 1 − p)n

Hence,
φX (t) = (peit + 1 − p)n

Example 2.5. Similarly, if X ∼ Po(λ), then


it −1)
φX (t) = MX (it) = eλ(e

Example 2.6. If X ∼ U (a, b), then


(
1
(b−a)
: if a < t < b
fX (t) :=
0 : otherwise

97
Hence,
Z ∞
φX (t) = eitx fX (x)dt
−∞
Z b
1
= eitx dt
(b − a) a
eibt − eiat
=
it(b − a)
Example 2.7. If X ∼ n(µ, σ 2 ), then we had seen that
2 /t2 /2
MX (t) = eµt+σ .
Hence,
2 t2 /2
φX (t) = eitµ−σ .
Example 2.8. If X ∼ Exp(λ), then
(
λe−λx : if x ≥ 0
fX (x) =
0 : if x < 0.
Hence,
Z ∞
φX (t) = λ xeitx−λx dx
0
λ
= eitx |∞
0
(λ − it)
λ
=
(λ − it)
Note that unlike MX (t) in Example 1.8, this converges everywhere on R.
Theorem 2.9. If X1 , X2 , . . . , Xn are i.i.d. random variables and Sn = X1 +X2 +. . .+Xn ,
then
φSn (t) = (φX1 (t))n
Proof. Exactly as in Lemma 1.9.
Remark 2.10. Note that ∞ n
X i E(X n )
φX (t) = tn
n=0
n!
This series converges absolutely, so by differentiating, we get
φ′X (0) = iE(X)
More generally,
(n)
φX (0) = in E(X n )
This can be used to compute the higher moments of X exactly as in Theorem 1.10.

98
3. Inversion Formulas and the Continuity Theorem
Remark 3.1.
(i) Recall that if X is a non-negative integer valued discrete random variable, then
we defined the probability generating function as

X
φX (t) = tn fX (n).
n=0

This can then be used to calculate the individual probabilities fX (n) = P (X = n)


by repeated differentiation (see Remark III.5.3).

(ii) Now suppose X is any integer-valued random variable, then the characteristic
function ∞
X
φX (t) = eitn fX (n)
n=−∞

Then, we wish to use it to recover the values fX (n).


Lemma 3.2. For any integers j, k ∈ Z,
(
Z π
1 1 : if j = k
ei(j−k)t dt =
2π −π ̸ k
0 : if j =

Proof. If j = k, then
ei(j−k)t = 1
for all t ∈ [−π, π]. So, Z π
1
ei(j−k)t dt = 1.
2π −π

If j ̸= k, then
π
ei(j−k)t π
Z
ei(j−k)t dt = |
−π (i(j − k) −π
ei(j−k)π − e−i(j−k)π
=
i(j − k)
sin((j − k)π)
= 2i
i(j − k)
=0

because sin(mπ) = 0 for all m ∈ Z. Hence the result.


Theorem 3.3. Let X be an integer-valued random variable. Then, for each k ∈ Z,
Z π
1
fX (k) = φX (t)e−ikt dt
2π −π

99
This is called the inversion formula.
Proof. By definition,

X
φX (t) = eitn fX (n).
n=−∞

Therefore,
π ∞
1 π −ikt itn
Z Z
1 −ikt
X
φX (t)e dt = e e fX (n)dt
2π −π n=−∞
π −π

1 π it(n−k)
X Z
= fX (n) e dt
n=−∞
π −π
= fX (k)

by Lemma 3.2.
The following is an analogue for continuous random variables.

Definition 3.4. We say that a function g : R → R is integrable if


Z ∞
|g(t)|dt < ∞.
−∞

Theorem 3.5 (Inversion Formula). Suppose that X is a continuous random variable


such that φX is integrable, then for each x ∈ R, we have
Z ∞
1
fX (x) = e−ixt φX (t)dt.
2π −∞

Example 3.6. Suppose X is a random variable such that


2 t2 /2
φX (t) = e−σ

This function is integrable so the density function is given by


Z ∞
1 2 2
fX (x) = e−ixt e−σ t /2 dt
2π −∞
Z ∞
1 2 2
= e−ixt−σ t /2 dt
2π −∞

Note that
σ 2 t2 2ixt + σ 2 t2 (σt + ix/σ)2 x2
ixt + = = + 2
2 2 2 2σ

100
Hence,
2 2 ∞
e−x /2σ
Z
2 /2
fX (x) = e−(σt+ix/σ) dt
2π −∞
2 2 Z ∞
e−x /2σ 2
= e−y /2 dy
2πσ −∞
1 2
= √ e−x /2
σ 2π
Hence, X ∼ n(0, σ 2 ).
Remark 3.7. Suppose X and Y are two random variables such that
φX (t) = φY (t)
for all t ∈ R, and suppose we know that φX is integrable. Then, by Theorem 3.5, we
have
fX (x) = fY (x)
for all x ∈ R. Hence, X ∼ Y . The next theorem tells us that this result is true even if
we do not assume that φX is integrable.
Theorem 3.8 (Uniqueness Theorem). If X and Y are two random variables with the
same characteristic functions, then X and Y have the same distribution function.
(End of Day 28)
Example 3.9. Suppose X ∼ n(µ1 , σ12 ) and Y ∼ n(µ2 , σ22 ) are two independent normal
random variables. Then,
2 2 /2 2 2 /2
φX (t) = eiµ1 t−σ1 t and φY (t) = eiµ2 t−σ2 t .
Since they are independent, we know that
2 2 2 /2
φX+Y (t) = φX (t)φY (t) = ei(µ1 +µ2 )t−(σ1 +σ2 )t .
By the uniqueness theorem, it follows that
X + Y ∼ n(µ1 + µ2 , σ12 + σ22 ).
This was the conclusion of Theorem VI.2.3.
Theorem 3.10 (Continuity Theorem). Let (Xn ) be a sequence of random variables and
X be a random variable such that
lim φXn (t) = φX (t)
n→∞

for each t ∈ R. Then,


lim FXn (x) = FX (x)
n→∞
for each x ∈ R.

101
Proof. Assume that all the characteristic functions are (uniformly) integrable. Then, by
a convergence theorem from analysis, we can say that, for each x ∈ R,
Z ∞ Z ∞
−itx
lim e φXn (t)dt = e−itx φX (t)dt.
n→∞ −∞ −∞

By the inversion formula, we conclude that

lim fXn (x) = fX (x).


n→∞

Again, by the same theorem, we see that


Z x Z x
lim FXn (x) = lim fXn (y)dy = fX (y)dy = FX (x).
n→∞ n→∞ −∞ −∞

Now if the characteristic functions are not integrable, one must apply this theorem to
2 2
the function φX (t)e−c t /2 (see the book for the details).

4. The Weak Law of Large Numbers and the Central


Limit Theorem
Remark 4.1. (i) Let z ∈ C be a complex number such that |z − 1| < 1, then we can
define

(z − 1)2 (z − 1)3 X (z − 1)n
log(z) = (z − 1) − + + ... = (−1)n−1
2 3 n=1
n

Then, one sees that


ˆ elog(z) = z for all z ∈ C with |z − 1| < 1.
ˆ log(1) = 0.
ˆ If h : (a, b) → C is a complex valued function on (a, b) such that |h(t) − 1| < 1
for all t ∈ (a, b), then
d h′ (t)
log(h(t)) =
dt h(t)

(ii) Now let X be a random variable with characteristic function φX . Then,

φX (0) = 1

and φX is continuous. Hence, there is an interval (−δ, δ) such that

log(φX (t))

is well-defined on that interval. Fix one such interval for now.

102
(iii) Suppose X has finite mean µ. Then,
φ′X (0) = iµ
log(φX (t)) log(φX (t)) − log(φX (0))
⇒ lim = lim
t→0 t t→0 t−0
d
= log(φX (t))|t=0
dt
φ′X (0)
=
φX (0)
= iµ
Hence,
log(φX (t)) − iµt
lim = 0.
t→0 t
(iv) Now suppose X has finite variance σ 2 . Then,
φ′′X (0) = −E(X 2 ) = −(µ2 + σ 2 )
Now by L’Hospital’s rule,
φ′X (t)
log(φX (t)) − iµt φX (t)
− iµ
lim = lim
t→0 t2 t→0 2t
φ′X (t) − iµφX (t)
= lim
t→0 2tφX (t)

φ (t) − iµφX (t)
= lim X
t→0 2t
φ′′X (t) − iµφ′X (t)
= lim
t→0 2
φ′′X (0) − iµφ′X (0)
=
2
−(µ + σ 2 ) + µ2
2
=
2
σ2
=− .
2
(v) Let Z : Ω → R denote the random variable
Z(w) = 0 for all w ∈ Ω.
Then, P (Z = 0) = 1, so
φZ (t) = ei·0 P (Z = 0) = 1.
Moreover, the distribution function of Z is
(
0 : if x < 0
FZ (x) =
1 : if x ≥ 0

103
Theorem 4.2 (Weak Law of Large Numbers). Let X1 , X2 , . . . be a sequence of i.i.d.
random variables having finite mean µ. Set

Sn := X1 + X2 + . . . + Xn

Then, for any ϵ > 0,  


Sn
lim P − µ > ϵ = 0.
n→∞ n
Proof. By Theorem 2.9,
φSn (t) = (φX1 (t))n
Sn
Hence, if Zn := n
− µ, then by Remark 2.3,

φZn (t) = e−iµt (φX1 (t/n))n

Fix t ∈ R and choose n ∈ N so that t/n is close enough to zero so that

|φX (t/n) − 1| = |φX (t/n) − φX (0)| < 1.

Then, log(φX (t/n)) is well-defined. Now note that

φZn (t) = exp[n log(φX (t/n) − iµ(t/n))]

Now note that


φX (t/n) − iµ(t/n)
lim n log(φX (t/n) − iµ(t/n)) = t lim
n→∞ n→∞ t/n
=0

by part (iii) of Remark 4.1. Hence,

lim φZn (t) = exp(0) = 1.


n→∞

This is true for any t ∈ R, so if Z denotes the random variable

Z(w) = 0 for all w ∈ Ω

then
lim φZn (t) = φZ (t)
n→∞

by part (v) of Remark 4.1. Therefore, by the Continuity Theorem (Theorem 3.10),

lim FZn (x) = FZ (x).


n→∞

104
Hence, for any ϵ > 0, we have
lim P (Zn < −ϵ) = lim FZn (−ϵ)
n→∞ n→∞
= FZ (−ϵ) = 0
lim P (Zn > ϵ) = 1 − lim P (Zn < ϵ)
n→∞ n→∞
= 1 − lim FZn (ϵ)
n→∞
= 1 − FZ (ϵ)
=0

Hence,
lim P (|Zn | > ϵ) = 0.
n→∞

This proves the theorem.

(End of Day 29)

Theorem 4.3 (Central Limit Theorem). Let X1 , X2 , . . . be a sequence of i.i.d. random


variables with finite mean µ and finite non-zero variance σ 2 . Then for any x ∈ R,
 
Sn − nµ
lim P √ ≤ x = Φ(x)
n→∞ σ n
Proof. Define
Sn − nµ
Sn∗ = √
σ n
Then fix t ∈ R. By Remark 2.3,
√ √
φSn∗ (t) = e−inµt/σ n
φSn (t/σ n)
√ √
= e−inµt/σ n (φX1 (t/σ n))n

 
−inµt
= exp √ + n log(φX1 (t/σ n))
σ n
Consider
√ √
−inµt √ t2 log(φX1 (t/σ n) − iµ(t/σ n)
lim √ + n log(φX1 (t/σ n)) = 2 lim √
n→∞ σ n σ n→∞ (t/σ n)2
t2 −σ 2
= 2
σ 2
−t2
=
2
whenever t ̸= 0 by part (iv) of Remark 4.1. Moreover,

−inµt √ −t2
lim √ + n log(φX1 (t/σ n)) =
n→∞ σ n 2

105
even when t = 0. Hence, we conclude that for any t ∈ R,
2 /2
lim φSn∗ (t) = e−t
n→∞

This is the characteristic function of Z ∼ n(0, 1). So by the Continuity Theorem Theo-
rem 3.10, we have that for any x ∈ R,
lim FSn∗ (x) = Φ(x).
n→∞

Hence the result.


Example 4.4. Suppose a fair coin is tossed 100 times. Use a normal approximation to
estimate the probability that there will be more than 60 heads.

Solution:
P100Here, X = Bern(1/2) represents a single coin toss, so we are interested in
S100 = i=1 Xi where the Xi are i.i.d. random variables with Xi ∼ Bern(1/2). We wish
to determine
P (S100 > 60) = 1 − P (S100 ≤ 60)
We know that
E(S100 ) = 100E(X) = 50
Var(S100 ) = 100Var(X) = 100(1/2)(1 − 1/2) = 25.

So we use the normal approximation formula to estimate


 
60 − 50
P (S100 > 60) ≈ 1 − Φ √
25
= 1 − Φ(2)
= 0.0228.

Example 4.5. A drug is supposed to be 75% effective. It is tested on 100 people. Using
a normal approximation, estimate the probability that at least 70 people will be cured.

Solution: Again, if X is the random variable so that X(ω) = 1 if ω is cured and


X(ω) = 0 if not, then X ∼ Bern(0.75) by hypothesis. We let X1 , X2 , . . . , X100 denote
the 100 experiments with Xi ∼ X and set
100
X
S100 = Xi
i=1

We wish to determine
P (S100 ≥ 70).

106
We know that
E(S100 ) = 100E(X) = 75
Var(S100 ) = 100Var(X) = 100 × 0.75 × 0.25 = 18.75.

Hence,

P (S100 ≥ 70) = 1 − P (S100 ≤ 69)


 
69 − 75
=1−Φ √
18.75
≈ 0.87.

Hence, there is an 87% chance that at least 70 people will be cured.

5. Review and Important Formulae


(i) Probability Measure:
F∞ P∞
ˆ (Sum Rule) If A1 , A2 , . . . are mutually disjoint, then P ( n=1 An ) = n=1 P (An ).
ˆ P (Ac ) = 1 − P (A).
ˆ P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
P (A∩B)
ˆ P (B|A) = P (A)
.
P (Aj )P (B|Aj )
ˆ Bayes’ Rule: P (Aj |B) = Pn
i=1 P (Ai )P (B|Ai )

ˆ Independent Events: P (A ∩ B) = P (A)P (B).

(ii) Density function:


ˆ Discrete RV: fX (x) = P (X = x).
ˆ Continuous RV: fX (x) = FX′ (x).

(iii) Distribution function: FX (x) = P (X ≤ x).

(iv) Probability of events related to a RV:


ˆ Discrete RV: P (X ∈ B) = x∈B P (X = x) = x∈B fX (x).
P P

ˆ Continuous RV: P (X ∈ B) = B fX (x)dx


R

ˆ Discrete RV: P (X ≤ x) = t≤x P (X = t)


P
Rx
ˆ Continuous RV: P (X ≤ x) = −∞ fX (t)dt.

(v) Jointly distributed RVs:

107
ˆ P ((X, Y ) ∈ B)) =
RR
B
f (x, y)dydx

R∞
ˆ Marginal of X: fX (x) = −∞
f (x, y)dy.
ˆ X and Y independent ⇔ f (x, y) = fX (x)fY (y).

(vi) Important Distributions:


ˆ Discrete:

Name Density Function Set of values E(X) Var(X)


Bern(p) px (1 − p)1−x {0, 1} p p(1 − p)
n x
B(n, p) x
p (1 − p)n−x {0, 1, 2, . . . , n} np np(1 − p)
x
Po(λ) e−λ λx! {0, 1, . . .} λ λ
1 1−p
Geom(p) p(1 − p)x−1 {1, 2, . . .} p p2
−r
(xr )(Nn−x ) r N −n
Hyp(n, r, N ) N {0, 1, . . . , n} np n Nr (1 − )
N N −1
(n)

ˆ Continuous:

Name Density Function Set of values E(X) Var(X)


1 a+b (b−a)2
U (a, b) (b−a)
(a, b) 2 12
Exp(λ) λe−λx [0, ∞) 1
λ
1
λ2
λα α−1 −λx α α
Gamma(α, λ) Γ(α)
x e [0, ∞) λ λ2
2 2
n(µ, σ 2 ) √ e−(x−µ) /2σ
1
R µ σ 2
σ 2π

(vii) Expectation:
ˆ Discrete RV: E(X) =
P
x xP (X = x)

R∞
ˆ Continuous RV: E(X) = xf (x)dx.−∞
R∞
ˆ Function of RV: E(φ(X)) = −∞ φ(x)f (x)dx.
ˆ Linearity: E(aX + bY ) = aE(X) + bE(Y ).
ˆ If X and Y are independent, then E(XY ) = E(X)E(Y ).
ˆ Markov Inequality: If X is non-negative, then

E(X)
P (X ≥ t) ≤ .
t

ˆ Chebyshev’s inequality:

σ2
P (|X − µ| ≥ t) ≤ .
t2

108
(End of Day 30)

(viii) Variance and Covariance:


ˆ Var(X) = E(X 2 ) − E(X)2 .
ˆ Var(a + bX) = b2 Var(X)
ˆ Cov(X, Y ) = E(XY ) − E(X)E(Y ).
ˆ If X and Y are independent, then Cov(X, Y ) = 0.
ˆ Var(X + Y ) = Var(X) + Var(Y ) + Cov(X, Y ).

(ix) Probability Generating Functions: If RV takes values in {0, 1, 2 . . . , }.


ˆ ΦX (t) = ∞ x
P
x=0 fX (x)t (for |z| < 1).
1 (n)
ˆ fX (n) = Φ (0).
n! X
ˆ E(X) = Φ′X (1)
ˆ Var(X) = Φ′′X (1) + Φ′X (1) − (Φ′X (1))2 .

(x) Moment Generating Functions: If etX has finite expectation.


ˆ MX (t) = E(etX ).
(n)
ˆ E(X n ) = MX (0).
ˆ If X and Y are independent, then MX+Y (t) = MX (t)MY (t).

(xi) Characteristic Function: Defined for all t ∈ R.


ˆ φX (t) = E(eitX ).
ˆ φa+bX (t) = eita φX (bt).
ˆ If X and Y are independent, then φX+Y (t) = φX (t)φY (t).
R ∞ −ixt
ˆ Inversion Formula: If φX is integrable, then fX (x) = 2π
1
−∞
e φX (t)dt.
ˆ Uniqueness Theorem: If φX = φY , then X ∼ Y .
ˆ Continuity Theorem: If φXn (t) → φX (t) for all t ∈ R, then FXn (x) → FX (x)
for all x ∈ R.

Distribution PGF MGF Char Fn


Bern(p) (tp + 1 − p) ΦX (et ) MX (it)
B(n, p) (tp + 1 − p)n ” ”
Po(λ) eλ(t−1) ” ”
p 1
Geom(p) 1−t(1−p)
(for |t| < 1−p ) ” ”
bt
e −e at
U (a, b) NA t(b−a)
MX (it)
λ

Γ(α, λ) NA λ−t
(for t < λ) ” (for all t)
tµ+σ 2 t2 /2
n(µ, σ 2 ) NA e ”

109
(xii) Weak Law of Large Numbers:
 
Sn
lim P − µ > ϵ = 0.
n→∞ n

(xiii) Central Limit Theorem:


 
Sn − nµ
lim P √ ≤x = Φ(x)
n→∞ σ n

where Φ is the distribution function for the standard normal.

(xiv) Normal Approximation Formulae:


 
x − nµ
P (Sn ≤ x) ≈ Φ √
σ n

 
Sn
P − µ ≥ c ≈ 2(1 − Φ(c n/σ))
n

(You will be given the standard normal table on an exam)

(End of Day 31)

110
Bibliography
[HPS] Hoel, Port, Stone, Introduction to Probability Theory, Houghton Mifflin Co.
(1971)

[Kroese] D.P. Kroese, A Short Introduction to Probability, https://people.smp.uq.


edu.au/DirkKroese/asitp.pdf

[Rohatgi-Saleh] Rohatgi, Saleh, An Introduction to Probability and Statistics (2nd Edi-


tion), Wiley (2001)

[Ross] Ross, A First Course in Probability (Ninth Edition), Pearson (2014)

111

You might also like