0% found this document useful (0 votes)
16 views

PrelimsProb MT23 26sep2023

Uploaded by

Rajesh Mal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

PrelimsProb MT23 26sep2023

Uploaded by

Rajesh Mal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Lecture notes for Prelims Probability

Matthias Winkel

Oxford, Michaelmas Term 2023


[email protected] or [email protected]
Version of 26 September 2023
Background
Probability theory is one of the fastest growing areas of mathematics. Probabilistic arguments
are used in a tremendous range of applications from number theory to genetics, from physics to
finance. Probability is a core part of computer science and a key tool in analysis. And of course
it underpins statistics. It is a subject that impinges on our daily lives: we come across it when
we go to the doctor or buy a lottery ticket, but we’re also using probability when we listen to the
radio or use a mobile phone, or when we enhance digital images and when our immune system
fights a cold. Whether you knew it or not, from the moment you were conceived, probability
played an important role in your life.
We all have some idea of what probability is: maybe we think of it as an approximation to
long run frequencies in a sequence of repeated trials, or perhaps as a measure of degree of belief
warranted by some evidence. Each of these interpretations is valuable in certain situations. For
example, the probability that I get a head if I flip a coin is sensibly interpreted as the proportion of
heads I get if I flip that same coin many times. But there are some situations where it simply does
not make sense to think of repeating the experiment many times. For example, the probability
that ‘UK interest rates will be more than 6% next March’ or the probability that ‘I’ll be involved
in a car accident in the next twelve months’ cannot be determined by repeating the experiment
many times and looking for a long run frequency.
The philosophical issue of interpretation is not one that we’ll resolve in this course. What
we will do is set up the abstract framework necessary to deal with complicated probabilistic
questions.
These notes are essentially those written by James Martin, based on a previous versions by
Christina Goldschmidt, Alison Etheridge, Neil Laws and Jonathan Marchini. I’m very glad to
receive any comments or corrections at [email protected] or [email protected].
The synopsis and reading list from the course handbook are reproduced on the next page for
your convenience. The suggested texts are an excellent source of further examples.
I hope you enjoy the course!

1
Overview
An understanding of random phenomena is becoming increasingly important in today’s world
within social and political sciences, finance, life sciences and many other fields. The aim of this
introduction to probability is to develop the concept of chance in a mathematical framework.
Random variables are introduced, with examples involving most of the common distributions.

Learning Outcomes
Students should have a knowledge and understanding of basic probability concepts, including
conditional probability. They should know what is meant by a random variable, and have met
the common distributions and their probability mass functions. They should understand the
concepts of expectation and variance of a random variable. A key concept is that of independence
which will be introduced for events and random variables.

Synopsis
Sample space, events, probability measure. Permutations and combinations, sampling with or
without replacement. Conditional probability, partitions of the sample space, law of total prob-
ability, Bayes’ Theorem. Independence.
Discrete random variables, probability mass functions, examples: Bernoulli, binomial, Pois-
son, geometric. Expectation, expectation of a function of a discrete random variable, variance.
Joint distributions of several discrete random variables. Marginal and conditional distributions.
Independence. Conditional expectation, law of total probability for expectations. Expectations of
functions of more than one discrete random variable, covariance, variance of a sum of dependent
discrete random variables.
Solution of first and second order linear difference equations. Random walks (finite state space
only).
Probability generating functions, use in calculating expectations. Examples including random
sums and branching processes.
Continuous random variables, cumulative distribution functions, probability density functions,
examples: uniform, exponential, gamma, normal. Expectation, expectation of a function of a
continuous random variable, variance. Distribution of a function of a single continuous random
variable. Joint probability density functions of several continuous random variables (rectangular
regions only). Marginal distributions. Independence. Expectations of functions of jointly con-
tinuous random variables, covariance, variance of a sum of dependent jointly continuous random
variables.
Random sample, sums of independent random variables. Markov’s inequality, Chebyshev’s
inequality, Weak Law of Large Numbers.

Reading List
1. G. R. Grimmett and D. J. A. Welsh, Probability: An Introduction, 2nd edition, Oxford
University Press, 2014, Chapters 1–5, 6.1–6.3, 7.1–7.3, 7.5 (Markov’s inequality), 8.1-8.2,
10.4.

2. J. Pitman, Probability, Springer-Verlag, 1993.

3. S. Ross, A First Course In Probability, Prentice-Hall, 1994.

4. D. Stirzaker, Elementary Probability, Cambridge University Press, 1994, Chapters 1–4, 5.1–
5.6, 6.1–6.3, 7.1, 7.2, 7.4, 8.1, 8.3, 8.5 (excluding the joint generating function).

2
Chapter 1

Events and probability

1.1 Introduction
We will think of performing an experiment which has a set of possible outcomes Ω. We call Ω
the sample space. For example,

(a) tossing a coin: Ω = {H, T };


(b) throwing two dice: Ω = {(i, j) : 1 ≤ i, j ≤ 6}.

An event is a subset of Ω. An event A ⊆ Ω occurs if, when the experiment is performed, the
outcome ω ∈ Ω satisfies ω ∈ A. You should think of events as things you can decide have or have
not happened by looking at the outcome of your experiment. For example,

(a) coming up heads: A = {H};


(b) getting a total of 4: A = {(1, 3), (2, 2), (3, 1)}.

The complement of A is Ac := Ω \ A and means “A does not occur”. For events A and B,
A ∪ B means “at least one of A and B occurs”;
A ∩ B means “both A and B occur”;
A \ B means “A occurs but B does not”.
If A ∩ B = ∅ we say that A and B are disjoint – they cannot both occur.
We will assign a probability P (A) to each event A. Later on we will discuss general rules (or
“axioms”) governing how these probabilities ought to behave. For now, let’s consider a simple
and special case, where Ω is a finite set, and the probability assigned to a subset A of Ω is
proportional to the size of A; that is,
|A|
P (A) = .
|Ω|
In that case for our examples above, we get:
(a) for a fair coin, P (A) = 1/2;
(b) for the two dice, P (A) = 1/12.
Example (b) demonstrates illustrates the need for counting in the situation where we have a finite
number of possible outcomes to our experiment, all equally likely. The sample space Ω has 36
elements (6 ways of choosing i and 6 ways of choosing j). Since A = {(1, 3), (2, 2), (3, 1)} contains
3 sample points, and all sample points are equally likely, we get P (A) = 3/36 = 1/12.
We want to be able to tackle much more complicated counting problems.

3
1.2 Counting
Many of you will have seen before the basic ideas involving permutations and combinations. If
you haven’t, or if you find them confusing, then you can find more details in the first chapter of
Introduction to Probability by Ross.

Arranging distinguishable objects


Suppose that we have n distinguishable objects (e.g. the numbers 1, 2, . . . , n). How many ways
to order them (permutations) are there? If we have three objects a, b, c then the answer is 6:
abc, acb, bac, bca, cab and cba.
In general, there are n choices for the first object in our ordering. Then, whatever the first
object was, we have n − 1 choices for the second object. Carrying on, we have n − m + 1 choices
for the mth object and, finally, a single choice for the nth. So there are

n(n − 1) . . . 2.1 = n!

different orderings.
Since n! increases extremely fast, it is sometimes useful to know Stirling’s formula:
√ 1
n! ∼ 2πnn+ 2 e−n ,

where f (n) ∼ g(n) means f (n)/g(n) → 1 as n → ∞. This is astonishingly accurate even for quite
small n. For example, the error is of the order of 1% when n = 10.

Arrangements when not all objects are distinguishable


What happens if not all the objects are distinguishable? For example, how many different ar-
rangements are there of a, a, a, b, c, d?
If we had a1 , a2 , a3 , b, c, d, there would be 6! arrangements. Each arrangement (e.g. b, a2 , d, a3 ,
a1 , c) is one of 3! which differ only in the ordering of a1 , a2 , a3 . So the 6! arrangements fall into
groups of size 3! which are indistinguishable when we put a1 = a2 = a3 . We want the number of
groups which is just 6!/3!.
We can immediately generalise this. For example, to count the arrangements of a, a, a, b, b, d,
play the same game. We know how many arrangements there are if the b’s are distinguishable,
but then all such arrangements fall into pairs which differ only in the ordering of b1 , b2 , and we
see that the number of arrangements is 6!/3!2!.
Lemma 1.1. The number of arrangements of the n objects

α1 , . . . , α1 , α2 , . . . , α2 , . . . , αk , . . . , αk
| {z } | {z } | {z }
m1 times m2 times mk times

where αi appears mi times and m1 + · · · + mk = n is


n!
. (1.1)
m1 !m2 ! · · · mk !
10!
Example 1.2. The number of arrangements of the letters of STATISTICS is 3!3!2! .

If there are just two types of object then, since m1 + m2 = n, the expression (1.1) is just a
binomial coefficient, mn1 = m1 !(n−m
n!
1 )!
= mn2 .
Note: we will always use the notation
 
n n!
= .
m m!(n − m)!

4
Recall the binomial theorem,
n  
n
X n m n−m
(x + y) = x y .
m
m=0

You can see where the binomial coefficient comes from because writing
(x + y)n = (x + y)(x + y) · · · (x + y)
and multiplying out, each term involves one pick from each bracket. The coefficient of xm y n−m
is the number of sequences of picks that give x exactly m times and y exactly n − m times and
that’s the number of ways of choosing the m “slots” for the x’s.
mk
The expression (1.1) is called a multinomial coefficient because it is the coefficient of am
1 · · · ak
1

in the expansion of
(a1 + · · · + ak )n
where m1 + · · · mk = n. We sometimes write
 
n
m1 m2 . . . m k
for the multinomial coefficient.
Instead of thinking in terms of arrangements, we can think of our binomial coefficient in terms
of choices. For example, if I have to choose a committee of size k from n people, there are nk
ways to do it. To see how this ties in, stand the n people in a line. For each arrangement of k
1’s and n − k 0’s I can create a different committee by picking the ith person for the committee
if the ith term in the arrangement is a 1.
Example 1.3. Pick a team of m players from a squad of n, all possible teams being equally likely.
Set
n
( )
X
Ω = (i1 , i2 , . . . , in ) : ik ∈ {0, 1} and ik = m ,
k=1
where (
1 if player k is picked,
ik =
0 otherwise.
Let A = {player 1 is in the team}. Then
n−1

#teams that include player 1 m−1 m
P (A) = = n = .
#possible teams (m) n
Many counting problems can be solved by finding a bijection (that is, a one-to-one correspon-
dence) between the objects we want to enumerate and other objects that we already know how
to enumerate.
Example 1.4. How many distinct non-negative integer-valued solutions of the equation
x1 + x2 + · · · + xm = n
are there?
Solution. Consider a sequence of n ?’s and m−1 |’s. There is a bijection between such sequences
and non-negative integer-valued solutions to the equation. For example, if m = 4 and n = 3,

|? {z ?} | |{z} | |{z}
? | |{z}
x1 =2 x2 =0 x3 =1 x4 =0
n+m−1

There are n sequences of n ?’s and m − 1 |’s and, hence, the same number of solutions to
the equation.

5
It is often possible to perform quite complex counting arguments by manipulating binomial
coefficients. Conversely, sometimes one wants to prove relationships between binomial coefficients
and this can most easily be done by a counting argument. Here is one famous example:
Lemma 1.5 (Vandermonde’s identity). For k, m, n ≥ 0,
  X k   
m+n m n
= , (1.2)
k j k−j
j=0

where we use the convention m



j = 0 for j > m.
Proof. Suppose we choose  a committee consisting of k people from a group of m men and n
m+n
women. There are k ways of doing this which is the left-hand side of (1.2).
Now the number of men in the committee is some j ∈ {0, 1, . . . , k} and then it contains k − j
women. The number of ways of choosing the j men is m j and for each such choice
n
there are k−j
choices for the women who make up the rest of the committee. So there are m n
 
j k−j committees
with exactly j men and summing over j we get that the total number of committees is given by
the right-hand side of (1.2).

An aside on sizes of sets


In this course, we will often deal with finite collections of objects, as in our counting examples.
We will also want to be able to deal with infinite sets, and we will want to distinguish between
those that are countable and those that are uncountable.
An infinite set S is called countable (or countably infinite) if there is a bijection between S
and the natural numbers N. That is, we can write S as a list: S = {x1 , x2 , x3 , . . .} = {xi : i ∈ N}.
Otherwise S is called uncountable. The natural numbers are themselves countable (take xi = i),
as are the rational numbers, but the real numbers, for example, are not. (Those of you doing
Analysis I will see much more about this there.)

1.3 The axiomatic approach


Definition 1.6. A probability space is a triple (Ω, F, P) where
1. Ω is a set, called the sample space,
2. F is a collection of subsets of Ω, called events, satisfying axioms F1 –F3 below,
3. P is a probability measure, which is a function P : F → R satisfying axioms P1 –P3 below.1
Our axioms are as follows:
The axioms of probability
F is a collection of subsets of Ω, with:
F1 : Ω ∈ F.
F2 : If A ∈ F, then also Ac ∈ F. S
F3 : If {Ai , i ∈ I} is a finite or countably infinite collection of members of F, then i∈I Ai ∈ F.
P is a function from F to R, with:
P1 : For all A ∈ F, P(A) ≥ 0.
P2 : P(Ω) = 1.
P3 : If {Ai , i ∈ I} S
is a finite orPcountably infinite collection of members of F, and Ai ∩ Aj = ∅
for i 6= j, then P( i∈I Ai ) = i∈I P(Ai ).

1
P : F → R means that to each element A of F, we associate a real number which we call P (A). P is a function
or mapping from F to R. Compare to a situation you are more familiar with: if f (x) = x2 then we say that f is a
function from R to R (or f : R → R for short).

6
Note that in particular (for I = {1, 2} and A1 = A and A2 = B):

P3 (special case): If A, B ∈ F with A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B).

(You may wonder whether this special case is enough to imply the full statement of P3 . It’s easy
to show, for example by induction, that this special case of the statement for two events implies
the statement for all finite collections of events. It turns out that the statement about countably
infinite collections is genuinely stronger P
– but this is rather a subtle point involving intricacies of
set theory! You may also wonder what i∈I means when I is countable.PAgain, there are some
subtleties, but the relevant cases for us are Pwhere I is either
P finite and i∈I the natural finite
sum, or I ⊆ N is countably infinite and i∈I P(Ai ) = ∞ i=0 P(Ai )1I (i) can be given rigorous
meaning as the limit of an infinite series, where 1I (i) = 1 if i ∈ I and 1I (i) = 0 if i 6∈ I.)
In the examples above, Ω was finite. In general Ω may be finite, countably infinite, or
uncountably infinite. If Ω is finite or countable, as it usually will be for the first half of this
course, then we normally take F to be the set of all subsets of Ω (the power set of Ω). (You
should check that, in this case, F1 –F3 are satisfied.) If Ω is uncountable, however, the set of
all subsets turns out to be too large: it ends up containing sets to which we cannot consistently
assign probabilities. This is an issue which some of you will see discussed properly in next year’s
Part A Integration course; for the moment, you shouldn’t worry about it, just make a mental
note that there is something to be resolved here.

Example 1.7. Consider a countably infinite P set Ω = {ω1 , ω2 , . . .} and an arbitrary collection
(p1 , p2 , . . .) of non-negative numbers with sum ∞ i=1 pi = 1. Put
X
P (A) = pi .
i : ωi ∈A

Then P satisfies P1 –P3 . The numbers (p1 , p2 , . . .) are called a probability function.

We can derive some useful consequences of the axioms.

Theorem 1.8. Suppose that (Ω, F, P) is a probability space and that A, B ∈ F. Then

1. P (Ac ) = 1 − P (A);

2. If A ⊆ B then P (A) ≤ P (B).

Proof. 1. Since A ∪ Ac = Ω and A ∩ Ac = ∅, by P3 , P (Ω) = P (A) + P (Ac ). By P2 , P (Ω) = 1


and so P (A) + P (Ac ) = 1, which entails the required result.
2. Since A ⊆ B, we have B = A ∪ (B ∩ Ac ). Since B ∩ Ac ⊆ Ac , it must be disjoint from A. So
by P3 , P (B) = P (A) + P (B ∩ Ac ). Since by P1 , P (B ∩ Ac ) ≥ 0, we thus have P (B) ≥ P (A).

Some other useful consequences are on the problem sheet.

1.4 Conditional probability


We have seen how to formalise the notion of probability. So for each event, which we thought of
as an observable outcome of an experiment, we have a probability (a likelihood, if you prefer).
But of course our assessment of likelihoods changes as we acquire more information and our next
task is to formalise that idea. First, to get a feel for what I mean, let’s look at a simple example.

Example 1.9. Suppose that in a single roll of a fair die we know that the outcome is an even
number. What is the probability that it is in fact a six?

7
Solution. Let B = {result is even} = {2, 4, 6} and C = {result is a six} = {6}. Then P(B) = 12
and P(C) = 61 , but if I know that B has happened, then P(C|B) (read “the probability of C given
B”) is 13 because given that B happened, we know the outcome was one of {2, 4, 6} and since the
die is fair, in the absence of any other information, we assume each of these is equally likely.
Now let A = {result is divisible by 3} = {3, 6}. If we know that B happened, then the only
way that A can also happen is if the outcome is in A ∩ B, in this case if the outcome is {6} and
so P(A|B) = 31 again which is P(A ∩ B)/P(B).

Definition 1.10. Let (Ω, F, P) be a probability space. If A, B ∈ F and P(B) > 0 then the
conditional probability of A given B is

P(A ∩ B)
P(A|B) = .
P(B)

(If P(B) = 0, then P(A|B) is not defined.)

We should check that this new notion fits with our idea of probability. The next theorem says
that it does.

Theorem 1.11. Let (Ω, F, P) be a probability space and let B ∈ F satisfy P(B) > 0. Define a
new function Q : F → R by Q(A) = P(A|B). Then (Ω, F, Q) is also a probability space.

Proof. Because we’re using the same F, we need only check axioms P1 –P3 .
P1 . For any A ∈ F,
P(A ∩ B)
Q(A) = ≥ 0.
P(B)
P2 . By definition,
P(Ω ∩ B) P(B)
Q(Ω) = = = 1.
P(B) P(B)
P3 . For disjoint events Ai , i ∈ I,
S
S P(( i∈I Ai ) ∩ B)
Q( i∈I Ai ) =
P(B)
S 
P i∈I (Ai ∩ B)
=
P(B)
P
P(Ai ∩ B)
= i∈I (because Ai ∩ B, i ∈ I, are disjoint)
P(B)
P
= i∈I Q(Ai ).

From the definition of conditional probability, we get a very useful multiplication rule:

P (A ∩ B) = P (A|B) P (B) . (1.3)

This generalises to

P (A1 ∩ A2 ∩ A3 ∩ . . . ∩ An ) = P (A1 ) P (A2 |A1 ) P (A3 |A1 ∩ A2 ) . . . P (An |A1 ∩ A2 ∩ . . . ∩ An−1 )


(1.4)
(you can prove this by induction).

Example 1.12. An urn contains 8 red balls and 4 white balls. We draw 3 balls at random
without replacement. Let Ri = {the ith ball is red} for 1 ≤ i ≤ 3. Then
8 7 6 14
P (R1 ∩ R2 ∩ R3 ) = P (R1 ) P (R2 |R1 ) P (R3 |R1 ∩ R2 ) = · · = .
12 11 10 55

8
Example 1.13. A bag contains 26 tickets, one with each letter of the alphabet. If six tickets
are drawn at random from the bag (without replacement), what is the chance that they can be
rearranged to spell CALVIN ?

Solution. Write Ai for the event that the ith ticket drawn is from the set {C, A, L, V, I, N }. By
(1.4),
6 5 4 3 2 1
P(A1 ∩ . . . ∩ A6 ) = · · · · · .
26 25 24 23 22 21

Example 1.14. A bitstream when transmitted has


4 3
P(0 sent) = , P(1 sent) = .
7 7
Owing to noise,
1
P(1 received | 0 sent) = ,
8
1
P(0 received | 1 sent) = .
6
What is P(0 sent | 0 received)?

Solution. Using the definition of conditional probability,

P(0 sent and 0 received)


P(0 sent | 0 received) = .
P(0 received)

Now
P(0 received) = P(0 sent and 0 received) + P(1 sent and 0 received).
Now we use (1.3) to get

P(0 sent and 0 received) = P(0 received | 0 sent)P(0 sent)



= 1 − P(1 received | 0 sent) P(0 sent)
 
1 4 1
= 1− = .
8 7 2

Similarly,

P(1 sent and 0 received) = P(0 received | 1 sent)P(1 sent)


1 3 1
= · = .
6 7 14
Putting these together gives
1 1 8
P(0 received) = + =
2 14 14
and
1
2 7
P(0 sent | 0 received) = 8 = .
14
8

9
1.5 The law of total probability and Bayes’ theorem
Definition 1.15. A family of events {B1 , B2 , . . .} is a partition of Ω if
S
1. Ω = i≥1 Bi (so that at least one Bi must happen), and

2. Bi ∩ Bj = ∅ whenever i 6= j (so that no two can happen together).


Theorem 1.16 (The law of total probability). Suppose {B1 , B2 , . . .} is a partition of Ω by sets
from F, such that P (Bi ) > 0 for all i ≥ 1. Then for any A ∈ F,
X
P(A) = P(A|Bi )P(Bi ).
i≥1

This result is sometimes also called the partition theorem. We used it in our bitstream example
to calculate the probability that 0 was received.

Proof. We have
  
[ [
P (A) = P A ∩  Bi  , since Bi = Ω
i≥1 i≥1
 
[
= P (A ∩ Bi )
i≥1
X
= P (A ∩ Bi ) by axiom P3 , since A ∩ Bi , i ≥ 1 are disjoint
i≥1
X
= P (A|Bi ) P (Bi ) .
i≥1

Note that if P(Bi ) = 0 for some i, then the expression in Theorem 1.16 wouldn’t make sense,
since P(A|Bi ) is undefined. (Although we could agree a convention by which P(A|B)P(B) means
0 whenever P(B) = 0; then we can make sense of the expression in Theorem
P 1.16 even if some of
the Bi have zero probability.) In any case, we can still write P(A) = i P(A ∩ Bi ).
Example 1.17. Crossword setter I composes m clues; setter II composes n clues. Alice’s prob-
ability of solving a clue is α if the clue was composed by setter I and β if the clue was composed
by setter II.
Alice chooses a clue at random. What is the probability she solves it?
Solution. Let

A = {Alice solves the clue}


B1 = {clue composed by setter I},
B2 = {clue composed by setter II}.

Then
m n
P(B1 ) = , P(B2 ) = , P(A|B1 ) = α, P(A|B2 ) = β.
m+n m+n
By the law of total probability,
αm βn αm + βn
P (A) = P (A|B1 ) P (B1 ) + P (A|B2 ) P (B2 ) = + = .
m+n m+n m+n

In our solution to Example 1.14, we combined the law of total probability with the definition
of conditional probability. In general, this technique has a name:

10
Theorem 1.18 (Bayes’ Theorem). Suppose that {B1 , B2 , . . .} is a partition of Ω by sets from F
such that P (Bi ) > 0 for all i ≥ 1. Then for any A ∈ F such that P (A) > 0,

P(A|Bk )P(Bk )
P(Bk |A) = P .
i≥1 P(A|Bi )P(Bi )

Proof. We have
P(Bk ∩ A)
P(Bk |A) =
P(A)
P(A|Bk )P(Bk )
= .
P(A)

Now substitute for P(A) using the law of total probability.

In Example 1.14, we calculated P(0 sent | 0 received) by taking {B1 , B2 , . . .} to be B1 =


{0 sent} and B2 = {1 sent} and A to be the event {0 received}.

Example 1.19. Recall Alice, from Example 1.17. Suppose that she chooses a clue at random
and solves it. What is the probability that the clue was composed by setter I?

Solution. Using Bayes’ theorem,

P(A|B1 )P(B1 )
P(B1 |A) =
P(A|B1 )P(B1 ) + P(A|B2 )P(B2 )
αm
m+n
= αm βn
+ m+n
m+n
αm
= .
αm + βn


When doing calculations, often it can be convenient to work with Bayes’ Theorem in “odds”
form. If we have an event B with probability P(B), then the ratio P(B)/P(B c ) is sometimes
called the odds of B.
By applying Bayes’ Theorem with the partition {B, B c }, and comparing the expressions it
gives for P(B|A) and for P(B c |A), we can obtain

P(B|A) P(A|B) P(B)


c
= .
P(B |A) P(A|B c ) P(B c )

The second fraction on the right-hand side is the odds of B. We could call the left-hand side the
conditional odds of B given A. The first fraction on the right-hand side is often called a Bayes
factor.

Example 1.20. A particular medical condition has prevalence 1/1000 in the population. There
exists a test for the condition which has false negative rate 0 (i.e. any person with the condition
will test positive) and false positive rate 0.0 (i.e. a typical person without the condition tests
positive with probability 0.01).
A member of the population selected at random takes a test, and tests positive. What is the
probability that they have the condition?

11
Solution. Write B for the event that the individual has the condition, and A for the event that
the test outcome is positive. We are asked to calculate P(B|A).
We have P(B) = 1/1000, P(A|B) = 1 and P(A|B c ) = 0.01.
Using the odds form of Bayes’ Theorem given above, we could write

P(B|A) P(A|B) P(B)


=
P(B c |A) P(A|B c ) P(B c )
1 1/1000
=
0.01 999/1000
= 0.0999.

Solving the equation p/(1 − p) = 0.0999, we obtain P(B|A) ≈ 0.091. Even though the test is
“99% accurate”, nonetheless the conditional probability of having the condition given a positive
test is still less than 10%. The chance of a a false positive considerably outweighs the chance of
a true positive, since the prevalence in the population is very low. 

Example 1.21 (Simpson’s paradox). Consider the following table showing a comparison of the
outcomes of two types of surgery for the removal of kidney stones (from Charig et al, 1986):

Number Success rate


Treatment A (open surgery) 350 (273/350 = ) 0.78
Treatment B (nephrolithotomy) 350 (289/350 = ) 0.83
On the basis of this comparison, it looks like Treatment B has performed slightly better than
Treatment A. A closer analysis of the data divides the patients into two groups, according to the
sizes of the stones:
Type I (stone < 2cm) Type II (stone > 2cm)
Number Success rate Number Success rate
Treatment A 87 (81/87 = ) 0.93 263 (192/263 =) 0.73
Treatment B 270 (234/270 = ) 0.87 80 (55/80 = ) 0.69
Now Treatment A appears to beat Treatment B both in patients of Type I, and in patients of Type
II. Our initial analysis seems to have been misleading because of a “confounding variable”, the
severity of the case. Looking at the second table, we can see that patients of Type II are harder to
treat; Treatment A was more often given to these harder cases, and Treatment B to easier cases.
This made Treatment B appear to perform better overall.
This is Simpson’s paradox; in conditional probability language, it consists of the fact that for
events E, F , G, we can have

P(E|F ∩ G) > P(E|F c ∩ G)


P(E|F ∩ Gc ) > P(E|F c ∩ Gc )

and yet

P(E|F ) < P(E|F c ).

(Exercise: identify corresponding events E, F and G in the example above.)

12
1.6 Independence
Of course, knowing that B has happened doesn’t always influence the chances of A.

Definition 1.22. 1. Events A and B are independent if P(A ∩ B) = P(A)P(B).

2. More generally, a family of events A = {Ai : i ∈ I} is independent if


!
\ Y
P Ai = P(Ai )
i∈J i∈J

for all finite subsets J of I.

3. A family A of events is pairwise independent if P(Ai ∩ Aj ) = P(Ai )P(Aj ) whenever i 6= j.

WARNING: PAIRWISE INDEPENDENT DOES NOT IMPLY INDEPENDENT.

See the problem sheet for an example of this.


Suppose that A and B are independent. Then if P (B) > 0, we have P (A|B) = P (A), and if
P (A) > 0, we have P (B|A) = P (B). In other words, knowledge of the occurrence of B does not
influence the probability of A, and vice versa.

Example 1.23. Suppose we have two fair dice. Let

A = {first die shows 4}, B = {total score is 6} and C = {total score is 7}.

Then
1
P (A ∩ B) = P ({(4, 2)}) =
36
but
1 5 1
· 6= .
P (A) P (B) =
6 36 36
So A and B are not independent. However, A and C are independent (you should check this).

Theorem 1.24. Suppose that A and B are independent. Then

(a) A and B c are independent;

(b) Ac and B c are independent.

Proof. (a) We have A = (A ∩ B) ∪ (A ∩ B c ), where A ∩ B and A ∩ B c are disjoint, so using the


independence of A and B,

P (A ∩ B c ) = P (A) − P (A ∩ B) = P (A) − P (A) P (B) = P (A) (1 − P (B)) = P (A) P (B c ) .

(b) Apply part (a) to the events B c and A.

More generally, if {Ai , i ∈ I} is any family of independent events, then also the family {Aci , i ∈
I} is independent. Proof: exercise! (We need to show that the product formula in Definition 1.22
holds for all finite subsets {Aci , i ∈ J}). One approach is to use the inclusion-exclusion formula
from Problem Sheet 1; various induction arguments are also possible.)

13
1.7 Some useful rules for calculating probabilities
If you’re faced with a probability calculation you don’t know how to do, here are some things to
try.

ˆ AND: Try using the multiplication rule:

P (A ∩ B) = P (A|B) P (B) = P (B|A) P (A)

or its generalisation:

P (A1 ∩ A2 ∩ . . . ∩ An ) = P (A1 ) P (A2 |A1 ) . . . P (An |A1 ∩ A2 ∩ . . . ∩ An−1 )

(as long as all of the conditional probabilities are defined).

ˆ OR: If the events are disjoint, use

P (A1 ∪ A2 ∪ . . . ∪ An ) = P (A1 ) + P (A2 ) + · · · + P (An ) .

Otherwise, try taking complements:

P (A1 ∪ A2 ∪ . . . ∪ An ) = 1 − P ((A1 ∪ A2 ∪ . . . ∪ An )c ) = 1 − P (Ac1 ∩ Ac2 ∩ . . . ∩ Acn )

(“the probability at least one of the events occurs is 1 minus the probability that none of
them occur”). If that’s no use, try using the inclusion-exclusion formula (see the problem
sheet):
n
X X
P (A1 ∪ A2 ∪ . . . ∪ An ) = P (Ai ) − P (Ai ∩ Aj ) + · · · + (−1)n+1 P (A1 ∩ A2 ∩ . . . ∩ An ) .
i=1 i<j

ˆ If you can’t calculate the probability of your event directly, try splitting it up according to
some partition of Ω and using the law of total probability.

Useful check: any probability that you calculate should be in the interval [0, 1]!
If not, something, has gone wrong....

14
Chapter 2

Discrete random variables

Observing whether an event occurs corresponds to answering a yes-no question about a system
that we’re observing, or an experiment that we’re performing. But often we may want to answer
more general questions. In particular, we may want to ask questions such as “how many?” , “how
big?”, “for how long?”, whose answer is a real number.
Random variables are essentially real-valued measurements of this kind. We start by consid-
ering discrete random variables, which take a finite or countable set of values (for example integer
values). Here’s a formal definition.
Definition 2.1. A discrete random variable X on a probability space (Ω, F, P) is a function
X : Ω → R such that
(a) {ω ∈ Ω : X(ω) = x} ∈ F for each x ∈ R,

(b) ImX := X(Ω) = {X(ω) : ω ∈ Ω} is a finite or countable subset of R.


We often abbreviate “random variable” to “r.v.”.
This looks very abstract, so give yourself a moment to try to understand what it means.
ˆ (a) says that {w ∈ Ω : X(ω) = x} is an event to which we can assign a probability. We usu-
ally abbreviate this event to {X = x} and write P (X = x) to mean P ({ω ∈ Ω : X(ω) = x}).
If these abbreviations confuse you at first, put in the ω’s to make it clearer what is meant.

ˆ (b) says that X can only take countably many values. Often ImX will be a subset of N.

ˆ If Ω is countable, (b) holds automatically because we can think of ImX as being indexed
by Ω, and so, therefore, ImX must itself be countable. If we also take F to be the set of
all subsets of Ω then (a) is also immediate.

ˆ Later in the course, we will deal with continuous random variables, which take uncountably
many values; we have to be a a bit more careful about what the correct analogue of (a) is;
we will end up requiring that sets of the form {X ≤ x} = {ω ∈ Ω : X(ω) ≤ x} are events to
which we can assign probabilities.
Example 2.2. Roll two dice and take Ω = {(i, j) : 1 ≤ i, j ≤ 6}. Examples of random variables
include X and Y where

X(i, j) = max{i, j}, the maximum of the two scores


Y (i, j) = i + j, the total score.

Definition 2.3. The probability mass function (p.m.f.) of X is the function pX : R → [0, 1]
defined by
pX (x) = P(X = x).

15
If x ∈
/ ImX (that is, X(ω) never equals x) then pX (x) = P ({ω : X(ω) = x}) = P (∅) = 0. Also
X X
pX (x) = P ({ω : X(ω) = x})
x∈ImX x∈ImX
!
[
=P {ω : X(ω) = x} since the events are disjoint
x∈ImX
= P (Ω) since every ω ∈ Ω gets mapped somewhere in ImX
= 1.

Example 2.4. Fix an event A ∈ F and let X : Ω → R be the function given by


(
1 if ω ∈ A,
X(ω) =
0 otherwise.

Then X is a random variable with probability mass function

pX (0) = P (X = 0) = P (Ac ) = 1 − P (A) , pX (1) = P (X = 1) = P (A)

and pX (x) = 0 for all x 6= 0, 1. We will usually write X = 1A and call this the indicator function
of the event A.

2.1 Some classical distributions


Before introducing concepts related to discrete random variables, we introduce a stock of examples
to try these concepts out on. All are classical and ubiquitous in probabilistic modelling. They
also have beautiful mathematical structure, some of which we’ll uncover over the course of the
term.

1. The Bernoulli distribution. X has the Bernoulli distribution with parameter p (where
0 ≤ p ≤ 1) if
P(X = 0) = 1 − p, P(X = 1) = p.
We often write q = 1 − p. (Of course since (1 − p) + p = 1, we must have P (X = x) = 0 for
all other values of x.) We write X ∼ Ber(p).
We showed in Example 2.4 that the indicator function 1A of an event A is an example of a
Bernoulli random variable with parameter p = P (A), constructed on an explicit probability
space.
The Bernoulli distribution is used to model, for example, the outcome of the flip of a coin
with “1” representing heads and “0” representing tails. It is also a basic building block for
other classical distributions.

2. The binomial distribution. X has a binomial distribution with parameters n and p


(where n is a positive integer and p ∈ [0, 1]) if
 
n k
P (X = k) = p (1 − p)n−k .
k

We write X ∼ Bin(n, p).


X models the number of successes obtained in a sequence of n independent trials, where each
trial has probability p of success and 1 − p of failure. To see this, note that the probability
of any particular sequence of length n ofsuccesses or failures, containing exactly k sucesses,
is pk (1 − p)n−k and there are exactly nk such sequences.

16
3. The geometric distribution. X has a geometric distribution with parameter p if

P(X = k) = p(1 − p)k−1 , k = 1, 2, . . . .

Notice that now X takes values in a countably infinite set – the whole of the positive
integers. We write X ∼ Geom(p).
Again we can interpret the distribution in terms of a sequence of independent trials, each
with probability p of success. Now X models the number of independent trials needed until
we see the first success.
NOTE: there is an alternative and also common definition for the geometric distribution
as the distribution of the number of failures, Y , before the first success. If X is defined as
above, then Y corresponds to X − 1 and so

P (Y = k) = p(1 − p)k , k = 0, 1, . . . .

If in doubt, state which one you are using.

4. The Poisson distribution. X has the Poisson distribution with parameter λ ≥ 0 if

λk e−λ
P (X = k) = , k = 0, 1, . . . .
k!
We write X ∼ Po(λ).
This distribution arises in many applications. It frequently provides a good model for the
number of events which occur in some situation where there are a large number of possible
events, each of which has a very small probability. For example, the number of decaying
particles detected by a Geiger counter near a lump of radioactive material, or the number of
calls arriving at a call centre in a given time period. Formally speaking it can be obtained
as a limit of the binomial distribution Bin(n, λ/n) as n becomes large (see problem sheet
3).

Exercise 2.5. Check that each of these really does define a probability mass function. That is:

ˆ pX (x) ≥ 0 for all x,

ˆ
P
x pX (x) = 1.

You may find it useful to refer to the reminders about series which you can find in the Appendix
at the end of these notes.

The distributions mentioned above are important, but naturally they do not constitute an
exhaustive list! In fact, given any function pX which satisfies the two properties above, we can
write down a probability space and a random variable X defined on it whose probability mass
function is p. Most directly, we could take Ω = {x ∈ R : p(x) 6= 0}, take F to consist of all subsets
of Ω, define
P ({ω}) = p(ω) for each ω ∈ Ω
and more generally X
P (S) = p(ω) for each S ⊆ Ω,
ω∈S

and then take X to be the identity function i.e. X(ω) = ω. However, this is often not the
most natural probability space to take. For example, suppose that pX is the mass function of
a Binomial(3, 1/2) distribution (which can model the number of heads obtained in a sequence
of three fair coin tosses). We could proceed as just outlined. But we could also take Ω =

17
{(i, j, k) : i, j, k ∈ {0, 1}}, with a 0 representing a tail and a 1 representing a head, so that an
element of Ω tells us exactly what the three coin tosses were. Then take F to be the power set
of Ω,
P ({(i, j, k)}) = 2−3 for all i, j, k ∈ {0, 1},
so that every sequence of coin tosses is equally likely, and finally set X((i, j, k)) = i + j + k. In
both cases, X has the same distribution, but the probability spaces are quite different.
Up to this point, we have often been quite explicit in describing our sample space Ω. Once
we get to more complicated examples, it can quickly become impractical to specify Ω explicitly.
Although the concept of a probability space (Ω, F, P) underlies everything, in practice it will be
rare that we think about Ω itself – instead we will talk directly about events and their probabilities,
and random variables and their distributions (and we can do that without assuming any particular
structure for Ω).

2.2 Expectation
We can encapsulate certain information about the distribution of a random variable through what
a statistician might call “summary statistics”. The most fundamental of these is the expectation,
which tells us the “average value” of the random variable.

Definition 2.6. The expectation (or expected value or mean) of X is


X
E[X] = xP(X = x) (2.1)
x∈ImX
P P
provided that x∈ImX |x|P (X = x) < ∞. If x∈ImX |x|P (X = x) is infinite, we say that the
expectation does not exist.
P
The reason we insist that x∈ImX |x|P(X = x) is finite, that is that the sum on the right-hand
side of equation (2.1) is absolutely convergent, is that we need the expectation to take the same
value regardless of the order in which we sum the terms. See Section A.1 for a discussion of
absolute convergence.
(The problems with different orders of summation giving different answer concern cases when
there
P are both positive and negative terms in the sum. If X is positive, i.e. ImX ⊆ R+ , and if
x∈ImX xP (X = x) diverges, then there is no issue with the order of summation. In this case,
we sometimes write E [X] = ∞ and say that the expectation is infinite.)
The expectation of X is the ‘average’ value which X takes – if we were able to take many
independent copies of the experiment that X describes, and take the average of the outcomes,
then we should expect that average to be close to E[X]. We will come back to this idea at the
end of the course when we look at the Law of Large Numbers.

Example 2.7. 1. Suppose that X is the number obtained when we roll a fair die. Then

E[X] = 1 · P(X = 1) + 2 · P(X = 2) + . . . + 6 · P(X = 6)


1 1 1
= 1 · + 2 · + . . . + 6 · = 3.5.
6 6 6
Of course, you’ll never throw 3.5 on a single roll of a die, but if you throw a lot of times
you expect the average number thrown to be close to 3.5.

2. Suppose A ∈ F is an event and 1A is its indicator function. Then


E [1A ] = 0 · P (Ac ) + 1 · P (A) = P (A) .

18
6 1
3. Suppose that P (X = n) = π 2 n2
, n ≥ 1. Then
∞ ∞
X 6 X1
nP (X = n) = =∞
π2 n
n=1 n=1

and so the expectation does not exist (or we may say E [X] = ∞).

4. Let X ∼ Po(λ). Then



X e−λ λk
E [X] = k
k!
k=0

−λ
X λk
=e
(k − 1)!
k=1

−λ
X λk−1
= λe
(k − 1)!
k=1
−λ λ
= λe e
= λ.

You will find some more examples on the problem sheet.


Let h : R → R. Then if X is a discrete random variable, Y = h(X) is also a discrete random
variable.

Theorem 2.8. If h : R → R, then


X
E [h(X)] = h(x)P (X = x)
x∈ImX
P
provided that x∈ImX |h(x)|P (X = x) < ∞.

Proof. Let A = {y : y = h(x) for some x ∈ ImX}. Then, starting from the right-hand side,
X X X
h(x)P (X = x) = h(x)P (X = x)
x∈ImX y∈A x∈ImX: h(x)=y
X X
= yP (X = x)
y∈A x∈ImX: h(x)=y
X X
= y P (X = x)
y∈A x∈ImX: h(x)=y
X
= yP (h(X) = y)
y∈A

= E [h(X)] .

Example 2.9. Take h(x) = xk . Then E[X k ] is called the kth moment of X, when it exists.

Let us now prove some properties of the expectation which will be useful to us later on.

Theorem 2.10. Let X be a discrete random variable such that E [X] exists.

(a) If X is non-negative then E [X] ≥ 0.

(b) If a, b ∈ R then E [aX + b] = aE [X] + b.

19
Proof. (a) We have ImX ⊆ [0, ∞) and so
X
E [X] = xP (X = x)
x∈ImX

is a sum whose terms are all non-negative and so must itself be non-negative.
(b) Exercise.

The expectation tells us important information about the average value of a random variable,
but there is a lot of information that it doesn’t provide. If you are investing in the stock market,
the expected rate of return of some stock is not the only quantity you will be interested in;
you would also like to get some idea of the riskiness of the investment. The typical size of
the fluctuations of a random variable around its expectation is described by another summary
statistic, the variance.

Definition 2.11. For a discrete random variable X, the variance of X is defined by

var (X) = E[(X − E[X])2 ]

provided that this quantity exists.

(This is E [f (X)] where f is given by f (x) = (x − E [X])2 – remember that E [X] is just a
number.)
Note that, since (X − E [X])2 is a non-negative random variable, by part (a) of Theorem 2.10,
var (X) ≥ 0. The variance is a measure of how much the distribution of X is spread out about
its mean: the more the distribution is spread out, the larger the variance. If X is, in fact,
deterministic (i.e. P (X = a) = 1 for some a ∈ R) then E [X] = a also and so var (X) = 0: only
randomness gives rise to variance.
Writing µ = E[X] and expanding the square we see that

var (X) = E (X − µ)2


 
X
= (x2 − 2µx + µ2 )pX (x)
x∈ImX
X X X
= x2 pX (x) − 2µ xpX (x) + µ2 pX (x)
x∈ImX x∈ImX x∈ImX
 2 2
= E X − 2µE [X] + µ
= E X 2 − (E [X])2 .
 

This is often an easier expression to work with.


pThose of you who have done statistics at school will have seen the standard deviation, which
is var (X). In probability, we usually work with the variance instead because it has natural
mathematical properties.

Theorem 2.12. Suppose that X is a discrete random variable whose variance exists. Then if a
and b are (finite) fixed real numbers, then the variance of the discrete random variable Y = aX +b
is given by
var (Y ) = var (aX + b) = a2 var (X) .

The proof is an exercise, but notice that of course b doesn’t come into it because it simply
shifts the whole distribution – and hence the mean – by b, whereas variance measures relative to
the mean.
In view of Theorem 2.12, why do you think statisticians often prefer to use the standard
deviation rather than variance as a measure of spread?

20
2.3 Conditional distributions
Back in Section 1.4 we talked about conditional probability P(A|B). In the same way, for a
discrete random variable X we can define its conditional distribution, given the event B. This is
what it sounds like: the mass function obtained by conditioning on the outcome B.

Definition 2.13. Suppose that B is an event such that P (B) > 0. Then the conditional proba-
bility mass function of X given B is given by

P({X = x} ∩ B)
P(X = x|B) = ,
P(B)

for x ∈ R. The conditional expectation of X given B is


X
E[X|B] = xP(X = x|B),
x

whenever the sum converges absolutely. We write pX|B (x) = P(X = x|B).

Theorem 2.14 (Partition theorem for expectations). If {B1 , B2 , . . .} is a partition of Ω such


that P (Bi ) > 0 for all i ≥ 1 then
X
E [X] = E [X|Bi ] P (Bi ) ,
i≥1

whenever E [X] exists.

Proof.
X
E[X] = xP(X = x)
x
!
X X
= x P(X = x|Bi )P(Bi ) by the law of total probability
x i
XX
= xP(X = x|Bi )P(Bi )
x i
!
X X
= P(Bi ) xP(X = x|Bi )
i x
X
= E[X|Bi ]P(Bi ).
i

Notice that again we could include cases where P (Bi ) = 0 for some i, if we agree to interpret
E [X|Bi ] P (Bi ) to be 0 in that situation.

Example 2.15. Let X be the number of rolls of a fair die required to get the first 6. (So X is
geometrically distributed with parameter 1/6.) Find E[X] and var (X).

Solution. Let B1 be the event that the first roll of the die gives a 6, so that B1c is the event that
it does not. Then

E[X] = E[X|B1 ]P(B1 ) + E[X|B1c ]P(B1c )


1 5
= + E[1 + X] (successive rolls are independent)
6 6
1 5
= + (1 + E[X]).
6 6

21
Rearrange to get E[X] = 6 (as our intuition would have us guess). Similarly,

E[X 2 ] = E[X 2 |B1 ]P(B1 ) + E[X 2 |B1c ]P(B1c )


1 5
= + E[(1 + X)2 ]
6 6
1 5
= + (1 + 2E[X] + E[X 2 ]).
6 6
Rearranging and using the previous result (E[X] = 6) gives E[X 2 ] = 66 and so var (X) = 30.
Compare this solution to a direct calculation using the probability mass function:

X ∞
X
k−1 2
E[X] = kpq , E[X ] = k 2 pq k−1 ,
k=1 k=1
1
with p = 6 and q = 56 . 

We’ll see a powerful approach to moment calculations in Chapter 4, but first we must find a
way to deal with more than one random variable at a time.

2.4 Joint distributions


Suppose that we want to consider two discrete random variables, X and Y , defined on the same
probability space. In the same way as a single random variable was characterised in terms of its
probability mass function, pX (x) for x ∈ R, so now we must specify pX,Y (x, y) = P(X = x, Y = y).
It’s not enough to specify P (X = x) and P (Y = y) because the events {X = x} and {Y = y}
might not be independent (think of the case Y = X 2 , for example).
Definition 2.16. Given two discrete random variables X and Y their joint probability mass
function is defined as

pX,Y (x, y) = P ({X = x} ∩ {Y = y}) , x, y ∈ R.

We usually write
P P the right-hand side simply as P(X = x, Y = y). We have pX,Y (x, y) ≥ 0 for all
x, y ∈ R and x y pX,Y (x, y) = 1.
If we are given the joint probability mass function of X and Y , we can recover the probability
mass functions of either of X or Y individually by summing over the possible values of the other:
X
pX (x) = pX,Y (x, y)
y
X
pY (y) = pX,Y (x, y).
x

In this context these distributions of X and Y alone are often called marginal distributions The
name comes from the way they appear in the margins when we write the joint probability mass
function as a table:
Example 2.17. Suppose that X and Y take only the values 0 or 1 and their joint mass function
pX,Y is given by
X 0 1
Y

1 1
0 3 2

1 1
1 12 12

22
P
Observe that x,y pX,Y (x, y) = 1 (always a good check when modelling).
The marginals are found by summing the rows and columns:
X 0 1 pY (y)
Y

1 1 5
0 3 2 6

1 1 1
1 12 12 6

5 7
pX (x) 12 12

7 1 1 7 1
Notice that P(X = 1) = 12 , P(Y = 1) = 6 and P(X = 1, Y = 1) = 12 6= 12 × 6 so {X = 1}
and {Y = 1} are not independent events.

Whenever pX (x) > 0 for some x ∈ R, we can also write down the conditional probability mass
function of Y given that X = x:

pX,Y (x, y)
pY |X=x (y) = P (Y = y|X = x) = for y ∈ R.
pX (x)

The conditional expectation of Y given that X = x is then


X
E [Y |X = x] = ypY |X=x (y),
y

whenever the sum converges absolutely.

Example 2.18. For the joint distribution in Example 2.17, we have


4 1
pY |X=0 (0) = , pY |X=0 (1) =
5 5
and
1
E [Y |X = 0] = .
5
Definition 2.19. Discrete random variables X and Y are independent if

P(X = x, Y = y) = P(X = x)P(Y = y) for all x, y ∈ R.

In other words, X and Y are independent if and only if the events {X = x} and {Y = y} are
independent for all choices of x and y. We can also write this as

pX,Y (x, y) = pX (x)pY (y) for all x, y ∈ R.

Example 2.20 (Part of an old exam question). A coin when flipped shows heads with probability
p and tails with probability q = 1−p. It is flipped repeatedly. Assume that the outcome of different
flips is independent. Let U be the length of the initial run and V the length of the second run.
Find P(U = m, V = n), P(U = m), P(V = m). Are U and V independent?

23
Solution. We condition on the outcome of the first flip and use the law of total probability.
P(U = m, V = n) = P(U = m, V = n | 1st flip H)P(1st flip H)
+ P(U = m, V = n | 1st flip T)P(1st flip T)
= ppm−1 q n p + qq m−1 pn q
= pm+1 q n + q m+1 pn .

X q p
P(U = m) = (pm+1 q n + q m+1 pn ) = pm+1 + q m+1
1−q 1−p
n=1
= pm q + q m p.

X p2 q2
P(V = n) = (pm+1 q n + q m+1 pn ) = q n + pn
1−p 1−q
m=1
2 n−1
=p q + q 2 pn−1 .
We have P(U = m, V = n) 6= f (m)g(n) unless p = q = 12 . So U , V are not independent unless
p = 12 . To see why, suppose that p < 12 , then knowing that U is small, say, tells you that the first
run is more likely to be a run of H’s and so V is likely to be longer. Conversely, knowing that U
is big will tell us that V is likely to be small. U and V are negatively correlated. 
In the same way as we defined expectation for a single discrete random variable, so in the
bivariate case we can define expectation of any function of the random variables X and Y . Let
h : R2 → R. Then h(X, Y ) is itself a random variable, and we can show as in Theorem 2.8 that
XX
E[h(X, Y )] = h(x, y)P(X = x, Y = y)
x y
XX
= h(x, y)pX,Y (x, y), (2.2)
x y

provided the sum converges absolutely.


Theorem 2.21. Suppose X and Y are discrete random variables and a, b ∈ R are constants.
Then
E[aX + bY ] = aE[X] + bE[Y ]
provided that both E [X] and E [Y ] exist.
Proof. Setting h(x, y) = ax + by, we have
E[aX + bY ] = E [h(X, Y )]
XX
= (ax + by)pX,Y (x, y)
x y
XX XX
=a xpX,Y (x, y) + b ypX,Y (x, y)
x y x y
! !
X X X X
=a x pX,Y (x, y) +b y pX,Y (x, y)
x y y x
X X
=a xpX (x) + b ypY (y)
x y

= aE[X] + bE[Y ].
Theorem 2.21 tells us that expectation is linear. This is a very important property. We can
easily extend by induction to get E[a1 X1 + · · · + an Xn ] = a1 E[X1 ] + · · · + an E[Xn ] for any finite
collection of random variables X1 , . . . , Xn . Note that we don’t need to make any assumption
about independence of the random variables.

24
Example 2.22. Your spaghetti bowl contains n strands of spaghetti. You repeatedly choose 2
ends at random, and join them together. What is the average number of loops in the bowl, once
no ends remain?
Solution. We start with 2n ends, and the number decreases by 2 at each step. When we have
k ends, the probability of forming a loop is 1/(k − 1). Before the ith step, we have 2(n − i + 1)
ends, so we form a loop with probability 1/[2(n − i) + 1].
Let Xi be the indicator function of the event that we form a loop at the ith step. Then
E[Xi ] = 1/[2(n − i) + 1]. Let M be the total number of loops formed. Then M = X1 + · · · + Xn ,
so using linearity of expectation,

E[M ] = E[X1 ] + E[X2 ] + · · · + +E[Xn−1 ] + E[Xn ]


1 1 1
= + + · · · + + 1.
2n − 1 2n − 3 3
(If n is large, this expectation is close to log n.)
Note that the probability mass function of M is not easy to obtain. So finding the expectation
of M directly from the definition at (2.1) would have been very much less straightforward.

Theorem 2.23. If X and Y are independent discrete random variables whose expectations exist,
then
E[XY ] = E[X]E[Y ].
Proof. We have
XX
E[XY ] = xyP(X = x, Y = y)
x y
XX
= xyP(X = x)P(Y = y) (by independence)
x y
! !
X X
= xP(X = x) yP(Y = y)
x y

= E[X]E[Y ].

Exercise 2.24. Show that var (X + Y ) = var (X) + var (Y ) when X and Y are independent.
What happens when X and Y are not independent? It’s useful to define the covariance,

cov (X, Y ) = E [(X − E [X])(Y − E [Y ])] .

Notice that cov (X, X) = var (X).


Exercise 2.25. Check that cov (X, Y ) = E [XY ] − E [X] E [Y ] and that

var (X + Y ) = var (X) + var (Y ) + 2cov (X, Y ) .

Notice that this means that if X and Y are independent, their covariance is 0. In general, the
covariance can be either positive or negative valued.

WARNING: cov (X, Y ) = 0 DOES NOT IMPLY THAT X AND Y ARE INDEPENDENT.

See the problem sheet for an example.


Definition 2.26. We can define multivariate probability mass functions analogously:

pX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = P(X1 = x1 , X2 = x2 , . . . , Xn = xn ),

for x1 , x2 , . . . , xn ∈ R, and so on.

25
By analogy with the way we defined independence for a sequence of events, we can define
independence for a family of random variables.

Definition 2.27. A family {Xi : i ∈ I} of discrete random variables are independent if for all
finite sets J ⊆ I and all collections {Ai : i ∈ J} of subsets of R,
!
\ Y
P {Xi ∈ Ai } = P (Xi ∈ Ai ) .
i∈J i∈J

Suppose that X1 , X2 , . . . are independent random variables which all have the same distribution.
Then we say that X1 , X2 , . . . are independent and identically distributed (i.i.d.).

26
Chapter 3

Difference equations and random


walks

3.1 Difference equations


Our next topic is not probability theory, but rather a tool that you need both to answer some
probability questions in the next chapter, as well as in all sorts of other areas of mathematics.
Here is a famous probability problem by way of motivation.
Example 3.1 (Gambler’s ruin). A gambler repeatedly plays a game in which he wins £1 with
probability p and loses £1 with probability q = 1 − p (independently at each play). He will leave
the casino if he loses all his money or if his fortune reaches £M .
What is the probability that he leaves with nothing if his initial fortune is £n?
If the initial fortune is £n, call the probability un and condition on the outcome of the first
play to see that
un = P(bankruptcy | win 1st game)P(win 1st game)
+ P(bankruptcy | lose 1st game)P(lose 1st game).
If the gambler wins the first game, by independence of different plays it’s just like starting over
from an initial fortune of £(n + 1); similarly, if he loses the first games, it’s just like starting over
from an initial fortune of £(n − 1). This implies that
un = pun+1 + qun−1 , (3.1)
which is valid for 1 ≤ n ≤ M − 1. We have the boundary conditions u0 = 1, uM = 0.
This is an example of a second-order recurrence relation; it is equations of this sort that we
will now learn how to solve.
Definition 3.2. A kth order linear recurrence relation (or difference equation) has the form
k
X
aj un+j = f (n) (3.2)
j=0

with a0 6= 0 and ak 6= 0, where a0 , . . . , ak are known constants independent of n. A solution to


such a difference equation is a sequence (un )n≥0 satisfying (3.2) for all n ≥ 0.
You may want to recall what you may know about solving linear ordinary differential equations
like
d2 y dy
a +b + cy = f (x)
dx2 dx
for the function y, since what we do here will be completely analogous.

27
The next theorem says that we can split the problem of finding a solution to our difference
equations into two parts.

Theorem 3.3. The general solution (un )n≥0 (i.e. if no boundary conditions are specified) of
k
X
aj un+j = f (n)
j=0

can be written as un = vn + wn where (vn )n≥0 is a particular solution to the equation and (wn )n≥0
is any solution of the associated homogeneous equation
k
X
aj wn+j = 0.
j=0

Proof. Recall that ak 6= 0 so we can write the equation as


 
k−1
1  X
un+k = f (n) − aj un+j  .
ak
j=0

To see that there always are solutions, we choose u0 = · · · = uk−1 = 0 and note that this equation
determines un+k inductively for all n ≥ 0 so that the equation holds. Now fix a particular solution
(vn )n≥0 and consider any solution (un )n≥0 . Then (un − vn )n≥0 satisfies
k
X
aj (un − vn ) = 0.
j=0

So (un ) and (vn ) indeed differ by a solution wn = un − vn to the homogeneous equation, as


claimed. Vice versa, for every particular solution (vn ) to the equation and every solution (wn ) to
the homogeneous equation, un = vn + wn also solves the equation.

3.2 First order linear difference equations


We will develop the necessary methods via a series of worked examples.

Example 3.4. Solve


un+1 = aun + b
where u0 = 3 and the constants a 6= 0 and b are given.

Solution. The homogeneous equation is wn+1 = awn . “Putting it into itself”, we get

wn = awn−1 = . . . = an w0 = Aan

for some constant A.


How about a particular solution? As in differential equations, guess a constant solution might
b
work, so try vn = C. This gives C = aC + b so provided that a 6= 1, C = 1−a and we have the
general solution
b
un = Aan + .
1−a
Setting n = 0 allows us to determine A so that the boundary condition u0 = 3 holds:
b b
3=A+ and so A = 3 − .
1−a 1−a

28
Hence,
b(1 − an )
 
b b
un = 3 − an + = 3an + .
1−a 1−a 1−a
What happens if a = 1? An applied-maths-type approach would set a = 1 +  and try to see
what happens as  → 0:
b(1 − (1 + )n )
un = u0 (1 + )n +
1 − (1 + )
(1 − (1 + n))
= u0 + b + O()
−
= u0 + bn + O() → u0 + bn as  → 0.

An alternative approach is to mimic what you did for differential equations and “try the next most
complex thing”. We have un+1 = un + b and the homogeneous equation has solution wn = A (a
constant). For a particular solution try vn = Cn (note that there is no point in adding a constant
term because the constant solves the homogeneous equation and so it makes no contribution to
the right-hand side when we substitute).
Then C(n + 1) = Cn + b gives C = b and we obtain once again the general solution

un = A + bn.

Setting n = 0 yields A = 3 and so un = 3 + bn. 

Example 3.5.
un+1 = aun + bn.
Solution. As above, the homogeneous equation has solution wn = Aan . For a particular solution,
try vn = Cn + D. Substituting, we obtain

C(n + 1) + D = a(Cn + D) + bn.

Equating coefficients of n and the constant terms gives

C = aC + b, C + D = aD,
b −c
so again provided a 6= 1 we can solve to obtain C = 1−a and D = 1−a . Thus for a 6= 1

bn b
un = Aan + − .
1 − a (1 − a)2
To find A, we need a boundary condition (e.g. the value of u0 ). 

Exercise 3.6. Solve the equation for a = 1. Hint: try vn = Cn + Dn2 .

3.3 Second order linear difference equations


Consider
un+1 + aun + bun−1 = f (n).
The general solution will depend on two constants. For the first order case, the homogeneous
equation had a solution of the form wn = Aλn , so we try the same here. Substituting wn = Aλn
in
wn+1 + awn + bwn−1 = 0
gives
Aλn+1 + aAλn + bAλn−1 = 0.

29
For a non-trivial solution we can divide by Aλn−1 and see that λ must solve the quadratic equation

λ2 + aλ + b = 0.

This is called the auxiliary equation. (So just as when you solve 2nd order ordinary differential
equations you obtain a quadratic equation by considering solutions of the form eλt , so here we
obtain a quadratic in λ by considering solutions of the form λn .)
If the auxiliary equation has distinct roots, λ1 and λ2 then the general solution to the homo-
geneous equation is
wn = A1 λn1 + A2 λn2 .
If λ1 = λ2 = λ try the next most complicated thing (or mimic what you do for ordinary differential
equations) to get
wn = (A + Bn)λn .
Exercise 3.7. Check that this solution works.
How about particular solutions? The same tricks as for the the first order case apply. We
can start by trying something of the same form as f , and if that fails then try the next most
complicated thing. You can save yourself work by not including components that you already
know solve the homogeneous equation.
Example 3.8. Solve
un+1 + 2un − 3un−1 = 1.
Solution. The auxiliary equation is just

λ2 + 2λ − 3 = 0

which has roots λ1 = −3, λ2 = 1, so

wn = A(−3)n + B.

For a particular solution, we’d like to try a constant, but that won’t work because we know that
it solves the homogeneous equation (it’s a special case of wn ). So try the next most complicated
thing, which is vn = Cn. Substituting, we obtain

C(n + 1) + 2Cn − 3C(n − 1) = 1,

which gives C = 14 . The general solution is then


1
un = A(−3)n + B + n.
4
If the boundary conditions had been specified, you could now find A and B by substitution. (Note
that it takes one boundary condition to specify the solution to a first order difference equation
and two to specify the solution to a 2nd order difference equation. Usually these will be the
values of u0 and u1 but notice that in the gambler’s ruin problem we are given u0 and uN .) 

Example 3.9. Solve


un+1 − 2un + un−1 = 1.
Solution. The auxiliary equation λ2 − 2λ + 1 = 0 has repeated root λ = 1, so the homogeneous
equation has general solution
wn = An + B.
For a particular solution, try the next most complicated thing, so vn = Cn2 . (Once again there is
no point in adding a Dn+E term to this as that solves the homogeneous equation, so substituting

30
it on the left cannot contribute anything to the 1 that we are trying to obtain on the right of the
equation.) Substituting, we obtain

C(n + 1)2 − 2Cn2 + C(n − 1)2 = 1,

which gives C = 12 . So the general solution is

1
un = An + B + n2 . 
2
Example 3.10 (The Fibonacci numbers). The Fibonacci numbers 1, 1, 2, 3, 5, 8, 13, . . . are defined
by the second-order linear difference equation

fn+2 = fn+1 + fn , n ≥ 0, (3.3)

with initial conditions f0 = f1 = 1. √


1± 5
This is homogeneous, with auxiliary equation λ2 − λ − 1 = 0. The roots are λ = 2 , and so
the general solution of (3.3) is given by
√ !n √ !n
1+ 5 1− 5
fn = A +B .
2 2

Putting in the initial conditions yields the simultaneous equations


√ √
1+ 5 1− 5
1 = A + B, 1 = A +B
2 2
√ √
5+1 5−1
which have solution A = √ ,
2 5
B= √ .
2 5
This yields the remarkable result that for n ≥ 0,
√ √ !n √ √ !n
5+1 1+ 5 5−1 1− 5
fn = √ + √
2 5 2 2 5 2
√ n+1
! √ n+1
!
1 1+ 5 1 1− 5
=√ −√ .
5 2 5 2

Notice that, despite the fact that 5 is irrational, this gives an integer for every n ≥ 0!

Example 3.11. Consider the second-order linear difference equation

un+2 − 2un+1 + 4un = 0, n ≥ 0, (3.4)


2
√ conditions u0 = u1 = 1. The auxiliary equation is λ − 2λ + 4 = 0, which has roots
with initial
λ = 1 ± i 3. So the general solution to (3.4) is
√ √
un = A(1 + i 3)n + B(1 − i 3)n .

Using the initial conditions, we get A = B = 12 , and so

1 √ 1 √
un = (1 + i 3)n + (1 − i 3)n , n ≥ 0.
2 2

is, in fact, real for every n ≥ 0. In order to see this, recall that 1 + i 3 = 2eiπ/3 and
This √
1 − i 3 = 2e−iπ/3 . So

1  iπ/3 n 1  −iπ/3 n einπ/3 + e−inπ/3  nπ 


un = 2e + 2e = 2n = 2n cos .
2 2 2 3

31
3.4 Random walks
We return to the gambler’s ruin problem of Example 3.1. The gambler’s fluctuating wealth is an
example of a more general class of random processes called random walks (sometimes the more
evocative phrase drunkard’s walk is used). Imagine a particle moving around a network. At each
step, it can move to one of the other nodes of the network: there are rules determining where the
particle can move to at the next time step from that position and with what probability it moves
to each of the possible new positions. The important point is that these rules only depend on
the current position, not on the earlier positions that the particle has visited. Random walks can
be used to model various real-world situations. For example, the path traced by a molecule as it
moves in a liquid or a gas; the path of an animal searching for food; or the price of a particular
stock every Monday morning. There are various examples on the problem sheets and later in the
course.
Let’s return to the setting of Example 3.1 and solve the recurrence relation we obtained there.
Recall that un = P (bankruptcy) if the gambler’s initial fortune is £n, and that (rearranging
(3.1)),
pun+1 − un + qun−1 = 0, 1 ≤ n ≤ M − 1, (3.5)
(where q = 1 − p), with u0 = 1, uM = 0. This is a homogeneous second-order difference equation.
The auxiliary equation is
pλ2 − λ + q = 0
which factorises as
(pλ − q)(λ − 1) = 0.
q 1
So λ = p or λ = 1. If p 6= 2 then
 n
q
un = A + B
p
for some constants A and B which we can find using the boundary conditions:
 M
q
u0 = 1 = A + B and uM =0=A+B .
p

These give
 M
1−p
p 1
A=−  M , B=  M
1−p 1−p
1− p 1− p

and so  n  M
1−p 1−p
p − p
un =  M .
1−p
1− p

1
Exercise 3.12. Check that in the case p = 2 we get
n
un = 1 − , 0 ≤ n ≤ M.
M
Figure 3.1 shows a simulation of paths in the gambler’s ruin model.

Example 3.13. What is the expected number of plays in the gambler’s ruin model before the
gambler’s fortune hits either 0 or M ?

32
Random walk simulation: p = 0.4, N = 20, start at 10

20
15
position

10
5
0

0 20 40 60

time

Figure 3.1: 10 simulated paths in the gambler’s ruin model, with M = 20, n = 10 and p = 0.4.
We see some get absorbed at 0, one at 20, and two which have not yet reached either boundary
at time 80.

Solution. Just as we used the partition theorem to get a recurrence for the probability of
bankruptcy, we can use the partition theorem for expectations to get a recurrence for the ex-
pected length of the process.
Let the initial fortune be £n, and let X be the number of steps until the walk reaches one of
the barriers at 0 or M . Write en for the expectation of X. Then
en = pE [X|first step is to n + 1] + qE [X|first step is to n − 1] .
Let’s think carefully about the conditional expectations on the right-hand side. If the first step is
to n + 1, then we have already spent one step to get there, and thereafter the number of steps to
reach the boundary is just the time to reach the boundary in a walk starting from n + 1. Hence
we get
E [X|first step is to n + 1] = 1 + en+1 ,
and similarly
E [X|first step is to n − 1] = 1 + en−1 .
So we obtain the recurrence
en = p(1 + en+1 ) + q(1 + en−1 )

33
which rearranges to give
pen+1 − en + qen−1 = −1. (3.6)
Our boundary conditions are e0 = eM = 0. Note that (3.6) has exactly the same form as (3.5),
except that the equation is no longer homogeneous: we have the constant −1 on the right-hand
side instead of 0.
Take the case p 6= q. As above, we have the general solution to the homogeneous equation
 n
q
wn = A + B .
p

For a particular solution to (3.6), try vn = Cn ( note that there’s no point trying a constant
since we already know that any constant solves the homogeneous equation). This yields

pC(n + 1) − Cn + qC(n − 1) = −1

and so C = −1/(p − q). Putting everything together, we get


 n
q n
en = A + B − .
p p−q
Using the boundary conditions, we get
 M
q M
e0 = 0 = A + B, eM = 0 = A + B − .
p p−q
Solving for A and B, we finally obtain
M 1 − (q/p)n n
en = M

(p − q) 1 − (q/p) p−q

for 0 ≤ n ≤ M . 

Exercise 3.14. Find en for p = q = 1/2 (the expression is rather simpler in that case!).
Finally, consider what happens if we remove the upper barrier at M , and instead have a
random walk on the infinite set {0, 1, 2, . . . }, starting from some site n > 0. Does the walk ever
reach the site 0, or does it stay strictly positive for ever? Let’s look at the probability of the
(M )
event that it hits 0. A natural idea is to let M → ∞ in the finite problem. Write un for the
probability of hitting 0 before M , which we calculated above. Then we have
  n  M
q
p
− pq ( n
q
6

lim
M →∞ M if p = q if p > q
lim u(M ) p
 
n = 1− q
p
=
M →∞  1 if p ≤ q.
n
limM →∞ 1 − M if p = q = 1/2

It turns out that this limit as M → ∞ really does give the appropriate probability that the
random walk on {0, 1, 2, . . . } hits 0. In particular, the walk has positive probability to stay away
from 0 for ever if and only if p > q. There are various ways to prove this; the idea below is not
complicated, but is nonetheless somewhat subtle.
Theorem 3.15. (Non-examinable) Consider a random walk on the integers Z, started from some
n > 0, which at each step increases by 1 with probability p, and decreases by 1 with probability
q = 1 − p. Then the probability un that the walk ever hits 0 is given by
( n
q
p if p > q,
un =
1 if p ≤ q.

34
Proof. In Proposition A.8 in the Appendix, we prove a useful result about increasing sequences
of events. A sequence of events Ak , k ≥ 1 is called increasing if A1 ⊆ A2 ⊆ A3 ⊆ . . . . Then
Proposition A.8 says that for such a sequence of events,

!
[
P Ak = lim P (Ak ) .
k→∞
k=1

(This can be regarded as a sort of continuity result for the probability function P.)
To apply this result to the random walk started from n, consider the event H that the random
walk reaches 0, and for each M , consider the event AM that the random walk reaches 0 before
M . If the walk ever reaches 0, then there must be some M such that the walk reaches 0 before
M , so that AM occurs.
S Conversely, if any event AM occurs, then clearly the event H also occurs.
Hence we have H = M AM .
Then indeed we have

un = P (H)

!
[
=P AM
M =1
= lim P (AM )
M →∞
= lim u(M
n ,
)
M →∞

as desired.

35
Chapter 4

Probability generating functions

We’re now going to turn to an extremely powerful tool, not just in calculations but also in proving
more abstract results about discrete random variables.
From now on we consider non-negative integer-valued random variables i.e. X takes values in
{0, 1, 2, . . .}.

Definition 4.1. Let X be a non-negative integer-valued random variable. Let



( )
X
k
S := s ∈ R : |s| P(X = k) < ∞ .
k=0

Then the probability generating function (p.g.f.) of X is GX : S → R defined by



X
GX (s) = E[sX ] = sk P (X = k) .
k=0

Let us agree to save space by setting

pk = pX (k) = P(X = k).

Notice that because ∞


P
k=0 pk = 1, GX (s) is certainly defined for |s| ≤ 1 (i.e. [−1, 1] ⊆ S), and
GX (1) = 1. Notice also that GX (s) is just a real-valued function. The parameter s is the argument
of the function and has nothing to do with X. It plays the same role as x if I write sin x, for
example.1
Why are generating functions so useful? Because they encode all of the information about
the distribution of X in a single function. It will turn out that we can get at this information by
using the tools of calculus.

Theorem 4.2 (Uniqueness theorem). The distribution of X is uniquely determined by its prob-
ability generating function, GX .
1
P∞The probability
n
generating function is an example of a power series, that is a function of the form f (x) =
n=0 cn x . It may be that this sum diverges for some values of x; the radius of convergence is the value r such
that the sum converges if |x| < r and diverges if |x| > r. For a probability generating function, we can see that the
radius of convergence must be at least 1. For the purposes of this course, you are safe to assume that the derivative
of f is well-defined for |x| < r and is given by

X
f 0 (x) = ncn xn−1
n=1

i.e. what you would get differentiating term-by-term. Those of you who are doing Analysis I & II will learn more
about power series there.

36
Proof. First note that GX (0) = p0 . Now, for |s| < 1, we can differentiate GX (s) term-by-term
to get
G0X (s) = p1 + 2p2 s + 3p3 s2 + · · · .
Setting s = 0, we see that G0X (0) = p1 . Similarly, by differentiating repeatedly, we see that

dk
GX (s) s=0
= k! pk .
dsk
So we can recover p0 , p1 , . . . from GX .

Probability generating functions for common distributions.


1. Bernoulli distribution. X ∼ Ber(p). Then
X
GX (s) = pk sk = qs0 + ps1 = q + ps
k

for all s ∈ R.

2. Binomial distribution. X ∼ Bin(n, p). Then


n   n  
X
k n k n−k
X n
GX (s) = s p (1 − p) = (ps)k (1 − p)n−k = (1 − p + ps)n ,
k k
k=0 k=0

by the binomial theorem. This is valid for all s ∈ R.

3. Poisson distribution. X ∼ Po(λ). Then


∞ ∞
X λk e−λ X (sλ)k
GX (s) = sk = e−λ = eλ(s−1)
k! k!
k=0 k=0

for all s ∈ R.

4. Geometric distribution with parameter p. Exercise on the problem sheet: check that
ps
GX (s) = ,
1 − (1 − p)s
1
provided that |s| < 1−p .

Theorem 4.3. If X and Y are independent, then

GX+Y (s) = GX (s)GY (s).

Proof. We have
GX+Y (s) = E sX+Y = E sX sY .
   

Since X and Y are independent, sX and sY are independent (see a question on the problem
sheet). So then by Theorem 2.23, this is equal to

E sX E sY = GX (s)GY (s).
   

This can be very useful for proving distributional relationships.

Theorem 4.4. Suppose that X1 , X2 , . . . , Xn are independent Ber(p) random variables and let
Y = X1 + · · · + Xn . Then Y ∼ Bin(n, p).

37
Proof. We have

GY (s) = E[sY ] = E[sX1 +···+Xn ] = E[sX1 · · · sXn ] = E[sX1 ] · · · E[sXn ] = (1 − p + ps)n .

As Y has the same p.g.f. as a Bin(n, p) random variable, the Uniqueness theorem yields that
Y ∼ Bin(n, p).

The interpretation of this is that Xi tells us whether the ith of a sequence of independent
coin flips is heads or tails (where heads has probability p). Then Y counts the number of heads
in n independent coin flips and so must be distributed as Bin(n, p).

Theorem 4.5. Suppose that X1 , X2 , . . . , Xn are independent random variables such that Xi ∼
Po(λi ). Then
n n
!
X X
Xi ∼ Po λi .
i=1 i=1
Pn
In particular, if λi = λ for all 1 ≤ i ≤ n then i=1 Xi ∼ Po(nλ).

Proof. Recall that E sXi = eλi (s−1) . By independence,


 

n n n
!
 Y   Y X
E sX1 +X2 +...+Xn = E sXi = eλi (s−1) = exp (s − 1)

λi .
i=1 i=1 i=1

Since this is the p.g.f. of the Po( ni=1 λi ) distribution and probability generating functions
P
uniquely determine distributions, the result follows.

4.1 Calculating expectations using probability generating func-


tions
We’ve already seen that differentiating GX (s) and setting s = 0 gives us a way to get at the
probability mass function of X. Derivatives at other points can also be useful. We have
∞ ∞ ∞
d d X k X d X
G0X (s) = E[sX ] = s P (X = k) = sk P (X = k) = ksk−1 P (X = k) = E[XsX−1 ].
ds ds ds
k=0 k=0 k=0

So
G0X (1) = E[X]
(as long as E [X] exists). Differentiating again, we then similarly get

G00X (1) = E[X(X − 1)] = E[X 2 ] − E[X],

and so, in particular,


var (X) = G00X (1) + G0X (1) − (G0X (1))2 .
In general,
dk
GX (s) = E[X(X − 1) · · · (X − k + 1)].
dsk s=1

Example 4.6. Let Y = X1 + X2 + X3 , where X1 , X2 and X3 are independent random variables


each having probability generating function

1 s s2
G(s) = + + .
6 3 2

38
1. Find the mean and variance of X1 .

2. What is the p.g.f. of Y ? What is P(Y = 3)?

3. What is the p.g.f. of 3X1 ? Why is it not the same as the p.g.f. of Y ? What is P(3X1 = 3)?

Solution. 1. Differentiating the probability generating function,


1
G0 (s) = + s, G00 (s) = 1,
3
and so E[X1 ] = G0 (1) = 4
3 and

4 16 5
var (X1 ) = G00 (1) + G0 (1) − (G0 (1))2 = 1 + − = .
3 9 9

2. Just as in our derivation of the probability generating function for the binomial distribution,

GY (s) = E[sX1 +X2 +X3 ] = E[sX1 ]E[sX2 ]E[sX3 ]

and so
3
1 s s2

1
1 + 6s + 21s2 + 44s3 + 63s4 + 54s5 + 27s6 .

GY (s) = + + =
6 3 2 216
11
P(Y = 3) is the coefficient of s3 in GY (s), that is 54 . (As an exercise, calculate P(Y = 3)
directly.)

3. We have
1 s3 s6
G3X1 (s) = E[s(3X1 ) ] = E[(s3 )X1 ] = GX1 (s3 ) =
+ + .
6 3 2
This is different from GY (s) because 3X1 and S3 have different distributions – knowing X1
does not tell you Y , but it does tell you 3X1 . Finally, P(3X1 = 3) = P(X1 = 1) = 31 . 

Of course, for each fixed s ∈ R, sX is itself a discrete random variable. So we can use the law
of total probability when calculating its expectation.

Example 4.7. Suppose that there are n red balls, n white balls and 1 blue ball in an urn. A ball
is selected at random and then replaced. Let X be the number of red balls selected before a blue
ball is chosen. Find

(a) the probability generating function of X,

(b) E [X],

(c) var (X).

Solution. (a) We will use the law of total probability for expectations. Let R be the event that
the first ball is red, W be the event that the first ball is white and B be the event that the first
ball is blue. Then

GX (s) = E sX = E sX |R P (R) + E sX |W P (W ) + E sX |B P (B) .


       

Of course, the value of X is affected by the first ball which is picked. If the first ball is blue then
we know that X = 0. If the first ball is white, we learn nothing about the value of X. If the first

39
ball is red then effectively we start over again counting numbers of red balls, but we add 1 for
the red ball we have already seen. This yields
GX (s) = E s1+X P (R) + E sX P (W ) + E s0 P (B)
     

n n 1
= sGX (s) + GX (s) +
2n + 1 2n + 1 2n + 1
and so
1 1/(n + 1)
GX (s) = = .
n + 1 − ns 1 − (1 − 1/(n + 1))s
(b) Differentiating, we get
n
G0X (s) =
(n + 1 − ns)2
and so
E [X] = G0X (1) = n.
(c) Recall that
var (X) = G00X (1) + G0X (1) − (G0X (1))2 .
Differentiating the p.g.f. again we get
2n2
G00X (s) =
(n + 1 − ns)3
and so G00X (1) = 2n2 . Hence,
var (X) = 2n2 + n − n2 = n(n + 1).
If we were just asked for E [X] it would be easier to calculate
E [X] = E [X|R] P (R) + E [X|W ] P (W ) + E [X|B] P (B)
N n 1
= (1 + E [X]) + E [X] +0· = n.
2n + 1 2n + 1 2n + 1
In order to calculate var (X), however, we need both E [X] and E X 2 and so it’s easier just to
 

find GX (s) and differentiate it. 


Theorem 4.8. Let X1 , X2 , . . . be i.i.d. non-negative integer-valued random variables with p.g.f.
GX (s). Let N be another non-negative integer-valued random variable, independent of X1 , X2 , . . .
and with p.g.f. GN (s). Then the p.g.f. of N
P
X
i=1 i is GN (GX (s)).
PN
Notice that the sum i=1 Xi has a random number of terms. We interpret it as 0 if N = 0.
Proof. We partition according to the value of N : we have

X
X1 +···+XN
E sX1 +···+XN |N = n P (N = n)
   
E s = by the law of total probability
n=0
X∞
E sX1 +···+Xn |N = n P (N = n)
 
=
n=0
X∞
E sX1 +···+Xn P (N = n)
 
= by the independence of N and {X1 , X2 , . . .}
n=0
X∞
E sX1 · · · E sXn P (N = n)
   
= since X1 , X2 , . . . are independent
n=0
X∞
= (GX (s))n P (N = n)
n=0
= GN (GX (s)).

40
Corollary 4.9. Suppose that X1 , X2 , . . . are independent and identically
Pdistributed Ber(p) ran-
dom variables and that N ∼ Po(λ), independently of X1 , X2 , . . .. Then N i=1 Xi ∼ Po(λp).

(Notice that we saw this result in disguise via a totally different method in a problem sheet
question.)

Proof. We have GX (s) = 1 − p + ps and GN (s) = exp(λ(s − 1)) and so by Theorem 4.8,
h PN i
E s i=1 Xi = GN (GX (s)) = exp(λ(1 − p + ps − 1)) = exp(λp(s − 1)).

Since this is the p.g.f. of Po(λp) and p.g.f.’s uniquely determine distributions, the result follows.

Example 4.10. In a short fixed time period, a photomultiplier detects 0, 1 or 2 photons with
probabilities 12 , 13 and 61 respectively. The photons detected by the photomultiplier cause it to give
off a charge of 2, 3, 4 or 5 electrons (with equal probability) independently for every one photon
originally detected. What is the probability generating function of the number of electrons given
off in the time period? What is the probability that exactly five electrons are given off in that
period?

Solution. Let N be the number of photons detected. Then the probability generating function
of N is
1 1 1
GN (s) = + s + s2 .
2 3 6
Let Xi be the number of electrons given off by the ith photon detected. Then Y = X1 + · · · + XN
is the total number given off in the period (remember that N here is random). Now GX (s) =
1 2 3 4 5
4 (s + s + s + s ) and so, by Theorem 4.8,

GY (s) = GN (GX (s))


1 1 1
= + GX (s) + (GX (s))2
2 3 6
1 1 2 1 3 1 1 1
= + s + s + s4 + s5 + (s4 + 2s5 + 3s6 + 4s7 + 3s8 + 2s9 + s10 ).
2 12 12 12 12 96
5
The probability that five electrons are given off is the coefficient of s5 , that is 48 . 

4.2 Branching processes


A really nice illustration of the power of probability generating functions is in the study of
branching processes.
Suppose we have a population (say of bacteria). Each individual in the population lives a unit
time and, just before dying, gives birth to a random number of children in the next generation.
This number of children has probability mass function p(i), i ≥ 0, called the offspring distribution.
Different individuals reproduce independently in the same manner. Here is a possible family tree
of such a population:

41
We start at the top of the tree, with a single individual in generation 0. Then there are 3
individuals in generations 1 and 2, 5 individuals in generation 3, a single individual in generation
4 and no individuals in subsequent generations.
(n)
Let Xn be the size of the population in generation n, so that X0 = 1. Let Ci be the number
of children of the ith individual in generation n ≥ 0, so that we may write
(n) (n) (n)
Xn+1 = C1 + C2 + · · · + CXn .
(n) (n)
(We interpret this sum as
P0∞if Xn =i 0.) Note that C1 , CX2n , . . . are independent and identically
distributed. Let G(s) = i=0 p(i)s and let Gn (s) = E s .
Theorem 4.11. For n ≥ 0,
Gn+1 (s) = Gn (G(s)) = G(G(. . . G(s) . . .)) = G(Gn (s)).
| {z }
n+1 times

(0)
Proof. Since X0 =
 1, we have G0 (s) = s. Also, we get X1 = C1 which has p.m.f. p(i), i ≥ 0.
So G1 (s) = E sX1 = G(s). Since
Xn
(n)
X
Xn+1 = Ci ,
i=1
by Theorem 4.8 we get
h PXn (n) i
Gn+1 (s) = E sXn+1 = E s i=1 Ci
 
= Gn (G(s)).

Hence, by induction, for n ≥ 1,


Gn+1 (s) = G(G(. . . G(s) . . .)) = G(Gn (s)).
| {z }
n+1 times

Corollary
P∞ 4.12. Suppose that the mean number of children of a single individual is µ i.e.
i=1 ip(i) = µ. Then
E [Xn ] = µn .
Proof. We have E [Xn ] = G0n (1). By the chain rule,
d
G0n (s) = G(Gn−1 (s)) = G0n−1 (s)G0 (Gn−1 (s)).
ds
Plugging in s = 1, we get E [Xn ] = E [Xn−1 ] G0 (1) = E [Xn−1 ] µ and, inductively, E [Xn ] = µn .

42
In particular, notice that we get exponential growth on average if µ > 1 and exponential
decrease if µ < 1. This raises an interesting question: can the population die out? If p(0) = 0
then every individual has at least one child and so the population clearly grows forever. If
p(0) > 0, on the other hand, then the population dies out with positive probability because

P (population dies out) = P (∪∞


n=1 {Xn = 0}) ≥ P (X1 = 0) = p(0) > 0.

(Notice that this holds even in the cases where E [Xn ] grows as n → ∞ !)
Example 4.13. Suppose that p(i) = (1/2)i+1 , i ≥ 0, so that each individual has a geometric
number of offspring. What is the distribution of Xn ?
Solution. First calculate
∞  k+1
X
k 1 1
G(s) = s = .
2 2−s
k=0
By plugging this into itself a couple of times, we get
2−s 3 − 2s
G2 (s) = , G3 (s) = .
3 − 2s 4 − 3s
n−(n−1)s
A natural guess is that Gn (s) = (n+1)−ns which is, in fact, the case, as can be proved by induction.
If we want the probability mass function of Xn , we need to expand this quantity out in powers
of s. We have

1 1 1 X nk s k
= = .
(n + 1) − ns n + 1 1 − ns/(n + 1) (n + 1)k+1
k=0

Multiplying by n − (n − 1)s, we get


∞ ∞ ∞
X nk+1 sk X nk−1 (n − 1)sk n X nk−1 sk
Gn (s) = − = + .
(n + 1)k+1 (n + 1)k n+1 (n + 1)k+1
k=0 k=1 k=1

We can read off the coefficients now to see that


(
n
n+1 if k = 0
P (Xn = k) = nk−1
(n+1)k+1
if k ≥ 1.

Notice that P (Xn = 0) → 1 as n → ∞, which indicates that the population dies out eventually
in this case. 

4.2.1 Extinction probability (non-examinable)


Let’s return to the general case for the moment and let q = P (population dies out). We can call
q the extinction probability of the branching process. We can find an equation satisfied by q by
conditioning on the number of children of the first individual.

X ∞
X
q= P (population dies out|X1 = k) P (X1 = k) = P (population dies out|X1 = k) p(k).
k=0 k=0

Now remember that each of the k individuals in the first generation behaves exactly like the parent.
In particular, we can think of each of them starting its own family, which is an independent copy
of the original family. Moreover, the whole population dies out if and only if all of these sub-
populations die out. If we had k families, this occurs with probability q k . So

X
q= q k p(k) = G(q). (4.1)
k=0

43
The equation q = G(q) doesn’t quite enable us to determine q: notice that 1 is always a solution,
but it’s not necessarily the only solution in [0, 1].
Using Proposition A.8 about increasing sequences of events (see Appendix), we have
!
[
q=P {Xn = 0}
n
= lim P (Xn = 0)
n→∞
= lim Gn (0). (4.2)
n→∞

Theorem 4.14. The extinction probability q is the smallest non-negative solution of

x = G(x) (4.3)

Proof. From (4.1) we know that q solves (4.3). Suppose some r ≥ 0 also solves (4.3). We
claim that in that case, Gn (0) ≤ r for all n ≥ 0. In that case we are done, since then also
q = limn→∞ Gn (0) ≤ r, and so indeed q is smaller than any other solution of (4.3).
We use induction to prove the claim that Gn (0) ≤ r for all n. For the base case n = 0, we
have G0 (0) = 0 ≤ r as required.
the induction step, suppose that Gn−1 (0) ≤ r. Now notice that the generating function
For P
G(s) = ∞ k
k=0 p(k)s is a non-decreasing function for s ≥ 0. Hence

Gn (0) = G Gn−1 (0) ≤ G(r) = r,

as required. This completes the proof.

It turns out that the question of whether the branching process inevitably dies out is deter-
mined by the mean number of children of a single individual. To avoid a trivial case, we assume
in the next result that p(1) 6= 1. (If p(1) = 1 then Xn = 1 with probability 1, for all n.) Then we
find that there is a positive probability of survival of the process for ever if and only if µ > 1.

Theorem 4.15. Assume p(1) 6= 1. Then q = 1 if µ ≤ 1, and q < 1 if µ > 1.

Proof. Note first that there’s a quick argument for the case where µ is strictly less than 1. Note
that as Xn takes non-negative integer values,

P (Xn > 0) ≤ E [Xn ]


P P
(since P (Xn > 0) = k≥1 P (Xn = k) ≤ k≥1 kP (Xn = k) = E [Xn ]).
But from Corollary 4.12, we have E [Xn ] = µn . Hence P (Xn > 0) → 0 as n → ∞, and so
from (4.2), we get q = 1.
Now we give a more general argument that also covers the cases µ = 1 and µ > 1. First,
observe that the gradient G0 (s) = k−1 is non-decreasing for s ≥ 0 (and, indeed,
P
k=1 kp(k)s
strictly increasing unless p0 + p1 = 1). That is, G is convex.
It is instructive to consider the consequences of convexity on the graph of y = G(x), on the
interval x ∈ [0, 1], and particularly near x = 1. The graph passes through the points (0, p0 ) and
(1, 1), and at (1, 1) its slope is µ = G0 (1).
We have the following two cases:

(1) Suppose µ > 1. Since the gradient of the curve y = G(x) is more than 1 at x = 1, and
the curve starts on the non-negative y-axis at x = 0, it must cross the line y = x at some
x ∈ [0, 1). Hence indeed the smallest non-negative fixed point q of G is less than 1. This is
illustrated on the left side of Figure 4.1.

44
1 1

y = G(x)
y=x y=x

p0
y = G(x)

p0

0 1 0 1

Figure 4.1: On the left, the case µ > 1; on the right, the case µ ≤ 1.

(2) Suppose µ ≤ 1. The gradient at 1 is at most 1, and in fact the gradient is strictly less than 1
for all x ∈ [0, 1). (We excluded the case p1 = 1 for which the gradient is 1 everywhere.) Now
the function y = G(x) must stay above the line y = x throughout [0, 1). So the smallest
non-negative fixed point q of G is 1. This is illustrated on the right side of Figure 4.1.

45
Chapter 5

Continuous random variables

5.1 Random variables and cumulative distribution functions


Recall that we defined a discrete random variable on a probability space (Ω, F, P) to be a function
X : Ω → R such that X can only take countably many values (and such that we can assign a
probability to the event {X = x}, i.e. such that {X = x} ∈ F). There is, however, a more
general notion. The essential idea is that a random variable can be any (sufficiently nice) function
X : Ω → R, which represents some sort of observable quantity in our random experiment.
Why do we need more general random variables?
ˆ Some outcomes are essentially continuous. In particular, many physical quantities are
most naturally modelled as taking uncountably many possible values, for example, lengths,
weights and speeds.

ˆ Even for discrete quantities, it is often useful to think instead in terms of continuous ap-
proximations. For example, suppose you wish to consider the number of working adults
who regularly contribute to charity. You might model this number as X ∈ {0, 1, . . . , n},
where n is the total number of working adults in the UK. We could, in theory, model this as
a Bin(n, p) random variable where p = P (adult contributes). But n is measured in millions.
So instead model Y ≈ X n as a continuous random variable taking values in [0, 1] and giving
the proportion of adults who contribute.
To give a concrete example of a random variable which is not discrete, imagine you have a
board game spinner. You spin the arrow and it lands pointing at an angle somewhere between
0 and 2π in such a way that every angle is equally likely; we want to model this angle as a
random variable X. How can we describe its distribution? We can’t assign a positive probability
to each angle – our probabilities wouldn’t sum to 1. To get around this, we don’t define the
probability of individual sample points, but only of certain natural events. For example, by
symmetry we expect that P (X ≤ π) = 1/2. More generally, we expect the probability that X lies
in an interval [a, b] ⊆ [0, 2π) to be proportional to the length of that interval: P (X ∈ [a, b]) = b−a
2π ,
0 ≤ a < b < 2π.
Definition 5.1. A random variable X defined on a probability space (Ω, F, P) is a function
X : Ω → R such that {ω : X(ω) ≤ x} ∈ F for each x ∈ R.
Let’s just check that this includes our earlier definition. If X is a discrete random variable
then [
{ω : X(ω) ≤ x} = {ω : X(ω) = y}.
y≤x : y∈ImX

Since ImX is countable, this is a countable union of events in F and, therefore, itself belongs to
F.

46
Of course, {ω : X(ω) ≤ x} ∈ F means precisely that we can assign a probability to this event.
The collection of these probabilities as x varies in R will play a central part in what follows.

Definition 5.2. The cumulative distribution function (c.d.f.) of a random variable X is the
function FX : R → [0, 1] defined by

FX (x) = P (X ≤ x) .

Example 5.3. Let X be the number of heads obtained in three tosses of a fair coin. Then
P (X = 0) = 18 , P (X = 1) = P (X = 2) = 83 and P (X = 3) = 18 . So



0 if x<0

1
if 0≤x<1


8

1 3
FX (x) = 8 + 8 = 12 if 1≤x<2
 1 3
8 + + 38 = 7
if 2≤x<3




 8 8
1 if x ≥ 3.

0 1 2 3 x

Example 5.4. Let X be the angle of the board game spinner. Then

0
 if x < 0,
x
FX (x) = 2π if 0 ≤ x < 2π,

1 if x ≥ 2π.

We can immediately write down some properties of the c.d.f. FX corresponding to a general
random variable X.

Theorem 5.5. 1. FX is non-decreasing.

2. P (a < X ≤ b) = FX (b) − FX (a) for a < b.

3. As x → −∞, FX (x) → 0.

4. As x → ∞, FX (x) → 1.

Proof. 1. If a < b then {ω : X(ω) ≤ a} ⊆ {ω : X(ω) ≤ b} and so

FX (a) = P (X ≤ a) ≤ P (X ≤ b) = FX (b).

2. Since {X ≤ a} is a subset of {X ≤ b},

P (a < X ≤ b) = P ({X ≤ b} \ {X ≤ a}) = P (X ≤ b) − P (X ≤ a) = FX (b) − FX (a).

3 & 4. (sketch) Intuitively, we want to put “FX (−∞) = P (X ≤ −∞)” and then, since X can’t
possibly be −∞ (or less!), the only sensible interpretation we could give the right-hand side would
be 0. Likewise, we would like to put “FX (∞) = P (X ≤ ∞)” and, since X cannot be larger than
∞, the only sensible interpretation we could give the right-hand side would be 1. The problem
is that ∞ and −∞ aren’t real numbers, but FX is a function on R. The only sensible way to
deal with this problem is by taking limits and to do this carefully involves using the countable
additivity axiom P3 in a somewhat intricate way.

47
Conversely, any function F satisfying conditions 1, 3 and 4 of Theorem 5.5 plus right-continuity
is the cumulative distribution function of some random variable defined on some probability space,
although we will not prove this fact.
As you can see from the coin-tossing example, FX need not be a smooth function. Indeed,
for a discrete random variable, FX is always a step function. However, in the rest of this chapter,
we’re going to concentrate on the case where FX is very smooth in that it has a derivative (except
possibly at a collection of isolated points).

Definition 5.6. A continuous random variable X is a random variable whose c.d.f. satisfies
Z x
FX (x) = P (X ≤ x) = fX (u)du,
−∞

where fX : R → R is a function such that

(a) fX (u) ≥ 0 for all u ∈ R


R∞
(b) −∞ fX (u)du = 1.

fX is called the probability density function (p.d.f.) of X or, sometimes, just its density.

Remark 5.7. The definition of a continuous random variable leaves implicit which functions fX
might possibly serve as a probability density function. Part of this is a more fundamental question
concerning which functions we are allowed to integrate (and for some of you, that will be resolved
in the Analysis III course in Trinity Term and in Part A Integration). For the purposes of this
course, you may assume that fX is a function which has at most countably many jumps and is
smooth everywhere else. Indeed, in almost all of the examples we will consider, fX will have 0, 1
or 2 jumps.

Remark 5.8. The Fundamental Theorem of Calculus (which some of you will see proved in
Analysis III), tells us that FX of the form given in the definition is differentiable with

dFX (x)
= fX (x),
dx
at any point x where fX is continuous.

Example 5.9. Suppose that X has c.d.f.


(
0 if x < 0
FX (x) =
1 − e−x if x ≥ 0.

Consider (
0 for x < 0
f (x) =
e−x for x ≥ 0.
Then (
Z x
0 if x < 0
f (u)du = R x
−∞ 0 e−u du =1− e−x if x ≥ 0,
and so X is a continuous random variable with density fX (x) = f (x). Notice that fX (0) = 1 and
so fX has a jump at x = 0. On the other hand, FX is smooth at 0, but it isn’t differentiable there.
To see this, if we approach 0 from the right, FX has gradient tending to 1; if we approach 0 from
the left, FX has gradient 0 and, since these don’t agree, there isn’t a well-defined derivative. On
the other hand, everywhere apart from 0 we do have FX0 (x) = fX (x).

48
Example 5.10. Suppose that a continuous random variable X has p.d.f.
(
cx2 (1 − x) for x ∈ [0, 1]
fX (x) =
0 otherwise.

Find the constant c and an expression for the c.d.f.

Solution. To find the constant, c, note that we must have


∞ 1 1
x3 x4
Z Z 
2 c
1= fX (x)dx = cx (1 − x)dx = c − = .
−∞ 0 3 4 0 12

It follows that c = 12. To find the c.d.f., we simply integrate:



Z x R0
 for x < 0
x 2
0 12u (1 − u)du for 0 6 x < 1
FX (x) = fX (u)du =
−∞ 
1 for x > 1.

Since
x
x3 x4
Z  
2
12u (1 − u)du = 12 − ,
0 3 4
we get 
0
 for x < 0
FX (x) = 4x3 − 3x4 for 0 6 x < 1

1 for x > 1.

Example 5.11. The duration in minutes of mobile phone calls made by students is modelled by
a random variable, X, with p.d.f.
(
1 −x/6
e if x ≥ 0
fX (x) = 6
0 otherwise.

What is the probability that a call lasts

(i) between 3 and 6 minutes?

(ii) more than 6 minutes?

Solution. (i)
Z 6 Z 6
1 −x/6 1
P (3 < X ≤ 6) = fX (x)dx = e dx = e− 2 − e−1 .
3 3 6

(ii) Z ∞ Z ∞
1 −x/6
P (X > 6) = fX (x)dx = e dx = e−1 . 
6 6 6
We often use the p.d.f. of a continuous random variable analogously to the way we used the
p.m.f. of a discrete random variable. There are several similarities between the two:

49
Probability density function (continuous) Probability mass function (discrete)
fX (x) > 0 ∀x ∈ R pX (x) > 0 ∀x ∈ R
Z ∞ X
fX (x)dx = 1 pX (x) = 1
−∞ x∈ImX
Z x X
FX (x) = fX (u)du FX (x) = pX (u)
−∞ u6x: u∈ImX

However, the analogy can be misleading. For example, there’s nothing to prevent fX (x)
exceeding 1.

WARNING: fX (x) IS NOT A PROBABILITY.

Suppose that  > 0 is small. Then, by Taylor’s theorem,

P (x < X ≤ x + ) = FX (x + ) − FX (x) ≈ fX (x).

So fX (x) is approximately the probability that X falls between x and x +  (or, indeed, between
x −  and x). What happens as  → 0?

Theorem 5.12. If X is a continuous random variable with p.d.f. fX then

P (X = x) = 0 for all x ∈ R

and Z b
P (a ≤ X ≤ b) = fX (x)dx for all a ≤ b.
a

Proof. (Non-examinable.) We argue by contradiction. Suppose that for some x ∈ R we have


P (X = x) > 0. Let p = P (X = x). Then for all n ≥ 1, P (x − 1/n < X ≤ x) ≥ p. We have
P (x − 1/n < X ≤ x) = FX (x) − FX (x − 1/n) and so FX (x) − FX (x − 1/n) ≥ p for all n ≥ 1. But
FX is continuous at x and so

lim (FX (x) − FX (x − 1/n)) = 0.


n→∞

This gives a contradiction. So we must have P (X = x) = 0.


Finally, P (a ≤ X ≤ b) = P (X = a) + P (a < X ≤ b) and so, since P (X = a) = 0, we get
Z b
P (a ≤ X ≤ b) = fX (x)dx.
a

So for a continuous r.v. X, the probability of getting any fixed value x is 0! Why doesn’t this
break our theory of probability? We have
[
{ω : X(ω) ≤ x} = {ω : X(ω) = y}
y≤x

and the right-hand side is an uncountable union of disjoint events of probability 0. If the union
were countable, this would entail that the left-hand side had probability 0 also, which wouldn’t
make much sense. But because the union is uncountable, we cannot expect to “sum up” these
zeros in order to get the probability of the left-hand side. The right way to resolve this problem
is using a probability density function.

50
Remark 5.13. There do exist random variables which are neither discrete nor continuous. To
give a slightly artificial example, suppose that we flip a fair coin. If it comes up heads, sample U
uniformly from [0, 1] and set X to be the value obtained; if it comes up tails, let X = 1/2. Then
X can take uncountably many values but does not have a density. Indeed, as you can check,
(
x
2 if 0 ≤ x < 1/2
P (X ≤ x) = x+1
2 if 1/2 ≤ x ≤ 1,

and there does not exist a function fX which integrates to give this.
The theory is particularly nice in the discrete and continuous cases because we can work
with probability mass functions and probability density functions respectively. But the cumulative
distribution function is a more general concept which makes sense for all random variables.

5.2 Some classical distributions


As we did for discrete distributions, we introduce a stock of examples of continuous distributions
which will come up time and again in this course.

1. The uniform distribution. X has the uniform distribution on an interval [a, b] if it has
p.d.f. (
1
for a ≤ x ≤ b,
fX (x) = b−a
0 otherwise.
We write X ∼ U[a, b].
2. The exponential distribution. X has the exponential distribution with parameter λ ≥ 0
if it has p.d.f.
fX (x) = λe−λx , x ≥ 0.
We write X ∼ Exp(λ). The exponential distribution is often used to model lifetimes or
the time elapsing between unpredictable events (such as telephone calls, arrivals of buses,
earthquakes, emissions of radioactive particles, etc).
3. The gamma distribution. X has the gamma distribution with parameters α > 0 and
λ ≥ 0 if it has p.d.f.
λα α−1 −λx
fX (x) = x e , x ≥ 0.
Γ(α)
Here, Γ(α) is the so-called gamma function, which is defined by
Z ∞
Γ(α) = uα−1 e−u du
0

for α > 0. For most values of α this integral does not have a closed form. However, for a
strictly positive integer n, we have Γ(n) = (n − 1)!. (See the Wikipedia “Gamma function”
page for lots more information about this fascinating function!)
If X has the above p.d.f. we write X ∼ Gamma(α, λ). The gamma distribution is a
generalisation of the exponential distribution and possesses many nice properties. The Chi-
squared distribution with d degrees of freedom, χ2d , which you may have seen at ‘A’ Level, is
the same as Gamma(d/2, 1/2) for d ∈ N.
4. The normal (or Gaussian) distribution. X has the normal distribution with parame-
ters µ ∈ R and σ 2 > 0 if it has p.d.f.
(x − µ)2
 
1
fX (x) = √ exp − , x ∈ R.
2πσ 2 2σ 2

51
We write X ∼ N(µ, σ 2 ). The standard normal distribution is N(0, 1). The normal distri-
bution is used to model all sorts of characteristics of large populations and samples. Its
fundamental importance across Probability and Statistics is a consequence of the Central
Limit Theorem, which you will use in Prelims Statistics and see proved in Part A Proba-
bility.

Exercise 5.14. For the uniform and exponential distributions:

ˆ Check that for each of these fX really is a p.d.f. (i.e. that it is non-negative and integrates
to 1).

ˆ Calculate the corresponding c.d.f.’s.

Example 5.15. Show that



(x − µ)2
Z  
1
I := √ exp − dx = 1.
−∞ 2πσ 2 2σ 2

Solution. We first change variables in the integral. Set z = (x − µ)/σ. Then


Z ∞ Z ∞
(x − µ)2
   2
1 1 z
I= √ exp − 2
dx = √ exp − dz.
−∞ 2πσ 2 2σ −∞ 2π 2

It follows that
Z ∞  2 Z ∞  2
2 1 x 1 y
I = √ exp − dx √ exp − dy
−∞ 2π 2 −∞ 2π 2
Z ∞Z ∞
(x2 + y 2 )
 
1
= exp − dxdy.
−∞ −∞ 2π 2

Now convert to polar co-ordinates: let r and θ be such that x = r cos θ and y = r sin θ. Then the
Jacobian is |J| = r and so we get
Z 2π Z ∞  2 i∞
1 r h 2
r exp − drdθ = −e−r /2 = 1.
0 0 2π 2 0

Since I is clearly non-negative (it’s the integral of a non-negative function), we must have I =
1. 

The c.d.f. of the standard normal distribution,


Z x
1 2
FX (x) = √ e−u /2 du,
−∞ 2π
cannot be written in a closed form, but can be found by numerical integration to an arbitrary
degree of accuracy. This very important function is usually called Φ and if you did some Statistics
at ‘A’ Level you will certainly have come across tables of its values.

5.3 Expectation
Recall that for a discrete r.v. we defined
X
E [X] = xpX (x) (5.1)
x∈ImX

52
whenever the sum is absolutely convergent and, more generally, for any function h : R → R, we
proved that X
E [h(X)] = h(x)pX (x) (5.2)
x∈ImX

whenever this sum is absolutely convergent. We want to make an analogous definition for con-
tinuous random variables. Suppose X has a smooth p.d.f. fX . Then for any x and small δ > 0,

P (x ≤ X ≤ x + δ) ≈ fX (x)δ

and, in particular,
P (nδ ≤ X ≤ (n + 1)δ) ≈ fX (nδ)δ.
So for the expectation, we want something like

X
(nδ)fX (nδ)δ.
n=−∞

We now want to take δ → 0; intuitively, we should obtain an integral.

Definition 5.16. Let X be a continuous random variable with probability density function fX .
The expectation or mean of X is defined to be
Z ∞
E [X] = xfX (x)dx (5.3)
−∞
R∞
whenever −∞ |x|fX (x)dx < ∞. Otherwise, we say that the mean is undefined (or as in the
discrete case, if only the positive tail diverges, we might say that E [X] = ∞.)

Theorem 5.17. Let X be a continuous random variable with probability density function fX ,
and let h be a function from R to R. Then
Z ∞
E [h(X)] = h(x)fX (x)dx. (5.4)
−∞
R∞
(whenever −∞ |h(x)|fX (x)dx < ∞).

Notice that (5.4) is analogous to (5.2) in the same way that (5.3) is analogous to (5.1). Proving
Theorem 5.17 in full generality, for any function h, is rather technical. Here we just give an idea
of one approach to the proof for a particular class of functions.

Proof of Theorem 5.17 (outline of idea, non-examinable). R∞ First we claim that if X is a


non-negative continuous random variable, then E [X] = 0 P (X > x) dx. To show this, we can
write the expectation as a double integral and change the order of integration:
Z ∞
E [X] = xfX (x)dx
Zx=0
∞ Z x
= fX (x)dy dx
x=0 y=0
Z ∞Z ∞
= fX (x)dx dy
y=0 x=y
Z ∞
= P (X > y) dy,
y=0

giving the claim as required.

53
So now suppose h is such that h(X) is a non-negative continuous random variable. Then
Z ∞
E [h(X)] = P (h(X) > y) dy
y=0
Z ∞Z
= fX (x)dx dy
y=0 x : h(x)>y
Z ∞ Z
= fX (x) dy dx
x=0 y : y<h(x)
Z ∞
= fX (x)h(x)dx,
x=0

giving the desired formula in this case.

As in the case of discrete random variables, we define the variance of X to be

var (X) = E (X − E [X])2


 

whenever the right-hand side is defined. For simplicity of notation, write µ = E [X]. Then we
have
Z ∞
var (X) = (x − µ)2 fX (x)dx
Z−∞

= (x2 − 2xµ + µ2 )fX (x)dx
Z−∞
∞ Z ∞ Z ∞
2 2
= x fX (x)dx − 2µ xfX (x)dx + µ fX (x)dx
−∞ −∞ −∞
= E X − µ2 ,
 2

R∞ R∞
since −∞ xfX (x)dx = µ and −∞ fX (x)dx = 1. So we recover the expression

var (X) = E X 2 − (E [X])2 .


 

Just as in the discrete case, expectation has a linearity property.


Theorem 5.18. Suppose X is a continuous random variable with p.d.f. fX . Then if a, b ∈ R
then E [aX + b] = aE [X] + b and var (aX + b) = a2 var (X).
Proof. Bt Theorem 5.17,
Z ∞ Z ∞ Z ∞
E [aX + b] = (ax + b)fX (x)dx = a xfX (x)dx + b fX (x)dx = aE [X] + b,
−∞ −∞ −∞

as required, since the density integrates to 1. Moreover,

var (aX + b) = E (aX + b − aE [X] − b)2 = E a2 (X − E [X])2


   

= a2 E (X − E [X])2 = a2 var (X) .


 

Example 5.19. Suppose X ∼ N(µ, σ 2 ). Then


ˆ X has the same distribution as µ + σZ, where Z ∼ N(0, 1),

ˆ X has c.d.f. FX (x) = Φ((x − µ)/σ), where Φ is the standard normal c.d.f.,

ˆ E [X] = µ,

ˆ var (X) = σ 2 .

54
Solution. First suppose that µ = 0 and σ 2 = 1. Then the first two assertions are trivial and
Z ∞
x 2
E [X] = √ e−x /2 dx
−∞ 2π
which must equal 0 since the integrand is an odd function. Since the mean is 0,
Z ∞ Z ∞ 2
 2 x2 −x2 /2 xe−x /2
var (X) = E X = √ e dx = x· √ dx.
−∞ 2π −∞ 2π
Integrating by parts (and taking limits of the bounds to ±∞), we get that this equals
" 2
#∞ Z ∞
e−x /2 1 2
−x · √ + √ e−x /2 dx = 1.
2π −∞ −∞ 2π
So var (X) = 1.
Suppose now that Z ∼ N(0, 1). Then
P (µ + σZ ≤ x) = P (Z ≤ (x − µ)/σ) = Φ((x − µ)/σ).
Let φ(x) = √1 e−x2 /2 , the standard normal density. Differentiating P (µ + σZ ≤ x) in x, we get

(x − µ)2
 
1 1
φ((x − µ)/σ) = √ exp − .
σ 2πσ 2 2σ 2
So µ + σZ ∼ N(µ, σ 2 ). Finally,
E [X] = E [µ + σZ] = µ + σE [Z] = µ
and
var (X) = var (µ + σZ) = σ 2 var (Z) = σ 2 . 
Exercise 5.20. Show that if X ∼ U[a, b] and Y ∼ Exp(λ) then
a+b (b − a)2 1 1
E [X] = , var (X) = , E [Y ] = , var (Y ) = 2 .
2 12 λ λ
Notice, in particular, that the parameter of the Exponential distribution is the reciprocal of its
mean.
Example 5.21. Suppose that X ∼ Gamma(2, 2), so that it has p.d.f.
(
4xe−2x for x ≥ 0,
fX (x) =
0 otherwise.
Find E [X] and E X1 .
 

Solution. We have
∞ ∞
23 3−1 −2x
Z Z
E [X] = x · 4xe−2x dx = x e dx
−∞ −∞ 2!
and, since Γ(3) = 2! we recognise the integrand as the density of a Gamma(3, 2) random variable.
So it must integrate to 1 and we get E [X] = 1.
On the other hand,
  Z ∞ Z ∞
1 1 −2x
E = · 4xe dx = 2 2e−2x dx
X −∞ x −∞

and again we recognise the integrand as the density of an Exp(2) random variable which must
integrate to 1. So we get E X1 = 2.


1 1
WARNING: IN GENERAL, E X 6= E[X] .

55
5.4 Examples of functions of continuous random variables
Example 5.22. Imagine a forest. Suppose that R is the distance from a tree to the nearest
neighbouring tree. Suppose that R has p.d.f.
( 2
re−r /2 for r ≥ 0,
fR (r) =
0 otherwise.

Find the distribution of the area of the tree-free circle around the original tree.

Solution. Let A be the area of the tree-free circle; then A = πR2 . We begin by finding the c.d.f.
of R and then use it to find the c.d.f. of A. FR (r) is clearly 0 for r < 0. For r ≥ 0,
Z r h ir
2 2 2
FR (r) = P (R ≤ r) = se−s /2 ds = −e−s /2 = 1 − e−r /2 .
0 0

Hence, using the fact that R can’t take negative values,


 r  r 
a a
2
= 1 − e−a/(2π)

FA (a) = P (A ≤ a) = P πR ≤ a = P R ≤ = FR
π π

for a ≥ 0. Of course, FA (a) = 0 for a < 0. Differentiating for a ≥ 0, we get


1 −a/(2π)
fA (a) = e .

So, recognising the p.d.f., we see that A is distributed exponentially with parameter 1/(2π). 

Remark 5.23. The distribution of R in Example 5.22 is called the Rayleigh distribution. One
way in which this distribution occurs is as follows. Pick a point in R2 such that the x and y
co-ordinates are independent N(0, 1) random variables. Then the Euclidean distance of that point
from the origin (0, 0) has the Rayleigh distribution (see Part A Probability for a proof of this fact;
there is a connection to Example 5.15).

We can generalise the idea in Example 5.22 to prove the following theorem.

Theorem 5.24. Suppose that X is a continuous random variable with density fX and that
h : R → R is a differentiable function with dh(x)
dx > 0 for all x, so that h is strictly increasing.
Then Y = h(X) is a continuous random variable with p.d.f.
d −1
fY (y) = fX (h−1 (y)) h (y),
dy

where h−1 is the inverse function of h.

Proof. Since h is strictly increasing, h(X) ≤ y if and only if X ≤ h−1 (y). So the c.d.f. of Y is

FY (y) = P (h(X) ≤ y) = P X ≤ h−1 (y) = FX (h−1 (y)).




Differentiating with respect to y using the chain rule, we get


d −1
fY (y) = fX (h−1 (y)) h (y).
dy
There is a similar result in the case where h is strictly decreasing. In any case, you may find
it easier to remember the proof than the statement of the theorem!
What if the function h is not one-to-one? It’s best to treat these on a case-by-case basis and
think them through carefully. Here’s an example.

56
Example 5.25. Suppose that a point is chosen uniformly from the perimeter of the unit circle.
What is the distribution of its x-co-ordinate?

Solution. Represent the chosen point by its angle, Θ. So then Θ has a uniform distribution on
[0, 2π), with p.d.f. (
1
for 0 ≤ θ < 2π
fΘ (θ) = 2π
0 otherwise.
Moreover, the x-co-ordinate is X = cos Θ, which takes values in [−1, 1]. We again work via
c.d.f.’s: 
0
 for θ < 0
θ
FΘ (θ) = 2π for 0 ≤ θ < 2π

1 for θ ≥ 2π.

Notice that there are two angles in [0, 2π) corresponding to each x-co-ordinate in (−1, 1):

Then FX (x) = 0 for x ≤ −1, FX (x) = 1 for x ≥ 1 and, for x ∈ (−1, 1), we can express the c.d.f.
in terms of arccos : [−1, 1] → [0, π] as

FX (x) = P (cos Θ ≤ x)
= P (arccos x ≤ Θ ≤ 2π − arccos x)
= FΘ (2π − arccos x) − FΘ (arccos x)
arccos x arccos x
=1− −
2π 2π
1
= 1 − arccos x.
π
This completely determines the distribution of X, but we might also be interested in the p.d.f.
Differentiating FX , we get
1 1
 √ 2 for −1 < x < 1
dFX (x)  π 1−x
= 0 for x < −1 or x > 1
dx 
undefined for x = −1 or x = 1.

So we can take (
1√ 1
π 1−x2 for −1 < x < 1
fX (x) =
0 for x ≤ −1 or x ≥ 1
Rx
and get FX (x) = −∞ fX (u)du.
R∞
Notice that fX (x) → ∞ as x → 1 or x → −1 even though −∞ fX (x)dx = 1. 

57
5.5 Joint distributions
We will often want to think of different random variables defined on the same probability space.
In the discrete case, we studied pairs of random variables via their joint probability mass function.
For a pair of arbitrary random variables, we use instead the joint cumulative distribution function,
FX,Y : R2 → [0, 1], given by
FX,Y (x, y) = P (X ≤ x, Y ≤ y) .
It’s again possible to show that this function is non-decreasing in each of its arguments, and that

lim lim FX,Y (x, y) = 0


x→−∞ y→−∞

and
lim lim FX,Y (x, y) = 1.
x→∞ y→∞

Definition 5.26. Let X and Y be random variables such that


Z y Z x
FX,Y (x, y) = fX,Y (u, v)dudv
−∞ −∞

for some function fX,Y : R2 → R such that

(a) fX,Y (u, v) ≥ 0 for all u, v ∈ R


R∞ R∞
(b) −∞ −∞ fX,Y (u, v)dudv = 1.

Then X and Y are jointly continuous and fX,Y is their joint density function.

If fX,Y is sufficiently smooth at (x, y), we get

∂2
fX,Y (x, y) = FX,Y (x, y).
∂x∂y
For a single continuous random variable X, it turns out that the probability that it lies in
some nice set A ⊆ R (see Part A Integration to see what we mean by “nice”, but note that any
set you can think of or write down will be!) can be obtained by integrating its density over A:
Z
P (X ∈ A) = fX (x)dx.
A

Likewise, for nice sets B ⊆ R2 we obtain the probability that the pair (X, Y ) lies in B by
integrating the joint density over the set B:
ZZ
P ((X, Y ) ∈ B) = fX,Y (x, y)dxdy.
(x,y)∈B

We will show here that this works for rectangular regions B.

Theorem 5.27. For a pair of jointly continuous random variables X and Y , we have
Z dZ b
P (a < X ≤ b, c < Y ≤ d) = fX,Y (x, y)dxdy,
c a

for a < b and c < d.

58
Proof. We have

P (a < X ≤ b, c < Y ≤ d)
= P (X ≤ b, Y ≤ d) − P (X ≤ a, Y ≤ d) + P (X ≤ a, Y ≤ c) − P (X ≤ b, Y ≤ c)
= FX,Y (b, d) − FX,Y (a, d) + FX,Y (a, c) − FX,Y (b, c)
Z dZ b
= fX,Y (x, y)dxdy.
c a

Theorem 5.28. Suppose X and Y are jointly continuous with joint density fX,Y . Then X is a
continuous random variable with density
Z ∞
fX (x) = fX,Y (x, y)dy,
−∞

and similarly Y is a continuous random variable with density


Z ∞
fY (y) = fX,Y (x, y)dx.
−∞

In this context the one-dimensional densities fX and fY are called the marginal densities of
the joint distribution with joint density fX,Y , just as in the discrete case at Definition 2.16.
R∞
Proof. If fX is defined by fX (x) = −∞ fX,Y (x, y)dy, then we have
Z x Z x Z ∞
fX (u)du = fX,Y (u, y)dy du
−∞ −∞ −∞
= P (X ≤ x) ,

so indeed X has density fX (and the case of fY is identical).

The definitions and results above generalise straightforwardly to the case of n random vari-
ables, X1 , X2 , . . . , Xn .

Example 5.29. Let


(
1
2 (x + y) for 0 ≤ x ≤ 1, 1 ≤ y ≤ 2,
fX,Y (x, y) =
0 otherwise.

density. What is P X ≤ 12 , Y ≥ 3

Check that fX,Y (x, y) is a joint 2 ? What are the marginal
1

densities? What is P X ≥ 2 ?

Solution. Clearly, fX,Y (x, y) ≥ 0 for all x, y ∈ R. We have


Z ∞ Z ∞ Z 2Z 1
1
fX,Y (x, y)dxdy = (x + y)dxdy
−∞ −∞ 1 0 2
Z 2 1
1 2 1
= x + xy dy
1 4 2
Z 2  0
1 1
= + y dy
1 4 2
1 2 2
 
1
= y+ y
4 4 1
= 1.

59
We have
  Z 2 Z 1/2
1 3 1
P X ≤ ,Y ≥ = (x + y)dxdy
2 2 3/2 0 2
Z 2  1/2
1 2 1
= x + xy dy
3/2 4 2 0
Z 2  
1 1
= + y dy
3/2 16 4
 2
1 1
= y + y2
16 8 3/2
1
= .
4
Integrating out y we get Z 2
1 1 3
fX (x) = (x + y)dy = x +
1 2 2 4
for x ∈ [0, 1], and integrating out x we get
Z 1
1 1 1
fY (y) = (x + y)dx = + y
0 2 4 2

for y ∈ [1, 2]. Using the marginal density of X,


  Z 1 
1 1 3 9
P X≥ = x+ dx = .
2 1 2 4 16
2

Definition 5.30. Jointly continuous random variables X and Y with joint density fX,Y are
independent if
fX,Y (x, y) = fX (x)fY (y)
for all x, y ∈ R. Likewise, jointly continuous random variables X1 , X2 , . . . , Xn with joint density
fX1 ,X2 ,...,Xn are independent if

fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) . . . fXn (xn )

for all x1 , x2 , . . . , xn ∈ R.

Note that if X and Y are independent then it follows easily that

FX,Y (x, y) = FX (x)FY (y)

for all x, y ∈ R.

Example 5.31. Consider the set-up of Example 5.29. Since


  
1 1 3 1 1
(x + y) 6= x+ + y ,
2 2 4 4 2

X and Y are not independent.

60
5.5.1 Expectation
We can write the expectation of a function h of a pair (X, Y ) of jointly continuous random
variables in a natural way.

Theorem 5.32. Z ∞ Z ∞
E [h(X, Y )] = h(x, y)fX,Y (x, y)dxdy.
−∞ −∞

As in the case of Theorem 5.17, a general proof of this result is rather technical, and we don’t
cover it here. However, note again that there is a very direct analogy with the discrete case which
we saw in equation (2.2).
In particular, the covariance of X and Y is

cov (X, Y ) = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]

(exercise: check the second equality).

Exercise 5.33. Check that


E [aX + bY ] = aE [X] + bE [Y ]
and
var (X + Y ) = var (X) + var (Y ) + 2cov (X, Y ) .

Remark 5.34. We have now shown that the rules for calculating expectations (and derived
quantities such as variances and covariances) of continuous random variables are exactly the
same as for discrete random variables. This isn’t a coincidence! We can make a more general
definition of expectation which covers both cases (and more besides) but in order to do so we need
a more general theory of integration, which some of you will see in the Part A Integration course.

Example 5.35. Let −1 < ρ < 1. The standard bivariate normal distribution has joint density
 
1 1 2 2
fX,Y (x, y) = p exp − (x − 2ρxy + y )
2π 1 − ρ2 2(1 − ρ2 )

for x, y ∈ R. What are the marginal distributions of X and Y ? Find the covariance of X and Y .

Proof. We have
Z ∞  
1 1 2 2
fX (x) = exp − (x − 2ρxy + y ) dy
2(1 − ρ2 )
p
−∞ 1 − ρ2

Z ∞  
1 1 2 2 2
= exp − [(y − ρx) + x (1 − ρ )] dy
2(1 − ρ2 )
p
−∞ 2π 1 − ρ2
1 −x2 /2 ∞ (y − ρx)2
Z  
1
=√ e exp − dy.
2(1 − ρ2 )
p
2π −∞ 2π(1 − ρ2 )

But the integrand is now the density of a normal random variable with mean ρx and variance
1 − ρ2 . So it integrates to 1 and we are left with
1 2
fX (x) = √ e−x /2 .

So X ∼ N(0, 1) and, by symmetry, the same is true for Y . Notice that X and Y are only
independent if ρ = 0.

61
Since X and Y both have mean 0, we only need to calculate E [XY ]. We can use a similar
trick:
Z ∞Z ∞  
xy 1 2 2
E [XY ] = exp − (x − 2ρxy + y ) dydx
2(1 − ρ2 )
p
−∞ −∞ 2π 1 − ρ2
Z ∞ Z ∞
(y − ρx)2
 
x 2 y
= √ e−x /2 exp − dydx.
2(1 − ρ2 )
p
−∞ 2π −∞ 2π(1 − ρ2 )

The inner integral now gives us the mean of a N(ρx, 1 − ρ2 ) random variable, which is ρx. So we
get Z ∞
ρx2 2
√ e−x /2 dx = ρE X 2 = ρ,
 
cov (X, Y ) =
−∞ 2π
 2
since E X = 1.

This yields the interesting conclusion that standard bivariate normal random variables X and
Y are independent if and only if their covariance is 0. This is a nice property of normal random
variables which is not true for general random variables, as we have already observed in the
discrete case.

62
Chapter 6

Random samples and the weak law of


large numbers

One of the reasons that we are interested in sequences of i.i.d. random variables is that we can
view them as repeated samples from some underlying distribution.
Definition 6.1. Let X1 , X2 , . . . , Xn denote i.i.d. random variables. Then these random variables
are said to constitute a random sample of size n from the distribution.
Statistics often involves random samples where the underlying distribution (the “parent dis-
tribution”) is unknown. A realisation of such a random sample is used to make inferences about
the parent distribution. Suppose, for example, we want to know about the mean of the parent
distribution. An important estimator is the sample mean.
n
1X
Definition 6.2. The sample mean is defined to be X̄n = Xi .
n
i=1
This is a key random variable which itself has an expectation and a variance. Recall that for
random variables X and Y (discrete or continuous),
var (X + Y ) = var (X) + var (Y ) + 2cov (X, Y ) .
We can extend this (by induction) to n random variables as follows:
n n
!
X X X
var Xi = var (Xi ) + cov (Xi , Xj )
i=1 i=1 i6=j
n
X X
= var (Xi ) + 2 cov (Xi , Xj ) .
i=1 i<j

Theorem 6.3. Suppose that X1 , X2 , . . . , Xn form a random sample from a distribution with mean
µ and variance σ 2 . Then the expectation and variance of the sample mean are
 1
E X̄n = µ and var X̄n = σ 2 .
 
n
2
Proof. We have E [Xi ] = µ and var (Xi ) = σ for 1 ≤ i ≤ n. So by linearity of expectation and
the variance rules recalled above,
" n # n
  1X 1X
E X̄n = E Xi = E [Xi ] = µ,
n n
i=1 i=1
n n n
! !
1X 1 X 1 X 1
var (Xi ) = σ 2 ,

var X̄n = var Xi = 2 var Xi = 2
n n n n
i=1 i=1 i=1

since independence implies that cov (Xi , Xj ) = 0 for all i 6= j.

63
Example 6.4. Let X1 , . . . , Xn be a random sample from a Bernoulli distribution
  with parameter

p. Then E [Xi ] = p, var (Xi ) = p(1 − p) for all 1 ≤ i ≤ n. Hence, E X̄n = p and var X̄n =
p(1 − p)/n.

In order for X̄n to be a good estimator of the mean, we would like to know that for large
sample sizes n, X̄n is not too far away from µ i.e. that |X̄n − µ| is small. The result which
tells us that this is true is called the law of large numbers and is of fundamental importance in
probability. Before we state it, let’s step away from the sample mean and consider a more basic
situation.
Suppose that A is an event with probability P (A) and write p = P (A). Let X be the indicator
function of the event A, i.e. the random variable defined by
(
1 if ω ∈ A
X(ω) = 1A (ω) =
0 if ω ∈
/ A.

Then X ∼ Ber(p) and E [X] = p. Suppose now that we perform our experiment repeatedly
and let Xi be the indicator of the event that A occurs on the ith trial. Our intuitive notion of
probability leads us to believe that if the number n of trials is large then the proportion of the
time that A occurs should be close to p i.e.
n
1X
Xi − p
n
i=1

should be small. So proving that the sample mean is close to the true mean in this situation will
also provide some justification for the way we have set up our mathematical theory of probability.

Theorem 6.5 (Weak law of large numbers). Suppose that X1 , X2 , . . . are independent and iden-
tically distributed random variables with mean µ. Then for any fixed  > 0,
n
!
1X
P Xi − µ >  → 0
n
i=1

as n → ∞.

(Equivalently, we could have put


n
!
1X
P Xi − µ ≤  → 1
n
i=1

as n → ∞.)
In other words, the probability that the sample mean deviates from the true mean by more
than some small quantity  tends to 0 as n → ∞. Notice that the result only depends on the
underlying distribution through its mean.
We will give a proof of the weak law under an additional assumption that the variance of the
distribution is finite. To do that, we’ll first prove a couple of very useful inequalities.

Theorem 6.6 (Markov’s inequality). Suppose that Y is a non-negative random variable whose
expectation exists. Then
E [Y ]
P (Y ≥ t) ≤
t
for all t > 0.

64
Proof. Let A = {Y ≥ t}. We may assume that P (A) ∈ (0, 1), since otherwise the result is
trivially true. Then by the law of total probability for expectations,

E [Y ] = E [Y |A] P (A) + E [Y |Ac ] P (Ac ) ≥ E [Y |A] P (A) ,

since P (Ac ) > 0 and E [Y |Ac ] ≥ 0. Now, we certainly have E [Y |A] = E [Y |Y ≥ t] ≥ t. So,
rearranging, we get
E [Y ]
P (Y ≥ t) ≤
t
as we wanted.

Theorem 6.7 (Chebyshev’s inequality). Suppose that Z is a random variable with a finite vari-
ance. Then for any t > 0,
var (Z)
P (|Z − E [Z] | ≥ t) ≤ .
t2
Proof. Note that P (|Z − E [Z] | ≥ t) = P (Z − E [Z])2 ≥ t2 and then apply Markov’s inequality


to the non-negative random variable Y = (Z − E [Z])2 .

Proof of Theorem 6.5 (under the assumption of finite variance). Suppose the common
distribution of the random variables Xi has mean µ and variance σ 2 . Set
n
1X
Z= Xi .
n
i=1

As we saw in Theorem 6.3,


n
!
1X σ2
E [Z] = µ and var (Z) = var Xi = .
n n
i=1

So by Chebyshev’s inequality,
n
!
1X σ2
P Xi − µ >  ≤ 2 .
n n
i=1

Since  > 0 is fixed, the right-hand side tends to 0 as n → ∞.

65
Appendix

A.1 Useful ideas from Analysis


Here are brief details of some ideas about sets, sequences, and series, that it will be useful to
make reference to. Those doing the Analysis I course in Maths this term will see all of this in
much greater detail!

Countability
A set S is countable if either it’s finite, or its elements can be written as a list: S = {x1 , x2 , x3 , . . . }.
Put another way, S is countable if there is a bijection from a subset of N to S. The set N itself
is countable; so is the set of rational numbers Q, for example. The set of real numbers R is not
countable.

Limits
Even if you haven’t seen a definition, you probably have an idea of what it means for a sequence
to converge to a limit. Formally, we say that a sequence of real numbers (a1 , a2 , a3 , . . . ) converges
to a limit L ∈ R if the following holds: for all  > 0, there exists N ∈ N such that |an − L| ≤ 
whenever n ≥ N .
Then we may write “L = limn→∞ an ”, or “an → L as n → ∞”.

Infinite sums
Finite sums are easy. If we have a sequence (a1 , a2 , a3 , . . . ), then for any n ∈ N we can define
n
X
sn = ak = a1 + a2 + · · · + an .
k=1

What do we mean by the infinite sum ∞


P
k=1 ak ? An P
infinite sum is really a sort of limit. If the
limit L = limn→∞ sn exists, then we say that the series ∞ k=1 ak converges, and Pthat its sum is L.

If the sequence (sn , n ∈ N) does not have a limit, then we say that the series k=1 ak diverges.
An important
P∞ idea for our purposes will be absolute
P∞ convergence of a series. We say that
the series k=1 ak converges absolutely if the series k=1 |ak | converges. If a series converges
absolutely, then it also converges.
One reason why absolute convergence is important is that it guarantees that the value of a
sum doesn’t depend on the order of the terms. In the definition of expectation of a discrete
random variable, for example, we may have an infinite sum and no reason to take the terms in
any particular
P∞ order. Formally, suppose f is a bijection from NP to N, and define bk = P af (k) . If
∞ ∞
the series
P∞ k=1 ak converges absolutely, then so does the series k=1 bk , and the sums k=1 ak
and k=1 bk are equal.
An example of a series that converges but does not converge absolutely is the series 1 − 12 +
1 1 1
3 − 4 + 5 − . . . , whose sum is ln 2.

66
1 1 1 1 1 1 1 1
If we reorder the terms as 1 + 3 − 2 + 5 + 7 − 4 + 9 + 11 − 6 + . . . , then the sum instead
becomes 32 ln 2.

Power series
A (real) power series is a function of the form

X
f (x) = ck xk
k=0

where the coefficients ck , k ≥ 0 are realPconstants. For any such series, there exists a radius of
convergence R ∈ [0, ∞) ∪ ∞, such that ∞ k
k=0 ck x converges absolutely for |x| < R, and not for
|x| > R.
In this course we will meet a particular class of power series called probability generating
functions, with the property that the coefficients ck are non-negative and sum to 1. In that case,
R is at least 1.
Power series behave well when differentiated! A power series f (x) = ∞ k
P
k=0 ck x with radius
of convergence R is differentiable on the interval (−R, R), and its derivative is also a power series
with radius of convergence R, given by

X
0
f (x) = (k + 1)ck+1 xk .
k=0

Series identities
Here is a reminder of some useful identities:
Geometric series: if a ∈ R and 0 ≤ r < 1 then
n−1
X a(1 − rn )
ark =
1−r
k=0

and

X a
ark = .
1−r
k=0
Exponential function: for λ ∈ R,

X λn
= eλ .
n!
n=0

Binomial theorem: for x, y ∈ R and n ≥ 0,


n  
n
X n
(x + y) = xk y n−k .
k
k=0

Differentiation and integration give us variants of these. For example, for 0 < r < 1,
∞ ∞
!
X
k−1 d X k
kr = r
dr
k=1 k=0

and
∞ k ∞
!
Z r
X r X
k
= t dt.
k 0
k=1 k=0

67
A.2 Increasing sequences of events
We mentioned the following result in the later part of the course. A sequence of events An , n ≥ 1
is called increasing if A1 ⊆ A2 ⊆ A3 ⊆ . . . .

Proposition A.8. If An , n ≥ 1 is an increasing sequence of events, then



!
[
P An = lim P (An ) .
n→∞
n=1

Proof. The proof uses countable additivity. Using the fact that the sequence is increasing, we
can write An as a disjoint union

An = A1 ∪ (A2 \ A1 ) ∪ (A3 \ A2 ) ∪ · · · ∪ (An \ An−1 ),

and similarly, we can write ∞


S
n=1 An as a disjoint union


[ ∞
[
Ak = A1 ∪ (Ak \ Ak−1 ).
k=1 k=2

Then applying the countable additivity axiom twice, we have


∞ ∞
!
[ X
P Ak = P (A1 ) + P (Ak \ Ak−1 )
k=1 k=2
n
" #
X
= lim P (A1 ) + P (Ak \ Ak−1 )
n→∞
k=2


P n
P
(since by definition of an infinite sum, bk = lim bk .)
k=2 n→∞ k=2

= lim P (An ) .
n→∞

68
Common discrete distributions

Distribution Probability mass function Mean Variance Generating function


Uniform
n+1 n2 − 1 s − sn+1
U{1, 2, . . . , n}, P (X = k) = n1 , 1 ≤ k ≤ n GX (s) =
2 12 n(1 − s)
n∈N
Bernoulli
Ber(p), P (X = 1) = p, P (X = 0) = 1 − p p p(1 − p) GX (s) = 1 − p + ps
p ∈ [0, 1]
Binomial  

69
n k
Bin(n, p), n ∈ P (X = k) = p (1 − p)n−k , k = 0, 1, . . . , n np np(1 − p) GX (s) = (1 − p + ps)n
k
N, p ∈ [0, 1]
Poisson λk −λ
P (X = k) = e , k = 0, 1, 2, . . . λ λ GX (s) = eλ(s−1)
Po(λ), λ ≥ 0 k!
Geometric
1 1−p ps
Geom(p), P (X = k) = (1 − p)k−1 p, k = 1, 2, . . . p p2
GX (s) =
1 − (1 − p)s
p ∈ [0, 1]
Alternative
1−p 1−p p
geometric, P (X = k) = (1 − p)k p, k = 0, 1, . . . p p2
GX (s) =
1 − (1 − p)s
p ∈ [0, 1]
Negative
binomial    k
n−1 k k(1−p) ps
NegBin(k, p), P (X = n) = (1 − p)n−k pk , n = k, k + 1, . . . p p2
GX (s) = 1−(1−p)s
k−1
k ∈ N, p ∈
[0, 1]
Common continuous distributions

Distribution Probability density function Cumulative distribution function Mean Variance


Uniform 1 x−a a+b (b − a)2
fX (x) = , a≤x≤b FX (x) = ,a≤x≤b
U[a, b], a < b b−a b−a 2 12
Exponential 1 1
fX (x) = λe−λx , x ≥ 0 FX (x) = 1 − e−λx , x > 0

70
Exp(λ), λ ≥ 0 λ λ2
Gamma
λα α−1 −λx α α
Gamma(α, λ), fX (x) = x e ,x≥0
Γ(α) λ λ2
α > 0, λ ≥ 0
Normal (x−µ)2
 
1 − x−µ
N(µ, σ 2 ), fX (x) = √ e 2σ 2 ,x∈R FX (x) = Φ µ σ2
2πσ 2 σ
µ ∈ R, σ 2 > 0
Standard Z x
1 x2 1 u2
Normal fX (x) = √ e− 2 , x ∈ R FX (x) = Φ(x) = √ e− 2 du 0 1
N(0, 1) 2π −∞ 2π
Beta Γ(α + β) α−1 α αβ
fX (x) = x (1 − x)β−1 , x ∈ [0, 1] α+β (α+β)2 (α+β+1)
Beta(α, β) Γ(α)Γ(β)

You might also like