PrelimsProb MT23 26sep2023
PrelimsProb MT23 26sep2023
Matthias Winkel
1
Overview
An understanding of random phenomena is becoming increasingly important in today’s world
within social and political sciences, finance, life sciences and many other fields. The aim of this
introduction to probability is to develop the concept of chance in a mathematical framework.
Random variables are introduced, with examples involving most of the common distributions.
Learning Outcomes
Students should have a knowledge and understanding of basic probability concepts, including
conditional probability. They should know what is meant by a random variable, and have met
the common distributions and their probability mass functions. They should understand the
concepts of expectation and variance of a random variable. A key concept is that of independence
which will be introduced for events and random variables.
Synopsis
Sample space, events, probability measure. Permutations and combinations, sampling with or
without replacement. Conditional probability, partitions of the sample space, law of total prob-
ability, Bayes’ Theorem. Independence.
Discrete random variables, probability mass functions, examples: Bernoulli, binomial, Pois-
son, geometric. Expectation, expectation of a function of a discrete random variable, variance.
Joint distributions of several discrete random variables. Marginal and conditional distributions.
Independence. Conditional expectation, law of total probability for expectations. Expectations of
functions of more than one discrete random variable, covariance, variance of a sum of dependent
discrete random variables.
Solution of first and second order linear difference equations. Random walks (finite state space
only).
Probability generating functions, use in calculating expectations. Examples including random
sums and branching processes.
Continuous random variables, cumulative distribution functions, probability density functions,
examples: uniform, exponential, gamma, normal. Expectation, expectation of a function of a
continuous random variable, variance. Distribution of a function of a single continuous random
variable. Joint probability density functions of several continuous random variables (rectangular
regions only). Marginal distributions. Independence. Expectations of functions of jointly con-
tinuous random variables, covariance, variance of a sum of dependent jointly continuous random
variables.
Random sample, sums of independent random variables. Markov’s inequality, Chebyshev’s
inequality, Weak Law of Large Numbers.
Reading List
1. G. R. Grimmett and D. J. A. Welsh, Probability: An Introduction, 2nd edition, Oxford
University Press, 2014, Chapters 1–5, 6.1–6.3, 7.1–7.3, 7.5 (Markov’s inequality), 8.1-8.2,
10.4.
4. D. Stirzaker, Elementary Probability, Cambridge University Press, 1994, Chapters 1–4, 5.1–
5.6, 6.1–6.3, 7.1, 7.2, 7.4, 8.1, 8.3, 8.5 (excluding the joint generating function).
2
Chapter 1
1.1 Introduction
We will think of performing an experiment which has a set of possible outcomes Ω. We call Ω
the sample space. For example,
An event is a subset of Ω. An event A ⊆ Ω occurs if, when the experiment is performed, the
outcome ω ∈ Ω satisfies ω ∈ A. You should think of events as things you can decide have or have
not happened by looking at the outcome of your experiment. For example,
The complement of A is Ac := Ω \ A and means “A does not occur”. For events A and B,
A ∪ B means “at least one of A and B occurs”;
A ∩ B means “both A and B occur”;
A \ B means “A occurs but B does not”.
If A ∩ B = ∅ we say that A and B are disjoint – they cannot both occur.
We will assign a probability P (A) to each event A. Later on we will discuss general rules (or
“axioms”) governing how these probabilities ought to behave. For now, let’s consider a simple
and special case, where Ω is a finite set, and the probability assigned to a subset A of Ω is
proportional to the size of A; that is,
|A|
P (A) = .
|Ω|
In that case for our examples above, we get:
(a) for a fair coin, P (A) = 1/2;
(b) for the two dice, P (A) = 1/12.
Example (b) demonstrates illustrates the need for counting in the situation where we have a finite
number of possible outcomes to our experiment, all equally likely. The sample space Ω has 36
elements (6 ways of choosing i and 6 ways of choosing j). Since A = {(1, 3), (2, 2), (3, 1)} contains
3 sample points, and all sample points are equally likely, we get P (A) = 3/36 = 1/12.
We want to be able to tackle much more complicated counting problems.
3
1.2 Counting
Many of you will have seen before the basic ideas involving permutations and combinations. If
you haven’t, or if you find them confusing, then you can find more details in the first chapter of
Introduction to Probability by Ross.
n(n − 1) . . . 2.1 = n!
different orderings.
Since n! increases extremely fast, it is sometimes useful to know Stirling’s formula:
√ 1
n! ∼ 2πnn+ 2 e−n ,
where f (n) ∼ g(n) means f (n)/g(n) → 1 as n → ∞. This is astonishingly accurate even for quite
small n. For example, the error is of the order of 1% when n = 10.
α1 , . . . , α1 , α2 , . . . , α2 , . . . , αk , . . . , αk
| {z } | {z } | {z }
m1 times m2 times mk times
If there are just two types of object then, since m1 + m2 = n, the expression (1.1) is just a
binomial coefficient, mn1 = m1 !(n−m
n!
1 )!
= mn2 .
Note: we will always use the notation
n n!
= .
m m!(n − m)!
4
Recall the binomial theorem,
n
n
X n m n−m
(x + y) = x y .
m
m=0
You can see where the binomial coefficient comes from because writing
(x + y)n = (x + y)(x + y) · · · (x + y)
and multiplying out, each term involves one pick from each bracket. The coefficient of xm y n−m
is the number of sequences of picks that give x exactly m times and y exactly n − m times and
that’s the number of ways of choosing the m “slots” for the x’s.
mk
The expression (1.1) is called a multinomial coefficient because it is the coefficient of am
1 · · · ak
1
in the expansion of
(a1 + · · · + ak )n
where m1 + · · · mk = n. We sometimes write
n
m1 m2 . . . m k
for the multinomial coefficient.
Instead of thinking in terms of arrangements, we can think of our binomial coefficient in terms
of choices. For example, if I have to choose a committee of size k from n people, there are nk
ways to do it. To see how this ties in, stand the n people in a line. For each arrangement of k
1’s and n − k 0’s I can create a different committee by picking the ith person for the committee
if the ith term in the arrangement is a 1.
Example 1.3. Pick a team of m players from a squad of n, all possible teams being equally likely.
Set
n
( )
X
Ω = (i1 , i2 , . . . , in ) : ik ∈ {0, 1} and ik = m ,
k=1
where (
1 if player k is picked,
ik =
0 otherwise.
Let A = {player 1 is in the team}. Then
n−1
#teams that include player 1 m−1 m
P (A) = = n = .
#possible teams (m) n
Many counting problems can be solved by finding a bijection (that is, a one-to-one correspon-
dence) between the objects we want to enumerate and other objects that we already know how
to enumerate.
Example 1.4. How many distinct non-negative integer-valued solutions of the equation
x1 + x2 + · · · + xm = n
are there?
Solution. Consider a sequence of n ?’s and m−1 |’s. There is a bijection between such sequences
and non-negative integer-valued solutions to the equation. For example, if m = 4 and n = 3,
|? {z ?} | |{z} | |{z}
? | |{z}
x1 =2 x2 =0 x3 =1 x4 =0
n+m−1
There are n sequences of n ?’s and m − 1 |’s and, hence, the same number of solutions to
the equation.
5
It is often possible to perform quite complex counting arguments by manipulating binomial
coefficients. Conversely, sometimes one wants to prove relationships between binomial coefficients
and this can most easily be done by a counting argument. Here is one famous example:
Lemma 1.5 (Vandermonde’s identity). For k, m, n ≥ 0,
X k
m+n m n
= , (1.2)
k j k−j
j=0
1
P : F → R means that to each element A of F, we associate a real number which we call P (A). P is a function
or mapping from F to R. Compare to a situation you are more familiar with: if f (x) = x2 then we say that f is a
function from R to R (or f : R → R for short).
6
Note that in particular (for I = {1, 2} and A1 = A and A2 = B):
(You may wonder whether this special case is enough to imply the full statement of P3 . It’s easy
to show, for example by induction, that this special case of the statement for two events implies
the statement for all finite collections of events. It turns out that the statement about countably
infinite collections is genuinely stronger P
– but this is rather a subtle point involving intricacies of
set theory! You may also wonder what i∈I means when I is countable.PAgain, there are some
subtleties, but the relevant cases for us are Pwhere I is either
P finite and i∈I the natural finite
sum, or I ⊆ N is countably infinite and i∈I P(Ai ) = ∞ i=0 P(Ai )1I (i) can be given rigorous
meaning as the limit of an infinite series, where 1I (i) = 1 if i ∈ I and 1I (i) = 0 if i 6∈ I.)
In the examples above, Ω was finite. In general Ω may be finite, countably infinite, or
uncountably infinite. If Ω is finite or countable, as it usually will be for the first half of this
course, then we normally take F to be the set of all subsets of Ω (the power set of Ω). (You
should check that, in this case, F1 –F3 are satisfied.) If Ω is uncountable, however, the set of
all subsets turns out to be too large: it ends up containing sets to which we cannot consistently
assign probabilities. This is an issue which some of you will see discussed properly in next year’s
Part A Integration course; for the moment, you shouldn’t worry about it, just make a mental
note that there is something to be resolved here.
Example 1.7. Consider a countably infinite P set Ω = {ω1 , ω2 , . . .} and an arbitrary collection
(p1 , p2 , . . .) of non-negative numbers with sum ∞ i=1 pi = 1. Put
X
P (A) = pi .
i : ωi ∈A
Then P satisfies P1 –P3 . The numbers (p1 , p2 , . . .) are called a probability function.
Theorem 1.8. Suppose that (Ω, F, P) is a probability space and that A, B ∈ F. Then
1. P (Ac ) = 1 − P (A);
Example 1.9. Suppose that in a single roll of a fair die we know that the outcome is an even
number. What is the probability that it is in fact a six?
7
Solution. Let B = {result is even} = {2, 4, 6} and C = {result is a six} = {6}. Then P(B) = 12
and P(C) = 61 , but if I know that B has happened, then P(C|B) (read “the probability of C given
B”) is 13 because given that B happened, we know the outcome was one of {2, 4, 6} and since the
die is fair, in the absence of any other information, we assume each of these is equally likely.
Now let A = {result is divisible by 3} = {3, 6}. If we know that B happened, then the only
way that A can also happen is if the outcome is in A ∩ B, in this case if the outcome is {6} and
so P(A|B) = 31 again which is P(A ∩ B)/P(B).
Definition 1.10. Let (Ω, F, P) be a probability space. If A, B ∈ F and P(B) > 0 then the
conditional probability of A given B is
P(A ∩ B)
P(A|B) = .
P(B)
We should check that this new notion fits with our idea of probability. The next theorem says
that it does.
Theorem 1.11. Let (Ω, F, P) be a probability space and let B ∈ F satisfy P(B) > 0. Define a
new function Q : F → R by Q(A) = P(A|B). Then (Ω, F, Q) is also a probability space.
Proof. Because we’re using the same F, we need only check axioms P1 –P3 .
P1 . For any A ∈ F,
P(A ∩ B)
Q(A) = ≥ 0.
P(B)
P2 . By definition,
P(Ω ∩ B) P(B)
Q(Ω) = = = 1.
P(B) P(B)
P3 . For disjoint events Ai , i ∈ I,
S
S P(( i∈I Ai ) ∩ B)
Q( i∈I Ai ) =
P(B)
S
P i∈I (Ai ∩ B)
=
P(B)
P
P(Ai ∩ B)
= i∈I (because Ai ∩ B, i ∈ I, are disjoint)
P(B)
P
= i∈I Q(Ai ).
From the definition of conditional probability, we get a very useful multiplication rule:
This generalises to
Example 1.12. An urn contains 8 red balls and 4 white balls. We draw 3 balls at random
without replacement. Let Ri = {the ith ball is red} for 1 ≤ i ≤ 3. Then
8 7 6 14
P (R1 ∩ R2 ∩ R3 ) = P (R1 ) P (R2 |R1 ) P (R3 |R1 ∩ R2 ) = · · = .
12 11 10 55
8
Example 1.13. A bag contains 26 tickets, one with each letter of the alphabet. If six tickets
are drawn at random from the bag (without replacement), what is the chance that they can be
rearranged to spell CALVIN ?
Solution. Write Ai for the event that the ith ticket drawn is from the set {C, A, L, V, I, N }. By
(1.4),
6 5 4 3 2 1
P(A1 ∩ . . . ∩ A6 ) = · · · · · .
26 25 24 23 22 21
Now
P(0 received) = P(0 sent and 0 received) + P(1 sent and 0 received).
Now we use (1.3) to get
Similarly,
9
1.5 The law of total probability and Bayes’ theorem
Definition 1.15. A family of events {B1 , B2 , . . .} is a partition of Ω if
S
1. Ω = i≥1 Bi (so that at least one Bi must happen), and
This result is sometimes also called the partition theorem. We used it in our bitstream example
to calculate the probability that 0 was received.
Proof. We have
[ [
P (A) = P A ∩ Bi , since Bi = Ω
i≥1 i≥1
[
= P (A ∩ Bi )
i≥1
X
= P (A ∩ Bi ) by axiom P3 , since A ∩ Bi , i ≥ 1 are disjoint
i≥1
X
= P (A|Bi ) P (Bi ) .
i≥1
Note that if P(Bi ) = 0 for some i, then the expression in Theorem 1.16 wouldn’t make sense,
since P(A|Bi ) is undefined. (Although we could agree a convention by which P(A|B)P(B) means
0 whenever P(B) = 0; then we can make sense of the expression in Theorem
P 1.16 even if some of
the Bi have zero probability.) In any case, we can still write P(A) = i P(A ∩ Bi ).
Example 1.17. Crossword setter I composes m clues; setter II composes n clues. Alice’s prob-
ability of solving a clue is α if the clue was composed by setter I and β if the clue was composed
by setter II.
Alice chooses a clue at random. What is the probability she solves it?
Solution. Let
Then
m n
P(B1 ) = , P(B2 ) = , P(A|B1 ) = α, P(A|B2 ) = β.
m+n m+n
By the law of total probability,
αm βn αm + βn
P (A) = P (A|B1 ) P (B1 ) + P (A|B2 ) P (B2 ) = + = .
m+n m+n m+n
In our solution to Example 1.14, we combined the law of total probability with the definition
of conditional probability. In general, this technique has a name:
10
Theorem 1.18 (Bayes’ Theorem). Suppose that {B1 , B2 , . . .} is a partition of Ω by sets from F
such that P (Bi ) > 0 for all i ≥ 1. Then for any A ∈ F such that P (A) > 0,
P(A|Bk )P(Bk )
P(Bk |A) = P .
i≥1 P(A|Bi )P(Bi )
Proof. We have
P(Bk ∩ A)
P(Bk |A) =
P(A)
P(A|Bk )P(Bk )
= .
P(A)
Example 1.19. Recall Alice, from Example 1.17. Suppose that she chooses a clue at random
and solves it. What is the probability that the clue was composed by setter I?
P(A|B1 )P(B1 )
P(B1 |A) =
P(A|B1 )P(B1 ) + P(A|B2 )P(B2 )
αm
m+n
= αm βn
+ m+n
m+n
αm
= .
αm + βn
When doing calculations, often it can be convenient to work with Bayes’ Theorem in “odds”
form. If we have an event B with probability P(B), then the ratio P(B)/P(B c ) is sometimes
called the odds of B.
By applying Bayes’ Theorem with the partition {B, B c }, and comparing the expressions it
gives for P(B|A) and for P(B c |A), we can obtain
The second fraction on the right-hand side is the odds of B. We could call the left-hand side the
conditional odds of B given A. The first fraction on the right-hand side is often called a Bayes
factor.
Example 1.20. A particular medical condition has prevalence 1/1000 in the population. There
exists a test for the condition which has false negative rate 0 (i.e. any person with the condition
will test positive) and false positive rate 0.0 (i.e. a typical person without the condition tests
positive with probability 0.01).
A member of the population selected at random takes a test, and tests positive. What is the
probability that they have the condition?
11
Solution. Write B for the event that the individual has the condition, and A for the event that
the test outcome is positive. We are asked to calculate P(B|A).
We have P(B) = 1/1000, P(A|B) = 1 and P(A|B c ) = 0.01.
Using the odds form of Bayes’ Theorem given above, we could write
Solving the equation p/(1 − p) = 0.0999, we obtain P(B|A) ≈ 0.091. Even though the test is
“99% accurate”, nonetheless the conditional probability of having the condition given a positive
test is still less than 10%. The chance of a a false positive considerably outweighs the chance of
a true positive, since the prevalence in the population is very low.
Example 1.21 (Simpson’s paradox). Consider the following table showing a comparison of the
outcomes of two types of surgery for the removal of kidney stones (from Charig et al, 1986):
and yet
12
1.6 Independence
Of course, knowing that B has happened doesn’t always influence the chances of A.
A = {first die shows 4}, B = {total score is 6} and C = {total score is 7}.
Then
1
P (A ∩ B) = P ({(4, 2)}) =
36
but
1 5 1
· 6= .
P (A) P (B) =
6 36 36
So A and B are not independent. However, A and C are independent (you should check this).
More generally, if {Ai , i ∈ I} is any family of independent events, then also the family {Aci , i ∈
I} is independent. Proof: exercise! (We need to show that the product formula in Definition 1.22
holds for all finite subsets {Aci , i ∈ J}). One approach is to use the inclusion-exclusion formula
from Problem Sheet 1; various induction arguments are also possible.)
13
1.7 Some useful rules for calculating probabilities
If you’re faced with a probability calculation you don’t know how to do, here are some things to
try.
or its generalisation:
(“the probability at least one of the events occurs is 1 minus the probability that none of
them occur”). If that’s no use, try using the inclusion-exclusion formula (see the problem
sheet):
n
X X
P (A1 ∪ A2 ∪ . . . ∪ An ) = P (Ai ) − P (Ai ∩ Aj ) + · · · + (−1)n+1 P (A1 ∩ A2 ∩ . . . ∩ An ) .
i=1 i<j
If you can’t calculate the probability of your event directly, try splitting it up according to
some partition of Ω and using the law of total probability.
Useful check: any probability that you calculate should be in the interval [0, 1]!
If not, something, has gone wrong....
14
Chapter 2
Observing whether an event occurs corresponds to answering a yes-no question about a system
that we’re observing, or an experiment that we’re performing. But often we may want to answer
more general questions. In particular, we may want to ask questions such as “how many?” , “how
big?”, “for how long?”, whose answer is a real number.
Random variables are essentially real-valued measurements of this kind. We start by consid-
ering discrete random variables, which take a finite or countable set of values (for example integer
values). Here’s a formal definition.
Definition 2.1. A discrete random variable X on a probability space (Ω, F, P) is a function
X : Ω → R such that
(a) {ω ∈ Ω : X(ω) = x} ∈ F for each x ∈ R,
(b) says that X can only take countably many values. Often ImX will be a subset of N.
If Ω is countable, (b) holds automatically because we can think of ImX as being indexed
by Ω, and so, therefore, ImX must itself be countable. If we also take F to be the set of
all subsets of Ω then (a) is also immediate.
Later in the course, we will deal with continuous random variables, which take uncountably
many values; we have to be a a bit more careful about what the correct analogue of (a) is;
we will end up requiring that sets of the form {X ≤ x} = {ω ∈ Ω : X(ω) ≤ x} are events to
which we can assign probabilities.
Example 2.2. Roll two dice and take Ω = {(i, j) : 1 ≤ i, j ≤ 6}. Examples of random variables
include X and Y where
Definition 2.3. The probability mass function (p.m.f.) of X is the function pX : R → [0, 1]
defined by
pX (x) = P(X = x).
15
If x ∈
/ ImX (that is, X(ω) never equals x) then pX (x) = P ({ω : X(ω) = x}) = P (∅) = 0. Also
X X
pX (x) = P ({ω : X(ω) = x})
x∈ImX x∈ImX
!
[
=P {ω : X(ω) = x} since the events are disjoint
x∈ImX
= P (Ω) since every ω ∈ Ω gets mapped somewhere in ImX
= 1.
and pX (x) = 0 for all x 6= 0, 1. We will usually write X = 1A and call this the indicator function
of the event A.
1. The Bernoulli distribution. X has the Bernoulli distribution with parameter p (where
0 ≤ p ≤ 1) if
P(X = 0) = 1 − p, P(X = 1) = p.
We often write q = 1 − p. (Of course since (1 − p) + p = 1, we must have P (X = x) = 0 for
all other values of x.) We write X ∼ Ber(p).
We showed in Example 2.4 that the indicator function 1A of an event A is an example of a
Bernoulli random variable with parameter p = P (A), constructed on an explicit probability
space.
The Bernoulli distribution is used to model, for example, the outcome of the flip of a coin
with “1” representing heads and “0” representing tails. It is also a basic building block for
other classical distributions.
16
3. The geometric distribution. X has a geometric distribution with parameter p if
Notice that now X takes values in a countably infinite set – the whole of the positive
integers. We write X ∼ Geom(p).
Again we can interpret the distribution in terms of a sequence of independent trials, each
with probability p of success. Now X models the number of independent trials needed until
we see the first success.
NOTE: there is an alternative and also common definition for the geometric distribution
as the distribution of the number of failures, Y , before the first success. If X is defined as
above, then Y corresponds to X − 1 and so
P (Y = k) = p(1 − p)k , k = 0, 1, . . . .
λk e−λ
P (X = k) = , k = 0, 1, . . . .
k!
We write X ∼ Po(λ).
This distribution arises in many applications. It frequently provides a good model for the
number of events which occur in some situation where there are a large number of possible
events, each of which has a very small probability. For example, the number of decaying
particles detected by a Geiger counter near a lump of radioactive material, or the number of
calls arriving at a call centre in a given time period. Formally speaking it can be obtained
as a limit of the binomial distribution Bin(n, λ/n) as n becomes large (see problem sheet
3).
Exercise 2.5. Check that each of these really does define a probability mass function. That is:
P
x pX (x) = 1.
You may find it useful to refer to the reminders about series which you can find in the Appendix
at the end of these notes.
The distributions mentioned above are important, but naturally they do not constitute an
exhaustive list! In fact, given any function pX which satisfies the two properties above, we can
write down a probability space and a random variable X defined on it whose probability mass
function is p. Most directly, we could take Ω = {x ∈ R : p(x) 6= 0}, take F to consist of all subsets
of Ω, define
P ({ω}) = p(ω) for each ω ∈ Ω
and more generally X
P (S) = p(ω) for each S ⊆ Ω,
ω∈S
and then take X to be the identity function i.e. X(ω) = ω. However, this is often not the
most natural probability space to take. For example, suppose that pX is the mass function of
a Binomial(3, 1/2) distribution (which can model the number of heads obtained in a sequence
of three fair coin tosses). We could proceed as just outlined. But we could also take Ω =
17
{(i, j, k) : i, j, k ∈ {0, 1}}, with a 0 representing a tail and a 1 representing a head, so that an
element of Ω tells us exactly what the three coin tosses were. Then take F to be the power set
of Ω,
P ({(i, j, k)}) = 2−3 for all i, j, k ∈ {0, 1},
so that every sequence of coin tosses is equally likely, and finally set X((i, j, k)) = i + j + k. In
both cases, X has the same distribution, but the probability spaces are quite different.
Up to this point, we have often been quite explicit in describing our sample space Ω. Once
we get to more complicated examples, it can quickly become impractical to specify Ω explicitly.
Although the concept of a probability space (Ω, F, P) underlies everything, in practice it will be
rare that we think about Ω itself – instead we will talk directly about events and their probabilities,
and random variables and their distributions (and we can do that without assuming any particular
structure for Ω).
2.2 Expectation
We can encapsulate certain information about the distribution of a random variable through what
a statistician might call “summary statistics”. The most fundamental of these is the expectation,
which tells us the “average value” of the random variable.
Example 2.7. 1. Suppose that X is the number obtained when we roll a fair die. Then
18
6 1
3. Suppose that P (X = n) = π 2 n2
, n ≥ 1. Then
∞ ∞
X 6 X1
nP (X = n) = =∞
π2 n
n=1 n=1
and so the expectation does not exist (or we may say E [X] = ∞).
Proof. Let A = {y : y = h(x) for some x ∈ ImX}. Then, starting from the right-hand side,
X X X
h(x)P (X = x) = h(x)P (X = x)
x∈ImX y∈A x∈ImX: h(x)=y
X X
= yP (X = x)
y∈A x∈ImX: h(x)=y
X X
= y P (X = x)
y∈A x∈ImX: h(x)=y
X
= yP (h(X) = y)
y∈A
= E [h(X)] .
Example 2.9. Take h(x) = xk . Then E[X k ] is called the kth moment of X, when it exists.
Let us now prove some properties of the expectation which will be useful to us later on.
Theorem 2.10. Let X be a discrete random variable such that E [X] exists.
19
Proof. (a) We have ImX ⊆ [0, ∞) and so
X
E [X] = xP (X = x)
x∈ImX
is a sum whose terms are all non-negative and so must itself be non-negative.
(b) Exercise.
The expectation tells us important information about the average value of a random variable,
but there is a lot of information that it doesn’t provide. If you are investing in the stock market,
the expected rate of return of some stock is not the only quantity you will be interested in;
you would also like to get some idea of the riskiness of the investment. The typical size of
the fluctuations of a random variable around its expectation is described by another summary
statistic, the variance.
(This is E [f (X)] where f is given by f (x) = (x − E [X])2 – remember that E [X] is just a
number.)
Note that, since (X − E [X])2 is a non-negative random variable, by part (a) of Theorem 2.10,
var (X) ≥ 0. The variance is a measure of how much the distribution of X is spread out about
its mean: the more the distribution is spread out, the larger the variance. If X is, in fact,
deterministic (i.e. P (X = a) = 1 for some a ∈ R) then E [X] = a also and so var (X) = 0: only
randomness gives rise to variance.
Writing µ = E[X] and expanding the square we see that
Theorem 2.12. Suppose that X is a discrete random variable whose variance exists. Then if a
and b are (finite) fixed real numbers, then the variance of the discrete random variable Y = aX +b
is given by
var (Y ) = var (aX + b) = a2 var (X) .
The proof is an exercise, but notice that of course b doesn’t come into it because it simply
shifts the whole distribution – and hence the mean – by b, whereas variance measures relative to
the mean.
In view of Theorem 2.12, why do you think statisticians often prefer to use the standard
deviation rather than variance as a measure of spread?
20
2.3 Conditional distributions
Back in Section 1.4 we talked about conditional probability P(A|B). In the same way, for a
discrete random variable X we can define its conditional distribution, given the event B. This is
what it sounds like: the mass function obtained by conditioning on the outcome B.
Definition 2.13. Suppose that B is an event such that P (B) > 0. Then the conditional proba-
bility mass function of X given B is given by
P({X = x} ∩ B)
P(X = x|B) = ,
P(B)
whenever the sum converges absolutely. We write pX|B (x) = P(X = x|B).
Proof.
X
E[X] = xP(X = x)
x
!
X X
= x P(X = x|Bi )P(Bi ) by the law of total probability
x i
XX
= xP(X = x|Bi )P(Bi )
x i
!
X X
= P(Bi ) xP(X = x|Bi )
i x
X
= E[X|Bi ]P(Bi ).
i
Notice that again we could include cases where P (Bi ) = 0 for some i, if we agree to interpret
E [X|Bi ] P (Bi ) to be 0 in that situation.
Example 2.15. Let X be the number of rolls of a fair die required to get the first 6. (So X is
geometrically distributed with parameter 1/6.) Find E[X] and var (X).
Solution. Let B1 be the event that the first roll of the die gives a 6, so that B1c is the event that
it does not. Then
21
Rearrange to get E[X] = 6 (as our intuition would have us guess). Similarly,
We’ll see a powerful approach to moment calculations in Chapter 4, but first we must find a
way to deal with more than one random variable at a time.
We usually write
P P the right-hand side simply as P(X = x, Y = y). We have pX,Y (x, y) ≥ 0 for all
x, y ∈ R and x y pX,Y (x, y) = 1.
If we are given the joint probability mass function of X and Y , we can recover the probability
mass functions of either of X or Y individually by summing over the possible values of the other:
X
pX (x) = pX,Y (x, y)
y
X
pY (y) = pX,Y (x, y).
x
In this context these distributions of X and Y alone are often called marginal distributions The
name comes from the way they appear in the margins when we write the joint probability mass
function as a table:
Example 2.17. Suppose that X and Y take only the values 0 or 1 and their joint mass function
pX,Y is given by
X 0 1
Y
1 1
0 3 2
1 1
1 12 12
22
P
Observe that x,y pX,Y (x, y) = 1 (always a good check when modelling).
The marginals are found by summing the rows and columns:
X 0 1 pY (y)
Y
1 1 5
0 3 2 6
1 1 1
1 12 12 6
5 7
pX (x) 12 12
7 1 1 7 1
Notice that P(X = 1) = 12 , P(Y = 1) = 6 and P(X = 1, Y = 1) = 12 6= 12 × 6 so {X = 1}
and {Y = 1} are not independent events.
Whenever pX (x) > 0 for some x ∈ R, we can also write down the conditional probability mass
function of Y given that X = x:
pX,Y (x, y)
pY |X=x (y) = P (Y = y|X = x) = for y ∈ R.
pX (x)
In other words, X and Y are independent if and only if the events {X = x} and {Y = y} are
independent for all choices of x and y. We can also write this as
Example 2.20 (Part of an old exam question). A coin when flipped shows heads with probability
p and tails with probability q = 1−p. It is flipped repeatedly. Assume that the outcome of different
flips is independent. Let U be the length of the initial run and V the length of the second run.
Find P(U = m, V = n), P(U = m), P(V = m). Are U and V independent?
23
Solution. We condition on the outcome of the first flip and use the law of total probability.
P(U = m, V = n) = P(U = m, V = n | 1st flip H)P(1st flip H)
+ P(U = m, V = n | 1st flip T)P(1st flip T)
= ppm−1 q n p + qq m−1 pn q
= pm+1 q n + q m+1 pn .
∞
X q p
P(U = m) = (pm+1 q n + q m+1 pn ) = pm+1 + q m+1
1−q 1−p
n=1
= pm q + q m p.
∞
X p2 q2
P(V = n) = (pm+1 q n + q m+1 pn ) = q n + pn
1−p 1−q
m=1
2 n−1
=p q + q 2 pn−1 .
We have P(U = m, V = n) 6= f (m)g(n) unless p = q = 12 . So U , V are not independent unless
p = 12 . To see why, suppose that p < 12 , then knowing that U is small, say, tells you that the first
run is more likely to be a run of H’s and so V is likely to be longer. Conversely, knowing that U
is big will tell us that V is likely to be small. U and V are negatively correlated.
In the same way as we defined expectation for a single discrete random variable, so in the
bivariate case we can define expectation of any function of the random variables X and Y . Let
h : R2 → R. Then h(X, Y ) is itself a random variable, and we can show as in Theorem 2.8 that
XX
E[h(X, Y )] = h(x, y)P(X = x, Y = y)
x y
XX
= h(x, y)pX,Y (x, y), (2.2)
x y
= aE[X] + bE[Y ].
Theorem 2.21 tells us that expectation is linear. This is a very important property. We can
easily extend by induction to get E[a1 X1 + · · · + an Xn ] = a1 E[X1 ] + · · · + an E[Xn ] for any finite
collection of random variables X1 , . . . , Xn . Note that we don’t need to make any assumption
about independence of the random variables.
24
Example 2.22. Your spaghetti bowl contains n strands of spaghetti. You repeatedly choose 2
ends at random, and join them together. What is the average number of loops in the bowl, once
no ends remain?
Solution. We start with 2n ends, and the number decreases by 2 at each step. When we have
k ends, the probability of forming a loop is 1/(k − 1). Before the ith step, we have 2(n − i + 1)
ends, so we form a loop with probability 1/[2(n − i) + 1].
Let Xi be the indicator function of the event that we form a loop at the ith step. Then
E[Xi ] = 1/[2(n − i) + 1]. Let M be the total number of loops formed. Then M = X1 + · · · + Xn ,
so using linearity of expectation,
Theorem 2.23. If X and Y are independent discrete random variables whose expectations exist,
then
E[XY ] = E[X]E[Y ].
Proof. We have
XX
E[XY ] = xyP(X = x, Y = y)
x y
XX
= xyP(X = x)P(Y = y) (by independence)
x y
! !
X X
= xP(X = x) yP(Y = y)
x y
= E[X]E[Y ].
Exercise 2.24. Show that var (X + Y ) = var (X) + var (Y ) when X and Y are independent.
What happens when X and Y are not independent? It’s useful to define the covariance,
Notice that this means that if X and Y are independent, their covariance is 0. In general, the
covariance can be either positive or negative valued.
WARNING: cov (X, Y ) = 0 DOES NOT IMPLY THAT X AND Y ARE INDEPENDENT.
25
By analogy with the way we defined independence for a sequence of events, we can define
independence for a family of random variables.
Definition 2.27. A family {Xi : i ∈ I} of discrete random variables are independent if for all
finite sets J ⊆ I and all collections {Ai : i ∈ J} of subsets of R,
!
\ Y
P {Xi ∈ Ai } = P (Xi ∈ Ai ) .
i∈J i∈J
Suppose that X1 , X2 , . . . are independent random variables which all have the same distribution.
Then we say that X1 , X2 , . . . are independent and identically distributed (i.i.d.).
26
Chapter 3
27
The next theorem says that we can split the problem of finding a solution to our difference
equations into two parts.
Theorem 3.3. The general solution (un )n≥0 (i.e. if no boundary conditions are specified) of
k
X
aj un+j = f (n)
j=0
can be written as un = vn + wn where (vn )n≥0 is a particular solution to the equation and (wn )n≥0
is any solution of the associated homogeneous equation
k
X
aj wn+j = 0.
j=0
To see that there always are solutions, we choose u0 = · · · = uk−1 = 0 and note that this equation
determines un+k inductively for all n ≥ 0 so that the equation holds. Now fix a particular solution
(vn )n≥0 and consider any solution (un )n≥0 . Then (un − vn )n≥0 satisfies
k
X
aj (un − vn ) = 0.
j=0
Solution. The homogeneous equation is wn+1 = awn . “Putting it into itself”, we get
wn = awn−1 = . . . = an w0 = Aan
28
Hence,
b(1 − an )
b b
un = 3 − an + = 3an + .
1−a 1−a 1−a
What happens if a = 1? An applied-maths-type approach would set a = 1 + and try to see
what happens as → 0:
b(1 − (1 + )n )
un = u0 (1 + )n +
1 − (1 + )
(1 − (1 + n))
= u0 + b + O()
−
= u0 + bn + O() → u0 + bn as → 0.
An alternative approach is to mimic what you did for differential equations and “try the next most
complex thing”. We have un+1 = un + b and the homogeneous equation has solution wn = A (a
constant). For a particular solution try vn = Cn (note that there is no point in adding a constant
term because the constant solves the homogeneous equation and so it makes no contribution to
the right-hand side when we substitute).
Then C(n + 1) = Cn + b gives C = b and we obtain once again the general solution
un = A + bn.
Example 3.5.
un+1 = aun + bn.
Solution. As above, the homogeneous equation has solution wn = Aan . For a particular solution,
try vn = Cn + D. Substituting, we obtain
C = aC + b, C + D = aD,
b −c
so again provided a 6= 1 we can solve to obtain C = 1−a and D = 1−a . Thus for a 6= 1
bn b
un = Aan + − .
1 − a (1 − a)2
To find A, we need a boundary condition (e.g. the value of u0 ).
29
For a non-trivial solution we can divide by Aλn−1 and see that λ must solve the quadratic equation
λ2 + aλ + b = 0.
This is called the auxiliary equation. (So just as when you solve 2nd order ordinary differential
equations you obtain a quadratic equation by considering solutions of the form eλt , so here we
obtain a quadratic in λ by considering solutions of the form λn .)
If the auxiliary equation has distinct roots, λ1 and λ2 then the general solution to the homo-
geneous equation is
wn = A1 λn1 + A2 λn2 .
If λ1 = λ2 = λ try the next most complicated thing (or mimic what you do for ordinary differential
equations) to get
wn = (A + Bn)λn .
Exercise 3.7. Check that this solution works.
How about particular solutions? The same tricks as for the the first order case apply. We
can start by trying something of the same form as f , and if that fails then try the next most
complicated thing. You can save yourself work by not including components that you already
know solve the homogeneous equation.
Example 3.8. Solve
un+1 + 2un − 3un−1 = 1.
Solution. The auxiliary equation is just
λ2 + 2λ − 3 = 0
wn = A(−3)n + B.
For a particular solution, we’d like to try a constant, but that won’t work because we know that
it solves the homogeneous equation (it’s a special case of wn ). So try the next most complicated
thing, which is vn = Cn. Substituting, we obtain
30
it on the left cannot contribute anything to the 1 that we are trying to obtain on the right of the
equation.) Substituting, we obtain
1
un = An + B + n2 .
2
Example 3.10 (The Fibonacci numbers). The Fibonacci numbers 1, 1, 2, 3, 5, 8, 13, . . . are defined
by the second-order linear difference equation
1 √ 1 √
un = (1 + i 3)n + (1 − i 3)n , n ≥ 0.
2 2
√
is, in fact, real for every n ≥ 0. In order to see this, recall that 1 + i 3 = 2eiπ/3 and
This √
1 − i 3 = 2e−iπ/3 . So
31
3.4 Random walks
We return to the gambler’s ruin problem of Example 3.1. The gambler’s fluctuating wealth is an
example of a more general class of random processes called random walks (sometimes the more
evocative phrase drunkard’s walk is used). Imagine a particle moving around a network. At each
step, it can move to one of the other nodes of the network: there are rules determining where the
particle can move to at the next time step from that position and with what probability it moves
to each of the possible new positions. The important point is that these rules only depend on
the current position, not on the earlier positions that the particle has visited. Random walks can
be used to model various real-world situations. For example, the path traced by a molecule as it
moves in a liquid or a gas; the path of an animal searching for food; or the price of a particular
stock every Monday morning. There are various examples on the problem sheets and later in the
course.
Let’s return to the setting of Example 3.1 and solve the recurrence relation we obtained there.
Recall that un = P (bankruptcy) if the gambler’s initial fortune is £n, and that (rearranging
(3.1)),
pun+1 − un + qun−1 = 0, 1 ≤ n ≤ M − 1, (3.5)
(where q = 1 − p), with u0 = 1, uM = 0. This is a homogeneous second-order difference equation.
The auxiliary equation is
pλ2 − λ + q = 0
which factorises as
(pλ − q)(λ − 1) = 0.
q 1
So λ = p or λ = 1. If p 6= 2 then
n
q
un = A + B
p
for some constants A and B which we can find using the boundary conditions:
M
q
u0 = 1 = A + B and uM =0=A+B .
p
These give
M
1−p
p 1
A=− M , B= M
1−p 1−p
1− p 1− p
and so n M
1−p 1−p
p − p
un = M .
1−p
1− p
1
Exercise 3.12. Check that in the case p = 2 we get
n
un = 1 − , 0 ≤ n ≤ M.
M
Figure 3.1 shows a simulation of paths in the gambler’s ruin model.
Example 3.13. What is the expected number of plays in the gambler’s ruin model before the
gambler’s fortune hits either 0 or M ?
32
Random walk simulation: p = 0.4, N = 20, start at 10
20
15
position
10
5
0
0 20 40 60
time
Figure 3.1: 10 simulated paths in the gambler’s ruin model, with M = 20, n = 10 and p = 0.4.
We see some get absorbed at 0, one at 20, and two which have not yet reached either boundary
at time 80.
Solution. Just as we used the partition theorem to get a recurrence for the probability of
bankruptcy, we can use the partition theorem for expectations to get a recurrence for the ex-
pected length of the process.
Let the initial fortune be £n, and let X be the number of steps until the walk reaches one of
the barriers at 0 or M . Write en for the expectation of X. Then
en = pE [X|first step is to n + 1] + qE [X|first step is to n − 1] .
Let’s think carefully about the conditional expectations on the right-hand side. If the first step is
to n + 1, then we have already spent one step to get there, and thereafter the number of steps to
reach the boundary is just the time to reach the boundary in a walk starting from n + 1. Hence
we get
E [X|first step is to n + 1] = 1 + en+1 ,
and similarly
E [X|first step is to n − 1] = 1 + en−1 .
So we obtain the recurrence
en = p(1 + en+1 ) + q(1 + en−1 )
33
which rearranges to give
pen+1 − en + qen−1 = −1. (3.6)
Our boundary conditions are e0 = eM = 0. Note that (3.6) has exactly the same form as (3.5),
except that the equation is no longer homogeneous: we have the constant −1 on the right-hand
side instead of 0.
Take the case p 6= q. As above, we have the general solution to the homogeneous equation
n
q
wn = A + B .
p
For a particular solution to (3.6), try vn = Cn ( note that there’s no point trying a constant
since we already know that any constant solves the homogeneous equation). This yields
pC(n + 1) − Cn + qC(n − 1) = −1
for 0 ≤ n ≤ M .
Exercise 3.14. Find en for p = q = 1/2 (the expression is rather simpler in that case!).
Finally, consider what happens if we remove the upper barrier at M , and instead have a
random walk on the infinite set {0, 1, 2, . . . }, starting from some site n > 0. Does the walk ever
reach the site 0, or does it stay strictly positive for ever? Let’s look at the probability of the
(M )
event that it hits 0. A natural idea is to let M → ∞ in the finite problem. Write un for the
probability of hitting 0 before M , which we calculated above. Then we have
n M
q
p
− pq ( n
q
6
lim
M →∞ M if p = q if p > q
lim u(M ) p
n = 1− q
p
=
M →∞ 1 if p ≤ q.
n
limM →∞ 1 − M if p = q = 1/2
It turns out that this limit as M → ∞ really does give the appropriate probability that the
random walk on {0, 1, 2, . . . } hits 0. In particular, the walk has positive probability to stay away
from 0 for ever if and only if p > q. There are various ways to prove this; the idea below is not
complicated, but is nonetheless somewhat subtle.
Theorem 3.15. (Non-examinable) Consider a random walk on the integers Z, started from some
n > 0, which at each step increases by 1 with probability p, and decreases by 1 with probability
q = 1 − p. Then the probability un that the walk ever hits 0 is given by
( n
q
p if p > q,
un =
1 if p ≤ q.
34
Proof. In Proposition A.8 in the Appendix, we prove a useful result about increasing sequences
of events. A sequence of events Ak , k ≥ 1 is called increasing if A1 ⊆ A2 ⊆ A3 ⊆ . . . . Then
Proposition A.8 says that for such a sequence of events,
∞
!
[
P Ak = lim P (Ak ) .
k→∞
k=1
(This can be regarded as a sort of continuity result for the probability function P.)
To apply this result to the random walk started from n, consider the event H that the random
walk reaches 0, and for each M , consider the event AM that the random walk reaches 0 before
M . If the walk ever reaches 0, then there must be some M such that the walk reaches 0 before
M , so that AM occurs.
S Conversely, if any event AM occurs, then clearly the event H also occurs.
Hence we have H = M AM .
Then indeed we have
un = P (H)
∞
!
[
=P AM
M =1
= lim P (AM )
M →∞
= lim u(M
n ,
)
M →∞
as desired.
35
Chapter 4
We’re now going to turn to an extremely powerful tool, not just in calculations but also in proving
more abstract results about discrete random variables.
From now on we consider non-negative integer-valued random variables i.e. X takes values in
{0, 1, 2, . . .}.
Theorem 4.2 (Uniqueness theorem). The distribution of X is uniquely determined by its prob-
ability generating function, GX .
1
P∞The probability
n
generating function is an example of a power series, that is a function of the form f (x) =
n=0 cn x . It may be that this sum diverges for some values of x; the radius of convergence is the value r such
that the sum converges if |x| < r and diverges if |x| > r. For a probability generating function, we can see that the
radius of convergence must be at least 1. For the purposes of this course, you are safe to assume that the derivative
of f is well-defined for |x| < r and is given by
∞
X
f 0 (x) = ncn xn−1
n=1
i.e. what you would get differentiating term-by-term. Those of you who are doing Analysis I & II will learn more
about power series there.
36
Proof. First note that GX (0) = p0 . Now, for |s| < 1, we can differentiate GX (s) term-by-term
to get
G0X (s) = p1 + 2p2 s + 3p3 s2 + · · · .
Setting s = 0, we see that G0X (0) = p1 . Similarly, by differentiating repeatedly, we see that
dk
GX (s) s=0
= k! pk .
dsk
So we can recover p0 , p1 , . . . from GX .
for all s ∈ R.
for all s ∈ R.
4. Geometric distribution with parameter p. Exercise on the problem sheet: check that
ps
GX (s) = ,
1 − (1 − p)s
1
provided that |s| < 1−p .
Proof. We have
GX+Y (s) = E sX+Y = E sX sY .
Since X and Y are independent, sX and sY are independent (see a question on the problem
sheet). So then by Theorem 2.23, this is equal to
E sX E sY = GX (s)GY (s).
Theorem 4.4. Suppose that X1 , X2 , . . . , Xn are independent Ber(p) random variables and let
Y = X1 + · · · + Xn . Then Y ∼ Bin(n, p).
37
Proof. We have
As Y has the same p.g.f. as a Bin(n, p) random variable, the Uniqueness theorem yields that
Y ∼ Bin(n, p).
The interpretation of this is that Xi tells us whether the ith of a sequence of independent
coin flips is heads or tails (where heads has probability p). Then Y counts the number of heads
in n independent coin flips and so must be distributed as Bin(n, p).
Theorem 4.5. Suppose that X1 , X2 , . . . , Xn are independent random variables such that Xi ∼
Po(λi ). Then
n n
!
X X
Xi ∼ Po λi .
i=1 i=1
Pn
In particular, if λi = λ for all 1 ≤ i ≤ n then i=1 Xi ∼ Po(nλ).
n n n
!
Y Y X
E sX1 +X2 +...+Xn = E sXi = eλi (s−1) = exp (s − 1)
λi .
i=1 i=1 i=1
Since this is the p.g.f. of the Po( ni=1 λi ) distribution and probability generating functions
P
uniquely determine distributions, the result follows.
So
G0X (1) = E[X]
(as long as E [X] exists). Differentiating again, we then similarly get
1 s s2
G(s) = + + .
6 3 2
38
1. Find the mean and variance of X1 .
3. What is the p.g.f. of 3X1 ? Why is it not the same as the p.g.f. of Y ? What is P(3X1 = 3)?
4 16 5
var (X1 ) = G00 (1) + G0 (1) − (G0 (1))2 = 1 + − = .
3 9 9
2. Just as in our derivation of the probability generating function for the binomial distribution,
and so
3
1 s s2
1
1 + 6s + 21s2 + 44s3 + 63s4 + 54s5 + 27s6 .
GY (s) = + + =
6 3 2 216
11
P(Y = 3) is the coefficient of s3 in GY (s), that is 54 . (As an exercise, calculate P(Y = 3)
directly.)
3. We have
1 s3 s6
G3X1 (s) = E[s(3X1 ) ] = E[(s3 )X1 ] = GX1 (s3 ) =
+ + .
6 3 2
This is different from GY (s) because 3X1 and S3 have different distributions – knowing X1
does not tell you Y , but it does tell you 3X1 . Finally, P(3X1 = 3) = P(X1 = 1) = 31 .
Of course, for each fixed s ∈ R, sX is itself a discrete random variable. So we can use the law
of total probability when calculating its expectation.
Example 4.7. Suppose that there are n red balls, n white balls and 1 blue ball in an urn. A ball
is selected at random and then replaced. Let X be the number of red balls selected before a blue
ball is chosen. Find
(b) E [X],
Solution. (a) We will use the law of total probability for expectations. Let R be the event that
the first ball is red, W be the event that the first ball is white and B be the event that the first
ball is blue. Then
Of course, the value of X is affected by the first ball which is picked. If the first ball is blue then
we know that X = 0. If the first ball is white, we learn nothing about the value of X. If the first
39
ball is red then effectively we start over again counting numbers of red balls, but we add 1 for
the red ball we have already seen. This yields
GX (s) = E s1+X P (R) + E sX P (W ) + E s0 P (B)
n n 1
= sGX (s) + GX (s) +
2n + 1 2n + 1 2n + 1
and so
1 1/(n + 1)
GX (s) = = .
n + 1 − ns 1 − (1 − 1/(n + 1))s
(b) Differentiating, we get
n
G0X (s) =
(n + 1 − ns)2
and so
E [X] = G0X (1) = n.
(c) Recall that
var (X) = G00X (1) + G0X (1) − (G0X (1))2 .
Differentiating the p.g.f. again we get
2n2
G00X (s) =
(n + 1 − ns)3
and so G00X (1) = 2n2 . Hence,
var (X) = 2n2 + n − n2 = n(n + 1).
If we were just asked for E [X] it would be easier to calculate
E [X] = E [X|R] P (R) + E [X|W ] P (W ) + E [X|B] P (B)
N n 1
= (1 + E [X]) + E [X] +0· = n.
2n + 1 2n + 1 2n + 1
In order to calculate var (X), however, we need both E [X] and E X 2 and so it’s easier just to
40
Corollary 4.9. Suppose that X1 , X2 , . . . are independent and identically
Pdistributed Ber(p) ran-
dom variables and that N ∼ Po(λ), independently of X1 , X2 , . . .. Then N i=1 Xi ∼ Po(λp).
(Notice that we saw this result in disguise via a totally different method in a problem sheet
question.)
Proof. We have GX (s) = 1 − p + ps and GN (s) = exp(λ(s − 1)) and so by Theorem 4.8,
h PN i
E s i=1 Xi = GN (GX (s)) = exp(λ(1 − p + ps − 1)) = exp(λp(s − 1)).
Since this is the p.g.f. of Po(λp) and p.g.f.’s uniquely determine distributions, the result follows.
Example 4.10. In a short fixed time period, a photomultiplier detects 0, 1 or 2 photons with
probabilities 12 , 13 and 61 respectively. The photons detected by the photomultiplier cause it to give
off a charge of 2, 3, 4 or 5 electrons (with equal probability) independently for every one photon
originally detected. What is the probability generating function of the number of electrons given
off in the time period? What is the probability that exactly five electrons are given off in that
period?
Solution. Let N be the number of photons detected. Then the probability generating function
of N is
1 1 1
GN (s) = + s + s2 .
2 3 6
Let Xi be the number of electrons given off by the ith photon detected. Then Y = X1 + · · · + XN
is the total number given off in the period (remember that N here is random). Now GX (s) =
1 2 3 4 5
4 (s + s + s + s ) and so, by Theorem 4.8,
41
We start at the top of the tree, with a single individual in generation 0. Then there are 3
individuals in generations 1 and 2, 5 individuals in generation 3, a single individual in generation
4 and no individuals in subsequent generations.
(n)
Let Xn be the size of the population in generation n, so that X0 = 1. Let Ci be the number
of children of the ith individual in generation n ≥ 0, so that we may write
(n) (n) (n)
Xn+1 = C1 + C2 + · · · + CXn .
(n) (n)
(We interpret this sum as
P0∞if Xn =i 0.) Note that C1 , CX2n , . . . are independent and identically
distributed. Let G(s) = i=0 p(i)s and let Gn (s) = E s .
Theorem 4.11. For n ≥ 0,
Gn+1 (s) = Gn (G(s)) = G(G(. . . G(s) . . .)) = G(Gn (s)).
| {z }
n+1 times
(0)
Proof. Since X0 =
1, we have G0 (s) = s. Also, we get X1 = C1 which has p.m.f. p(i), i ≥ 0.
So G1 (s) = E sX1 = G(s). Since
Xn
(n)
X
Xn+1 = Ci ,
i=1
by Theorem 4.8 we get
h PXn (n) i
Gn+1 (s) = E sXn+1 = E s i=1 Ci
= Gn (G(s)).
Corollary
P∞ 4.12. Suppose that the mean number of children of a single individual is µ i.e.
i=1 ip(i) = µ. Then
E [Xn ] = µn .
Proof. We have E [Xn ] = G0n (1). By the chain rule,
d
G0n (s) = G(Gn−1 (s)) = G0n−1 (s)G0 (Gn−1 (s)).
ds
Plugging in s = 1, we get E [Xn ] = E [Xn−1 ] G0 (1) = E [Xn−1 ] µ and, inductively, E [Xn ] = µn .
42
In particular, notice that we get exponential growth on average if µ > 1 and exponential
decrease if µ < 1. This raises an interesting question: can the population die out? If p(0) = 0
then every individual has at least one child and so the population clearly grows forever. If
p(0) > 0, on the other hand, then the population dies out with positive probability because
(Notice that this holds even in the cases where E [Xn ] grows as n → ∞ !)
Example 4.13. Suppose that p(i) = (1/2)i+1 , i ≥ 0, so that each individual has a geometric
number of offspring. What is the distribution of Xn ?
Solution. First calculate
∞ k+1
X
k 1 1
G(s) = s = .
2 2−s
k=0
By plugging this into itself a couple of times, we get
2−s 3 − 2s
G2 (s) = , G3 (s) = .
3 − 2s 4 − 3s
n−(n−1)s
A natural guess is that Gn (s) = (n+1)−ns which is, in fact, the case, as can be proved by induction.
If we want the probability mass function of Xn , we need to expand this quantity out in powers
of s. We have
∞
1 1 1 X nk s k
= = .
(n + 1) − ns n + 1 1 − ns/(n + 1) (n + 1)k+1
k=0
Notice that P (Xn = 0) → 1 as n → ∞, which indicates that the population dies out eventually
in this case.
Now remember that each of the k individuals in the first generation behaves exactly like the parent.
In particular, we can think of each of them starting its own family, which is an independent copy
of the original family. Moreover, the whole population dies out if and only if all of these sub-
populations die out. If we had k families, this occurs with probability q k . So
∞
X
q= q k p(k) = G(q). (4.1)
k=0
43
The equation q = G(q) doesn’t quite enable us to determine q: notice that 1 is always a solution,
but it’s not necessarily the only solution in [0, 1].
Using Proposition A.8 about increasing sequences of events (see Appendix), we have
!
[
q=P {Xn = 0}
n
= lim P (Xn = 0)
n→∞
= lim Gn (0). (4.2)
n→∞
x = G(x) (4.3)
Proof. From (4.1) we know that q solves (4.3). Suppose some r ≥ 0 also solves (4.3). We
claim that in that case, Gn (0) ≤ r for all n ≥ 0. In that case we are done, since then also
q = limn→∞ Gn (0) ≤ r, and so indeed q is smaller than any other solution of (4.3).
We use induction to prove the claim that Gn (0) ≤ r for all n. For the base case n = 0, we
have G0 (0) = 0 ≤ r as required.
the induction step, suppose that Gn−1 (0) ≤ r. Now notice that the generating function
For P
G(s) = ∞ k
k=0 p(k)s is a non-decreasing function for s ≥ 0. Hence
Gn (0) = G Gn−1 (0) ≤ G(r) = r,
It turns out that the question of whether the branching process inevitably dies out is deter-
mined by the mean number of children of a single individual. To avoid a trivial case, we assume
in the next result that p(1) 6= 1. (If p(1) = 1 then Xn = 1 with probability 1, for all n.) Then we
find that there is a positive probability of survival of the process for ever if and only if µ > 1.
Proof. Note first that there’s a quick argument for the case where µ is strictly less than 1. Note
that as Xn takes non-negative integer values,
(1) Suppose µ > 1. Since the gradient of the curve y = G(x) is more than 1 at x = 1, and
the curve starts on the non-negative y-axis at x = 0, it must cross the line y = x at some
x ∈ [0, 1). Hence indeed the smallest non-negative fixed point q of G is less than 1. This is
illustrated on the left side of Figure 4.1.
44
1 1
y = G(x)
y=x y=x
p0
y = G(x)
p0
0 1 0 1
Figure 4.1: On the left, the case µ > 1; on the right, the case µ ≤ 1.
(2) Suppose µ ≤ 1. The gradient at 1 is at most 1, and in fact the gradient is strictly less than 1
for all x ∈ [0, 1). (We excluded the case p1 = 1 for which the gradient is 1 everywhere.) Now
the function y = G(x) must stay above the line y = x throughout [0, 1). So the smallest
non-negative fixed point q of G is 1. This is illustrated on the right side of Figure 4.1.
45
Chapter 5
Even for discrete quantities, it is often useful to think instead in terms of continuous ap-
proximations. For example, suppose you wish to consider the number of working adults
who regularly contribute to charity. You might model this number as X ∈ {0, 1, . . . , n},
where n is the total number of working adults in the UK. We could, in theory, model this as
a Bin(n, p) random variable where p = P (adult contributes). But n is measured in millions.
So instead model Y ≈ X n as a continuous random variable taking values in [0, 1] and giving
the proportion of adults who contribute.
To give a concrete example of a random variable which is not discrete, imagine you have a
board game spinner. You spin the arrow and it lands pointing at an angle somewhere between
0 and 2π in such a way that every angle is equally likely; we want to model this angle as a
random variable X. How can we describe its distribution? We can’t assign a positive probability
to each angle – our probabilities wouldn’t sum to 1. To get around this, we don’t define the
probability of individual sample points, but only of certain natural events. For example, by
symmetry we expect that P (X ≤ π) = 1/2. More generally, we expect the probability that X lies
in an interval [a, b] ⊆ [0, 2π) to be proportional to the length of that interval: P (X ∈ [a, b]) = b−a
2π ,
0 ≤ a < b < 2π.
Definition 5.1. A random variable X defined on a probability space (Ω, F, P) is a function
X : Ω → R such that {ω : X(ω) ≤ x} ∈ F for each x ∈ R.
Let’s just check that this includes our earlier definition. If X is a discrete random variable
then [
{ω : X(ω) ≤ x} = {ω : X(ω) = y}.
y≤x : y∈ImX
Since ImX is countable, this is a countable union of events in F and, therefore, itself belongs to
F.
46
Of course, {ω : X(ω) ≤ x} ∈ F means precisely that we can assign a probability to this event.
The collection of these probabilities as x varies in R will play a central part in what follows.
Definition 5.2. The cumulative distribution function (c.d.f.) of a random variable X is the
function FX : R → [0, 1] defined by
FX (x) = P (X ≤ x) .
Example 5.3. Let X be the number of heads obtained in three tosses of a fair coin. Then
P (X = 0) = 18 , P (X = 1) = P (X = 2) = 83 and P (X = 3) = 18 . So
0 if x<0
1
if 0≤x<1
8
1 3
FX (x) = 8 + 8 = 12 if 1≤x<2
1 3
8 + + 38 = 7
if 2≤x<3
8 8
1 if x ≥ 3.
0 1 2 3 x
Example 5.4. Let X be the angle of the board game spinner. Then
0
if x < 0,
x
FX (x) = 2π if 0 ≤ x < 2π,
1 if x ≥ 2π.
We can immediately write down some properties of the c.d.f. FX corresponding to a general
random variable X.
3. As x → −∞, FX (x) → 0.
4. As x → ∞, FX (x) → 1.
FX (a) = P (X ≤ a) ≤ P (X ≤ b) = FX (b).
3 & 4. (sketch) Intuitively, we want to put “FX (−∞) = P (X ≤ −∞)” and then, since X can’t
possibly be −∞ (or less!), the only sensible interpretation we could give the right-hand side would
be 0. Likewise, we would like to put “FX (∞) = P (X ≤ ∞)” and, since X cannot be larger than
∞, the only sensible interpretation we could give the right-hand side would be 1. The problem
is that ∞ and −∞ aren’t real numbers, but FX is a function on R. The only sensible way to
deal with this problem is by taking limits and to do this carefully involves using the countable
additivity axiom P3 in a somewhat intricate way.
47
Conversely, any function F satisfying conditions 1, 3 and 4 of Theorem 5.5 plus right-continuity
is the cumulative distribution function of some random variable defined on some probability space,
although we will not prove this fact.
As you can see from the coin-tossing example, FX need not be a smooth function. Indeed,
for a discrete random variable, FX is always a step function. However, in the rest of this chapter,
we’re going to concentrate on the case where FX is very smooth in that it has a derivative (except
possibly at a collection of isolated points).
Definition 5.6. A continuous random variable X is a random variable whose c.d.f. satisfies
Z x
FX (x) = P (X ≤ x) = fX (u)du,
−∞
fX is called the probability density function (p.d.f.) of X or, sometimes, just its density.
Remark 5.7. The definition of a continuous random variable leaves implicit which functions fX
might possibly serve as a probability density function. Part of this is a more fundamental question
concerning which functions we are allowed to integrate (and for some of you, that will be resolved
in the Analysis III course in Trinity Term and in Part A Integration). For the purposes of this
course, you may assume that fX is a function which has at most countably many jumps and is
smooth everywhere else. Indeed, in almost all of the examples we will consider, fX will have 0, 1
or 2 jumps.
Remark 5.8. The Fundamental Theorem of Calculus (which some of you will see proved in
Analysis III), tells us that FX of the form given in the definition is differentiable with
dFX (x)
= fX (x),
dx
at any point x where fX is continuous.
Consider (
0 for x < 0
f (x) =
e−x for x ≥ 0.
Then (
Z x
0 if x < 0
f (u)du = R x
−∞ 0 e−u du =1− e−x if x ≥ 0,
and so X is a continuous random variable with density fX (x) = f (x). Notice that fX (0) = 1 and
so fX has a jump at x = 0. On the other hand, FX is smooth at 0, but it isn’t differentiable there.
To see this, if we approach 0 from the right, FX has gradient tending to 1; if we approach 0 from
the left, FX has gradient 0 and, since these don’t agree, there isn’t a well-defined derivative. On
the other hand, everywhere apart from 0 we do have FX0 (x) = fX (x).
48
Example 5.10. Suppose that a continuous random variable X has p.d.f.
(
cx2 (1 − x) for x ∈ [0, 1]
fX (x) =
0 otherwise.
Since
x
x3 x4
Z
2
12u (1 − u)du = 12 − ,
0 3 4
we get
0
for x < 0
FX (x) = 4x3 − 3x4 for 0 6 x < 1
1 for x > 1.
Example 5.11. The duration in minutes of mobile phone calls made by students is modelled by
a random variable, X, with p.d.f.
(
1 −x/6
e if x ≥ 0
fX (x) = 6
0 otherwise.
Solution. (i)
Z 6 Z 6
1 −x/6 1
P (3 < X ≤ 6) = fX (x)dx = e dx = e− 2 − e−1 .
3 3 6
(ii) Z ∞ Z ∞
1 −x/6
P (X > 6) = fX (x)dx = e dx = e−1 .
6 6 6
We often use the p.d.f. of a continuous random variable analogously to the way we used the
p.m.f. of a discrete random variable. There are several similarities between the two:
49
Probability density function (continuous) Probability mass function (discrete)
fX (x) > 0 ∀x ∈ R pX (x) > 0 ∀x ∈ R
Z ∞ X
fX (x)dx = 1 pX (x) = 1
−∞ x∈ImX
Z x X
FX (x) = fX (u)du FX (x) = pX (u)
−∞ u6x: u∈ImX
However, the analogy can be misleading. For example, there’s nothing to prevent fX (x)
exceeding 1.
So fX (x) is approximately the probability that X falls between x and x + (or, indeed, between
x − and x). What happens as → 0?
P (X = x) = 0 for all x ∈ R
and Z b
P (a ≤ X ≤ b) = fX (x)dx for all a ≤ b.
a
So for a continuous r.v. X, the probability of getting any fixed value x is 0! Why doesn’t this
break our theory of probability? We have
[
{ω : X(ω) ≤ x} = {ω : X(ω) = y}
y≤x
and the right-hand side is an uncountable union of disjoint events of probability 0. If the union
were countable, this would entail that the left-hand side had probability 0 also, which wouldn’t
make much sense. But because the union is uncountable, we cannot expect to “sum up” these
zeros in order to get the probability of the left-hand side. The right way to resolve this problem
is using a probability density function.
50
Remark 5.13. There do exist random variables which are neither discrete nor continuous. To
give a slightly artificial example, suppose that we flip a fair coin. If it comes up heads, sample U
uniformly from [0, 1] and set X to be the value obtained; if it comes up tails, let X = 1/2. Then
X can take uncountably many values but does not have a density. Indeed, as you can check,
(
x
2 if 0 ≤ x < 1/2
P (X ≤ x) = x+1
2 if 1/2 ≤ x ≤ 1,
and there does not exist a function fX which integrates to give this.
The theory is particularly nice in the discrete and continuous cases because we can work
with probability mass functions and probability density functions respectively. But the cumulative
distribution function is a more general concept which makes sense for all random variables.
1. The uniform distribution. X has the uniform distribution on an interval [a, b] if it has
p.d.f. (
1
for a ≤ x ≤ b,
fX (x) = b−a
0 otherwise.
We write X ∼ U[a, b].
2. The exponential distribution. X has the exponential distribution with parameter λ ≥ 0
if it has p.d.f.
fX (x) = λe−λx , x ≥ 0.
We write X ∼ Exp(λ). The exponential distribution is often used to model lifetimes or
the time elapsing between unpredictable events (such as telephone calls, arrivals of buses,
earthquakes, emissions of radioactive particles, etc).
3. The gamma distribution. X has the gamma distribution with parameters α > 0 and
λ ≥ 0 if it has p.d.f.
λα α−1 −λx
fX (x) = x e , x ≥ 0.
Γ(α)
Here, Γ(α) is the so-called gamma function, which is defined by
Z ∞
Γ(α) = uα−1 e−u du
0
for α > 0. For most values of α this integral does not have a closed form. However, for a
strictly positive integer n, we have Γ(n) = (n − 1)!. (See the Wikipedia “Gamma function”
page for lots more information about this fascinating function!)
If X has the above p.d.f. we write X ∼ Gamma(α, λ). The gamma distribution is a
generalisation of the exponential distribution and possesses many nice properties. The Chi-
squared distribution with d degrees of freedom, χ2d , which you may have seen at ‘A’ Level, is
the same as Gamma(d/2, 1/2) for d ∈ N.
4. The normal (or Gaussian) distribution. X has the normal distribution with parame-
ters µ ∈ R and σ 2 > 0 if it has p.d.f.
(x − µ)2
1
fX (x) = √ exp − , x ∈ R.
2πσ 2 2σ 2
51
We write X ∼ N(µ, σ 2 ). The standard normal distribution is N(0, 1). The normal distri-
bution is used to model all sorts of characteristics of large populations and samples. Its
fundamental importance across Probability and Statistics is a consequence of the Central
Limit Theorem, which you will use in Prelims Statistics and see proved in Part A Proba-
bility.
Check that for each of these fX really is a p.d.f. (i.e. that it is non-negative and integrates
to 1).
It follows that
Z ∞ 2 Z ∞ 2
2 1 x 1 y
I = √ exp − dx √ exp − dy
−∞ 2π 2 −∞ 2π 2
Z ∞Z ∞
(x2 + y 2 )
1
= exp − dxdy.
−∞ −∞ 2π 2
Now convert to polar co-ordinates: let r and θ be such that x = r cos θ and y = r sin θ. Then the
Jacobian is |J| = r and so we get
Z 2π Z ∞ 2 i∞
1 r h 2
r exp − drdθ = −e−r /2 = 1.
0 0 2π 2 0
Since I is clearly non-negative (it’s the integral of a non-negative function), we must have I =
1.
5.3 Expectation
Recall that for a discrete r.v. we defined
X
E [X] = xpX (x) (5.1)
x∈ImX
52
whenever the sum is absolutely convergent and, more generally, for any function h : R → R, we
proved that X
E [h(X)] = h(x)pX (x) (5.2)
x∈ImX
whenever this sum is absolutely convergent. We want to make an analogous definition for con-
tinuous random variables. Suppose X has a smooth p.d.f. fX . Then for any x and small δ > 0,
P (x ≤ X ≤ x + δ) ≈ fX (x)δ
and, in particular,
P (nδ ≤ X ≤ (n + 1)δ) ≈ fX (nδ)δ.
So for the expectation, we want something like
∞
X
(nδ)fX (nδ)δ.
n=−∞
Definition 5.16. Let X be a continuous random variable with probability density function fX .
The expectation or mean of X is defined to be
Z ∞
E [X] = xfX (x)dx (5.3)
−∞
R∞
whenever −∞ |x|fX (x)dx < ∞. Otherwise, we say that the mean is undefined (or as in the
discrete case, if only the positive tail diverges, we might say that E [X] = ∞.)
Theorem 5.17. Let X be a continuous random variable with probability density function fX ,
and let h be a function from R to R. Then
Z ∞
E [h(X)] = h(x)fX (x)dx. (5.4)
−∞
R∞
(whenever −∞ |h(x)|fX (x)dx < ∞).
Notice that (5.4) is analogous to (5.2) in the same way that (5.3) is analogous to (5.1). Proving
Theorem 5.17 in full generality, for any function h, is rather technical. Here we just give an idea
of one approach to the proof for a particular class of functions.
53
So now suppose h is such that h(X) is a non-negative continuous random variable. Then
Z ∞
E [h(X)] = P (h(X) > y) dy
y=0
Z ∞Z
= fX (x)dx dy
y=0 x : h(x)>y
Z ∞ Z
= fX (x) dy dx
x=0 y : y<h(x)
Z ∞
= fX (x)h(x)dx,
x=0
whenever the right-hand side is defined. For simplicity of notation, write µ = E [X]. Then we
have
Z ∞
var (X) = (x − µ)2 fX (x)dx
Z−∞
∞
= (x2 − 2xµ + µ2 )fX (x)dx
Z−∞
∞ Z ∞ Z ∞
2 2
= x fX (x)dx − 2µ xfX (x)dx + µ fX (x)dx
−∞ −∞ −∞
= E X − µ2 ,
2
R∞ R∞
since −∞ xfX (x)dx = µ and −∞ fX (x)dx = 1. So we recover the expression
X has c.d.f. FX (x) = Φ((x − µ)/σ), where Φ is the standard normal c.d.f.,
E [X] = µ,
var (X) = σ 2 .
54
Solution. First suppose that µ = 0 and σ 2 = 1. Then the first two assertions are trivial and
Z ∞
x 2
E [X] = √ e−x /2 dx
−∞ 2π
which must equal 0 since the integrand is an odd function. Since the mean is 0,
Z ∞ Z ∞ 2
2 x2 −x2 /2 xe−x /2
var (X) = E X = √ e dx = x· √ dx.
−∞ 2π −∞ 2π
Integrating by parts (and taking limits of the bounds to ±∞), we get that this equals
" 2
#∞ Z ∞
e−x /2 1 2
−x · √ + √ e−x /2 dx = 1.
2π −∞ −∞ 2π
So var (X) = 1.
Suppose now that Z ∼ N(0, 1). Then
P (µ + σZ ≤ x) = P (Z ≤ (x − µ)/σ) = Φ((x − µ)/σ).
Let φ(x) = √1 e−x2 /2 , the standard normal density. Differentiating P (µ + σZ ≤ x) in x, we get
2π
(x − µ)2
1 1
φ((x − µ)/σ) = √ exp − .
σ 2πσ 2 2σ 2
So µ + σZ ∼ N(µ, σ 2 ). Finally,
E [X] = E [µ + σZ] = µ + σE [Z] = µ
and
var (X) = var (µ + σZ) = σ 2 var (Z) = σ 2 .
Exercise 5.20. Show that if X ∼ U[a, b] and Y ∼ Exp(λ) then
a+b (b − a)2 1 1
E [X] = , var (X) = , E [Y ] = , var (Y ) = 2 .
2 12 λ λ
Notice, in particular, that the parameter of the Exponential distribution is the reciprocal of its
mean.
Example 5.21. Suppose that X ∼ Gamma(2, 2), so that it has p.d.f.
(
4xe−2x for x ≥ 0,
fX (x) =
0 otherwise.
Find E [X] and E X1 .
Solution. We have
∞ ∞
23 3−1 −2x
Z Z
E [X] = x · 4xe−2x dx = x e dx
−∞ −∞ 2!
and, since Γ(3) = 2! we recognise the integrand as the density of a Gamma(3, 2) random variable.
So it must integrate to 1 and we get E [X] = 1.
On the other hand,
Z ∞ Z ∞
1 1 −2x
E = · 4xe dx = 2 2e−2x dx
X −∞ x −∞
and again we recognise the integrand as the density of an Exp(2) random variable which must
integrate to 1. So we get E X1 = 2.
1 1
WARNING: IN GENERAL, E X 6= E[X] .
55
5.4 Examples of functions of continuous random variables
Example 5.22. Imagine a forest. Suppose that R is the distance from a tree to the nearest
neighbouring tree. Suppose that R has p.d.f.
( 2
re−r /2 for r ≥ 0,
fR (r) =
0 otherwise.
Find the distribution of the area of the tree-free circle around the original tree.
Solution. Let A be the area of the tree-free circle; then A = πR2 . We begin by finding the c.d.f.
of R and then use it to find the c.d.f. of A. FR (r) is clearly 0 for r < 0. For r ≥ 0,
Z r h ir
2 2 2
FR (r) = P (R ≤ r) = se−s /2 ds = −e−s /2 = 1 − e−r /2 .
0 0
Remark 5.23. The distribution of R in Example 5.22 is called the Rayleigh distribution. One
way in which this distribution occurs is as follows. Pick a point in R2 such that the x and y
co-ordinates are independent N(0, 1) random variables. Then the Euclidean distance of that point
from the origin (0, 0) has the Rayleigh distribution (see Part A Probability for a proof of this fact;
there is a connection to Example 5.15).
We can generalise the idea in Example 5.22 to prove the following theorem.
Theorem 5.24. Suppose that X is a continuous random variable with density fX and that
h : R → R is a differentiable function with dh(x)
dx > 0 for all x, so that h is strictly increasing.
Then Y = h(X) is a continuous random variable with p.d.f.
d −1
fY (y) = fX (h−1 (y)) h (y),
dy
Proof. Since h is strictly increasing, h(X) ≤ y if and only if X ≤ h−1 (y). So the c.d.f. of Y is
56
Example 5.25. Suppose that a point is chosen uniformly from the perimeter of the unit circle.
What is the distribution of its x-co-ordinate?
Solution. Represent the chosen point by its angle, Θ. So then Θ has a uniform distribution on
[0, 2π), with p.d.f. (
1
for 0 ≤ θ < 2π
fΘ (θ) = 2π
0 otherwise.
Moreover, the x-co-ordinate is X = cos Θ, which takes values in [−1, 1]. We again work via
c.d.f.’s:
0
for θ < 0
θ
FΘ (θ) = 2π for 0 ≤ θ < 2π
1 for θ ≥ 2π.
Notice that there are two angles in [0, 2π) corresponding to each x-co-ordinate in (−1, 1):
Then FX (x) = 0 for x ≤ −1, FX (x) = 1 for x ≥ 1 and, for x ∈ (−1, 1), we can express the c.d.f.
in terms of arccos : [−1, 1] → [0, π] as
FX (x) = P (cos Θ ≤ x)
= P (arccos x ≤ Θ ≤ 2π − arccos x)
= FΘ (2π − arccos x) − FΘ (arccos x)
arccos x arccos x
=1− −
2π 2π
1
= 1 − arccos x.
π
This completely determines the distribution of X, but we might also be interested in the p.d.f.
Differentiating FX , we get
1 1
√ 2 for −1 < x < 1
dFX (x) π 1−x
= 0 for x < −1 or x > 1
dx
undefined for x = −1 or x = 1.
So we can take (
1√ 1
π 1−x2 for −1 < x < 1
fX (x) =
0 for x ≤ −1 or x ≥ 1
Rx
and get FX (x) = −∞ fX (u)du.
R∞
Notice that fX (x) → ∞ as x → 1 or x → −1 even though −∞ fX (x)dx = 1.
57
5.5 Joint distributions
We will often want to think of different random variables defined on the same probability space.
In the discrete case, we studied pairs of random variables via their joint probability mass function.
For a pair of arbitrary random variables, we use instead the joint cumulative distribution function,
FX,Y : R2 → [0, 1], given by
FX,Y (x, y) = P (X ≤ x, Y ≤ y) .
It’s again possible to show that this function is non-decreasing in each of its arguments, and that
and
lim lim FX,Y (x, y) = 1.
x→∞ y→∞
Then X and Y are jointly continuous and fX,Y is their joint density function.
∂2
fX,Y (x, y) = FX,Y (x, y).
∂x∂y
For a single continuous random variable X, it turns out that the probability that it lies in
some nice set A ⊆ R (see Part A Integration to see what we mean by “nice”, but note that any
set you can think of or write down will be!) can be obtained by integrating its density over A:
Z
P (X ∈ A) = fX (x)dx.
A
Likewise, for nice sets B ⊆ R2 we obtain the probability that the pair (X, Y ) lies in B by
integrating the joint density over the set B:
ZZ
P ((X, Y ) ∈ B) = fX,Y (x, y)dxdy.
(x,y)∈B
Theorem 5.27. For a pair of jointly continuous random variables X and Y , we have
Z dZ b
P (a < X ≤ b, c < Y ≤ d) = fX,Y (x, y)dxdy,
c a
58
Proof. We have
P (a < X ≤ b, c < Y ≤ d)
= P (X ≤ b, Y ≤ d) − P (X ≤ a, Y ≤ d) + P (X ≤ a, Y ≤ c) − P (X ≤ b, Y ≤ c)
= FX,Y (b, d) − FX,Y (a, d) + FX,Y (a, c) − FX,Y (b, c)
Z dZ b
= fX,Y (x, y)dxdy.
c a
Theorem 5.28. Suppose X and Y are jointly continuous with joint density fX,Y . Then X is a
continuous random variable with density
Z ∞
fX (x) = fX,Y (x, y)dy,
−∞
In this context the one-dimensional densities fX and fY are called the marginal densities of
the joint distribution with joint density fX,Y , just as in the discrete case at Definition 2.16.
R∞
Proof. If fX is defined by fX (x) = −∞ fX,Y (x, y)dy, then we have
Z x Z x Z ∞
fX (u)du = fX,Y (u, y)dy du
−∞ −∞ −∞
= P (X ≤ x) ,
The definitions and results above generalise straightforwardly to the case of n random vari-
ables, X1 , X2 , . . . , Xn .
density. What is P X ≤ 12 , Y ≥ 3
Check that fX,Y (x, y) is a joint 2 ? What are the marginal
1
densities? What is P X ≥ 2 ?
59
We have
Z 2 Z 1/2
1 3 1
P X ≤ ,Y ≥ = (x + y)dxdy
2 2 3/2 0 2
Z 2 1/2
1 2 1
= x + xy dy
3/2 4 2 0
Z 2
1 1
= + y dy
3/2 16 4
2
1 1
= y + y2
16 8 3/2
1
= .
4
Integrating out y we get Z 2
1 1 3
fX (x) = (x + y)dy = x +
1 2 2 4
for x ∈ [0, 1], and integrating out x we get
Z 1
1 1 1
fY (y) = (x + y)dx = + y
0 2 4 2
Definition 5.30. Jointly continuous random variables X and Y with joint density fX,Y are
independent if
fX,Y (x, y) = fX (x)fY (y)
for all x, y ∈ R. Likewise, jointly continuous random variables X1 , X2 , . . . , Xn with joint density
fX1 ,X2 ,...,Xn are independent if
fX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = fX1 (x1 )fX2 (x2 ) . . . fXn (xn )
for all x1 , x2 , . . . , xn ∈ R.
for all x, y ∈ R.
60
5.5.1 Expectation
We can write the expectation of a function h of a pair (X, Y ) of jointly continuous random
variables in a natural way.
Theorem 5.32. Z ∞ Z ∞
E [h(X, Y )] = h(x, y)fX,Y (x, y)dxdy.
−∞ −∞
As in the case of Theorem 5.17, a general proof of this result is rather technical, and we don’t
cover it here. However, note again that there is a very direct analogy with the discrete case which
we saw in equation (2.2).
In particular, the covariance of X and Y is
Remark 5.34. We have now shown that the rules for calculating expectations (and derived
quantities such as variances and covariances) of continuous random variables are exactly the
same as for discrete random variables. This isn’t a coincidence! We can make a more general
definition of expectation which covers both cases (and more besides) but in order to do so we need
a more general theory of integration, which some of you will see in the Part A Integration course.
Example 5.35. Let −1 < ρ < 1. The standard bivariate normal distribution has joint density
1 1 2 2
fX,Y (x, y) = p exp − (x − 2ρxy + y )
2π 1 − ρ2 2(1 − ρ2 )
for x, y ∈ R. What are the marginal distributions of X and Y ? Find the covariance of X and Y .
Proof. We have
Z ∞
1 1 2 2
fX (x) = exp − (x − 2ρxy + y ) dy
2(1 − ρ2 )
p
−∞ 1 − ρ2
2π
Z ∞
1 1 2 2 2
= exp − [(y − ρx) + x (1 − ρ )] dy
2(1 − ρ2 )
p
−∞ 2π 1 − ρ2
1 −x2 /2 ∞ (y − ρx)2
Z
1
=√ e exp − dy.
2(1 − ρ2 )
p
2π −∞ 2π(1 − ρ2 )
But the integrand is now the density of a normal random variable with mean ρx and variance
1 − ρ2 . So it integrates to 1 and we are left with
1 2
fX (x) = √ e−x /2 .
2π
So X ∼ N(0, 1) and, by symmetry, the same is true for Y . Notice that X and Y are only
independent if ρ = 0.
61
Since X and Y both have mean 0, we only need to calculate E [XY ]. We can use a similar
trick:
Z ∞Z ∞
xy 1 2 2
E [XY ] = exp − (x − 2ρxy + y ) dydx
2(1 − ρ2 )
p
−∞ −∞ 2π 1 − ρ2
Z ∞ Z ∞
(y − ρx)2
x 2 y
= √ e−x /2 exp − dydx.
2(1 − ρ2 )
p
−∞ 2π −∞ 2π(1 − ρ2 )
The inner integral now gives us the mean of a N(ρx, 1 − ρ2 ) random variable, which is ρx. So we
get Z ∞
ρx2 2
√ e−x /2 dx = ρE X 2 = ρ,
cov (X, Y ) =
−∞ 2π
2
since E X = 1.
This yields the interesting conclusion that standard bivariate normal random variables X and
Y are independent if and only if their covariance is 0. This is a nice property of normal random
variables which is not true for general random variables, as we have already observed in the
discrete case.
62
Chapter 6
One of the reasons that we are interested in sequences of i.i.d. random variables is that we can
view them as repeated samples from some underlying distribution.
Definition 6.1. Let X1 , X2 , . . . , Xn denote i.i.d. random variables. Then these random variables
are said to constitute a random sample of size n from the distribution.
Statistics often involves random samples where the underlying distribution (the “parent dis-
tribution”) is unknown. A realisation of such a random sample is used to make inferences about
the parent distribution. Suppose, for example, we want to know about the mean of the parent
distribution. An important estimator is the sample mean.
n
1X
Definition 6.2. The sample mean is defined to be X̄n = Xi .
n
i=1
This is a key random variable which itself has an expectation and a variance. Recall that for
random variables X and Y (discrete or continuous),
var (X + Y ) = var (X) + var (Y ) + 2cov (X, Y ) .
We can extend this (by induction) to n random variables as follows:
n n
!
X X X
var Xi = var (Xi ) + cov (Xi , Xj )
i=1 i=1 i6=j
n
X X
= var (Xi ) + 2 cov (Xi , Xj ) .
i=1 i<j
Theorem 6.3. Suppose that X1 , X2 , . . . , Xn form a random sample from a distribution with mean
µ and variance σ 2 . Then the expectation and variance of the sample mean are
1
E X̄n = µ and var X̄n = σ 2 .
n
2
Proof. We have E [Xi ] = µ and var (Xi ) = σ for 1 ≤ i ≤ n. So by linearity of expectation and
the variance rules recalled above,
" n # n
1X 1X
E X̄n = E Xi = E [Xi ] = µ,
n n
i=1 i=1
n n n
! !
1X 1 X 1 X 1
var (Xi ) = σ 2 ,
var X̄n = var Xi = 2 var Xi = 2
n n n n
i=1 i=1 i=1
63
Example 6.4. Let X1 , . . . , Xn be a random sample from a Bernoulli distribution
with parameter
p. Then E [Xi ] = p, var (Xi ) = p(1 − p) for all 1 ≤ i ≤ n. Hence, E X̄n = p and var X̄n =
p(1 − p)/n.
In order for X̄n to be a good estimator of the mean, we would like to know that for large
sample sizes n, X̄n is not too far away from µ i.e. that |X̄n − µ| is small. The result which
tells us that this is true is called the law of large numbers and is of fundamental importance in
probability. Before we state it, let’s step away from the sample mean and consider a more basic
situation.
Suppose that A is an event with probability P (A) and write p = P (A). Let X be the indicator
function of the event A, i.e. the random variable defined by
(
1 if ω ∈ A
X(ω) = 1A (ω) =
0 if ω ∈
/ A.
Then X ∼ Ber(p) and E [X] = p. Suppose now that we perform our experiment repeatedly
and let Xi be the indicator of the event that A occurs on the ith trial. Our intuitive notion of
probability leads us to believe that if the number n of trials is large then the proportion of the
time that A occurs should be close to p i.e.
n
1X
Xi − p
n
i=1
should be small. So proving that the sample mean is close to the true mean in this situation will
also provide some justification for the way we have set up our mathematical theory of probability.
Theorem 6.5 (Weak law of large numbers). Suppose that X1 , X2 , . . . are independent and iden-
tically distributed random variables with mean µ. Then for any fixed > 0,
n
!
1X
P Xi − µ > → 0
n
i=1
as n → ∞.
as n → ∞.)
In other words, the probability that the sample mean deviates from the true mean by more
than some small quantity tends to 0 as n → ∞. Notice that the result only depends on the
underlying distribution through its mean.
We will give a proof of the weak law under an additional assumption that the variance of the
distribution is finite. To do that, we’ll first prove a couple of very useful inequalities.
Theorem 6.6 (Markov’s inequality). Suppose that Y is a non-negative random variable whose
expectation exists. Then
E [Y ]
P (Y ≥ t) ≤
t
for all t > 0.
64
Proof. Let A = {Y ≥ t}. We may assume that P (A) ∈ (0, 1), since otherwise the result is
trivially true. Then by the law of total probability for expectations,
since P (Ac ) > 0 and E [Y |Ac ] ≥ 0. Now, we certainly have E [Y |A] = E [Y |Y ≥ t] ≥ t. So,
rearranging, we get
E [Y ]
P (Y ≥ t) ≤
t
as we wanted.
Theorem 6.7 (Chebyshev’s inequality). Suppose that Z is a random variable with a finite vari-
ance. Then for any t > 0,
var (Z)
P (|Z − E [Z] | ≥ t) ≤ .
t2
Proof. Note that P (|Z − E [Z] | ≥ t) = P (Z − E [Z])2 ≥ t2 and then apply Markov’s inequality
Proof of Theorem 6.5 (under the assumption of finite variance). Suppose the common
distribution of the random variables Xi has mean µ and variance σ 2 . Set
n
1X
Z= Xi .
n
i=1
So by Chebyshev’s inequality,
n
!
1X σ2
P Xi − µ > ≤ 2 .
n n
i=1
65
Appendix
Countability
A set S is countable if either it’s finite, or its elements can be written as a list: S = {x1 , x2 , x3 , . . . }.
Put another way, S is countable if there is a bijection from a subset of N to S. The set N itself
is countable; so is the set of rational numbers Q, for example. The set of real numbers R is not
countable.
Limits
Even if you haven’t seen a definition, you probably have an idea of what it means for a sequence
to converge to a limit. Formally, we say that a sequence of real numbers (a1 , a2 , a3 , . . . ) converges
to a limit L ∈ R if the following holds: for all > 0, there exists N ∈ N such that |an − L| ≤
whenever n ≥ N .
Then we may write “L = limn→∞ an ”, or “an → L as n → ∞”.
Infinite sums
Finite sums are easy. If we have a sequence (a1 , a2 , a3 , . . . ), then for any n ∈ N we can define
n
X
sn = ak = a1 + a2 + · · · + an .
k=1
66
1 1 1 1 1 1 1 1
If we reorder the terms as 1 + 3 − 2 + 5 + 7 − 4 + 9 + 11 − 6 + . . . , then the sum instead
becomes 32 ln 2.
Power series
A (real) power series is a function of the form
∞
X
f (x) = ck xk
k=0
where the coefficients ck , k ≥ 0 are realPconstants. For any such series, there exists a radius of
convergence R ∈ [0, ∞) ∪ ∞, such that ∞ k
k=0 ck x converges absolutely for |x| < R, and not for
|x| > R.
In this course we will meet a particular class of power series called probability generating
functions, with the property that the coefficients ck are non-negative and sum to 1. In that case,
R is at least 1.
Power series behave well when differentiated! A power series f (x) = ∞ k
P
k=0 ck x with radius
of convergence R is differentiable on the interval (−R, R), and its derivative is also a power series
with radius of convergence R, given by
∞
X
0
f (x) = (k + 1)ck+1 xk .
k=0
Series identities
Here is a reminder of some useful identities:
Geometric series: if a ∈ R and 0 ≤ r < 1 then
n−1
X a(1 − rn )
ark =
1−r
k=0
and
∞
X a
ark = .
1−r
k=0
Exponential function: for λ ∈ R,
∞
X λn
= eλ .
n!
n=0
Differentiation and integration give us variants of these. For example, for 0 < r < 1,
∞ ∞
!
X
k−1 d X k
kr = r
dr
k=1 k=0
and
∞ k ∞
!
Z r
X r X
k
= t dt.
k 0
k=1 k=0
67
A.2 Increasing sequences of events
We mentioned the following result in the later part of the course. A sequence of events An , n ≥ 1
is called increasing if A1 ⊆ A2 ⊆ A3 ⊆ . . . .
Proof. The proof uses countable additivity. Using the fact that the sequence is increasing, we
can write An as a disjoint union
∞
[ ∞
[
Ak = A1 ∪ (Ak \ Ak−1 ).
k=1 k=2
∞
P n
P
(since by definition of an infinite sum, bk = lim bk .)
k=2 n→∞ k=2
= lim P (An ) .
n→∞
68
Common discrete distributions
69
n k
Bin(n, p), n ∈ P (X = k) = p (1 − p)n−k , k = 0, 1, . . . , n np np(1 − p) GX (s) = (1 − p + ps)n
k
N, p ∈ [0, 1]
Poisson λk −λ
P (X = k) = e , k = 0, 1, 2, . . . λ λ GX (s) = eλ(s−1)
Po(λ), λ ≥ 0 k!
Geometric
1 1−p ps
Geom(p), P (X = k) = (1 − p)k−1 p, k = 1, 2, . . . p p2
GX (s) =
1 − (1 − p)s
p ∈ [0, 1]
Alternative
1−p 1−p p
geometric, P (X = k) = (1 − p)k p, k = 0, 1, . . . p p2
GX (s) =
1 − (1 − p)s
p ∈ [0, 1]
Negative
binomial k
n−1 k k(1−p) ps
NegBin(k, p), P (X = n) = (1 − p)n−k pk , n = k, k + 1, . . . p p2
GX (s) = 1−(1−p)s
k−1
k ∈ N, p ∈
[0, 1]
Common continuous distributions
70
Exp(λ), λ ≥ 0 λ λ2
Gamma
λα α−1 −λx α α
Gamma(α, λ), fX (x) = x e ,x≥0
Γ(α) λ λ2
α > 0, λ ≥ 0
Normal (x−µ)2
1 − x−µ
N(µ, σ 2 ), fX (x) = √ e 2σ 2 ,x∈R FX (x) = Φ µ σ2
2πσ 2 σ
µ ∈ R, σ 2 > 0
Standard Z x
1 x2 1 u2
Normal fX (x) = √ e− 2 , x ∈ R FX (x) = Φ(x) = √ e− 2 du 0 1
N(0, 1) 2π −∞ 2π
Beta Γ(α + β) α−1 α αβ
fX (x) = x (1 − x)β−1 , x ∈ [0, 1] α+β (α+β)2 (α+β+1)
Beta(α, β) Γ(α)Γ(β)