0% found this document useful (0 votes)
90 views

ALL ST218 Lecture Notes

This document describes different types of probability distributions for random variables. It defines discrete distributions, where the variable can take on only countable values, and continuous distributions, where the variable can take on any value within an interval. For discrete distributions, the probabilities are given by a probability mass function defined on the support values. For continuous distributions, the probabilities of intervals are given by the integral of a probability density function over the interval. An example of calculating the distribution of the sum of two dice rolls is provided to illustrate a discrete distribution.

Uploaded by

pip
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

ALL ST218 Lecture Notes

This document describes different types of probability distributions for random variables. It defines discrete distributions, where the variable can take on only countable values, and continuous distributions, where the variable can take on any value within an interval. For discrete distributions, the probabilities are given by a probability mass function defined on the support values. For continuous distributions, the probabilities of intervals are given by the integral of a probability density function over the interval. An example of calculating the distribution of the sum of two dice rolls is provided to illustrate a discrete distribution.

Uploaded by

pip
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

ST218 Mathematical Statistics A

Jon Warren
Contents

1 Probability Distributions 2
1.1 One random variable: discrete and continuous distributions . . . . . . . . . 3
1.2 Expectation and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 A reminder of some important distributions . . . . . . . . . . . . . . . . . 10
1.4 Many random variables and joint distributions . . . . . . . . . . . . . . . . 12
1.5 Dependence and Independence . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Transformations of densities . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 The Multivariate Normal Distribution 27


2.1 The standard Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . 28
2.2 The general Gaussian distribution . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Characterization by mean and variance-covariance matrix . . . . . . 32
2.2.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 Fisher’s Theorem and a statistical application . . . . . . . . . . . . . . . . 36

3 Conditioning 40
3.1 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 The law of total probability . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Priors and posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Conditional expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 The Tower property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4 Infinite Sequences of Random Variables 61


4.1 Markov’s inequality and the weak law of Large Numbers . . . . . . . . . . 62
4.2 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Convergence of random variables . . . . . . . . . . . . . . . . . . . . . . . 75
4.5 More on convergence of random variables . . . . . . . . . . . . . . . . . . 79
4.6 The strong law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . 83

1
Chapter 1

Probability Distributions

2
1.1 One random variable: discrete and continuous
distributions
A random variable is a quantity measured in an experiment having random outcomes
whose value depends on the outcome of that experiment. More formally, if Ω is the sample
space that is used to model the experiment, then a random variable X is a function

X : Ω → R.

Then, X(ω) is the value you observe for X if the the outcome of the experiment corre-
sponds to the sample point ω. When modelling the experiment, probabilities are assigned
to events by a probability measure P on the sample space Ω: the probability of an event
A ⊂ Ω is denoted by P(A).
Given a probability measure P on the sample space1 Ω we can determine the proba-
bility of observing different values for X. If B ⊆ R then the probability2 that X takes a
value in B is denoted by P(X ∈ B) which is shorthand for

P({ω ∈ Ω : X(ω) ∈ B}).

Notice how this mathematical quantiy is defined: {ω ∈ Ω : X(ω) ∈ B} is an event, so a


subset of the sample space Ω, and consequently its probability is determined by the given
measure P. If B = {x} for some x ∈ R, then we have,

P(X = x) = P({ω ∈ Ω : X(ω) = x}),

which illustrates the need to distinguish the random variable X and the ordinary variable
x denoting some particular, but unspecified, value. When we let the subset B vary we
end up with a function B 7→ P(X ∈ B) which defines a probability measure ( or simply
probability distribution) on R. It is the distribution of X.
Its important to appreciate the fundamental difference between the concept of a ran-
dom variable which is a function on the sample space, and its distribution which is a
function defined on subsets of R.
How do we describe probability measures ( or distributions) on R? A very useful
analogy is to think about distributions of mass along R. We distinguish two types of
distribution. The mass could be concentrated at certain selected points, or it could be
“smeared” along the line. These two possibilities correspond to discrete and continuous
probability distributions which we will now formally define.
Definition 1. A random variable X has a discrete distribution if there exists a finite or
countably infnite set of values {x1 , x2 , . . . xk , . . .} such that

P(X ∈ R \ {x1 , x2 , . . .}) = 0.


1
Sample spaces will be very much in the background in this module. We will rarely refer to them, but
its important to remember that they are always there, part of the mathematics we are describing.
2
In general this probability will only exist if B is a measurable set, but unmeasurable sets are so weird
you will probably never meet one. Their structure is so complicated that its impossible measure their
size is any sensible way.

3
In this case there exist corresponding probabilities p1 , p2 , . . . summing to 1 such that,
P(X = xi ) = pi for all i.
If pi > 0 for all i then the set X = {x1 , x2 , . . .} is called the support of the distribution
and the function fX : X → [0, 1] defined by
fX (xi ) = pi for all i,
is called the probability mass function.
Knowing the support and probability mass function determines the distribution en-
tirely because for B ⊆ R, X
P(X ∈ B) = fX (x).
x∈B∩X
Example 1. Suppose a die is rolled twice and let X denote the score obtained on the first
roll, and Y that obtained on the second roll. What’s the distribution of X + Y ?
We model the experiment with the sample space
Ω = {(i, j) ∈ Z2 : 1 ≤ i ≤ 6, 1 ≤ j ≤ 6},
and define X and Y via
X(ω) = i, and Y (ω) = j for ω = (i, j) ∈ Ω.
To compute P(X + Y = 6) we identify the event
{ω ∈ Ω : (X + Y )(ω) = 6} = {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}.
Thus, counting the sample points, since we assume each sample point has an equal prob-
ability, P(X + Y = 6) = 5/36. For k = 2, 3, . . . , 12 similar considerations give P(X = k),
and we conclude the distribution is discrete with support {2, 3, . . . , 12} and mass function
(
(k − 1)/36 for k = 2, 3, . . . , 7,
fX+Y (k) =
(13 − k)/36 for k = 8, 9, . . . , 12.
Notice how we can combine the random variables X and Y using algebraic operations such
a summation, using the fact they are functions. Notice too that this doesn’t correspond
to “summing” the distributions: fX+Y 6= fX + fY . In fact the distribution of X + Y is
computed from the distributions of each of X and Y using an operation called convolution.
Our second type of distribution is characterised by the fact the mass is “smeared out”
along the line, with the probability of observing any specific value given in advance always
being 0. So in order to describe the distribution we give the probability of seeing a value
falling in given intervals.
Definition 2. A random variable has a continuous distribution if there exists a function
fX : R → [0, ∞) so that
Z b
P(a ≤ X ≤ b) = fX (x)dx, for all a ≤ b.
a

In this case the function fX is called the probability density function.

4
A probability density function always satisfies
Z ∞
fX (x)dx = 1.
−∞

Note that because P(X = x) = 0 for all x we also have


Z b
P(a < X < b) = fX (x)dx, for all a ≤ b.
a

Once again, knowing the p.d.f. determines the distribution entirely because for B ⊆ R,

P(X ∈ B) = Area({(x, y) ∈ R2 : x ∈ B, 0 ≤ y ≤ fX (x)}),

which in the case of B = [a, b] reduces to the interpretation of the definite integral as an
area under the graph of fX .
Suppose the density fX is continuous at x = a. Then for x close to a, fX (x) ≈ fX (a)
and hence for small h, Z a+h
fX (x)dx ≈ hfX (a).
a
Thus
fX (a) ≈ h−1 P(a ≤ X ≤ a + h),
which becomes an exact equality in the limit h ↓ 0. This explains the terminology
“density”: it is ( the limit of) a probability divided by a length3 . Notice that a density,
unlike a probability, can be greater than 1.
Example 2. Suppose (X, Y ) are the co-ordinates of the point chosen uniformly at random
from the square with side length L,

Ω = {(x, y) ∈ R2 : 0 ≤ x ≤ L, 0 ≤ y ≤ L}.

The random variable X is the function X(ω) = x for ω = (x, y) ∈ Ω. It has a continuous
distribution. If we consider 0 ≤ a ≤ b ≤ L, then P(a ≤ X ≤ b) is equal to ratio of the area
of the rectangle {(x, y) : a ≤ x ≤ b, 0 ≤ y ≤ L} to the area of the whole sample space Ω,
and so is (b − a)/L. On the otherhand if a ≤ b ≤ 0 or L ≤ a ≤ b, then P(a ≤ X ≤ b) = 0
so we see that if we take (
1/L if 0 ≤ x ≤ L
fX (x) =
0 otherwise.
then Z b
P(a ≤ X ≤ b) = fX (x)dx, for all a ≤ b.
a
Notice that the value of the density at x = 0 and x = L could equally be defined to be 0,
and this formula would still hold. This illustrates that denisties are not unique and can
be changed arbitrarily at a finite ( or countable ) set of values.
3
Densities in the physical world usually involve dividing mass by volume. But that’s because the mass
is distributed in the three dimensions of the physical universe, not along a line as here.

5
Example 3. This is a continuation of the previous example. Consider the random variable
max(X, Y ). It has a continuous distribution also. If we consider 0 ≤ a ≤ b ≤ L, then
P(a ≤ max(X, Y ) ≤ b) is proportional to the area of the set {(x, y) : a ≤ max(x, y) ≤ b},
and so is (b2 − a2 )/L2 . On the otherhand if a ≤ b ≤ 0 or L ≤ a ≤ b, then P(a ≤
max(X, Y ) ≤ b) = 0 so we see that we can take as density the function
(
2x/L2 if 0 ≤ x ≤ L
fmax(X,Y ) (x) =
0 otherwise.

Just check that when we calculate a definite integrate of this function we get the right
expression for P(a ≤ max(X, Y ) ≤ b)4 .

4
And to derive the formula for the density in the first place, we could put a = 0 and differentiate with
respect to b to obtain fmax(X,Y ) (b), from the formula for P(a ≤ max(X, Y ) ≤ b) when b is between 0 and
1.

6
1.2 Expectation and variance
The expectation or expected value of a random variable X, denoted by E[X] is the mean
of its distribution: loosely speaking the average of the possible values of X weighted
according to their probability. Classically it is also the basis for the idea of a game being
fair: if you win a random amount X but have to pay a fixed amount in order to play, then
the game is considered fair if you pay exactly E[X]. It is also an analogue of the centre
of mass ( or gravity): if a probability distribution is considered as a description of mass
distributed along a line, then its mean is the point along the line at which distribution
would balance if pivoted there.
In general we have the following definition.

Definition 3. If X has a discrete distribution with support {x1 , x2 , x3 , . . .} and mass


function fX then the expected value of X is defined as
X
E[X] = xi fX (xi ),
i
P
provided this sum is absolutely convergent, meaning: i |xi |fX (xi ) < ∞.
If X has a continuous distribution with density function fX then the expected value of
X is defined as Z ∞
E[X] = xfX (x)dx,
−∞
R∞
provided this integral is absolutely convergent, meaning: −∞
|x|fX (x)dx < ∞.

A rather trivial, but nevertheless important, case of the definition of E[X] for a discrete
distribution is the following. Suppose that there is a non-random c ∈ R so that P(X =
c) = 1, then E[Z] = c. This is because such an X has a discrete distribution with support
{c} and mass function fX (c) = 1.
The next two propositions state very important properties of expectations, which we
will use frequently. We won’t give proofs.

Proposition 1. If X has a discrete distribution with support {x1 , x2 , x3 , . . .} and mass


function fX ,then the expected value of g(X) is given by
X
E[g(X)] = g(xi )fX (xi ),
i
P
provided this sum is absolutely convergent, meaning: i |g(xi )|fX (xi ) < ∞.
If X has a continuous distribution with density function fX then the expected value of
g(X) is given by Z ∞
E[g(X)] = g(x)fX (x)dx,
−∞
R∞
provided this integral is absolutely convergent, meaning: −∞
|g(x)|fX (x)dx < ∞.

Proposition 2. Assume all expectations are defined by absolutely convergent sums or


integrals.

7
(1) Positivity. Suppose that P(Z ≥ 0) = 1 then E[Z] ≥ 0. If additionally P(Z > 0) >
0 then E[Z] > 0. Suppose that P(Z1 ≥ Z2 ) = 1 then E[Z1 ] ≥ E[Z2 ].

(2) Linearity. If X and Y are random variables and a, b ∈ R non-random constants,


then
E[aX + bY ] = aE[X] + bE[Y ].
More generally for finite sums,
" n
# n
X X
E ai X i = ai E[Xi ].
i=1 i=1

(3) Fubini’s formula If X and Y are independent5 random variables then

E[XY ] = E[X]E[Y ].

More generally for a finite sequence of independent random variables,


" n # n
Y Y
E Xi = E[Xi ]
i=1 i=1

A natural question is find a quantity that helps describe the spread or dispersion of
a distribution. Compare the uniform distribution on the interval [−1/2, +1/2] with the
uniform distribution on the interval [−2, +2]. Both have mean µ = 0 but on average
the value of a random variable with the second distribution is further away from 0 than
the value of a random variable with the first distribution. This suggests measuring the
dispersion of the distribution of a random variable X with E[|X − µ|] where µ = E[X].
Notice that the modulus inside the expectation is essential: E[X − µ] = 0!
In fact a better choice to measure dispersion is E[(X − µ)2 ], the average squared
distance that the random variable is from its mean. We will also consider its square
root. One way of seeing how mathematically natural this quantity is to consider its
generalization p
E[(X − Y )2 ]
where Y is another random variable. In some sense this measures how far apart the values
of X and Y tend to be. It is the natural analogue of the quantity
v
u n
uX
t (xi − yi )2
1=1

which measures the Euclidean distance between points (x1 , x2 , . . . xn ) and (y1 , y2 , . . . yn )
in Rn .
5
Independence is very important relationship which may hold between random variables. You will
have met it in it ST115, and we will look at the concept again in more detail in a few lectures time.
In the meantime recall that independent random variables arise when modelling quantities that do not
influence each other, such as in repeated rolls of a dice.

8
Definition 4. We define the variance of a random variable X by
var(X) = E[(X − µ)2 ],
where µ = E[X], and the standard deviation of X to be
p
E[(X − µ)2 ],
provided these expectations exist.
Proposition 3. Assuming the expectations all exist, then for any random variable X,
2
var(X) = E[X2 ] − E[X] ≥ 0.
Moreover var(X) = 0 if and only if P(X = µ) = 1 where µ = E(X).
Proof.
2
var(X) = E[(X − µ)2 ] = E[X 2 − 2µX + µ2 ] = E[X 2 ] − 2µE[X] + µ2 = E[X 2 ] − E[X] ,
2 2 2
by linearity
 and using E[µ ] = µ because µ is a non-random constant. Since P (X −
2 2
µ) ≥ 0 = 1 we have by positivity of expectation that E[(X − µ) ] ≥ 0, with equality
only if P (X − µ)2 = 0 = 1.
The next proposition will be used many times in the future; the calculation technique
of expanding the square of a sum within the expectation that we use in its proof is very
common too
Proposition 4. Suppose that X1 , X2 , . . . , Xn are independent random variables. Then
n
! n
X X
var Xi = var(Xi ).
i=1 i=1

Proof. Let µ = E[X1 + X2 + . . . + Xn ] = E[X1 ] + . . . + E[Xn ], and denote E[Xi ] by µi .


Then, by definition of variance, multiplying out the brackets and using linearity of
expectation,
!  !2   !2 
Xn Xn Xn
var Xi = E  Xi − µ  = E  (Xi − µi ) 
i=1 i=1 i=1
" n
# n
X X
=E (Xi − µi )(Xj − µj ) = E [(Xi − µi )(Xj − µj )]
i,j=1 i,j=1
n
X  X
E (Xi − µi )2 +

= E [(Xi − µi )(Xj − µj )]
i=1 i6=j

Now, when i 6= j, Xi and Xj are independent, and Fubini gives,


E [(Xi − µi )(Xj − µj )] = E[Xi Xj ] − µi µj = 0.

9
1.3 A reminder of some important distributions
The Binomial distribution with parameters p ∈ [0, 1] and integer n ≥ 1. Suppose that
A1 , A2 , A3 , . . . , An is a sequence of independent events with P(Ai ) = p for each i. Let X
be the number of these events that occur. Then X has the Binomial distribution with
parameters n and p. This distribution has support {0, 1, 2, . . . , n} and mass function
 
n k
fX (k) = p (1 − p)n−k for k = 0, 1, 2 . . . n.
k

By writing X = ni=1 i where 1 , 2 , . . . , n are the random variables defined by


P

(
1 if Ai occurs
i =
0 otherwise,

we can find the mean and variance of the Binomial distribution using linearity of expec-
tation and Proposition 4. We find that E[X = np and var(X) = np(1 − p).
The Geometric distribution with parameter p ∈ (0, 1]. Suppose that (Ai ; i ≥ 1)
is an infinite sequence of independent events with P(Ai ) = p for each i. Let the random
variable X be defined by
X + 1 = min{k : Ak occurs}
Then X has a geometric distribution with support {0, 1, 2, . . .} and mass function

fX (k) = p(1 − p)k .

By direct calculation we find that the mean and variance of the geometric distribution
are E[X] = (1 − p)/p and var(X) = (1 − p)/p2 .
The Exponential distribution with parameter α > 0. The continuous analogue of
the geometric distribution, which is a natural model for a waiting time, is the exponential
distribution with density (
αe−αx for x ≥ 0,
fX (x) =
0 otherwise.
By direct calculation we find that the mean and variance of the exponential distribution
are E[X] = 1/α and var(X) = 1/α2 .
The Gamma distribution. Suppose that X1 , X2 , . . . , Xn are independent random
variables each having the exponential distribution with parameter α. Then the sum
Sn = X1 + X2 + . . . + Xn has Gamma distribution with parameters n (called shape) and
α (called rate) which has density
(
xn−1 −αx
αn (n−1)! e for x ≥ 0,
fSn (x) =
0 otherwise.

10
Using linearity of expectation and Proposition 4, we can deduce the mean and variance of
Sn from the mean and variance of the exponential exponential distribution and find that
E[Sn ] = n/α and var(Sn ) = n/α2 .
The Poisson Distribution. A random variable N has the Poisson distribution with
parameter λ if it has a discrete distribution supported on {0, 1, 2, . . .} with

λk −λ
P(N = k) = e .
k!
By direct calculation we find that the mean and variance of the Poisson distribution are
E[N ] = var(N ) = λ. The Poisson distribution arises when modelling points in space or
instants of time occurring at random: for example the instants of time at which a radioac-
tive sample emits a particle. Suppose that X1 , X2 , . . . , Xn is a sequence of independent
random variables, each having the exponential distribution with parameter α. For a fixed
t ≥ 0 define a random variable N via
(
0 if X1 > t,
N=
max{k : X1 + X2 + . . . + Xk ≤ t} if X1 ≤ t.

Then N has the Poisson distribution with parameter λ = αt.


The Gaussian distribution. A random variable Z has the standard Gaussian dis-
tribution then it has a continuous distribution with density
1 2
fZ (z) = √ e−z /2 z ∈ R.

If Z has the standard normal distribution then random variable σZ + µ, where σ > 0 and
µ ∈ R are constants has the Gaussian distribution with mean µ and variance σ 2 . This
distribution has density
1 2 2
fX (x) = √ e−(x−µ) /(2σ ) x ∈ R.
2πσ
This distribution is often used when considering random errors in measuring quantities,
or fluctuations in quantities that result from combining many sources of variation.
The importance of the Gaussian distribution stems from the fact that if X and Y are
two independent random variables each having a Gaussian distribution ( say with means
2
µX and µY respectively and variances σX and σY2 respectively) then the sum X + Y also
2
has a Gaussian distribution ( necessarily with mean µX + µY and variance σX + σY2 .) In
fact the family of Gaussian distributions is the only family of probability distributions,
generated by translation and scaling of a single distribution with a finite variance, having
this property.

11
1.4 Many random variables and joint distributions
Suppose that a pair of random variables X and Y are defined on the same sample space,
and so correspond to two quantities measured in the same experiment. We can ask how
the value observed for X relates to the value observed for Y , if it does at all. For example,
in tossing a coin twice, we may take the sample space {H, T }2 and consider the random
variables
X(HH) = 2 Y (HH) = 0,
X(HT ) = X(T H) = 1 Y (HT ) = Y (T H) = 1,
X(T T ) = 0 Y (T T ) = 2.
X and Y are different functions on the sample space representing the number of heads
tossed, and the number of tails respectively. Both have the Binomial distribution with
parameter n = 2 and p = 1/2 ( if the coin is unbiased). The notion of the joint distribution
of X and Y captures the relationship between X and Y .
In general the joint distribution of two random variables X and Y which are defined
on the same sample space is the mapping
B 7→ P((X, Y ) ∈ B) B ⊂ R2 .
This is a probability distribution ( or probability measure) on R2 . Similarly the joint
distribution of n random variables X1 , X2 , . . . , Xn is the mapping

B 7→ P((X1 , X2 , . . . , Xn ) ∈ B) B ⊂ Rn .
Definition 5. The joint distribution of random variables X and Y is discrete if there
exist two finite or countably infinite sets X = {x1 , x2 , . . .} and Y = {y1 , y2 , . . .} such that
P(X ∈ X and Y ∈ Y) = 1.
In this case we define their joint mass function to be
fXY (x, y) = P(X = x and Y = y) for each pair (x, y) ∈ X × Y.
This definition extends easily to the case of n random variables.
In the example of coin tossing given above, the joint mass function is defined on
{0, 1, 2} × {0, 1, 2} via
fXY (0, 2) = 1/4,
fXY (1, 1) = 1/2,
fXY (2, 0) = 1/4
and fXY (x, y) = 0 all other cases.
Definition 6. The joint distribution of random variables X and Y is continuous if there
exists a joint density function fXY : R2 → [0, ∞) so that
Z dZ b
P(a ≤ X ≤ b and c ≤ Y ≤ d) = fXY (x, y)dxdy for all a ≤ b and c ≤ d.
c a

12
Note that the definite integral appearing in this definition has the geometric interpre-
tation
Volume({(x, y, z) : a ≤ x ≤ b, c ≤ y ≤ d and 0 ≤ z ≤ fXY (x, y)}).
More generally if X and Y have a continuous joint distribution with density fXY then
Z Z
P((X, Y ) ∈ B) = fXY (x, y) dx dy
B
= Volume({(x, y, z) : (x, y) ∈ B and 0 ≤ z ≤ fXY (x, y)}).

Again this definition extends easily to the case of n random variables having a continuous
joint distribution.
We are already familiar with a pair of random variables having a continuous joint
distribution. If (X, Y ) are the co-ordinates of a point chosen uniformly at random from
the square [0, L]2 as in example 2, then their joint distribution has the density
(
1/L2 if 0 ≤ x ≤ L, 0 ≤ y ≤ L,
fXY =
0 otherwise.

To see that this is the right density, note that for this choice of density,
1
P((X, Y ) ∈ B) = Volume({(x, y, z) : (x, y) ∈ B and 0 ≤ z ≤ 1/L2 }) = Area(B∩[0, L]2 ),
L2
which agrees with the notion that the point is chosen uniformly at random from the
square.
It is easy to recover the ( marginal) distributions of each of X and Y from their joint
mass function or joint density.
If X and Y have a discrete joint distribution with mass function fXY defined on X × Y
then X has a discrete distribution with mass function
X
fX (x) = fXY (x, y) for each x ∈ X
y∈Y
S
This is because we have the equality between events {X = x} = y∈Y {X = x and Y =
y}6 . Analogously if X and Y have a continuous joint distribution with density function
fXY then X has a continuous distribution with density given by
Z ∞
fX (x) = fXY (x, y)dy.
−∞

Example 4. Suppose that random variables X and Y have a discrete joint distribution
with support {0, 1, 2, . . . , }2 and mass function

fXY (k, l) = p1 p2 (1 − p1 )k (1 − p2 )l

where p1 and p2 constants with values in (0, 1).


6
This is either the union of a finite number of sets, or countably infinitely many sets

13
Then
X
P(Y > X) = fXY (k, l) =
{(k,l):0≤k<l}
∞ ∞ ∞
X X X p1 (1 − p2 )
p1 p2 (1 − p1 )k (1 − p2 )l = p1 (1 − p1 )k (1 − p2 )k+1 = .
k=0 l=k+1 k=0
1 − (1 − p1 )(1 − p2 )

Example 5. Suppose that X and Y have a continuous joint distribution with density
(
k(y − x) if 0 ≤ x ≤ y ≤ 1,
fXY (x, y) =
0 otherwise,

where k ∈ R is some constant. Find k, compute P(Y ≥ 2X), and find the marginal
distribution of X.
We can find k via the calculation
Z ∞Z ∞ Z 1 Z y
k 1 2
 Z
k
1= fXY (x, y)dxdy = k(y − x)dx dy = y dy = .
−∞ −∞ 0 0 2 0 6

Next we can compute, as follows.


!
ZZ Z 1 Z y/2 Z 1
9 3
P(Y ≥ 2X) = fXY (x, y)dxdy = 6(y − x)dx dy = y 2 dy = .
{y≥2x} 0 0 4 0 4

Finally X has a continuous distribution with density


Z ∞
fXY (x, y)dy.
−∞

This is zero unless x ∈ [0, 1], in which case we have


Z ∞ Z 1
fXY (x, y)dy = 6(y − x)dy = 3(x − 1)2 .
−∞ x

So X has denisty (
3(x − 1)2 if 0 ≤ x ≤ 1,
fX (x) =
0 otherwise.
Example 6. Let (X, Y, Z) be the co-ordinates of a point chosen uniformly at random from
the cube [0, 1]3 ⊂ R3 . Let

M1 = min(X, Y, Z) and M2 = max(X, Y, Z).

Find the density of the joint distribution of M1 and M2 .


First note that the density will be zero outside the set {(m1 , m2 ) ∈ R2 : 0 ≤ m1 ≤
m2 ≤ 1} because P(0 ≤ M1 ≤ M2 ≤ 1) = 1.

14
Consider for 0 ≤ m1 ≤ m2 ≤ 1,

P(M1 ≥ m1 and M2 ≤ m2 ) = P((X, Y, Z) ∈ [m1 , m2 ]3 ) = (m2 − m1 )3 .

This gives us the joint distribution function, for 0 ≤ m1 ≤ m2 ≤ 1, we have,

F (m1 , m2 ) = P(M1 ≤ m1 , M2 ≤ m2 ) =
= P(M2 ≤ m2 ) − P(M2 ≤ m2 , M1 > m1 ) = m32 − (m2 − m1 )3 .

Now we know that the joint density f must satisfy


Z m2 Z m1
F (m1 , m2 ) = f (z1 , z2 )dz1 dz2 ,
−∞ −∞

and hence differentiating first with respect to m2 and then with respect to m1 ,
∂2
f (m1 , m2 ) = F (m1 , m2 ).
∂m1 ∂m2
This gives a density of
(
6(m2 − m1 ) if 0 ≤ m1 ≤ m2 ≤ 1,
f (m1 , m2 ) =
0 otherwise,

The same distribution as in the previous example!


Example 7. Suppose that n independent and identical trials are performed each of which
has r possible outcomes with respective probabilities p1 , p2 , . . . , pr which sum to 1. For
example sampling with replacement from a population, individuals of which have one of
r possible types. Let Xi denote the number of n trials that result in outcome i. Then the
joint distribution of X1 , X2 , . . . Xr is a discrete distribution with support
X
{(n1 , n2 , . . . nr ) ∈ Zn : ni = n and each ni ≥ 0}.

The joint mass function is given, by


n!
P(X1 = n1 , X2 = n2 , . . . and Xr = nr ) = pn1 pn2 . . . pnr r .
n1 !n2 ! . . . nr ! 1 2
Notice that this formula is not true for any non-negative integers n1 , n2 , . . . , nr ; it only
holds if n1 + n2 + . . . + nr = n. This distribution is called the multinomial distribution
with parameters n, r and (p1 , p2 , . . . , pr ).
Suppose we fix i ∈ {1, 2, . . . r}, then what is the marginal distribution of Xi ? There is
no need to do any computations. Simply let Ak be the event that the kth trial results in
outcome i. Then the events A1 , A2 , . . . , An are independent, each with probability pi of
occuring. Xi is the number of these event that occur, and so has the Binomial distribution
with parameters n and pi .
We finish this lecture by stating a proposition without proof which allows the compu-
tation of expected values from joint distributions.

15
Proposition 5. If X and Y each have discrete distributions with supports X and Y
respectively and joint mass function fXY ,then the expected value of g(X, Y ) is given by
X
E[g(X, Y )] = g(x, y)fXY (x, y),
x∈X ,y∈Y
P
provided this sum is absolutely convergent, meaning: x∈X ,y∈Y |g(x, y)|fXY (x, y) < ∞.
If X and Y have a continuous joint distribution with density function fXY then the
expected value of g(X, Y ) is given by
Z ∞Z ∞
E[g(X, Y )] = g(x, y)fXY (x, y)dxdy,
−∞ −∞
R∞ R∞
provided this integral is absolutely convergent, meaning: −∞ −∞
|g(x, y)|fXY (x, y)dxdy <
∞.

16
1.5 Dependence and Independence
Whereas from knowledge of the joint distribution, one may determine the marginal dis-
tributions of each X and Y , it is not true that the two marginal distributions determine
the joint distribution!
Suppose that X and Y have distributions specified by
P(X = 0) = P(X = 1) = 1/2,
P(Y = 0) = P(Y = 1) = 1/2.
The joint distributions which are consistent with this information are described by

P(X = 0 and Y = 0) = θ, P(X = 0 and Y = 1) = 1/2 − θ,


P(X = 1 and Y = 0) = 1/2 − θ, P(X = 1 and Y = 1) = 1θ,
where θ ∈ [0, 1/2]. Notice that in the extreme case θ = 1/2 then P(X = Y ) = 1, and if
θ = 0 then P(X = 1 − Y ) = 1. More generally if θ > 1/4 then X and Y tend to agree in
value, whereas if θ < 1/4 then they tend to disagree in value.
Suppose X and Y are each uniformly distributed on the interval [0, 1]. Its not possible
to describe simply all possibly joint distributions they may have, but here are three
examples.
(i) (X, Y ) is uniformly distributed on the square [0, 1]2 .
(ii) (X, Y ) is uniformly distributed on [0, 1/2]2 ∪ [1/2, 1]2 ,
(iii) (X, Y ) is uniformly distributed on the diagonal {(x, y) ∈ R2 : 0 ≤ x = y ≤ 1}.
In this last case the joint distribution is neither discrete nor continuous- there is no joint
density function such that definition 6 is satisfied.
Definition 7. Two random variables X and Y ( defined on a common sample space) are
called independent if
P(X ∈ B and Y ∈ B 0 ) = P(X ∈ B)P(Y ∈ B 0 ),
for all choices B, B 0 ⊆ R.
Otherwise said for all choices B, B 0 ⊆ R, the events {X ∈ B} and {X 0 ∈ B 0 } are
independent. This captures the informal idea the value of X that is observed does not
influence the value of Y that is observed, and vice versa.
In practice this definition is hard to check because there are so many possible choices of
the subsets B and B 0 . However we can give some equivalent conditions for independence
that are easier to check. Before we do so, lets define the joint distribution function of a
pair of random variables X and Y to be the function FX Y : R2 → [0, 1] by
FXY (x, y) = P(X ≤ x and Y ≤ y) for (x, y) ∈ R2 .
Then we state but not prove7 the following fact.
7
We saw in example 6 of the last lecture how if the joint distribution is continuous, then its density
can be calculated from its distribution function, but the general proposition is much harder to prove.

17
Proposition 6. The joint distribution of X and Y is determined by its distribution func-
tion.

You can think of this proposition as saying suppose you know the values of P(X ∈
B and Y ∈ B 0 ) for all choices of B and B 0 of the form (−∞, x] and (−∞, y], then, in
theory, you can calculate from this information the probability P(X ∈ B and Y ∈ B 0 )
for a general choice of B and B 0 . In the light of this, the next proposition, which we state
but also don’t prove, is not surprising.

Proposition 7. X and Y are independent if and only if their joint distribution function
satisfies
FXY (x, y) = FX (x)FY (y) for all x, y ∈ R.
where FX and FY are the distribution functions of X and Y respectively.

In the case we assume the joint distribution is either discrete or continuous we can
give further characterizations of independence.

Proposition 8. If the joint distribution of X and Y is discrete then their being indepen-
dent is equivalent to the mass function satisfying

fXY (x, y) = fX (x)fY (y) for all x and y in the support of X and Y respectively,

where fX and fY are the mass functions of X and Y .

Proposition 9. If the joint distribution of X and Y is continuous then their being inde-
pendent is equivalent to some version8 of the joint density satisfying

fXY (x, y) = fX (x)fY (y) for all x, y ∈ R.

where fX and fY are versions of the density functions of X and Y .

In the examples of joint distributions given at the beginning of this lecture, we saw
two instances of random variables X and Y which are independent. Firstly if they have
a discrete joint distribution was given by

P(X = 0 and Y = 0) = P(X = 0 and Y = 1) =


P(X = 1 and Y = 0) = P(X = 1 and Y = 1) = 1/4.

Then secondly when the continuous joint distribution was uniformly distributed on the
square [0, 1]2 .
Example 8. Here is an example of checking independence from a given joint density.
Suppose that X and Y have a joint distribution with density
(
e−(x+y) if x ≥ 0 and y ≥ 0,
fXY (x, y) =
0 otherwise..
8
Recall the density of a distribution isn’t unique because its value can be changed arbitrarily on a
“small” set of points. So by some version of the density we means some choice of the density.

18
Then the density of the marginal distribution of X is computed as
Z ∞ (R ∞
0
e−(x+y) dy = e−x if x ≥ 0,
fX (x) = fXY (x, y) =
−∞ 0 otherwise.

Similarly we find fY (y) = e−y if y ≥ 0 and zero otherwise. Thus we have fXY (x, y) =
fX (x)fY (y) for all x, y ∈ R and X and Y are independent.
Independence of more than two random variables is defined as followed. A sequence
X1 , X2 , . . . , , Xn is a sequence of independent random variables if the events

{X1 ∈ B1 }, {X2 ∈ B2 }, . . . , {Xn ∈ Bn } are mutually independent,

for all choices of subsets B1 , B2 , . . . Bn ⊆ R. All the propositions we have stated above
for a pair of random variables have straightforward generalizations to n random variables.
We finish with a proof of the convolution formula for the density of a sum of indepen-
dent random variables with continuous distributions.

Proposition 10. Suppose X and Y are independent random variables with continuous
distributions having densities fX and fY respectively. Then the random variable X + Y
has a continuous distribution with density
Z ∞
fX+Y (x) = fX (x − y)fY (y)dy for x ∈ R.
−∞

Proof. The probability P(X + Y ≤ z) can be written as P((X, Y ) ∈ B) for the region B
given by
B = {(x, y) ∈ R2 : x + y ≤ z}.
Since X and Y are independent their joint density is fXY (x, y) = fX (x)fY (y), and hence
Z Z Z ∞ Z z−y
P(X + Y ≤ z) = fXY (x, y)dx dy = fX (x)fY (y)dx dy
B −∞ −∞
Z ∞Z z Z z Z ∞
= fX (x − y)fY (y)dx dy = fX (x − y)fY (y)dy dx
−∞ −∞ −∞ −∞
Z z
= fX+Y (x)dx.
−∞

which shows fX+Y to be the density of the distribution of X + Y .

19
1.6 Covariance
When random variables are independent, the variance of their sum is given by the sum
of the variances of each variable, as stated in Proposition 4 So what can we say about
the variance of a sum of random variables which are not independent? We need some
more information, which comes the joint distribution, to calculate the variance of a sum
in general.
Definition 8. The covariance of two random variables X and Y is defined to be

cov(X, Y ) = E[(X − µX )(Y − µY )],

where µX = E[X] and µY = E[Y ].


Notice that cov(X, X) = var(X). Expanding the bracket in the definition of covari-
ance, analgously to in proof of Proposition 3, gives the alternative expression

cov(X, Y ) = E[XY ] − E[X]E[Y ].

With this definition we can calculate, as in the proof of Proposition 4,


h 2 i h 2 i
var (X + Y ) = E (X + Y ) − (µx + µY ) = E (X − µX ) + (Y − µY )
= E (X − µX )2 + 2(X − µX )(Y − µY ) + (Y − µY )2
 

= E (X − µX )2 +2E [(X − µX )(Y − µY )]+E (Y − µY )2 = var(X)+2cov(X, Y )+var(Y ).


   

This calculation generalizes to the following bilinearity property of covariance- which we


will use a lot in the future. You can remember the formula by noticing the analogy with
expanding the algebraic expression (x1 + y1 )(x2 + y2 ).
Proposition 11. Suppose X1 , X2 , Y1 and Y2 random variables, then

cov(X1 + Y1 , X2 + Y2 ) = cov(X1 , X2 ) + cov(X1 , Y2 ) + cov(Y1 , X2 ) + cov(Y1 , Y2 ).

Also if a and b are constants and X and Y are random variables, then,

cov(aX, bY ) = ab cov(X, Y ).

Notice that if X1 = X2 and Y1 = Y2 then the first result stated in the proposition
agrees with our formula for var (X + Y ), and the proof is very similar: simply write the
covariance as an expectation and use the linearity of expectations.
What properties of the joint distribution are described by cov(X, Y )? We know that
if X and Y are independent, then using Fubini as in the proof of Proposition 4 gives
cov(X, Y ) = 0. But there exist pairs of random variables X and Y that are not indepen-
dent with cov(X, Y ) = 0 also. Consider the following example.
Let (X, Y ) be the co-ordinates of a point chosen uniformly at random from the unit
disc {(x, y) ∈ R2 : x2 + y 2 ≤ 1}. The density of the joint distribution is just
(
1/π if x2 + y 2 ≤ 1,
fXY (x, y) =
0 otherwise.

20
Because the disc is symmetric E[X] = E[Y ] = 0. If this isn’t clear immediately, then
consider the density of the marginal distribution of X: it is an even function. Now
consider Z ∞Z ∞
E[XY ] = xyfXY (x, y)dxdy.
−∞ −∞
Breaking this integral up into the sum of four integrals, each over one of the four quadrants
of the plane, shows it to be 0 too. Thus cov(X, Y ) = 0, and yet X and Y are not
independent.
We can develop this example to gain some insight into non-zero covariances. Suppose
we define two new random variables U and V via
U = aX + bY and V = bX + aY,
where a and b are non random constants. Let us consider the joint distribution of U and
V : the transformation (x, y) 7→ (ax + by, bx + ay) is a linear transformation that maps the
unit disc to an ellipse. Because the transformation is linear, (X, Y ) having the uniform
distribution on the disc is transformed into U, V ) having a uniform distribution on the
ellipse. In fact U and V have joint density
(
1
2 2 if (a2 + b2 )(u2 + v 2 ) − 4abuv ≤ (a2 − b2 )2 ,
fU V (u, v) = π|a −b |
0 otherwise.
The exact form of this density does not matter to us because using linearity we can
compute easily that E[U ] = E[V ] = 0, and that,
cov(U, V ) = E[U V ] = E[(aX + bY )(bX + aY )] = abE[X 2 ] + (a2 + b2 )E[XY ] + abE[Y 2 ].
Recall that E[XY ] = 0, and we can also compute9 that E[X 2 ] = E[Y 2 ] = 1/4. Thus we
obtain
cov(U, V ) = ab/2.

Figure 1.1 shows the ellipse for varying values of the parameters a and b. There are
two cases we will consider: case (a) that a > b > 0, illustrated on the top row of the
figure and case (b) a > −b > 0, illustrated on the bottom row. In case (a) the major axis
of the ellipse is the line {v = u} and the minor axis is the line {v = −u}. In case (b) the
major axis of the ellipse is the line {v = −u} and the minor axis is the line {v = u}. Note
that the covariance of U and V is positive in case (a) and negative in case (b). In case
(a) values of U and V that are either larger than average ( i.e. positive) or smaller than
average (i.e. negative) have a tendency to occur together. This leads to U and V tending
to reinforce each other in the sum U + V and var(U + V ) > var(X) + var(Y ). In case (b),
positive values of U tend to occur with negative values of V and vice versa. Thus U and V
have a tendency to ( partially) cancel in the sum U +V and var(U +V ) < var(X)+var(Y ).
How strong these effects are depends on the shape, but not the size of the ellipse.
Scaling a and b by some constant does not alter the shape of the ellipse but only its
size. To obtain a quantity that depends only on the shape of the ellipse we consider the
covariance of U and V appropriately scaled using their variances.
9
We found the distribution of X in lecture 2

21
Figure 1.1: The random variable (U, V ) is uniformly distributed within an ellipse. This
is illustrated for different values of the parameters a and b. Top left: a = 1.8, b = 1.33.
Top middle: a = 2.0, b = 1.0, Top right: a = 2.2, b = 0.4. Btm left: a = 1.8, b = −1.33.
Btm middle: a = 2.0, b = −1.0, Btm right: a = 2.2, b = −0.4.

4 4 4

2 2 2

-4 -2 2 4 -4 -2 2 4 -4 -2 2 4

-2 -2 -2

-4 -4 -4

4 4 4

2 2 2

-4 -2 2 4 -4 -2 2 4 -4 -2 2 4

-2 -2 -2

-4 -4 -4

22
Definition 9. Suppose X and Y are random variables with non-zero variances, then we
define the correlation of X and Y to be

cov(X, Y )
p .
var(X)var(Y )

In our example,

var(V ) = var(U ) = E[(aX + bY )2 ] = a2 E[X 2 ] + b2 E[Y 2 ] = (a2 + b2 )/4,

and so the correlation between U and V is 2ab/(a2 + b2 ). Notice that this value always
lies in the range [−1, +1]. In Figure 1.1 the correlation between U and V is positive on
the top row, decreasing from right to left, and negative on the bottom row, increasing
from right to left.
In fact the correlation of any pair of random variables always lies in the range [−1, +1].
This follows from the next proposition. Moreover a value of the correlation close +1
indicates that the mass of the joint distribution is concentrated near a line in the plane
having positive gradient, and a value of the correlation close −1 indicates that the mass
of the joint distribution is concentrated near a line in the plane having negative gradient.

Proposition 12 ( Cauchy-Schwartz inequality). For any random variables X and Y ,


such that the expectations exist,
2
E[XY ] ≤ E[X 2 ]E[Y 2 ].

with equality if and only if there exists a c ∈ R so that P(Y = cX) = 1.

Proof. Consider

E[(X + tY )2 ] = E[X 2 ] + 2tE[XY ] + t2 E[Y 2 ] ≥ 0

by positivity of expectation. We can assume E[Y 2 ] > 0, otherwise the claimed inequality
is trivially true because P(Y = 0) = 1. Take t = −E[XY ]/E[Y 2 ] which minimizes the
quadratic expression. Then we obtain,

E[X 2 ] − (E[XY ])2 /E[Y 2 ] ≥ 0,

which rearranges to give the result. If we have equality, then, for this choice of t, we have
E[(X + tY )2 ] = 0 which implies P(X + tY = 0) = 1.
By applying the Cauchy Schwartz inequality to random variables X − µX and Y − µY
we obtain
(cov(X, Y ))2 ≤ var(X)var(Y ),
from which it follows immediately that correlation lies in the range [−1, +1].

23
1.7 Transformations of densities
Suppose X has a continuous distribution with density fX and g : R → R is some function
which we will assume is continuous, increasing and a bijection of R. We also assume
h = g −1 is differentiable. Let Y be the random variable Y = g(X). How do we calculate
the distribution of Y from fX and g?
For a < b we have10

P(a ≤ Y ≤ b) = P(a ≤ g(X) ≤ b) =


Z g −1 (b) Z b
−1 −1 1 −1
P(g (a) ≤ X ≤ g (b)) = fX (x) dx = fX (g (y))dy
g −1 a a g 0 (g −1 (y))

where we make the substitution y = g(x) in the integral. Since this holds for for all a < b
we deduce that Y has a continuous distribution with density
1 −1
fY (y) = fX (g (y)) = h0 (y)fX (h(y)). (1.1)
g 0 (g −1 (y))

A very important special case is Y = aX + b where a > 0 and b ∈ R, and then we have
 
1 y−b
fY (y) = fX .
a a
1
We can understand the factor g0 (g−1 (y))
appearing in (1.1) intuitively by remembering that
fX and fY are densities. The function g stretches or contracts space (the real line) near
a point x according to whether g 0 (x) > 1 or g 0 (x) < 1. If g stretches space, then the same
amount of mass ( probability) is spread more thinly, and so its density falls. If g contracts
space then the density rises.
Now lets consider how joint densities behave when we transform variables. There is
a change of variable formula for multidimensional integrals which involves the Jacobian
matrix. Let us recall11 the definition of this. Suppose g is a differentiable R2 -valued
function defined on R2 ( or some subset ofR2 .) Write

g(x, y) = (g1 (x, y), g2 (x, y))

where g1 and g2 are R-valued functions. Then the Jacobian matrix of g at a point
(x, y) ∈ R2 is the matrix !
∂g1 ∂g1
∂x ∂y
Jg (x, y) = ∂g2 ∂g2
∂x ∂y

and the Jacobian of g at the point (x, y) is the determinant of this matrix det(Jg (x, y)).
Now we can state our formula for the transformation of joint densities.
10
The second equality here depends on g being an increasing and a bijection. For functions such as
g(x) = x2 or g(x) = e−x it is necessary to modify the argument, and the formula (1.1) does not hold in
general.
11
If you havent met this already, then you will soon in ST208.

24
Proposition 13. Suppose X and Y have a joint distribution with density fXY . Suppose
that D ⊆ R2 is such that P((X, Y ) ∈ D) = 1 and g : D → D0 is a bijection between D
and D0 ⊆ R2 with non-vanishing Jacobian. Then the random variables U and V defined
by (U, V ) = g(X, Y ) have a joint distribution with density
(
1
f (g −1 (u, v)) if (u, v) ∈ D0 ,
| det(Jg (g −1 (u,v))| XY
fU V (u, v) =
0 otherwise

Notice the modulus of the determinant in the statement! There is an extension of this
result to more than two variables; it is entirely as you would expect.
Example 9. In the last lecture we considered the case that (X, Y ) was uniformly dis-
tributed in the disc D = {(x, y) ∈ R2 : x2 + y 2 ≤ 1} and we defined random variables U
and V by
U = aX + bY and V = bX + aY,
where a and b are constants. We can put this into the framework of the proposition by
defining g : R2 :→ R2 by

g(x, y) = (g1 (x, y), g2 (x, y)) = (ax + by, bx + ay).

Calculating the Jacobian matrix of g we obtain


!
∂g1 ∂g1  
∂x ∂y a b
Jg (x, y) = ∂g2 ∂g2 =
∂x ∂y
b a

It doesn’t depend on x and y! This is a really important point: the Jacobian matrix of a
linear transformation is a constant matrix ( compare with the derivative of a linear func-
tion of one variable), and actually its the same matrix as describes the transformation.
Geometrically this means linear transformations stretch and contract space in the same
way everywhere12 .
Now the boundary of the D is described by the equation x2 + y 2 = 1. We need to find
the image of the disc under the transformation given by g. We do this first by finding
the inverse of g: if u = ax + by and v = bx + ay, then x = (au − bv)/(a2 − b2 ) and
y = (−bu + av)/(a2 − b2 ), substituting into the equation of the circle we find that the
boundary of D0 = g(D) is described by the equation
 2  2
au − bv −bu + av
+ = 1,
a2 − b 2 a2 − b 2

which simplifies to
(a2 + b2 )(u2 + v 2 ) − 4abuv = (a2 − b2 )2 .
12
The determinant of a matrix tells you how the corresponding linear transformation scales area ( or
in higher dimensions volume). But Proposition 13 also deals with transformations that are not linear.
In this case at a point (x, y) the Jacobian matrix is telling you what linear transformation you can
approximate g by close to the point (x, y). And then the determinant defining the Jacobian calculates
how this approximating linear transformation scales area.

25
This we recognise13 as the equation of an ellipse. What is the area of this ellipse? Because
the Jacobian matrix of g is constant it must be given by

| det(Jg )| × Area of D = |a2 − b2 | π.


What happens if a = b, then the area of the ellipse is zero? In this case the transformation
g is singular and not invertible: the image of the plane R2 is just the line {(u, v) ∈ R2 :
u = v}. The proposition does not apply, and the joint distribution of U and V does not
have a density.
In order to use Proposition 13 we need to calculate the inverse g −1 of our transfor-
mation g. In some cases, having done this, it is easier to compute the Jacobian we need
from g −1 rather than from g, using the fact that

Jg (g −1 (u, v)) is the inverse of the matrix Jg−1 (u, v),

and hence
1
det(Jg (g −1 (u, v)) = .
det(Jg−1 (u, v))
We will see how this works in the next example.
Example 10. Suppose X and Y are independent and each have the exponential distribution
with parameter α > 0. Let U = X + Y and V = X/(X + Y ). We want to find the joint
distribution of U and V . So let g : (0, ∞)2 → (0, ∞) × (0, 1) be the function

g(x, y) = (g1 (x, y), g2 (x, y)) = (x + y, x/(x + y)).

This has inverse h = g −1 given by


h(u, v) = (h1 (u, v), h2 (u, v)) = (uv, u(1 − v))

The Jacobian matrix of h is then calculated14


 ∂h1 ∂h1   
∂u ∂v
v u
Jh (u, v) = ∂h2 ∂h2 =
∂u ∂v
1 − v −u

which has determinant −u. Now the joint density of X and Y is α2 e−α(x+y) on the set
{(x, y) ∈ R2 : x > 0, y > 0} and zero otherwise. Consequently the proposition gives the
joint density of U and V as
(
α2 ue−αu if u > 0 and 0 < v < 1,
fU V (u, v) =
0 otherwise.

Thus U has a Gamma distribution15 with parameters α and 2, V is uniformly distributed


on (0, 1), and U and V are independent. Nice answer!
13
I do, maybe you didn’t at first: but you’ve just learnt to! Actually a quadratic expression in two
variables can in general describe an ellipse, a parabola or a hyperbola and telling which from the equation
is a bit tricky.
14
You have to be very careful calcuating the derivatives; it is easy to make mistakes. For example
∂(uv)
∂u = v because v behaves like a constant when differentiating with respect to u.
15
as we knew it would have from Lecture 3

26
Chapter 2

The Multivariate Normal


Distribution

27
2.1 The standard Gaussian distribution
Definition 10. A random vector1 Z = (Z1 , Z2 , . . . , Zn ) has the standard Gaussian2 dis-
tribution in Rn if the joint distribution of Z1 , Z2 , . . . , Zn is continuous with density
n
!
1 1X 2
f (z1 , z2 , . . . , zn ) = exp − z for (z1 , z2 , . . . , zn ) ∈ Rn
(2π)n/2 2 i=1 i

This distribution is very special for it combines probability and geometry in a unique
way.
(a) The random variables Z1 , Z2 , . . . Zn are independent because their joint density fac-
torizes !
n n
1 1X 2 Y 1
√ exp −zi2 /2 .

n/2
exp − zi =
(2π) 2 i=1 i=1

Notice this also implies that each component Zi has the standard Gaussian distri-
bution on R.
Pn 2
(b) The joint density is a function of i=1 zi , which by Pythagoras is the squared
distance of the point (z1 , z2 , . . . , zn ) to the origin. Consequently the joint density is
rotationally symmetric.
A linear transformation of Rn is a rotation only if it can be written3 as z 7→ Oz where O
is an orthogonal n × n matrix4 . An orthogonal matrix is a matrix satsfying
OT O = OOT = I, (2.1)
where I denotes the identity matrix. Another way of expressing this is to say the rows (
and the columns too) of O form an orthonormal basis for R. Computing the determinant
of both sides of the above equation gives 1 = det(OT O) = det(OT ) det(O) = (det(O))2 .
So an the determinant of an orthogonal matrix is equal to either 1 or −1.
The following proposition formalises the observation (b) made above.
Proposition 14. Suppose that Z has the standard Gaussian distribution in Rn and O
is a non-random orthogonal n × n matrix. Let W = OZ. Then W has the standard
Gaussian distribution in Rn too.
Proof. Because | det(O)| = 1, and the map z 7→ Oz has inverse w 7→ OT w, Proposition
135 implies that the distribution of the vector W has density
fW (w) = fZ (OT w)
1
By a random vector we just mean a vector whose components are random variables. Our vector will
be column vector, and so it can be written as here as the transpose of a row vector.
2
Gaussian, named after Carl Friedrich Gauss, one the greatest mathematicians of all time. Also known
more prosaically as the normal distribution.
3
z is a n-dimensional column vector in what follows
4
not all orthogonal transfomrations are rotations; some are reflections.
5
Actually this is the generalization of Proposition 13 to n variables. Remember too that in example
9 we saw that the Jacobian of a linear transformation is the determinant of the matrix describing the
transformation, so in this case O.

28
where fZ (z) denotes the density ofPZ. Now notice that if z = (z1 , z2 , . . . , zn )T is n dimen-
sional column vector then z T z = ni=1 zi2 , and if z = OT w then6 z T z = (OT w)T (OT w) =
wT OOT w = wT w. Consequently the density of W is given by
1
exp −wT w/2 ,

fW (w) = n/2
(2π)

and W thus has the standard Gaussian distribution.


Here are two very significant consequences of this proposition. Suppose Z has the
standard Gaussian distribution in Rn and a is a non-random n-dimensional column vector
of length one. Then
n
X
T
a Z= ai Zi has the standard Gaussian distribution.
i=1

To deduce this from the proposition we just choose an orthogonal matrix O whose first
row7 is aT . Then the first component of W = OZ will be exactly aT Z. Further suppose
that b is another non-random n-dimensional column vector of length one and that aT b = 0,
then
n
X n
X
T T
the random variables a Z = ai Zi and b Z = bi Zi are independent.
i=1 i=1

To deduce this, just make aT and bT the first and second rows of O respectively8 .

6
I’m using properties of transpose here: (AB)T = B T AT and (AT )T = A.
7
We can always do this, by results from linear algebra
8
Again we can always do this, by results from linear algebra: any collection of orthonormal vectors
can be extended to an orthonormal basis.

29
2.2 The general Gaussian distribution
2.2.1 Definition
Recall that if X has a general Gaussian distribution on the real line, then it can be
constructed from a random variable Z having the standard Gaussian distribution on R
by a combination of a scaling and a translation:

X = aZ + b

where a and b are real constants and we can take a > 0. In fact this is really the
definition of general Gaussian distribution on R: any distribution that can be obtained
by scaling and translating the standard Gaussian distribution. Notice also that we then
have E[X] = b and var(X) = a2 , so knowing the mean and variance of X, together with
knowing it has a Gaussian distribution, completely specifies its distribution. We want to
do something similar in higher dimensions.

Definition 11. Let b = (b1 , b2 , , . . . , bn )T be a constant n-dimensional vector, and A =


(Aij ; 1 ≤ i ≤ n, 1 ≤ j ≤ m) be an n × m non-random matrix. Suppose that Z has the
standard Gaussian distribution in Rm and let X = (X1 , X2 , . . . , Xn )T be the n-dimensional
random vector given by
X = AZ + b.
Then we say X has a Gaussian distribution in Rn , and that X1 , X2 , . . . , Xn have a joint
distribution which is Gaussian.

There is something very subtle about this definition9 which is that we don’t assume
m = n. We could assume m = n; a priori that might lead to a smaller family of
distributions being called Gaussian, but in fact it wouldn’t: we would in fact define the
same family that way. The advantage of allowing m to be different from n comes later,
when we prove a theorem which describes the properties of the Gaussian distribution. We
pay a price: the theorem in the next lecture is harder with our definition.
We want to generalise the fact that mean and variance determine a Gaussian distri-
bution in one dimension. So we first need to calculate the higher dimensional analogues
of mean and variance.
Suppose that X = AZ + b as in the definition. Then , for each 1 ≤ i ≤ n,
" m #
X
E[Xi ] = E Aij Zj + bi = bi . (2.2)
j=1

We say that the mean vector of X is

E[X1 ], E[X2 ], . . . , E[Xn ])T = b.


9
Its important to understand that there may be many different ways of defining what turns out to be
the same mathematical object. Think about defining the sine function; this can be done geometrically
using triangles or via a power series in analysis. It turns out to be the same function in the end. Here
we are defining a certain family of distributions.

30
Next we can calculate, for 1 ≤ i ≤ n and 1 ≤ j ≤ n
" m
! m
!#
X X
cov(Xi , Xj ) = E Aik Zk Ajl Zl =
k=1 l=1
" m m
#
XX
E Aik Ajl Zk Zl =
k=1 l=1
m X
X m m
X
Aik Ajl E[Zk Zl ] = Aik Ajk . (2.3)
k=1 l=1 k=1

And notice that this is exactly the ijth entry of the matrix AAT . We say that the
variance-covariance matrix of X is the n × n matrix Σ with entries Σij = cov(Xi , Xj ).
And so we have Σ = AAT . Notice that Σ is a symmetric matrix: Σij = Σji with its
diagonal elements given by the variances of X1 , X2 , . . . , Xn .
Given a random vector X = (X1 , X2 , . . . , Xn )T with mean vector b and variance-
covariance matrix Σ we can easily calculate the expected value and variance of a random
variable Y of the form n
X
T
Y =v X= vi Xi
i=1

where v = (v1 , . . . , vn )T is some given constant vector. First we have


" n # n
 T X X
E v X] = E vi Xi = vi E[Xi ] = v T b.
i=1 i=1

Then secondly we have


n
! n
X X
T
var(v X) = var vi Xi = vi vj cov(Xi , Xj ) = v T Σv.
i=1 i,j=1

This establishes another very important property of the matrix Σ: it is non-negative


definite, which means that

v T Σv ≥ 0 for all n-dimensional vectors v. (2.4)

This holds, of course, because we have just seen the quantity v T Σv is the variance of a
random variable.

31
2.2.2 Characterization by mean and variance-covariance matrix
Theorem 1. The distribution of a random vector X = (X1 , X2 , . . . , Xn )T having a Gaus-
sian distribution10 is characterized by its mean vector and variance-covariance matrix.
Moreover for any vector µ ∈ Rn and non-negative definite, symmetric n × n matrix Σ
there exists a random vector X having mean µ and variance-covariance matrix Σ.
For the proof we are going to use moment generating functions. If X = (X1 , X2 , . . . , Xn )T
is a random vector then its moment generating function is the function φX defined by
" !#
X
φX (t1 , t2 , . . . , tn ) = E exp ti Xi (2.5)
i=1

whenever the expectation exists ( and is finite). If X and Y are two n-dimensional random
vectors and their moment generating functions exist and are equal ( in a neighbourhood
of the origin) then11 X and Y have the same distribution.
Proof of theorem. Recall that if Z has the standard Gaussian distribution on R, then
2
E etZ = et /2 .
 

Consequently if (Z1 , Z2 , . . . , Zn ))T has the standard Gaussian distribution on Rn , then


n
2 2 2
Y
t1 Z1 +t2 Z2 +...+tn Zn
E eti Zi = e(t1 +t2 +...+tn )/2 .
   
E e =
i=1

More generally if X = (X1 , X2 , . . . , Xn ))T has any Gaussian distribution on Rn , then


by Definition 11, it can be written as X = AZ + b where Z has the standard Gaussian
distribution on Rm for some m, A is an n × m constant matrix and b a constant vector
of dimension n. Consequently we can calculate that
" n
!# " n m
! n
!#
X X X X
E exp ti Xi = E exp ti Aik Zk + ti bi =
i=1 i=1 k=1 i=1
n
! " m n
! !#
X X X
= exp ti bi E exp ti Aik Zk
i=1 k=1 i=1
n
! m
" n
! !#
X Y X
= exp ti bi E exp ti Aik Zk
i=1 k=1 i=1
!  !2 
n m n
X Y 1 X
= exp ti bi exp  ti Aik 
i=1 k=1
2 i=1
 !2 
n m n
X 1 X X
= exp  ti bi + ti Aik 
i=1
2 k=1 i=1

10
If n ≥ 2 we often say a multivariate Gaussian distribution
11
This is a very deep result from analysis, even when n = 1

32
This last expression can be written as

exp tT b + tT Σt/2 ,


where t = (t1 , t2 , . . . , tn )T , and Σ = AAT . Thus the moment generating function, and
hence the distribution, of X is determined by b, its mean vector, and Σ its variance-
covariance matrix.
For the second half of the theorem, suppose we are given an n-dimensional vector
b and a non-negative definite matrix Σ. We recall that the symmetric matrix Σ can
be written12 as OT ΛO where O is an orthogonal matrix and Λ is diagonal. Since Σ is
supposed to be non-negative definite13 , the entries of Λ must all be non-negative, and we
define the n × n matrix A by A = OT Λ1/2 where Λ1/2 denotes the diagonal matrix whose
entries are the square roots of the entries of Λ. Then let X = AZ + b where Z has the
standard Gaussian distribution on Rn . By Definition 11, X has a Gaussian distribution
and by the calculation we made in the previous lecture, the mean vector of X is b and its
variance-covariance matrix is AAT = (OT Λ1/2 )(OT Λ1/2 )T = OT ΛO = Σ.

12
take the diagonal entries of Λ to be the eigenvalues of Σ and the rows of O to be the corresponding
eigenvectors, normalized to have length 1. It is then easy to check that OT ΛO is a matrix with these
same eigenvalues and eigenvectors. and so must be equal to Σ.
13
Recall from the last lecture that a symmetric matrix Σ is called non-negative definite if v T Σv ≥ 0
for all vectors v; this is equivalent to saying all the eigenvalues of Σ are non-negative.

33
2.2.3 Properties
Proposition 15. Suppose that the random vector X = (X1 , X2 , . . . , Xn )T has a Gaussian
distribution with mean vector b and variance-covariance matrix Σ.
If Σ is invertible then the distribution of X is continuous with a density given by
 
1 1 T −1
fX (x1 , x2 , . . . , xn ) = p exp − (x − b) Σ (x − b)
(2π)n/2 det(Σ) 2

where x = (x1 , x2 , . . . , xn )T ∈ Rn .
If Σ is not invertible then there exist one or more linear relationships between the
random variables X1 , X2 , . . . , Xn which hold with probability one. Consequently the dis-
tribution of X is not continuous.

Proof. Suppose Σ is invertible then all its eigenvalues are strictly positive. If we write
Σ = OT ΛO with O orthogonal and Λ diagonal then the matrix Λ has strictly positive
diagonal entries and is invertible too. Moreover Σ−1 = OT Λ−1 O. Define A = OT Λ1/2 and
suppose, as we may, that X = AZ + b where Z has the standard Gaussian distribution
on Rn . Notice that the map z 7→ Az + b is a bijection on Rn because the matrix A is
invertible. In fact if x = Az + b then z = A−1 (x − b). By the extension of Proposition 13
to n-variables we have that the density of the distribution of X is given by
1
fX (x) = fZ (A−1 (x − b))
det(A)

where fZ is the density of the standard normal distribution on Rn . To complete


√ the
calculation of fX we note that det(Σ) = det(AAT ) = det(A)2 so det(A) = det Σ and
also that A−1 = Λ−1/2 O so that
T
A−1 (x − b) A−1 (x − b) = (x − b)T OT Λ−1/2 Λ−1/2 O(x − b) = (x − b)T Σ−1 (x − b).


If Σ is not invertible then there exists at least one vector v 6= 0 so that Σv and hence
T
v Σv is zero. Then !
Xn
var vi Xi = v T Σv = 0
i=1

which implies that !


n
X n
X
P vi Xi = vi bi = 0 .
i=1 i=1

The following theorem describes key properties of random variables X1 , X2 , . . . , Xn


whose joint distribution is Gaussian. These properties woudn’t hold if all you knew was
that each of random variables X1 , X2 , . . . Xn individually had a Gaussian distribution.
The theorem has a very quick proof that seems suspiciously easy. This is because of our
clever definition of Gaussian distributions which involved two different dimensions m and
n.

34
Theorem 2. Suppose that the random vector X = (X1 , X2 , . . . , Xn )T has a Gaussian
distribution.

(a) For any positive integer p, if B is a p × n matrix and c a p-dimensional constant


vector, then the random vector

Y = BX + c

has a Gaussian distribution14 .

(b) For any positive integer p ≤ n, and 1 ≤ i1 < i2 < . . . < ip ≤ n the vector

X̃ = (Xi1 , Xi2 , . . . Xip )T

has a Gaussian distribution too.

(c) X1 , X2 , . . . Xn are independent random variables if and only if

cov(Xi , Xj ) = 0 for all 1 ≤ i 6= j ≤ n.

Proof. Since X has a Gaussian distribution we can write X = AZ + b where Z has a


standard Gaussian distribution in some number of dimensions m. Then

Y = (BA)Z + (Bb + c) = CZ + d,

and this representation shows that Y has a Gaussian distribution. Let à be the p × m
matrix consisting of the rows i1 , i2 , . . . , ip of A, and let b̃ = (bi1 , bi,2 , . . . , bip )T . Then

X̃ = ÃZ + b̃,

and this representation shows that X̃ has a Gaussian distribution.


For the final part of the theorem, recall that independence of Xi and Xj implies that
cov(Xi , Xj ) = 0 for any pair of random variables, so we must prove the converse relation.
If if
cov(Xi , Xj ) = 0 for all 1 ≤ i 6= j ≤ n.
then the variance-covarince matrix is a diagonal matrix Σ. Let Σ1/2 be diagonal matrix
whose entries are the square root of those of Σ. Let Z be an n-dimensional random vector
with the standard Gaussian distribution, and consider the random variable X 0 = Σ1/2 Z +b
where b is the mean vector of X. Since each component Xi0 is a function of Zi alone, and
the random variables Z1 , Z2 , . . . , Zn are independent, then X10 , X20 , . . . Xn0 are independent
random variables too. But calculating the mean vector and variance-covariance matrix of
X 0 we find they agree with those of X, and so by the preceding theorem X and X 0 have
the same distribution. And thus it follows X1 , X2 , . . . Xn are independent.

14
It might be that Y is just a deterministic constant vector; thats ok, we include constants as “degen-
erate” Gaussian distributions too.

35
2.3 Fisher’s Theorem and a statistical application
We proved on the first tutorial sheet that if Z1 , Z2 , . . . , Zn are independet random variables
each having the standard Gaussian distribution on R then the random variables

Z12 + Z22 + . . . + Zn2

has the Gamma distribution15 with parameters 1/2 and n/2. This distribution is also
known as the χ2 -distribution with n degrees of freedom.
If X and Y are independent random variables having continuous distributions with
densities fX and fY and P(Y > 0) = 1, then a very similar argument to that used to
prove Proposition 10 shows that the random variable X/Y has a continuous distribution
with density Z ∞
fX/Y (t) = yfX (ty)fY (y)dy. (2.6)
0
Calculating, using this formula and the densities of the Gaussian and Gamma distribu-
tions, the following proposition can be proved.

Proposition 16. If Z has the standard Gaussian distribution on R, Y has the χ2 -


distribution with n degrees of freedom, and Z and Y are independent, then
Z
p
Y /n

has a continuous distribution with density f (t) proportional to

(1 + t2 /n)−(n+1)/2 t ∈ R.

The distribution appearing in this proposition is called the t-distribution with n de-
grees of freedom.

Theorem 3. Suppose lhat X1 , X2 , . . . , Xn are independent random variables each having


the Gaussian distribution on R with mean µ and variance σ 2 > 0. Let
n n
1X 1 X
X̄ = Xi and S 2 = (Xi − X̄)2 .
n i=1 n − 1 i=1

Then

(a) X̄ has the Gaussian distribution on R with mean µ and variance σ 2 /n.

(b) (n − 1)S 2 /σ 2 has the χ2 distribution with n − 1 degrees of freedom.

(c) X̄ and S 2 are independent.


p
(d) (X̄ − µ)/ S 2 /n has the t-distribution with n − 1 degrees of freedom.
15
See also Lecture 8 video for the case n = 2

36
Proof. A linear combination of independent random variables each having a Gaussian
distribution, has a Gaussian distribution itself. Hence X̄ has a Gaussian distribution and
we can calculate " n # n
1X 1X
E[X̄] = E Xi = E[Xi ] = µ.
n i=1 n i=1
and !
n n
1X 1 X σ2
var(X̄) = var Xi = var(X i ) = .
n i=1 n2 i=1 n
This proves (a).
Let X = (X1 , X2 , . . . , Xn )T ; this random vector has a Gaussian distribution16 with
covariance matrix Σ = σ 2 I and mean vector µ1, where I denotes the n × n identity
matrix and 1 = (1, 1, . . . , 1)T is the n-dimensional vector whose every component is equal
to 1.
The key idea of the proof is to transform the random vector X into a new random
vector Y with some nice properties. To this end let O = (Oij )1≤i,j≤n be the orthogonal
matrix whose first row17 is
1
(n−1/2 , n−1/2 , . . . n−1/2 )T = √ 1.
n

Because O is an orthogonal matrix, if i 6= 1, then its ith row Oi1 , Oi2 , . . . , Oin is a vector
which is orthogonal to 1 and consequently
n
X
Oij = 0.
j=1

Now let Y = OX. Notice that, for i 6= 1,


" n # n
!
X X
E[Yi ] = E Oij Xj = Oij µ = 0,
j=1 j=1

and that the variance-covariance matrix of Y is σ 2 OOT = σ 2 I. Because this matrix is


diagonal the random variables Y1 , Y2 , . . . Yn are independent.
The next stage of the proof is to write the two random variables we are interested in,
X̄ and S 2 , in terms of the random vector Y . Firstly,
n
X √
Y1 = O1j Xj = nX̄.
j=1

16
In fact each of X1 , X2 , . . . Xn can be written as a linear transformation of a standard Gaussian random
variable Zi ,
Xi = σZi + µ,
and because X1 , X2 , . . . Xn are independent so are Z1 , . . . Zn . But this means the vector Z =
(Z1 , Z2 . . . , Zn )T has the standard Gaussian distribution on Rn , and the vector X can be written as
a linear transformation of Z.
17
Notice this vector has length one, and so there exists an orthogonal matrix with it as its first row

37
Next using the othogonality of O, we have,
n
X n
X
Yi2 = Y T Y = (OX)T (OX) = X T (OT O)X = Xi2 .
i=1 i=1

Consequently

n n
!
1 X 1 X
S2 = (Xi − X̄)2 = Xi2 − nX̄ 2 =
n − 1 i=1 n−1 i=1
n
! n
1 X 1 X 2
Yi2 − Y12 = Y .
n−1 i=1
n − 1 i=2 i

So because Y1 is independent of Y2 , Y3 , . . . , Yn , we deduce that X̄ is independent of S 2 ,


proving (c). Moreover
X n
2 2
(n − 1)S /σ = Yi /σ)2 ,
i=2

and the random variables Y2 /σ, Y3 /σ, . . . , Yn /σ are independent and each have the stan-
dard Gaussian distribution on R. Thus (b) follows from the definition of the χ2 distribu-
tion.
Finally, for (d), we write
p
(X̄ − µ) (X̄ − µ)/ σ 2 /n
p = p .
S 2 /n S 2 /σ 2

Then, from (a), (b) and (c), we see that the preceding proposition applies to the right
handside.

A very common statistical problem is the following. We have data x1 , x2 , . . . , xn


which we model as being generated by observing values of independent random vari-
ables X1 , X2 , . . . Xn where each Xi is supposed to have a Gaussian distribution on R with
some mean µ and variance σ 2 . In applications where we do this µ usually represents
something in world which we want to estimate from the data. For example x1 , x2 . . . , xn
could be n noisy measurements of some quantity whose precise value is µ. The natural
choice of an estimate for µ is the sample mean
n
1X
x̄ = xi .
n i=1

However it is very important to understand how accurate this estimate is likely to be.
This can be done by considering the distribution of the random variable
n
1X
X̄ = Xi ,
n i=1

38
and seeing how far away X̄ tends to be from its mean µ. We know that the distribution
of X̄ − µ is Gaussian with mean zero and variance σ 2 /n. So we see immediately how our
estimate will (probably) be more accurate for larger sample sizes n, and smaller σ 2 , which
describes the inherent variablilty of the data. The problem is that σ 2 also needs to be
estimated from the data. The natural way to do this is with the sample variance
n
2 1 X
s = (xi − x̄)2 .
n − 1 i=1

But since this is an estimate, we now need to understand how accurate this is likely to
be too! And so we are led to consider the distribution of the random variable
n
2 1 X
S = (Xi − X̄)2
n − 1 i=1

According to the theorem (n − 1)S 2 /σ 2 has the χ2 distribution with n − 1 degrees of


freedom. This latter distribution has mean18 n − 1. Consequently

E[S 2 ] = σ 2 ,

which is reassuring: we say our estimating σ 2 with s2 is unbiased. To fully understand


how accurate our estimate x̄ of µ is likely to be we need to study the joint distribution of
the random variables X̄ and S 2 , and this is exactly what the theorem does.

18
You can view this a special case of the mean of a gamma distribution, or just calculate E[Z12 + Z22 +
2
. . . + Zn−1 ] where Zi each have the standard Gaussian distribution.

39
Chapter 3

Conditioning

40
3.1 Conditional distributions
Suppose A is an event with probability P(A) > 0 and X a random variable. Then the
conditional distribution of X given A is the probability distribution of R given by

B 7→ P(X ∈ B|A) = P(X ∈ B and A occurs)/P(A) (3.1)

for subsets B ⊆ R. Conditional distributions can be discrete and described by a mass


function or continuous and described by a density.
Example 11. Let T have an exponential distribution on R with parameter α. Fix some
t > 0 and let A be the event that T ≥ t. Then for any a and b satisfying t ≤ a ≤ b we
have

P(a ≤ T ≤ b and T ≥ t) P(a ≤ T ≤ b)


P(a ≤ T ≤ b|T ≥ t) = =
P(T ≥ t) P(T ≥ t)
−αa −αb Z b
e −e −α(a−t) −α(b−t)
= =e −e = αe−α(u−t) du.
e−αt a

Together with the fact that P(T ≤ t|T ≥ t) = 0 this shows that the conditional distribu-
tion of T given T ≥ t has density
(
αe−α(u−t) if u ≥ t
f (u) = (3.2)
0 otherwise.

This is a shifted exponential distribution: in fact it is the same distribution as that of


T + t. Another way of saying an equivalent thing: if you are waiting for a time T which
has the exponential distribution to occur and it hasn’t occurred by some time t, then the
amount of time you have to wait further for T to occur has an exponential distribution
(with same parameter). This is the key property of the exponential distribution that
makes it the natural model for waiting times1 .
Example 12. Pick one of two coins out of a pocket. The first coin is unbiased, the second
biased so it lands head up with probability 2/3. You are equally likely to choose either
coin. Toss the chosen coin 10 times. What is the distribution of X, the number of times
a head is tossed?
This is a good example to demonstrate that it can be natural, when a building model,
to specify ( one or more ) conditional distribution of a random variable, and then calculate
probabilities from those conditional probabilties. Suppose A is the event we pick the
unbiased coin. Then P(A) = 1/2. Since if A occurs we are tossing an unbiased coin,
we want the conditional distribution of X given A to be a Binomial distribution with
parameters 10 and 1/2. If A doesnt occur then we are tossing the biased coin and so
we want the conditional distribution of X given Ac to be a Binomial distribution with
parameters 10 and 2/3. Using the law of total probability we can easily compute the
1
not all waiting times: the older I get, the less time I can expect to have left to live.

41
distribution of X. For k=0,1,2. . . , 10 we have,

P(X = k) = P(X = k|A)P(A) + P(X = k|Ac )P(Ac )


   11  k  10−k !
n 1 1 2 1
= + (3.3)
k 2 2 3 3

Now we want to develop the idea of a conditional distribution of a random variable


X further, and condition on the value of another random variable Y . The easiest case is
if Y has a discrete distribution, because then if y is value that belongs to the support of
the distribution of Y , the event Y = y has probability greater than zero, and we are back
to our previous setting where we were conditioning on an event. Let’s consider the case
where X and Y both have discrete distributions.
Definition 12. Let X and Y be random variables with discrete distributions having sup-
ports X and Y respectively. Let fXY denote the mass function of the joint distribution of
X and Y , and fY be the mass function of Y . Then for each y ∈ Y define the conditional
mass function of X given Y = y to be
fXY (x, y)
fX|Y (x|y) = for x ∈ X .
fY (y)
Then, just by thinking of mass functions as probabilities, it is clear that the con-
ditional distribution of X given Y = y is the discrete distribution with mass function
x 7→ fX|Y (x|y) for x ∈ X .
Example 13. Suppose that the joint distribution of X and Y has mass function
   x  y−x
1 y 1 1
fXY (x, y) = for x = 0, 1, 2, . . . y, and y = 0, 1, 2, 3, . . .
2 x 6 3
where xy is a Binomial co-efficient. What is the conditional distribution of X given


Y = y?
First we find the marginal distribution of Y . For any non-negative integer y we have
y    x  y−x  y  y+1
1X y 1 1 1 1 1 1
fY (y) = = + = .
2 x=0 x 6 3 2 6 3 2

So in fact we see Y has a geometric distribution. Now the conditional mass function of
X given Y = y is
   x  y−x , y+1    x  y−x
1 y 1 1 1 y 1 2
fX|Y (x|y) = = for x = 0, 1, 2, . . . , y.
2 x 6 3 2 x 3 3

So the conditional distribution of X given Y = y is a Binomial2 distribution with param-


eters y and 1/3.
2
Why does this work out so nicely? It models pulling balls out of an ( infinite) box where each ball
can be red with probability 1/2, blue with probability 1/6 and green with probability 1/3. The colours
of balls are independent of one another. You carrying on drawing a ball until you get first get a red ball.
Y counts the number of balls this takes to happen ( not counting the red one). X counts the number of
blue balls drawn ( prior to the red one).

42
Now suppose that X and Y have a continuous joint distribution. Then if y is some
given value, the event Y = y has zero probability, and so it doesn’t make sense to condition
on it happening. But the preceding definition for mass functions still seems to suggest a
natural formula.

Definition 13. Let X and Y have a continuous joint distribution with density fXY .
Denote the density of the ( marginal) distribution of Y by fY . Then for each y ∈ R such
that f (y) > 0 we define the conditional density of X given Y = y to be

fXY (x, y)
fX|Y (x|y) = for x ∈ R.
fY (y)
We then define the conditional distribution of X given Y = y to be the distribution on R
with density x 7→ fX|Y (x|y).

There is a problem with this definition which is that we know densities are not unique,
and their values can be changed on small sets. We tend to ignore this issue, but it means
that really, for any particular value y, such as y = 0, the conditional density of X given
Y = y isn’t meaningfully defined at all. In practice we choose versions of our densities
which are functions which are continuous wherever possible, and this mostly avoids the
problem.
Why is definition reasonable? One justification is to consider conditioning on the event
y −  ≤ Y ≤ y + . This makes sense in the usual way because we are conditioning on an
event of positive probability. Then let  tend to 0 and see what happens. Suppose B ⊆ R
then we would like

P(X ∈ B|Y = y) = lim P(X ∈ B|y −  ≤ Y ≤ y + ). (3.4)


↓0

But providing fXY is a continuous function


R y+ R
y−
f
B XY
(x, z) dx dz Z
fXY (x, y)
P(X ∈ B|y −  ≤ Y ≤ y + ) = R y+ → dx (3.5)
y− Y
f (z) dz B fY (y)

as  tends to zero. So our definition of conditional distribution is sensible, at least in this


case.
Example 14. Suppose that X and Y are the coordinates of a point chosen uniformly in
the unit disc {(x, y) ∈ R2 : x2 + y 2 }. What is the conditional distribution of X given
Y = y for some y ∈ (−1, 1)? What is the conditional probability P(X > 0|Y = y)?
The density of the joint distribution is X and Y
(
1
if x2 + y 2 ≤ 1
fXY (x, y) = π
0 otherwise

The density of the marginal distribution of Y , is


Z ∞
2p
fY (y) = fXY (x, y) dx = 1 − y 2 if − 1 < y < 1
−∞ π

43
and fY (y) = 0 otherwise. Thus we find that the conditional density of X given Y = y
satisfying −1 < y < 1 is
1 p p
fX|Y (x|y) = p for − 1 − y 2 ≤ x ≤ 1 − y 2 .
2 1 − y2

and fX|Y (x|y) = 0 otherwise. Thus the conditional


p pdistribution of X given Y = y is the
3
uniform distribution on the interval [− 1 − y , 1 − y 2 ]. The conditional probability
2

P(X > 0|Y = y) is now computed ( not that a calculation is really needed in this case)
from this conditional distribution:
Z ∞
1
P(X > 0|Y = y) = fX|Y (x|y)dx = .
0 2
Its important to reflect on this example and see that it is exactly as you would expect:
the random point (X, Y ) is somewhere in the disc. If you are told that in fact Y = y0 for
some given value of y0 , then the point (X, Y ) must be somewhere on the line segment

{(x, y) ∈ R2 : x2 + y 2 ≤ 1} ∩ {(x, y) ∈ R2 : y = y0 }
q q
= {(x, y) ∈ R2 : − 1 − y02 ≤ x ≤ 1 − y02 , y = y0 } (3.6)

Moreover since the point is chosen uniformly at random it has no preference for one region
of the disc over another, so it it has no preference as to where on this line segment it lies
either.
Example 15. Suppose the joint distribution of X and Y is given by
r
y −y(x2 /2+1)
fXY (x, y) = e for x ∈ R and y > 0, (3.7)

fXY (x, y) = 0 if y ≤ 0. Then, for a fixed y > 0, the conditional density of X given Y = y
must be
2
C(y)e−yx /2 for x ∈ R (3.8)
p y −y
where C(y) = 2π e /fY (y) is some constant depending on y. This density is propor-
tional to, and hence must be equal4 to, the density of the Gaussian distribution withpmean
y
zero and variance 1/y. Since that implies that the constant C(y) must then be 2π it
follows the marginal distribution of Y is the exponential with rate 1. This is an argument
really worth thinking about! First we didnt need to find fY to find the conditional dis-
tribution of X. Secondly, because we know the value of the normalizing constant in the
Gaussian density we didn’t have to calculate the marginal distribution of Y by doing an
integration. Magic!

3
take care here: remember fX|Y is the density of a distribution when treated as a function of x with
y thought of as fixed.
4
if two probability density functions are proportional they must be equal since the integral of both (
over R) is equal to one.

44
3.2 Regression
Proposition 17. Suppose X and Y have a joint distribution which is Gaussian, with
2
the means and variances of X and Y being µX , µY , σX > 0 and σY2 > 0. Suppose
5 2 2 2
the covariance between X and Y is σXY and that σXY < σX σY . Then the conditional
6
distribution of Y given X = x ∈ R is the Gaussian distribution with mean
σXY
µY |X=x = µY + 2
(x − µX )
σX
and variance
2
σXY
σY2 |X = σY2 − 2
.
σX
Notice that variance of the conditional distribution doesn’t depend on x, but the mean
of the conditional distribution does, but only as a linear function. The exact form of that
linear function is a very famous formula in statistics. Suppose we are interested in how
the value of X can be used to predict the value of Y , then if we observe that the value of
X is x say, a sensible7 prediction for Y is the conditional mean µY |X=x . So we are using a
linear function of X to predict Y . And the line described by that linear function is known
as the regression line. The variance σY2 |X is a measure of how accurate this prediction will
be; we see that
σY2 |X
= 1 − ρ2 (3.9)
σY2
where ρ is the correlation between X and Y . So for a fixed variance of Y the correlation
controls how accurate the prediction is.
Proof. Under the conditions stated in the proposition the variance-covariance matrix Σ
of the vector (X, Y )T is invertible. We have
 2   2 
σX σXY −1 1 σY −σXY
Σ= and Σ = 2 2
σXY σY2 2
σX σY − σXY −σXY 2
σX
2 2 2
Denote the determinant σX σY − σXY by D. The joint density of X and Y is, according
to Proposition 15, proportional to

exp −σY2 (x − µX )2 /(2D) − σX 2


(y − µY )2 /(2D) − σXY (x − µX )(y − µY )/D


We want to treat x as fixed and consider this expression as a function of y. Notice that
2
D/σX = σY2 |X and σX
2
µY /D +σXY (x−µX )/D = µY |X /σY2 |X , and checking the coefficients
of y 2 and y on both sides verifies that
2
σX (y − µY )2 /(2D) − σXY (x − µX )(y − µY )/D = (y − µY |X )2 /(2σY2 |X ) + some stuff

where “some stuff” doesnt have any dependency on y. Thus the joint density of X
and Y , treated as a function of y alone is proportional to the density of the Gaussian
5
notice this the same as saying the correlation between X and Y is neither 1 nor −1
6
Note we are reversing the role of X and Y compared to the last lecture
7
optimal in a sense we will see in a few lectures

45
distribution on R with mean µY |X and variance σY2 |X . This proves the proposition by the
same argument as used in example 15.

The notion of a conditional distribution can be extended to more random variables.


In general if we have random variables X1 , X2 , . . . , Xn and some k between 1 and n − 1
then we can consider the conditional distribution of X1 , X2 , . . . , Xk given values for the
remaining random variables Xk+1 , . . . Xn If the joint distribution of X1 , X2 , . . . , Xn has a
density then the definition of a conditional density given in Definition 13 is easily extended.
Our proposition, just proved, extends too. If the joint distribution of X1 , X2 , . . . , Xn is
Gaussian ( with non-singular variance-covariance matrix) then the conditional distribution
of any k of the variables on the values of the other n−k variables is a Gaussian distribution
on Rk , with a mean which is a linear function of the given values for the n − k variables,
and a variance that does not depend on those values.

46
3.3 The law of total probability
Recall that the law of total probability from elementary probability states that if B1 , B2 , . . . Bn
are a partition8 and A is some further event then
n
X
P(A) = P(A|Bi )(Bi ). (3.10)
i=1

This extends to the case where the partition is an infinite sequence of sets too. As a
consequence, if X and Y have discrete distributions then, for any x ∈ X , the support of
the distribution of X,
X
P(X = x) = P(X = x|Y = y)P(Y = y) (3.11)
y∈Y

where Y is the support of the distribution of Y . What is the analagous statement for
continuous distributions? Suppose X and Y have a joint distribution which is continuous
then Z ∞
P(a ≤ X ≤ b) = P(a ≤ X ≤ b|Y = y)fY (y)dy, (3.12)
−∞

where fY is the density of the distribution of Y , and the conditional probability P(a ≤
X ≤ b|Y = y) is defined to mean
Z b
fX|Y (x|y)dx,
a

where the conditional density is defined as in fX|Y as in Definition 13, Notice this condi-
tional density is only defined for y such that fY (y) > 0, but this doesn’t matter in making
sense of equation (3.12)! It is very easy to check (3.12); just substitute the formula for
P(a ≤ X ≤ b|Y = y) into the righthandside, further write the conditional density as the
ratio which defines it in Definition 13, and then we see that (3.12) is equivalent to
Z ∞ Z b
P(a ≤ X ≤ b) = fXY (x, y) dx dy, (3.13)
−∞ a

which of course is true.


Here is generalization of (3.12) to more general events. Continue to suppose X and
Y have a continuous joint distribution. Let B be a subset of R2 . Now we “slice” B up,
and define, for each y ∈ R:

By = {x ∈ R : (x, y) ∈ B}. (3.14)

Then we have Z ∞

P (X, Y ) ∈ B = P(X ∈ By |Y = y)fY (y) dy. (3.15)
−∞

8
This means these sets are disjoint and their union is the entire sample space

47
The proof9 is exactly the same as for (3.12), with the conditional probability being defined
as an integral of the conditional density.
Let’s look at a example where we can solve an interesting problem by using condi-
tioning. We are going to use a generalization of (3.15) that involves conditioning on two
random variables. Suppose that B be a subset of R2 , and that X, Y and Z are three
random variables whose joint distribution is continuous. Then
Z ∞

P (X, Y ) ∈ B|Z = z = P(X ∈ By |Y = y, Z = z)fY |Z (y|z) dy. (3.16)
−∞

The proof is similar to before, noting that the conditional density of X given Y = y and
Z = z satisfies
fXY Z (x, y, z) fXY |Z (x, y|z)
fX|Y Z (x|y, z) = = (3.17)
fY Z (y, z) fY |Z (y|z)
where the first equality is taken to be a definition, and the conditional joint distribution
of X and Y given Z = z is defined to have density

fXY Z (x, y, z)
fXY |Z (x, y|z) = . (3.18)
fZ (z)

Example 16. Consider the following game between two players. Player A goes first. They
observe a random number X1 which is uniformly distributed in [0, 1]. They then choose
whether to“stick” and score X1 , or continue and observe a second uniformly distributed
random number X2 . If X1 + X2 > 1 they go “bust” and loose automatically. Otherwise
they score X1 + X2 , and player B begins to play. This second player observes a random
number Y1 , again uniformly distributed in [0, 1]. If Y1 is greater than the score of player
A then player B wins, if not then a second random number Y2 with uniform distribution
is observed. If Y1 + Y2 is both greater than the score of player A, and less than 1, then
player B wins, otherwise player A has won. Assume all four random variables, X1 , X2 ,
Y1 and Y2 are independent. Find the optimal strategy for player A.
Let X denote the score10 of player A; because we don’t know the strategy of player
A, we don’t know the distribution of X, but we do know that X must be independent of
Y1 and Y2 , since it will be some function of X1 and X2 . Let A ⊂ R3 be given by

A = {(x, y1 , y2 ) ∈ [0, 1]3 : y1 + y2 < x or y1 < x < 1 < y1 + y2 }

This is exactly the set of values for the random variables X, Y1 and Y2 that correspond
to player A winning. We can’t calculate P((X, Y1 , Y2 ) ∈ A) because we dont yet know
the distribution of X. But we can calculate the conditional probability that A wins given
their score X = x, which more formally is,

P((Y1 , Y2 ) ∈ Ax |X = x)
9
I am hiding what is really going on mathematically speaking. The deep mathematics is that we can
always calculate double integrals by integrating first over one variable then over the remaining variable,
but we have been assuming we could do that all through the module, so this isn’t the right time to make
a fuss about it.
10
if player A continues and then X1 + X2 > 1 we can still call this this score of player A

48
where Ax is the slice through A:

Ax = {(y1 , y2 ) ∈ [0, 1]2 : y1 + y2 < x or y1 < x < 1 < y1 + y2 } for x ∈ [0, 1],

and Ax is empty if x > 1. We do this as follows. The conditional distribution of Y1 and Y2


given X = x is the uniform distribution on the square [0, 1]2 , because X is independent
of Y1 and Y2 . So
P((Y1 , Y2 ) ∈ Ax |X = x) = area(Ax ) = x2 ,
provided x ≤ 1. If x > 1 then the conditional probability is zero.
Now lets think about the choice player A has to make once they know the value of
the first random number X1 is x1 say. They can “stick” in which case their score X is
equal to x1 and, by the above calculation, their conditional probability of winning is x21 .
Alternatively they can continue and observe the value of X2 ( and risk going bust in the
process). To make their decision they need to know the conditional probability of winning
if they continue. We calculate this using a version11 of (3.17), where now, because we are
supposing player A is continuing, we let X = X1 + X2 ,


Z ∞ 
P (X, Y1 , Y2 ) ∈ A|X1 = x1 = P (Y1 , Y2 ) ∈ Ax |X = x fX|X1 (x|x1 )dx
−∞
Z 1 Z 1
2
= x fX|X1 (x|x1 )dx = x2 dx = (1 − x31 )/3,
x1 x1

since the conditional distribution of X = X1 + X2 given X1 = x1 is unform on [x1 , x1 + 1].


So finally we know how player A should make their decision as to whether to continue
or not, having observed that X1 = x1 . If x21 > (1 − x31 )/3 they should stick, otherwise12
they should continue. By considering solutions of the polynomial equation (1 − x31 ) = 3x21
we can see that player A should continue if x1 < x? ≈ 0.53209.
Up until now, we have mostly considered joint distributions that are either continuous
or discrete, although we have seen in the setting of Gaussian distributions cases where the
distribution “lives” on a subspace of Rn and so is not considered to be either. Let’s now
look at how we can describe the joint distribution of a pair of random variables X and Y
where X has a discrete distribution with support X and Y has a continuous distribution.
In this case (X, Y ) belongs to the set

{(x, y) ∈ R2 : x ∈ X }

with probability one, and yet this set has zero area13 , so we cannot say the joint distri-
bution is a continuous distribution. Nevertheless we should be able to describe the joint
11
Dont worry that it doesnt look exactly like (3.17); its the same idea. Concentrate on convincing
yourself that it makes  intuitive sense. The step  that I’m skipping over is justifying that P (Y1 , Y2 ) ∈
Ax |X = x, X1 = x1 = P (Y1 , Y2 ) ∈ Ax |X = x
12
the probability that the value of X1 gives equality here is zero, so we can be sloppy about whether
the inequality is strict or not
13
it is a finite or countably infinite collection of vertical lines in the plane

49
distribution with a function14 fXY : X × R → [0, ∞) such that for each x ∈ X and a ≤ b,
Z b
P(X = x and a ≤ Y ≤ b) = fXY (x, y)dy. (3.19)
a

In many situations its natural to describe such a joint distribution using conditional
distributions. We can write the function fXY appearing in (3.19) in the form
fXY (x, y) = fX (x)fY |X (y|x) (3.20)
R∞
where fX (x) = −∞ f (x, y)dy is the mass function of the distribution of X defined for
x ∈ X , and y 7→ fY |X (y|x) is the density of the conditional distribution of Y given X = x.
If instead we wish to condition on Y = y, then we run into the problem of conditioning
on an event of zero probability again. However the sensible way to define the conditional
distribution of X given Y = y is to say that this conditional distribution is the discrete
distribution on X with mass function
fXY (x, y)
fX|Y (x|y) = for x ∈ X (3.21)
fY (y)
defined at y ∈ R for which the density of Y , fY (y), is strictly positive. Here fXY denotes
the function appearing in (3.19). Why is this a good15 definition of the conditional
distribution of X? Because it makes the law of total probability work: for each x ∈ X ,
Z ∞
P(X = x) = P(X = x|Y = y)fY (y)dy (3.22)
−∞

Example 17. An electrical device contains a fragile component that fails at a random time
Y which has a continuous distribution. However the component has been manufactured
in one of two factories. If it has been produced in factory one then Y is exponentially
distributed with rate λ1 . If it has been produced in factory two then Y is exponentially
distributed with rate λ2 . Factory one produces a proportion p of these components and
factory two a proportion 1 − p. What is the distribution of the random time Y and if the
component fails at some time y > 0, what is the probability that it was manufactured in
factory one?
Let X be the random variable taking the value 1 if the component is manufactured in
factory one and the value 2 if it is manufactured in factory two. The question tells us the
mass function of X, and the conditional distributions of Y given X = 1 and given X = 2.
So using (3.20) we obtain the following description of the joint distribution of X and Y .
For 0 ≤ a ≤ b, Z b
P(X = 1 and a ≤ Y ≤ b) = pλ1 e−λ1 y dy
a
and Z b
P(X = 0 and a ≤ Y ≤ b) = (1 − p)λ2 e−λ2 y dy
a
14
I’m going to avoid calling this a density function because it is not a density in the same sense as
the density of a continuous joint distribution. Neverthless it is a sort of density, however with respect to
length rather than area.
15
Other than the formula just looks right!

50
Summing these two expressions shows the distribution of Y has density given by
(
pλ1 e−λ1 y + (1 − p)λ2 e−λ2 y for y ≥ 0,
fY (y) =
0 otherwise.

Finally the conditional probability that X = 1 given Y = y is given by (3.21),

pλ1 e−λ1 y
P(X = 1|Y = y) =
pλ1 e−λ1 y + (1 − p)λ2 e−λ2 y

51
3.4 Priors and posteriors
We begin with an example using elementary probability ( very relevant to real life right
now!) to illustrate the notion of prior and posterior probabilities, and the use Bayes
formula.
Example 18. Mass testing of the people of Liverpool ( population approximately 500,000)
for Covid-19 began last week. According to Public Health England the test being used has
a sensitivity of 70% and a specificity of 99.68%. If one in fifty people living in Liverpool
are currently infected, and everyone is tested, how many positive test results do we expect
to obtain, and of those testing positive how many are expected to in fact be healthy?
This is the important question of false positives16 in testing. Suppose we test a random
Liverpudlian, D denotes the event they are infected, and we can reasonably assert that
P(D) = 1/50. This is known as the prior probability of being infected, because this is
before we take account of the result of the test. Let T be event that this person has a
positive test result. Then the Public Health England data states that

P(T |D) = 0.7 and P(T |DC ) = 0.0032

The total number of positive test results we expect to obtain in Liverpool is therefore by
the law of total probability,

10, 000 ∗ 0.7 + 490, 000 ∗ 0.0032 = 8, 568.

Of these 7,000 were genuinely infected, and 1,568 were “false positives”. If our random
Liverpudlian tests positive, then the posterior (because it is after the test) probability of
being infected is
P(D)P(T |D) 7, 000
P(D|T ) = c c
= = 0.82.
P(D)P(T |D) + P(D )P(T |D ) 8, 568
The method illustrated above, combining prior probabilities with the evidence from
the test to arrive at posterior probabilities, is the basis of an entire system for doing
statistics: Bayesian Statistics. But to appreciate Bayesian Statistics fully, we have to
revisit the question of what probability actually means in the real world.
Classically, probabilities are assigned to events according to the characteristic fre-
quency with which the event tends to occur in repetitions of the experiment. For example
in many tosses of an unbiased coin, we expect to see half the tosses result in a head.
The weekend before the recent US election, CNN estimated Trump’s chances of win-
ning to be ten percent. But what did this mean? Did it mean they were planning to
rerun the general election many times17 and observe that Trump won in approximately
ten percent of them? This is an example of using probability to quantify our beliefs
about how likely something is to happen. However the problem with this interpretation
16
However the argument that there are many false positives arising from the NHS tests used for
people with symptoms is without foundation. Those people arguing for this position either lack a basic
grounding in statistics, or have not made their argument in full knowledge of the facts, or are attempting
to deliberately deceive the public.
17
Thankfully they didn’t mean this, although it does seem now that Trump wants the election rerun!

52
of probability is that different individuals may have different beliefs, and then what? It
is because of this subjectivity that Bayesian Statistics can occasionally be controversial,
but is also an enormously powerful method, and very widely used.
Example 19. The estimate that one in fifty people in Liverpool has Covid-19 comes from
a survey conducted by the Office for National Statistics. Suppose that an unknown
proportion of a population is infected and that in a random sample of n people k test18
positive. The naive estimate is that a proportion k/n of the population is infected. That’s
perfectly good, but we would like to understand how uncertain this estimate is.
The Bayesian approach to this is to treat the unknown proportion of a population
which is infected as a random variable, X, and assign a prior distribution to X which
reflects our uncertainty about its value before doing any sampling. In this case, if we know
nothing about X, we may decide19 to model it as having a uniform distribution on [0, 1].
We next let Y be a random variable representing the number of people in our sample
who test positive. If the proportion of the population that is infected was known to be p
then we would model Y as having a Binomial distribution with parameters p and n ( the
sample size, remember). So since we are now treating the proportion of the population
that is infected as a random variable X, we specify that the conditional distribution of
Y given X = p is Binomial with parameters p and n. So X has a continuous distribution
with density (
1 if 0 ≤ x ≤ 1
fX (x) =
0 otherwise.
The conditional distribution of Y given X = x has mass function
 
n k
fY |X (k|x) = x (1 − x)n−k for k = 0, 1, 2, . . . , n.
k

We can now compute the marginal distribution of Y using (3.22). For k = 0, 1, 2 . . . , n


we have20
Z ∞  Z 1  
n k n−k n k!(n − k)! 1
P(Y = k) = fY |X (k|x)fX (x)dx = x (1−x) dx = = .
−∞ k 0 r (n + 1)! n+1

So Y is uniformly distributed! That’s nice.


Now return to the original question: we have observed that k people in the sample have
tested positive.The posterior distribution of X describes our uncertaintity concerning
the proportion of the population who are infected in the light of this evidence. This is
the conditional distribution of X given Y = k, and is computed by combining (3.20) and
(3.21). It is a continuous distribution with density
(
(n+1)! k
fX (x)fY |X (k|x) k!(n−k)!
x (1 − x)n−k if 0 ≤ x ≤ 1
fX|Y (x|k) = =
fY (k) 0 otherwise.
18
the PCR testing used by the ONS is very accurate and we can now ignore the possibility of false
positives
19
this is where it is reasonable to start arguing!
20
the integral is known as a beta integral- see https://en.wikipedia.org/wiki/Beta_function

53
Figure 3.1: The density of the posterior distribution of the proportion of the population
who are infected, if 6 are infected in a sample size of 10 (blue), if 12 are infected in a
sample size of 20 (red), if 30 are infected in a sample size of 50 ( green), if 60 are infected
in a sample size of 100 (brown).

10

0.2 0.4 0.6 0.8 1

This distribution is called a beta distribution with parameters k+1 an n−k+1. Notice that
the first equality is a version of Bayes formula in our setting: but be careful21 interpreting
it as it mixes up mass functions and densities!

21
particularly because as compared to (3.20) and (3.21) I have swapped which variable has a discrete
distribution and which has a continuous distribution; just to keep you on your toes!

54
3.5 Conditional expectation
Whenever X and Y are two random variables such that we can make sense of the con-
ditional distribution of X given Y = y we can also define the notion of the conditional
expectation of X given Y = y. This is simply the mean of the conditional distribution.

Definition 14. Suppose X and Y are two random variables, for which the conditional
distribution of X given Y = y is defined. Then the conditional expectation of X given
Y = y is defined by X
E[X|Y = y] = x fX|Y (x|y)
x∈X

if the conditional distribution of X is discrete with support X and mass function fX|Y (·|y),
or by Z ∞
E[X|Y = y] = x fX|Y (x|y) dx
−∞

if the conditional distribution of X is continuous with density fX|Y (·|y). We also require
that the sum or integral be absolutely convergent in order for the conditional expectation
to be defined.

Proposition 18. For any function h : R2 → R


X
E[h(X, Y )|Y = y] = h(x, y) fX|Y (x|y)
x∈X

if the conditional distribution of X is discrete with support X and mass function fX|Y (·|y),
or Z ∞
E[h(X, Y )|Y = y] = h(x, y) fX|Y (x|y) dx
−∞

if the conditional distribution of X is continuous with density fX|Y (·|y), provided the sum
or integral is absolutely convergent.

This proposition can save alot of work. Just imagine if you forgot about it and had
to use Definition 14 instead; you would first need to compute the conditional distribution
of h(X, Y ) given Y = y.
Example 20. Let X and Y be the coordinates of a random point uniformly distributed
in the disc {(x, y) ∈ R2 : x2 + y 2 ≤ 1}. As we saw in example 14, the conditional
p Y = y where y ∈ (−1, 1) is the uniform distribution in the
distributionpof X given
interval [− 1 − y , 1 − y 2 ]. Consequently
2

Z √1−y2
x dx
E[X|Y = y] = √ p = 0,
2
− 1−y 2 2 1 − y

and
Z √1−y2
x2 dx 1 − y2
E[X 2 |Y = y] = √ p =
− 1−y 2 2 1 − y2 3

55
Proposition 18 can also be extended to cover more random variables. So for example,
if X, Y and Z had a continuous joint distribution we would have
Z ∞Z ∞
E[h(X, Y, Z)|Z = z] = h(x, y, z) fXY |Z (x, y|z) dx dy, (3.23)
−∞ −∞

where fXY |Z (x, y|z) is the density of conditional joint distribution of X and Y given Z = z
which we defined previously at (3.18).
Just as linearity is a key property of expectations, so it is for conditional expectations
also. Except now it is more general because we can extend it so that it applies not just
to constants coefficients, but any function of the random variable on which we are condi-
tioning. The proof of the following proposition is a simple application of the preceeding
Proposition 18, together with properties of sum or integrals.
Proposition 19 (Linearity of conditional expectation). (a) If a : R → R is any ( non-
random) function, then, for any random variable Y ,
E[a(Y )|Y = y] = a(y).

(b) If a : R → R and b : R → R are ( non-random) functions, then, for any random


variables X, Y and Z such that the conditional expectations exist,
E[a(Z)X + b(Z)Y |Z = z] = a(z)E[X|Z = z] + b(z)E[Y |Z = z].
The next definition may seem insignificant, but in fact realising that it is natural to
think of a conditional expectation as a random variable itself turns out to be a big deal.
Definition 15. The conditional expectation of X given Y , denoted by E[X|Y ] is the
random variable which takes the value E[X|Y = y] whenever the event Y = y occurs.
Example 21. Suppose we roll a pair of dice. The score on first die is X, and that on the
second die is Y . Then for x = 1, 2, 3, . . . , 6,
E[X + Y |X = x] = x + E[Y |X = x] = x + 7/2.
The first equality is linearity, and then, because X and Y are independent, the conditional
distribution of Y is uniform on {1, 2, 3, . . . , 6} and so E[Y |X = x] = E[Y ]. According to
Definition 15, we therefore write
E[X + Y |X] = X + 7/2.
Another application of linearity gives,
E[X|X + Y ] + E[Y |X + Y ] = E[X + Y |X + Y ] = X + Y,
and since by symmetry it must be the case that E[X|X + Y ] = E[Y |X + Y ] we deduce
22
that
E[X|X + Y ] = (X + Y )/2.
22
its rather neat that we can calculate this without thinking at all about the conditional distribution of
X given X + Y = k, but if you have any doubts, then check it by working out the conditional distribution
of X.

56
Thinking about this further may help you appreciate why it so natural to treat a condi-
tional expectation as a random variable. Imagine a game in which you win the total score
on the two dice. We know that your expected winnings are E[X + Y ] = E[X] + E[Y ] = 7.
The dice are rolled one by one, and after the first die has been rolled and you see the
result, you assess your expected winnings to be whatever you have scored on the first roll
plus E[Y ] = 7/2. But whatever you score on the first roll is the value of the random
variable X. So your expected winnings at this stage of the game are X + 7/2.
Similarly, in an alternative game, you win just whatever the score X on the first die
is. If the value of the combined score X + Y on the two dice is revealed to you, then you
would assess your expected winnings to be half of this, so (X + Y )/2.

57
3.6 The Tower property
Proposition 20 (Tower property). For random variables X, Y and Z such that the
conditional expectations exist,
 
E E[X|Y ] = E[X],
and  
E E[X|Y, Z]|Z = E[X|Z].
The tower property is actually a generalization of the law of total probability. So for
example, to deduce (3.12) from the preceding proposition we take h : R → R to be the
function defined by (
1 if a ≤ x ≤ b
h(x) =
0 otherwise,
 
and then the equation E E[h(X)|Y ] = E[h(X)] is equivalent23 to
 
E P(a ≤ X ≤ b|Y ) = P(a ≤ X ≤ b) (3.24)
and if Y has a continuous distribution then the left handside can be computed as an
integral.
Example 22. Consider the scores X and Y on a pair of dice as in Example 21 again. Then
we can easily verify these two examples of the tower property holding.
 
E E[X + Y |X] = E[X + 7/2] = 7 = E[X + Y ],
and  
E E[X|X + Y ] = E[(X + Y )/2] = 7/2 = E[X].
Example 23. Suppose that X and Y are a pair of random variables with the property
that
E[Y |X] = aX + b
for some a, b ∈ R. Remember that if the joint distribution of X and Y is Gaussian then
this will be the case. Let us use the notation:
2
E[X] = µX , E[Y ] = µY , var(X) = σX , var(Y ) = σY2 and cov(X, Y ) = σXY .
Now the tower property implies that
 
µY = E[Y ] = E E[Y |X] = E[aX + b] = aµX + b,
and also that

σXY = E[(X − µX )(Y − µY )]


 
= E E[(X − µX )(Y − µY )|X]
 
= E (X − µX )E[(Y − µY )|X]
2
= E[(X − µX )(aX + b − µY )] = aσX .
23
note we treat the conditional probability as a random variable here, in just the same way as we can
treat a conditional expectation as a random variable

58
With these equations, we can solve to find a and b:
σXY σXY
a= 2
and b = µY − 2
µX ,
σX σX

in agreement with Proposition 15.


We can use the tower property to prove a very general result which can be interpreted
as saying if you want to make a prediction for a random variable Y having observed
the value of a random variable X, then the conditional expectation E[Y |X] is the best
prediction for Y in the sense of minimizing the average squared error.

Proposition 21. Consider a pair of random variables X and Y . The choice of random
variable Z which minimizes
E[(Z − Y )2 ]
amongst all random variables which can be written Z = f (X) for some function f , is

Z = E[Y |X].

Proof. Let Ŷ = E[Y |X] and consider any Z which can be written as a function of X,
then,

E[(Z − Y )2 ] = E[(Z − Ŷ + Ŷ − Y )2 ]
= E[(Z − Ŷ )2 ] + E[(Ŷ − Y )2 ] + 2E[(Z − Ŷ )(Ŷ − Y )].

Now we will show that the last term is zero, and hence E[(Z −Y )2 ] is minimized by Z = Ŷ .
Using the facts that Z − Ŷ is some function of X, and E[(Ŷ − Y )|X] = Ŷ − E[Y |X] = 0,
we have
   
E[(Z − Ŷ )(Ŷ − Y )] = E E[(Z − Ŷ )(Ŷ − Y )|X] = E [(Z − Ŷ )E[(Ŷ − Y )|X] = 0

Proposition 22. Suppose that X and Y are independent random variables, then

E[Y |X] = E[Y ].

More generally if X and Y are independent given Z, then

E[Y |X, Z] = E[Y |Z].

Saying X and Y are independent given Z means that for any subsets B and B 0 of R
and any value z such that the conditional probabilities are defined

P(X ∈ B and Y ∈ B 0 |Z = z) = P(X ∈ B|Z = z)P(Y ∈ B 0 |Z = z). (3.25)

59
Example 24. Consider again Example 19. X represents the proportion of the population
that is infected - assumed to have a uniform distribution. Y is the number of individuals
who are infected in a sample of size n, which we assume to have a conditional distribution
given X = x which is the Binomial distribution with parameters x and n. Then we saw
in Example 19 that the conditional distribution of X given Y = k is the beta distribution
with density (
(n+1)! k
k!(n−k)!
x (1 − x)n−k if 0 ≤ x ≤ 1
fX|Y (x|k) =
0 otherwise.
We said previously that the niave estimate of X given Y = k would be k/n, and its easy
to check that the function x 7→ fX|Y (x|k) attains its maximum at x = k/n. However this
is not the conditional expectation of X given Y = k. We have24
Z 1 Z 1
(n + 1)! k+1 k+1
xfX|Y (x|k) dx = x (1 − x)n−k dx =
0 0 k!(n − k)! n+2

So
Y +1
E[X|Y ] =
n+2
This is not greatly different from the naive estimate, unless n is small.
Now let Z be the number of individuals who are infected in a second sample of size
m, and suppose that we want to predict Z from the value we observe for Y . We assume
that the conditional distribution of Z given X = x is the Binomial distribution with
parameters x and m, and consequently

E[Z|X] = mX

We further assume that Z and Y are independent given X. Using the tower property and
the preceding proposition gives
      Y +1
E[Z|Y ] = E E[Z|X, Y ]|Y = E E[Z|X]|Y = E mX|Y = m .
n+2

24
notice the integral is another beta integral with one parameter changed

60
Chapter 4

Infinite Sequences of Random


Variables

61
4.1 Markov’s inequality and the weak law of Large
Numbers
Proposition 23 (Markov’s inequality). Let Z be a random variable whose expectation
exists and which satisfies P(Z ≥ 0) = 1. Then for any c > 0,
P(Z ≥ c) ≤ E[Z]/c.
Proof. Consider the random variable Y defined by
(
c if Z ≥ c,
Y =
0 otherwise.

Y has a discrete distribution with two possible values, and we have P(Y = c) = P(Z ≥ c).
Consequently
E[Y ] = c.P(Y = c) = c.P(Z ≥ c).
Notice also it follows from the definition of Y , and the non-negativity of Z that P(Y ≤
Z) = 1. This implies, by positivity of expectation that
E[Y ] ≤ E[Z],
from which the result follows.
Effective use of Markov’s inequality depends on making a good choice of the random
variable Z. Here is a classic case.
Proposition 24 (Chebyshev’s inequality). Suppose X is a random variable whose ex-
pected value and variance both exist. Then, for any  > 0,
var(X)
P(|X − E[X]| ≥ ) ≤ .
2
Proof. Let Z = (X − E[X])2 . Then E[Z] = var(X), and P(Z ≥ 0) = 1. Take c = 2 in
Markov’s inequality to obtain
P(|X − E[X]| ≥ ) = P(Z ≥ 2 ) ≤ E[Z]/2 = var(X)/2 .

This result is important because it allows us to use the variance, which we know
describes the spread of the distribution about its mean, to obtain quantitative bounds on
the distribution.

1
PThe
n
following theorem describes the behaviour of the averages of n random quantities,
n k=1 Xi , when n is large. The key to the poof P is to combine Chebyshev’s inequality
with the fact that the variance of the average n1 nk=1 Xi is decreasing with n.
Theorem 4 (Weak law of large numbers). Let X1 , X2 , X3 , . . . be a sequence
P of independent
random variables having common mean µ and variance σ 2 . Let Sn = nk=1 Xk . Then for
any  > 0,
P(|Sn /n − µ| ≥ ) → 0, as n tends to infinity.

62
Proof. First by linearity of expectation
n
1X
E[Sn /n] = E[Xi ] = µ.
n i=1

Secondly, the independence of the Xi implies that


n
1 X σ2
var(Sn /n) = 2 var(Xi ) = ,
n i=1 n

where we also use var(aX) = a2 var(X) for any non random constant a. Then applying
Chebyshev’s inequality to Sn /n gives,
σ2
P(|Sn /n − µ| ≥ ) ≤ ,
n2
and the lefthandside tends to 0 as n tends to infinity.
Suppose that A1 , A2 . . . , An , . . . is a sequence of independent events with P(An ) = p
for all n ≥ 1. Define random variables Xn via
(
1 if An occurs
Xn = (4.1)
0 otherwise.

Since E[Xi ] = p, the law of large numbers is, in this case, a result which is in accord1
with the frequentist interpretation of probability: in many repetitions of the activity with
the uncertain outcome2 , and then there is some characteristic frequency with which a
particular outcome tends to occur.
The assumption that the variance of Xi exists in the theorem is not necessary for the
conclusion to hold, at least if we assume the Xi are identically distributed, but the proof
is then more involved. The assumption that the mean of Xi exists is, however, essential-
otherwise the behaviour can be very different indeed.
Example 25. Consider a random path along the edges of a n × n chess board. The path
starts at the bottom left corner, must always move in a direction towards the top right
corner and finishes there. All such paths are equally likely. The figure shows such a path
on a 4 × 4 chess board. Suppose n = 100, use Chebyshev’s inequality to estimate the
probability that the path passes no more than 10 squares away from the centre of the
chess board.
Define random variables X1 , X2 , . . . , X2n by setting Xi = 1 if the ith edge of the path is
“upwards” and Xi = −1 if the ith edge of the path is “rightwards”. The joint distribution
of X1 , X2 , . . . , X2n is uniform on the set
( 2n
)
X
x ∈ {−1, 1}2n : xi = 0 .
i=1
1
This argument can seem a bit circular at first- we have proved something that justifies what we
assumed probability meant in the first place.
2
Think of An as describing the outcome of the nth repetition of an activity like tossing a (biased)
coin.

63
Figure 4.1: A path on a 4 × 4 chess board

It follows3
1
P(Xi = 1) = for each i,
2
and4
n−1
P(Xi = 1 and Xj = 1) = for all i 6= j.
4n − 2
From this we find that E[Xi2 = 1] for each i and that E[Xi Xj ] = −1/(2n − 1) for i 6= j.
Now assume n is even, then the path passes more than k squares from the centre of
the board if and only if
X n
Xi ≥ 2(k + 1).



i=1
Pn
We compute the variance of i=1 Xi as
!  !2 
n
X n
X
var Xi = E Xi 
i=1 i=1
X n2
= E[Xi Xj ] = nE[X12 ] + n(n − 1)E[X1 X2 ] = .
1≤i,j≤n
2n − 1

Now Chebyshev’s inequality gives us that the probability of not passing within k-squares
of the centre is less that
n2
4(k + 1)2 (2n − 1)
This is just less that 0.11 when n = 100 and k = 10, so there is at least a 90% chance of
the path passing within 10 squares of the centre of the board.

See tutorial problem sheet one; there are 2n 2n−1


3
 
n paths in total of which n have the ith edge being
“upwards”.
Because there are 2n−2
4

n paths which have both the ith and jth edges being “upwards”.

64
4.2 Convergence in Distribution
The law of large numbers says something about the distribution of the random variable of
interest, n1 Sn , when n is large. It tells us that the mass of the distribution is concentrated
near the point µ = E[Sn /n]. In other problems we may be interested in a sequence of
random variables having distributions which have some other limiting behaviour. This is
best illustrated through examples.
Example 26. Suppose that (Xn , n ≥ 1) is a sequence of independent random variables
with each Xn uniformly distributed on [0, 1]. Let Mn = min1≤i≤n Xi . We can find the
distribution of Mn easily: for any 0 ≤ x ≤ 1,

P(Mn > x) = P(Xi > x for all 1 ≤ i ≤ n) = (1 − x)n . (4.2)

So now consider nMn . The distribution of this random variable is described by the fact
that
P(nMn > x) = P(Mn > x/n) = (1 − x/n)n ,
provided that 0 ≤ x ≤ n. Letting n tend to infinity, and using a celebrated fact from
analysis, we obtain
P(nMn > x) → e−x , (4.3)
for every x ≥ 0. Recall that if X is a random variable that has the exponential distribri-
bution with rate 1, then
P(X > x) = e−x ,
for all x ≥ 0. Comparing this (4.3) we see that as n grows large the distribution of nMn
gets close to the distribution of X. In fact the density of the distribution of nMn is
(
(1 − x/n)n−1 if 0 ≤ x ≤ n,
fnMn (x) = (4.4)
0 otherwise

which converges to the density of the exponential distribution for each x ∈ R as illustrated
in Figure 4.2.
In the preceding example the densities of the random variables of interest converged.
But the next example shows that it makes sense to talk about convergence in distribution
without densities existing at all.
Example 27. Suppose that (Xn , n ≥ 1) is a sequence of random variables with the distri-
bution of Xn being the discrete uniform distribution on {1, 2, 3 . . . , n}. Should we say Xn
converge in distribution as n tends to infinity? It is certainly the case that mass functions
of Xn converge: for each positive integer k the mass function fXn (k) = 1/n provided
n ≥ k, so
fXn (k) → 0 as n → ∞.
But there is no random variable X whose distribution could be described by this limit.
Consider instead the random variables Xn /n. These random variables have discrete
distributions, but the support of the distribution of Xn /n changes with n in such a way

65
Figure 4.2: The density of the random variable nMn for n = 2 (green), n = 4 (red), n = 8
(cyan),n = 16 ( blue) and n = 32 (purple)

0.8

0.6

0.4

0.2

1 2 3 4

that it is impossible to say a limit of the mass function exists. However if we look at the
probabilities
bnxc
P(Xn /n ≤ x) = for any x ∈ [0, 1] (4.5)
n
then we see, see Figure 4.2, that these converge

P(Xn /n ≤ x) → x as n → ∞ for any x ∈ [0, 1]. (4.6)

This reflects that the distribution of Xn /n is becoming close to the distribution of a


random variable X which has the continuous uniform distribution on the interval [0, 1].
In our next example our sequence of random variables have continuous distributions,
but the limiting distribution is discrete.
Example 28. Suppose that (Xn , n ≥ 1) is a sequence of random variables with the distri-
bution of Xn being a Gaussian distribution with mean c + 1/n and variance 1/n2 . Then,
since the variance of Xn is very small when n is large we know that the mass of the dis-
tribution of Xn will be concentrating close to its mean. But these means are approaching
c as n tends to infinity. We have

E[(Xn − c)2 ] = (c − E[Xn ])2 + var(Xn ) = 2/n2 → 0 as n → ∞, (4.7)

and hence by Markov’s inequality5

lim P(|Xn − c| > ) = 0, for all  > 0. (4.8)


n→∞

5
notice that this is exactly the form of the statement made in the law of large numbers.

66
Xn
Figure 4.3: The cummulative distribution function of n
for n = 2, 10 and 100.

0.8

0.6

0.4

0.2

0.2 0.4 0.6 0.8 1

The limiting distribution puts mass one at the single point c, so if6 P(X = c) = 1, then
we have we have
P(Xn ≤ x) → P(X ≤ x) = 0 for all x < c, (4.9)
and
P(Xn ≤ x) → P(X ≤ x) = 1 for all x > c. (4.10)
But it is important to notice that we dont have the same statement holding at x = c:
P(Xn ≤ c) = Φ(−1) 6= 1 = P(X ≤ c), (4.11)
where, here, Φ denotes the cummulative distribution function of the standard Gaussian
distribution.
The following definition encompasses all three of the previous examples.
Definition 16. Let (Xn ; n ≥ 1) be a sequence of random variables, and X some further
random variable. Then we say Xn converges in distribution to X as n tends to infinity,
written
dist
Xn → X as n → ∞,
if the distribution functions FXn of Xn , and FX of X, satisfy
FXn (x) → FX (x) as n → ∞,
6
It might seem strange to treat this as a probability distribution, the random variable X isnt “random”
at all, but its is perfectly ok to do this.

67
for every x ∈ R at which the function FX is continuous.

Notice that if X has a continuous distribution then FX is continuous everywhere, and


consequently
P(Xn ≤ x) = FXn (x) → FX (x) = P(X ≤ x)
for all x ∈ R. In fact, we have, for all a ≤ b

P(a ≤ Xn ≤ b) → P(a ≤ X ≤ b). (4.12)

To prove this we need to verify that P(Xn < a) converges to P(X < a), which we do as
follows. By continuity, for any  > 0 there exists an h > 0 so that

FX (a − h) > FX (a) − /2.

Then, by convergence of FXn there exists an n0 such that for all n ≥ n0 ,

FXn (a − h) > FX (a − h) − /2.

Also, there exists an n1 so that for all n ≥ n1 ,

FXn (a) < FX (a) + .

So combining the statements, we see that for n ≥ max(n0 , n1 ),

FX (a) +  > FXn (a) ≥ P(Xn < a) ≥ FXn (a − h) > FX (a) − ,

This shows that P(Xn < a) converges to FX (a) which, by continuity of the distribution
of X is equal to P(X < a).
We now turn to look at distributions which are discrete, and whose support is always
a subset of Z. In this case convergence in distribution is equivalent to convergence of
mass functions. Suppose (Xn , n ≥ 1) is a sequence of random variables, each of which has
a discrete distribution with support contained in Z. Let X be a further random variable
whose support is contained in Z. Then Xn converges in distribution to X if, and only if,
for all k ∈ Z,
P(Xn = k) → P(X = k) as n → ∞. (4.13)
Example 29. Fix a constant λ > 0. Suppose, for n > λ, that Xn has the Binomial
distribution with parameters λ/n and n. Then Xn converges in distribution to X where
X has a Poisson distribution with parameter λ > 0. To verify this we just compute, for
a fixed integer k ≥ 0
   k  n−k
n λ λ λk −λ
lim 1− = e . (4.14)
n→∞ k n n k!

68
4.3 The Central Limit Theorem
Stirlings formula
n!
lim =1
n→∞ √ n+
1
2πn 2 e−n

can be used to investigate the behaviour of Binomial probabilities


 
n k
p(k, n, p) = p (1 − p)n−k
k

when n large and k close to np. In particular we see ( for n = 2m even)


√ 2m −2m
r
(2m)! 4πm.(2m) e 2
p(m, n, 21 ) = 2
2−2m ∼ 2m −2m
2−2m ∼ .
(m!) 2πm.m e πn

More generally fixing x ∈ R and taking a sequence of integers kn so √2 (kn − n/2) → x,


n
we have7 √
n
lim p(kn , n, 12 ) = φ(x), (4.15)
n→∞ 2
2
where φ(x) = √12π e−x is the standard Gaussian density. This was a calculation first
performed by De Moivre in 1738. No one really realised how important it was at the
time. Laplace (1810) extended this result to general p ∈ (0, 1) and at about the same
time Gauss used the density φ to model the distribution of general measurement errors.
Notice that the Binomial probabilities p(kn , n, 1/2) in (4.15) are all tending to zero as
n tends to infinity. Often to be useful we need to have a statement about the behaviour
of sums of Binomial probabilities. By realising that the sum
X
p(k, n, p)
k
p p
where k ranges between np + z1 np(1 − p) and np + z2 np(1 − p) is an approximation
for the area under the the graph of the function z 7→ φ(z) for z ∈ [z1 , z2 ] it is possible to
prove the following.

Theorem 5. Fix p ∈ (0, 1). For n = 1, 2, 3, . . . let Xn have the Binomial distribution
with parameters n and p. Then for all −∞ ≤ z1 < z2 ≤ ∞,
p p Z z2
P(np + z1 np(1 − p) ≤ Xn ≤ np + z2 np(1 − p)) → φ(z)dz as n tends to infinity,
z1

2
where φ(z) = √1 e−z /2 is the standard Gaussian density.

p
In the language of the previous lecture we say that (Xn −np)/ np(1 − p) converges in
distribution to a random variable having the standard Gaussian distribution. Remember
that, for a Binomially distributed random variable, E[Xn ] = np and var(Xn ) = np(1 − p).
7
Dont try to check this for yourself; its not particularly difficult but its quite messy!

69
p
So the mean and variance of (Xn − np)/ np(1 − p) are 0 and 1 which agree with the
mean and variance of the limiting distribution.
This theorem suggests approximating suitable sums of Binomial probabilities with
probabilities calculated from the Gaussian density but the Theorem doesn’t tells how
good the approximation will be. But in fact these approximations are very good if n is
only moderately large ( such as n = 20) providing neither p nor 1 − p is too small.
Example 30. Suppose we toss a fair coin 10, 000 times. Then the number of heads X has
the Binomial distribution with parameters n = 10, 000 and p = 1/2. We know that the
most likely value of X is np = 5,√000. The probability of getting exactly 5000 heads is,
by (4.15), approximately 2φ(0)/ n = 0.008 (3 d.p.) The law of large numbers tells us
that we should expect however to observe a value of X close to 5, 000, but it gives no
information that allows us to make sense of “close” in a more quantitative way. Using
the Gaussian approximation to the Binomial distribution helps. Taking8 z1 = −1.96,
z2 = 1.96 we have
Z 1.96
P(4902 ≤ Xn ≤ 5098) ≈ φ(z)dz = 0.95.
−1.96

We could use Chebyshev to give a bound on the probability of the same event, this would
give
P(4902 ≤ Xn ≤ 5098) ≥ 1 − var(Xn )/992 = 0.74(3.dp)
In this case the Gaussian approximation is accurate to about 0.01.
Theorem 5 turns out to be a special case of a much, much more general result known
as the Central Limit Theorem. Before we get to this, we need to learn how to use moment
generating functions to investigate the convergence of distributions. Recall that if X is a
random variable, then its morment generating function MX (t) is defined by

MX (t) = E[etX ]

for all t ∈ R such that the expectation exists. In fact, either the moment generating
function exists only at t = 0, in which case it is essentially useless, or there exists some
δ > 0 so that exists for t ∈ (−δ, δ). Possibly it exists for all t ∈ R, but this isn’t necessary
for it to be useful.
We will make use of the following property of moment generating functions, which
explains their name.
Proposition 25. Suppose that X is a random variable with moment generating function
M (t) = E[etX ] existing for t ∈ (−δ, δ) for some δ > 0. Then

X mn
M (t) = tn
t=0
n!

where mn = E[X n ], the nth moment of X exists for every n ≥ 1, and the power series
converges for all t ∈ (−δ, δ).
8
these numbers come from a table of the Gaussian distribution function. See https://www.
mathsisfun.com/data/standard-normal-distribution-table.html

70
It is easy to see why this proposition should be true if one replaces etX by its Taylor
expansion, and then uses the linearity of expectations. However because it deals with an
infinite sum, justifying the use of linearity here requires some sophisticated analysis.
The key property of moment generating functions that makes them so useful, is that
they characterise distributions.
Theorem 6. Suppose that X and Y are two random variables whose moment generating
functions are defined and equal for all |t| < δ for some δ > 0. Then X and Y have the
same distribution.
Closely related to this uniqueness property, and of interest to us now, is that the
convergence of moment generating functions implies convergence of distributions.
Theorem 7. Suppose that (Xn , n ≥ 1) is a sequence of random variables such that the
moment generating function Mn (t) of Xn exists if |t| < δ for some δ > 0 not depending
on n. Suppose further that X is a random variable whose moment generating function
M (t) also exists for |t| < δ and that

Mn (t) → M (t) as n → ∞ for each t ∈ (−δ, δ).

Then Xn converges in distribution to X as n tends to infinity.


Example 31. Suppose that Xn has the exponential distribution with rate parameter n.
Then the moment generating function of Xn is
n
Mn (t) = for t < n. (4.16)
n−t
n
Notice although the function t 7→ n−t makes sense for t > n, it is not equal to E[etX ], and
Mn (t) doesnt exist for t ≥ n. Thus, for every 9 t ∈ R

Mn (t) → 1 as n → ∞. (4.17)

Now M (t) = 1 for all t ∈ R is the moment generating function of X satisfying P(X =
0) = 1, and so Xn converge in distribution to such an X. This is a result which is very
easy to check directly from Definition 16, and so here using moment generating functions
was overkill- but it is still interesting to see how the argument works.
Suppose on the other hand that Xn has the exponential distribution with rate param-
eter 1/n. Then the moment generating function of Xn is
1
Mn (t) = for t < 1/n. (4.18)
1 − nt
So now, the limit is given is given by
(
1 0 if t 6= 0,
lim = (4.19)
n→∞ 1 − nt 1 if t = 0.
9
This might worry you, since Mn (t) isn’t defined for every t ∈ R, but its okay because for each t,
Mn (t) is defined for all n > 1/t and so it still makes sense to talk abut a limit; you can always throw
away the first few terms of the sequence of random variables.

71
But in fact this is not the limit of the functions Mn which are only defined if t < 1/n.
Because 1/n → 0 as n → ∞, there is no choice of δ > 0 so as to satisfy the conditions in
the statement of Theorem 7. In fact this sequence of random variables Xn do not converge
in distribution at all10
Example 32. Suppose that Xn is uniformly distributed on the set {1, 2, . . . , n}. Then the
moment generating function of Xn is given by Mn (0) = 1 and
et etn − 1
Mn (t) = for t ∈ R \ {0} (4.20)
n et − 1
The moment generating function of the random variable Xn /n is given by

et/n et − 1
E[etXn /n ] = Mn (t/n) = (4.21)
n et/n − 1
Since (ex − 1)/x → 1 as x → 0, we have, for a fixed t 6= 0,
et − 1
lim Mn (t/n) = . (4.22)
n→∞ t
Since M (t) = (et − 1)/t for t 6= 1 andM (0) = 1 is the moment generating function of a
random variable X with the uniform distribution on the interval [0, 1], this is in agreement
with example 27.
Here is a useful result from first year analysis.
Lemma 1. Suppose (an , n ≥ 1) is a sequence of numbers satisfying an → 1 as n tends to
infinity, and
lim n(an − 1) = b,
n→∞

for some b ∈ R. Then


lim ann = eb .
n→∞

Example 33. Suppose that Xn has the Binomial distribution with parameters 1/2 and n.
Then the moment generating function of Xn is given by
n
1 + et

Mn (t) = for t ∈ R. (4.23)
2

Lets consider the random variable (2Xn − n)/ n. Notice that the mean of this random
variable is zero, and its variance one. Its moment generating functions is given by
√ n  −t/√n √ n
√ −t√n 1 + e2t/ n + et/ n

√ e
−t n
Mn (2t/ n)e = e = (4.24)
2 2
Since by Taylor expanding the exponentials we obtain
√ √
e−t/ n
+ et/ n
t2
=1+ + ...
2 2n
10
Despite the fact their distribution functions satisfy limn→∞ Fn (x) = 0 for all x ∈ R. This is because
F (x) = 0 for all x ∈ R isn’t a distribution function corresponding to any probability distribution on R.

72
Lemma 1 implies that
√ √ 2
Mn (2t/ n)e−t n → et /2 as n → ∞, (4.25)
and we recognize the limit as the moment generating functions of the standard Gaussian
distribution. This proves Theorem 5 in the case p = 1/2.
Now we come to one of the deepest and most beautiful results in probability and
statistics. The proof is a straightforward generalization11 of the argument we used in the
preceding example.
Theorem 8 ( The Central Limit Theorem). Let X1 , X2 , X3 , . . . be a sequence of indepen-
dent and identically
Pn distributed random variables having common mean µ and variance
2
σ . Let Sn = k=1 Xk . Then, as n tends to infinity,
Sn − nµ

nσ 2
converges in distribution to a random variable having the standard Gaussian distribution.
Proof. We give a proof under the additional assumption12 that the moment generating
function of Xi exists. Let
M (t) = E[exp{t(X − µ)/σ)}],
which we assume exists for t ∈ (−δ, δ) for some δ > 0. Now compute the moment
−nµ
generating function of Sn∗ = S√nnσ 2
, using independence, as follows.
" ( n
)#
∗ t X √ n
Mn (t) = E[etSn ] = E exp √ (Xi − µ)/σ) = M (t/ n) .
n i=1

The function M is an infinitely differentiable, by Proposition 25, and so we can Taylor


expand M to give,
M 00 (0) 2
M (t) = M (0) + M 0 (0)t + t + R(t),
2
where |R(t)| ≤ Ct3 for t ∈ (−δ/2, δ/2) for some constant C > 0. Now, M (0) = 1 and
recalling Proposition 25 again we have
M 0 (0) = E[(X − µ)/σ)] = 0 M 00 (0) = var (X − µ)/σ) = 1.

and
So n
t2
 
t 2
Mn (t) = 1 + +R √ → et /2 ,
2n n
as n tends to infinity by Lemma 1. Consequently the statement of the Theorem follows
from Theorem 7, and the fact that the moment generating function of the standard
2
Gaussian distribution is et /2 .
11
But don’t be mislead; this is a very hard mathematical result; the difficulty is proving the properties
of moment generating functions we are using
12
The proof in the general case is very similar but we replace the moment generating function of X by
the characteristic function of X, defined by t 7→ E[eitX ]. Characteristic functions have similar properties
to moment generating functions, but they have the great advantage that they always exist, because
|eitX | ≤ 1.

73
What is remarkable in this result is that the limiting distribution is always the same,
no matter what the distribution of the random variables Xi , subject to the assumption
of having a variance. This phenomenon is called13 the universality of the Gaussian
distribution.
In Theorem 8, we assume the distribution of each Xi is the same and the random
variables are independent of each other. There are other results which relax this as-
sumptions and give Gaussian approximations for sums of random variables each having
different distributions, and which may be (weakly) dependent. This is the reason the
Gaussian distribution is such a good model for “random errors” in many situations: such
errors arise by combining many smaller random effects. For similar reasons the log-normal
distribution is an excellent model for the variability of such diverse quantities as stock
market prices and blood pressures.
You will see the central limit theorem used to prove results in statistics about the
distribution of estimators of unknown parameters in ST219. The Gaussian distribution (
because of the central limit theorem) also appears in physics, geometry and even number
theory.

13
See https://terrytao.files.wordpress.com/2011/01/universality.pdf for more about this

74
4.4 Convergence of random variables
We have previously given a definition of statement “random variables X1 , X2 , . . . converge
in distribution to a random variable X” . This is standard terminology used by everyone,
but it is actually quite misleading. Remember that from the very beginning of the module
we stressed that a random variable and its distribution are very different mathematical
dist
objects. When we write Xn −→ X, what we are really saying is that the distribution of
Xn becomes close, in a certain sense, to the distribution of X as n increases. But this
does not necessarily mean that the random variable Xn gets close to the random variable
X in any reasonable sense.
Example 34. Remember random variables are functions on some sample space Ω. Our
sample space could have just two sample points Ω = {H, T }, with each sample point
being equally likely. Then let
Xn (H) = 1 and Xn (T ) = 0 if n is even,
and
Xn (T ) = 1 and Xn (H) = 0 if n is odd.
Its pretty clear that the distribution of every Xn is the same for every n: P(Xn = 0) =
P(Xn = 1) = 1/2 for each n. So there is no doubt that this sequence of random variables
converge in distribution. But as functions, the sequence Xn is alternating between two
fixed functions, which are different, and in no sense getting close together as n increases.
This is reflected in what happens if we conduct the corresponding experiment: when we
observe the values of the sequence random variables we either see 1, 0, 1, 0, 1, 0, 1..., if we
choose ω = T , or we see 0, 1, 0, 1, 0... if we choose ω = H. In neither eventuality does the
sequence we observe converge!
A slightly less artificial example of the same phenomena would be to take P(Xn =
0) = P(Xn = 1) = 1/2 for each n and X1 , X2 . . . are independent. Then, again, since the
dist
distribution of Xn doesn’t vary with n at all, it is trivially true that Xn −→ X where
P(X = 0) = P(X = 1) = 1/2. But if we observed values for the random variables
X1 , X2 , . . ., for example generated by repeatedly tossing a coin, we would see a “random”
sequence of 0s and 1s with no tendency to converge to any particular value.
An even more interesting example occurs in the setting of the Central Limit Theo-
rem. The random variables Sn∗ converge in distribution to a random variable having the
standard Gaussian distribution. However if we generate observations of the sequence of
random variables S1∗ , S2∗ , . . . then the values obtained will oscillate, with the oscillations
slowly becoming larger as n increases. There is a beautiful theorem in advanced prob-
ability theory called the Law of the Iterated Logarithm that describes this oscillating
behaviour very precisely.
We want a notion of convergence for random variables that relates to the values we
would observe if the experiment was conducted. Your first guess for a definition that
captures this idea might be to say Xn converges to X if Xn (ω) converges to X(ω) for
every ω ∈ Ω, the sample space. This is a good idea; it doesn’t quite work, but it can
be fixed. But fixing it turns out to be quite interesting! So we will start with something
easier, if slightly less intuitive.

75
Definition 17. Let X1 , X2 , . . . be a sequence of random variables, and X a further random
variable, all defined on the same sample space. Then we say Xn converges to X in
probability, written
prob
Xn −→ X,
if for every  > 0,
P(|Xn − X| > ) → 0, as n → ∞.

This definition should remind you of the statement of the law of large numbers, the
conclusion of which we can now write as
n
1X prob
Xi −→ µ
n i=1

where µ = E[Xi ].
Example 35. Suppose X is some random variable, and 1 , 2 , . . . a sequence of random
variables each having the standard Gaussian distribution. Let
1
Xn = X + n ,
n
so we can think of Xn as a noisy measurement of X, with an error that tends to decreases
prob
with n. We have Xn −→ X because
Z ∞
2 2
P(|Xn − X| > ) = P(|n | > n) = √ e−z /2 dz → 0.
2π n
Notice that we don’t have to evaluate the integral to see that it converges to 0: this is a
consequence simply of n → ∞ and the fact the integral is finite.
Now let’s go back to that first idea of convergence for random variables. We need to
consider the possibility that there might be sample points in the sample space for which
it is not true that Xn (ω) converges to X(ω). But providing the set of all these “bad”
sample points has probability zero, it shouldn’t stop us from saying Xn converges to X,
because when we conduct the experiment we will never choose14 one of these bad ω.

Definition 18. Let X1 , X2 , . . . be a sequence of random variables, and X a further random


variable, all defined on the same sample space Ω. Then we say Xn converges to X
almost surely , written
a.s.
Xn −→ X,
if
P({ω ∈ Ω : Xn (ω) → X(ω) as n → ∞}) = 1.

Sometimes this convergence is easy to check as in the following example.


14
“We could choose a bad ω , but we won’t” might seem paradoxical, but its really no different from
what happens with picking a point unformly at random from, say, the disc in the plane: you could pick
the origin, but you won’t because that is an event with probability zero.

76
Example 36. Suppose that X is some random variable. Define random variables

Xn = bnXc.

Then no matter what the value of X, we have |Xn /n − X| ≤ 1/n and consequently
Xn (ω)/n → X(ω) for every sample point in the sample space on which X is defined.
a.s.
Thus Xn /n −→ X.
But in general the event {ω ∈ Ω : Xn (ω) → X(ω) as n → ∞} can be complicated,
and it can be difficult to compute its probability. A key tool is the following notion.

Definition 19. Let A1 , A2 , . . . be a sequence of events, all subsets of some sample space
Ω. Then the event An occurs infinitely often is defined to be the event
∞ [
\ ∞
{ω ∈ Ω : ω ∈ An for infinitely many n} = An .
m=1 n=m

Then we have the following criterion for almost sure converegence.


a.s.
Proposition 26. Xn −→ X if and only if

P(|Xn − X| >  occurs infinitely often) = 0,

for every  > 0.

Theorem 9 (The Borel-Cantelli Lemmas). Let A1 , A2 , . . . be a sequence of events, all


subsets of some sample space Ω.

If ∞
P
n=1 P(An ) < ∞ then

P(An occurs infinitely often) = 0.


P∞
If n=1 P(An ) = ∞ and A1 , A2 , . . . are independent events then

P(An occurs infinitely often) = 1.

Example 37. Suppose X is some random variable, and 1 , 2 , . . . a sequence of independent


random variables each having the standard Cauchy distribution. Let
1
Xn = X + n ,
n
so as in example 35 we can think of Xn as a noisy measurement of X, with an error that
prob
decreases with n. The same argument as in example 35 shows that Xn −→ X. But let
An be the event that {|Xn − X| > 1} = {|n | > n}. Then
∞ ∞ Z ∞ ∞
X X 2dz 1X1
P(An ) = ≥ = ∞.
n=1 n=1 n π(1 + z 2 ) π n=1 n

77
R ∞ 2dz 1 ∞ dz
R 1
where we have estimated the integral using n π(1+z 2) ≥ π n z2
= πn . Consequently by
the second Borel-Cantelli Lemma, P(|Xn − X| > 1 infinitely often ) = 1 and Xn cannot
be converging almost surely to X.
This is strange! What is happening? Convergence in probability of Xn to X tells us
that the error |Xn − X| will be below any given threshold with probability that tends that
one as n tends to infinity. But nevertheless, occasionally large errors occur, and although
these become increasingly rare as n increases they continue to occur indefinitely.
Now suppose instead that
1
Xn = X + 2 n .
n
Then, fixing an arbitrary  > 0, let An be the event that {|Xn − X| > } = {|n | > n2 }.
Then ∞ ∞ Z ∞ ∞
X X 2dz 2 X 1
P(An ) = 2
≤ < ∞.
n=1
2 π(1 + z )
n=1 n 
π n=1 n2
Thus by the first Borel-Cantelli Lemma P(|Xn − X| >  infinitely often ) = 0 and Xn
does converge almost surely to X.

78
4.5 More on convergence of random variables
The relationship between the three notions of convergence we have discussed is given the
following theorem; the reverse implications do not hold.

Theorem 10. Let X1 , X2 , . . . be a sequence of random variables, and X a further random


variable, all defined on the same sample space Ω. Then
a.s. prob dist
Xn −→ X =⇒ Xn −→ X =⇒ Xn −→ X

Proof. We will give the proof of the first implication. So suppose that (Xn , n ≥ 1) is a
sequence of random variables converging almost surely to a limit X.
Fix some  > 0, and consider the event:

A = {ω ∈ Ω : |Xn (ω) − X(ω)| >  for infinitely many n}

By Proposition 26 the probability of this event is zero, and by Definition 19 it can be


written as ∞ [ ∞
\
An
m=1 n=m

where An = {ω ∈ Ω : |Xn (ω) − X(ω)| > }. For each m, let



[
Em = An = {ω ∈ Ω : |Xn (ω) − X(ω)| >  for some n ≥ m}.
n=m

The sequence of events (Em , m ≥ 1) satisfies

Em ⊇ Em+1 for all m,

and consequently15

!
\
lim P(Em ) = P Em .
m→∞
m=1

But righthand side is the probability of the event A and so zero. Finally we have
An ⊆ En for each n and so since

0 ≤ P(An ) ≤ P(En ) −→ 0 as n tends to infinty,

we deduce that
lim P(|Xn − X| > ) = lim P(An ) = 0.
n→∞ n→∞

Since  > 0 was arbitrary this proves convergence of Xn to X in probability.

15
This is a fundamental property of the probability measure P which follows from countable additivity,
you may have see it in ST115; if not don’t worry, using countable additivity of P is very important in
more advanced work, and you may do more of this in modules next year.

79
We have seen examples in the previous lecture which show that convergence in prob-
ability does not imply convergence almost surely, and neither does convergence in distri-
bution imply convergence in probability in general. The exception is the special case in
which the limit is a constant c. In this case convergence in probability and convergence
prob
in distribution are equivalent and we usually write Xn −→ c.
Example 38. This is a continuation of example 36. Suppose the random variable X has
the exponential distribution so that P(X > t) = e−t for any t ≥ 0. Then Xn = bnXc has
a geometric distribution, for we have, for any non-negative integer k,

P(Xn = k) = P(k ≤ nX < k + 1) = e−k/n (1 − e−1/n )

We saw that Xn /n tends to X almost surely. From the preceding Theorem we can
conclude that if X1 , X2 , . . . , Xn , . . . are random variables having geometric distributions
dist
with parameters (1 − e−1/n ) then Xn /n −→ X where X has the exponential distribution
with rate parameter 1.
The properties of limits for sums and products that you are familar with for sequences
of real numbers hold too for convergence of random variables in either the almost sure
sense or in probability.

Proposition 27. Let (Xn , n ≥ 1) and (Yn , n ≥ 1) be two sequences of random variables.
Suppose that, as n tends to infinity,
a.s a.s.
Xn −→ X, and Yn −→ Y.

Then
a.s
Xn + Yn −→ X + Y,
and
a.s
Xn Yn −→ XY.
The analogous statements hold for convergence in probability.

We need to be more careful about making similar statements about convergence in


distribution. The problem is that if Xn and Yn are known to be converging in distribution
to limits X and Y , then the distribution of the sum Xn + Yn can behave in different
ways, and possibly not converge at all, depending on the joint distributions of Xn and Yn .
However we do have the following very useful results.

Proposition 28. Let (Xn , n ≥ 1) and (Yn , n ≥ 1) be two sequences of random variables.
Suppose that, as n tends to infinity,
dist dist
Xn −→ X, and Yn −→ Y.

Suppose also that, for each n, Xn and Yn are independent. Then


dist
Xn + Yn −→ X + Y,

where X and Y are independent.

80
In this next proposition, we don’t assume anything about the relationship between Xn
and Yn but instead consider the case that the limit of Yn is a constant.

Proposition 29. Let (Xn , n ≥ 1) and (Yn , n ≥ 1) be two sequences of random variables.
Suppose that, as n tends to infinity,
dist prob
Xn −→ X, and Yn −→ c,

where c is some constant. Then


dist
Xn + Yn −→ X + a,

and
dist
Xn Yn −→ aX,
and provided a 6= 0,
Xn dist X
−→ .
Yn a
Finally we consider the effect of composing with a continuous function.

Proposition 30. Let (Xn , n ≥ 1) be a sequence of random variables and g : R → R be a


continuous function. Suppose
a.s
Xn −→ X
Then
a.s
g(Xn ) −→ g(X).
The analogous statement holds if convergence almost surely is replaced by convergence in
probability, and also if it is replaced by convergence in distribution.

Example 39. Fix p ∈ (0, 1) and let Xn be a random variable having the Binomial distri-
bution with parameters n and p. Let us show that, as n tends to infinity,
Xn − np dist
p −→ Z (4.26)
Xn (n − Xn )/n

where Z is a random variable with the standard Gaussian distribution. Note that there
appears to be a problem with this statement; it is a ratio of random variables and the
denominator can be zero with positive probability. But the probability the denominator
equals zero, tends to zero as n tends to infinity. So, you could define the ratio to take any
value you line in the eventuality the denominator is zero, without affecting it converging
in distribution. For this reason, we don’t bother about the possibility that denominator
is zero.
First, the Central Limit Theorem16 , implies
X − np dist
p n −→ Z (4.27)
np(1 − p)
16
Pn
This is because we can write Xn as a sum of independent random variables i=1 i as in Section 1.3

81
where Z is a random variable with the standard Gaussian distribution. On the otherhand
the law of large numbers tells us that
Xn prob
−→ p
n
Using Proposition 30 we can deduce from this that
s
Xn (n − Xn ) prob
−→ 1 (4.28)
n2 p(1 − p)

Finally we combine the two limits (4.27) and (4.28) using the third statement from Propo-
sition 29 to obtain (4.26).
This result is used in practical applications in Statistics. The context is the same
as Example 19; when estimating the proportion of a population infected with a disease,
we test a sample of size n and find k infected individuals. The natural estimate of the
proportion of the population that is infected is then k/n, but how accurate should we
judge this estimate to be? We can treat the number k as the observed value of a random
variable X having the Binomial distribution with parameters p and n where p represents
the unknown proportion of the population which is infected. The convergence (4.26) is
used to construct an approximate confidence interval17 for p, giving a range of values of
p that are “reasonable” having observed k infected individuals in the sample.

17
I don’t want to explain this now, but its something to look forward to ST219

82
4.6 The strong law of large numbers
We finish the module with one of the greatest results of probability theory.

Theorem 11 (The strong law of large numbers). Let (Xi , i ≥ 1) be a sequence of inde-
pendent and identically distributed random variables whose mean µ exists. Then
n
1X a.s.
Xi −→ µ.
n i=1

Proof. We will prove the special case in which we make the strong extra assumption that
the moment generating function of Xi exists.
Let M (t) = E[etXi ] which we assume exists for t ∈ (−δ, δ). By independence
h Pn i n
E et i=1 Xi = M (t) .

Suppose  > 0. Then M (0) = 1, and by Proposition 25, M 0 (0) = µ, and so for some
h > 0,
M (t) < 1 + (µ + )t
for all 0 < t < h. Pick such a t. By Markov’s inequality
n
!
X  Pn  n
P Xi ≥ n(µ + ) = P et i=1 Xi ≥ en(µ+)t ≤ M (t) e−n(µ+)t = enβ
i=1

where β = log M (t) − (µ + )t < log(1 + (µ + )t) − (µ + )t ≤ 0 since log(1 + x) ≤ x for
all x > −1. Since β < 0 we have
∞ n
! ∞
X X X
P Xi ≥ n(µ + ) ≤ enβ < ∞,
n=1 i=1 n=1

and hence by the Borel Cantelli Lemma,


n
!
X
P Xi ≥ n(µ + ) for infinitely many n = 0.
i=1

An analogous argument shows that


n
!
X
P Xi ≤ n(µ − ) for infinitely many n = 0,
i=1

so !
1 Xn
P Xi − µ ≥  for infinitely many n = 0,

n
i=1

and the conclusion follows from Proposition 26.

83
The full proof, without making extra assumptions, is much harder.
An immediate consequence of this result, and the fact almost sure convergence im-
plies convergence in probability is we can improve on the statement of the weak law of
large numbers, Theorem 4, which we previously proved using Chebyshev’s inequality; an
argument which required finite variances to exist.

Corollary 1. Let (Xi , i ≥ 1) be a sequence of independent and identically distributed


random variables whose mean µ exists. Then
n
1X prob
Xi −→ µ.
n i=1

The hypotheses in the statement of strong law are all needed for the conclusion to
hold.
Pn If the mean of the distribution of the random variables Xi does not exist then
1
n i=1 Xi need not converge almost surely at all. Taking the sequence (Xi ; i ≥ 1) to
be independent random variables each having the standard Cauchy distribution, we have
seen that n
1X dist
Xi = X 1 , (4.29)
n i=1
1
Pn
andPn consequently n i=1 Xi cannot converge almost surely to any constant. In fact
1
n i=1 Xi doesn’t converge in either the sense of probability or almost surely to any
limit at all.
The next example shows that the assumption that the random variables Xi be iden-
tically distributed is also essential.
Example 40. Let (Xi , i ≥ 1) be a sequence of independent random variables satisfying
1 1
P(Xi = i2 − 1) = 2
and P(Xi = −1) = 1 − 2 . (4.30)
i i
Then E[Xi ] = 0 for each i, but we can show that
n
1X a.s.
Xi −→ −1. (4.31)
n i=1

Let E be the event defined by

E = {ω ∈ Ω : Xi (ω) = i2 − 1 for infinitely many i}. (4.32)

We can compute the probability of E using the Borel-Cantelli Lemma. We have


∞ ∞
X
2
X 1
P(Xi = i − 1) = < ∞, (4.33)
i=1 i=1
i2

and so by Theorem 9 P(E) = 0. The complement of the event E is

E c = {ω ∈ Ω : there exists an N such that Xi (ω) = −1 for i ≥ N }. (4.34)

84
It is important to note that the N that appears in the above expression can depend on
the sample point ω: that is to say N is a random variable. If ω ∈ E c , then for n ≥ N (ω),
n N (ω)
1X 1 X N (ω) − n
Xi (ω) = Xi (ω) − , (4.35)
n i=1 n i=1 n

and so as n → ∞, we have n1 ni=1 Xi (ω) → −1. But we know that P(E c ) = 1, so this
P
proves (4.31).
Example 41. Suppose x is a real number satisfying 0 ≤ x < 1; it can be expressed in
decimal form
x = 0.x1 x2 x3 . . .
where xi ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} is the ith digit in the expansion. More formally we
have ∞
X
x= xi .10−i .
i=1
18
This expansion is esssentially unique . Let us define for each k ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
and n ≥ 1 the function fn,k : [0, 1) → Z,
fn,k (x) = |{1 ≤ i ≤ n : xi = k}|,
in other words fn,k counts the number of times the digit k appears in the first n decimal
places of the expansion of x. Now let
1
Ek = {x ∈ [0, 1) : lim fn,k (x) exists and equals 1/10}
n→∞ n

Its quite tricky at first to find a choice of x ∈ [0, 1) that belongs to any of these sets. Any
number that has a decimal expansion that terminates definitely doesn’t belong to any Ek .
The cleverly chosen rational number
123, 456, 789
= 0.012345678901234567890....
9, 999, 999, 999
belongs to Ek for every k because the decimal expansion keeps repeating the sequence
0123456789. What about irrational choices of x such as x = π − 3? Surprisingly it is not
even known there are an infinite number of the digit k in the decimal expansion π − 3,
and so we cannot say whether π − 3 belongs to Ek or not19 .
However there is a very simple way to find x which do belong to Ek . Suppose X is
a random number chosen uniformly at random from [0, 1), and let Xi be the ith digit in
its decimal expansion. Then20 (Xi ; i ≥ 1) is a sequence of independent random variables
each uniformly distributed in the set {0, 1, 2, . . . , 9}. Now for any digit k of interest let
(
1 if Xi = k,
Yi =
0 otherwise.
18
To get uniqueness you just have to agree that you will write one tenth as 0.1 and not 0.099999999 . . .
and so on.
19
If you want to read about more examples, look at https://en.wikipedia.org/wiki/Normal_number
20
its messy to write out a formal proof of this, but not surprising: just think about, for example,
P(X1 = 1 and X2 = 9) to get the idea

85
The sequence of random variables (Yi ; i ≥ 1) is a sequence of independent and identically
distributed random variables with E[Yi ] = 1/10. Applying the strong law of large numbers
to (Yi ; i ≥ 1) shows that the probability that X belongs to Ek is one.
Example 42. Imagine a simple gambling game, where you bet on repeated tosses of a
biased coin. On each toss, if you stake a certain amount x then you either lose your stake
with probability q or receive back 2x with probability p = 1 − q. Assume p > 1/2 > q so
the game is biased in your favour. You start with an initial fund of money W0 and make
repeated bets. What is a good strategy to follow for deciding how much to bet on each
toss?
Since the game is biased in your favour it might at first appear sensible to bet as much
as possible so as to maximize your expected winnings. But in fact this is foolish idea,
since if you bet your entire fund each time, you will, soon or later, lose everything and go
bust (you can’t bet using borrowed money).
Suppose you bet a fraction f of your current funds on each toss. Then your funds
after the nth toss Wn satisfy
(
(1 + f )Wn−1 if you win the nth toss
Wn = (4.36)
(1 − f )Wn−1 if you lose the nth toss.
n
Notice that E[Wn ] = 1 + (p − q)f grows geometrically, and to maximize this growth
rate you might pick f close to 1 (although not equal to 1 bearing in mind our earlier
comment about going bankrupt.)
But consider how the random variable log Wn behaves. From (4.36) we obtain
Wn = (1 + f )Xn (1 − f )n−Xn (4.37)
where Xn is the number of the first n tosses that resulted in wins for you. Consequently,
since by the strong law of large numbers, Xn /n → p almost surely,
1 Xn n − Xn a.s.
log Wn = log(1 + f ) + log(1 − f ) −→ p log(1 + f ) + q log(1 − f ), (4.38)
n n n
as n tends to infinity. This is bad news, because for f close to one this limit is negative,
and consequently, for such an f ,
a.s.
Wn −→ 0. (4.39)
Even though your expected winnings are growing, in fact you are guaranteed to lose all
your money if you keep playing with this strategy!
We can now suggest a good strategy; pick f so as to maximize the long term growth
rate of Wn given by the limit in (4.38). Some easy calculus shows this maximum occurs
at f = p − q, and moreover the growth rate is positive there. Good news!
If p = 0.6 then the proportion to bet to maximize the growth rate would be 20%.
In a real life experiment, 61 people were given an initial $25 to play this game (knowing
the bias of the coin was 3:2 in favour of heads) for up to 300 coin tosses, or until they
had reached maximum winnings of $250. Betting 20% of their available funds at each
toss, one would expect nearly all the participants to reach the $250 maximum. In fact 17
participants went bust and only 13 reached the maximum. Average winnings were $90.
40 participants gambled on tails at some point in the experiment21 .
21
The moral is most people are very poor gamblers. No wonder Las Vegas makes a lot of money.

86

You might also like