0% found this document useful (0 votes)
6 views80 pages

ST2133 ch 3

Chapter 3 focuses on random variables and univariate distributions, detailing their definitions, characteristics, and the distinction between discrete and continuous types. It outlines learning outcomes, including formulating problems with random variables, calculating moments, and deriving distribution functions. The chapter emphasizes the importance of understanding random variables in the context of probability distributions and expectations.

Uploaded by

twyjancis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views80 pages

ST2133 ch 3

Chapter 3 focuses on random variables and univariate distributions, detailing their definitions, characteristics, and the distinction between discrete and continuous types. It outlines learning outcomes, including formulating problems with random variables, calculating moments, and deriving distribution functions. The chapter emphasizes the importance of understanding random variables in the context of probability distributions and expectations.

Uploaded by

twyjancis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Chapter 3

Random variables and univariate


distributions 3

3.1 Recommended reading


Casella, G. and R.L. Berger, Statistical Inference. Chapter 1, Sections 1.4–1.6; Chapter
2, Sections 2.1–2.3; Chapter 3, Sections 3.1–3.3; Chapter 5, Section 5.5.

3.2 Learning outcomes


On completion of this chapter, you should be able to:

provide both formal and informal definitions of a random variable

formulate problems in terms of random variables

explain the characteristics of distribution functions

explain the distinction between discrete and continuous random variables

provide the probability mass function (pmf) and support for some common discrete
distributions

provide the probability density function (pdf) and support for some common
continuous distributions

explain whether a function defines a valid mass or density

calculate moments for discrete and continuous distributions

prove and manipulate inequalities involving the expectation operator

derive moment generating functions for discrete and continuous distributions

calculate moments from a moment generating function

calculate cumulants from a cumulant generating function

determine the distribution of a function of a random variable

summarise scale/location and probability integral transformations.

41
3. Random variables and univariate distributions

3.3 Introduction
This chapter extends the discussion of (univariate) random variables and common
distributions of random variables introduced in ST104b Statistics 2, so it is advisable
to review that material first before proceeding. Here, we continue with the study of
random variables (which, recall, associate real numbers – the sample space – with
3 experimental outcomes), probability distributions (which, recall, associate probability
with the sample space), and expectations (which, recall, provide the central tendency,
or location, of a distribution). Fully appreciating random variables, and the associated
statistical notation, is a core part of understanding distribution theory.

3.4 Mapping outcomes to real numbers


A random variable is used to assign real numbers to the outcomes of an experiment.
The outcomes themselves may be non-numeric (in which case a mapping of outcomes to
real numbers is required), or numeric (when no such mapping is necessary). A
probability distribution then shows how likely di↵erent outcomes are.

Example 3.1 Continuing Example 2.26 (flipping a coin three times), suppose we
define the random variable X to denote the number of heads. We assume the
outcomes of the flips are independent (which is a reasonable assumption), with a
constant probability of success (which is also reasonable, as it is the same coin).
Hence X can take the values 0, 1, 2 and 3, and so is a discrete random variable such
that ⌦ = {0, 1, 2, 3}. For completeness, when writing out the probability (mass)
function we should specify the probability of X for all real values, not just those in
⌦. This is easily achieved with ‘0 otherwise’. Hence:
8
>
> (1 ⇡)3 for x = 0
>
>
>
> 2
<3⇡(1 ⇡) for x = 1
P (X = x) = pX (x) = 3⇡ 2 (1 ⇡) for x = 2
>
>
>
> ⇡3 for x = 3
>
>
:0 otherwise.

The cumulative distribution, FX (x) = P (X  x), in this case is:


8
>
> 0 for x<0
>
>
>
>(1 ⇡)3 for 0x<1
<
FX (x) = (1 ⇡)2 (1 + 2⇡) for 1x<2
>
>
>
> 1 ⇡3 for 2x<3
>
>
:1 for x 3.

Formally, we can define a random variable drawing on our study of probability space in
Chapter 2. Recall that we define probability as a measure which maps events to the
unit interval, i.e. [0, 1]. Hence in the expression ‘P (A)’, the argument ‘A’ must represent

42
3.4. Mapping outcomes to real numbers

an event. So P (X = x) denotes the probability of the event {X = x}, similarly


P (X  x) denotes the probability of the event {X  x}.

Random variable

A random variable is a function X : ⌦ ! R with the property that if:

Ax = {! 2 ⌦ : X(!)  x} 3
then:
Ax 2 F for all x 2 R.
Therefore, Ax is an event for every real-valued x.

The above definition tells us that random variables map experimental outcomes to real
numbers, i.e. X : ⌦ ! R. Hence the random variable X is a function, such that if ! is
an experimental outcome, then X(!) is a real number. The remainder of the definition
allows us to discuss quantities such as P (X  x). Technically, we should write in full
P ({! 2 ⌦ : X(!)  x}), but will use P (X  x) for brevity.

3.4.1 Functions of random variables

Our interest usually extends beyond a random variable X, such that we may wish to
consider functions of a random variable.

Example 3.2 Continuing Example 3.1, let Y denote the number of tails. Hence
Y (!) = 3 X(!), for any outcome !. More concisely, this can be written as
Y = 3 X, i.e. as a linear transformation of X.

Function of a random variable

If g : R ! R is a well-behaved1 function, X : ⌦ ! R is a random variable and


Y = g(X), then Y is a random variable, Y : ⌦ ! R with Y (!) = g(X(!)) for all
! 2 ⌦.

3.4.2 Positive random variables

Many observed phenomena relate to quantities which cannot be negative, by definition.


Examples include workers’ incomes, examination marks etc. When modelling such
real-world phenomena we need to capture this ‘stylised fact’ by using random variables
with sample spaces restricted to non-negative real values.

1
All functions considered in this course are well-behaved, and so we will omit a technical discussion.

43
3. Random variables and univariate distributions

Positive random variable

A random variable X is positive, denoted X 0, if it takes a positive value for


every possible outcome, i.e. X(!) 0 for all ! 2 ⌦.

3 Example 3.3 We extend Examples 3.1 and 3.2. When flipping a coin three times,
the sample space is:

⌦ = {T T T, T T H, T HT, HT T, HHT, HT H, T HH, HHH}.

Let F be the collection of all possible subsets of ⌦, i.e. F = {0, 1}⌦ , such that any
set of outcomes is an event. As the random variable X denotes the number of heads,
its full specification is:

X(T T T ) = 0

X(T T H) = X(T HT ) = X(HT T ) = 1

X(HHT ) = X(HT H) = X(T HH) = 2

and X(HHH) = 3.

A few examples of cumulative probabilities are:

P (X  1) = P ({! 2 ⌦ : X(!)  1}) = P (;) = 0

P (X  0) = P ({! 2 ⌦ : X(!)  0}) = P ({T T T }) = (1 ⇡)3

P (X  1) = P ({! 2 ⌦ : X(!)  1}) = P ({T T T, T T H, T HT, HT T }) = (1 ⇡)2 (1 + 2⇡)

P (X  1.5) = P ({! 2 ⌦ : X(!)  1.5}) = P ({T T T, T T H, T HT, HT T }) = (1 ⇡)2 (1 + 2⇡)

P (X  4) = P ({! 2 ⌦ : X(!)  4}) = P (⌦) = 1.

Note that P (X  1) = P (X  1.5).


Similarly, as Y denotes the number of tails, we have:

Y (T T T ) = 3 X(T T T ) = 3, Y (T T H) = 3 X(T T H) = 2 etc.

and the following illustrative probabilities:

P (Y = 2) = P (3 X = 2) = P (X = 1) = 3⇡(1 ⇡)2

and:
P (Y  2) = P (X > 0) = 1 P (X  0) = 1 (1 ⇡)3 .
Since X and Y are simple counts, i.e. the number of heads and tails, respectively,
these are positive random variables.

44
3.5. Distribution functions

Activity 3.1 Two fair dice are rolled. Let X denote the absolute value of the
di↵erence between the values shown on the top face of the dice. Express each of the
following in words.

(a) {X  2}.

(b) {X = 0}. 3
(c) P (X  2).

(d) P (X = 0).

Activity 3.2 A die is rolled and a coin is tossed. Defining the random variable X to
be the value shown on the die, and the random variable Y to represent the coin
outcome such that: (
1 if heads
Y =
0 if tails
write down concise mathematical expressions for each of the following.

(a) The value on the die is less than 3.

(b) The probability that the value on the die is less than 3.

(c) The coin shows a head.

(d) The probability that the number of heads shown is less than 1.

(e) The die roll is a 6 and there are no heads.

(f) The probability that the number of heads is less than the value on the die.

3.5 Distribution functions


Although a random variable maps outcomes to real numbers, our interest usually lies in
probabilities associated with the random variable. The (cumulative) distribution
function fully characterises the probability distribution associated with a random
variable.

Distribution function

The distribution function, or cumulative distribution function (cdf ), of a


random variable X is the function FX : R ! [0, 1] given by:

FX (x) = P (X  x).

Some remarks on distribution functions are the following.

i. The terms ‘distribution function’, ‘cumulative distribution function’ and ‘cdf’ can

45
3. Random variables and univariate distributions

be used synonymously.

ii. FX denotes the cdf of the random variable X, similarly FY denotes the cdf of the
random variable Y etc. If an application has only one random variable, such as X,
where it is unambiguous what the random variable is, we may simply write F .

3 We now define right continuity which is required to establish the properties of


distribution functions.

Right continuity

A function g : R ! R is right continuous if g(x+) = g(x) for all x 2 R, where:

g(x+) = lim g(x + h).


h#0

(Note that g(x+) refers to the limit of the values given by g as we approach point x
from the right.2 )

We now consider properties of distribution functions. Non-examinable proofs can be


found in Appendix A.

Properties of distribution functions

A distribution function FX has the following properties.

i. FX is a non-decreasing function, i.e. if x < y then FX (x)  FX (y).

ii. lim FX (x) = 0 and lim FX (x) = 1.


x! 1 x!1

iii. FX is right continuous, i.e. FX (x+) = FX (x) for all x 2 R.

Since distribution functions return cumulative probabilities, we can use them to


determine probabilities of specific events of interest. Non-examinable proofs can be
found in Appendix A.

Probabilities from distribution functions

For real numbers x and y, with x < y, we have the following.

i. P (X > x) = 1 FX (x).

ii. P (x < X  y) = FX (y) FX (x).

iii. P (X < x) = lim FX (x h) = FX (x ).


h#0

iv. P (X = x) = FX (x) FX (x ).

2
Similarly, we may define left continuity as g(x ) = g(x) for all x 2 R, where g(x ) = lim g(x h).
h#0

46
3.5. Distribution functions

Example 3.4 Find the cumulative distribution functions corresponding to the


following density functions.

(a) Standard Cauchy:


1
fX (x) = for 1 < x < 1.
⇡(1 + x2 )
3
(b) Logistic:
e x
fX (x) = for 1 < x < 1.
(1 + e x )2
(c) Pareto:
a 1
fX (x) = for 0 < x < 1.
(1 + x)a
(d) Weibull:
cx⌧
fX (x) = c⌧ x⌧ 1
e for x 0, c > 0 and ⌧ > 0.

Solution

(a) We have:
Z x Z x  x
1 1
FX (x) = fX (t) dt = dt = arctan t
1 1 ⇡(1 + t2 ) ⇡ 1

1 1 ⇡
= arctan x
⇡ ⇡ 2
1 1
= arctan x + .
⇡ 2
(b) We have:
Z x Z x  x
e t 1 1
FX (x) = fX (t) dt = t 2
dt = t
= x
.
1 1 (1 + e ) 1+e 1 1+e

(c) We have:
Z x Z x  x
a 1 1 1
FX (x) = fX (t) dt = dt = =1 .
1 0 (1 + t)a (1 + y)a 1
0 (1 + x)a 1

For x < 0 it is obvious that FX (x) = 0, so in full:


8
<0 for x < 0
FX (x) = 1
:1 for x 0.
(1 + x)a 1
(d) We have:
Z x Z x h ix
cy ⌧ cy ⌧ cx⌧
FX (x) = fX (t) dt = c⌧ t⌧ 1
e dt = e =1 e .
1 0 0

For x < 0 it is obvious that FX (x) = 0, so in full:


(
0 for x < 0
FX (x) = cx⌧
1 e for x 0.

47
3. Random variables and univariate distributions

Example 3.5 Suppose X is a discrete random variable with distribution function:

x(x + 1)
FX (x) =
42
over the support {1, 2, . . . , 6}. Determine the mass function of X, i.e. pX .
3 Solution
We have:
x(x + 1) (x 1)x x
pX (x) = FX (x) FX (x 1) = = .
42 42 21
In full: (
x/21 for x = 1, 2, . . . , 6
pX (x) =
0 otherwise.

Activity 3.3 Let X be a random variable which models the value of claims received
at an insurance company. Suppose that only claims greater than k are paid. Write
an expression for the distribution functions of claims paid and claims not paid in
terms of the distribution function of X.

While the distribution function returns P (X  x), there are occasions when we are
interested in P (X > x), i.e. the probability that X is larger than x. In models of
lifetime, then the event {X > x} represents survival beyond time x. This gives rise to
the survival function.

Survival function

If X is a random variable with distribution function FX , the survival function F̄X


is defined as:
F̄X (x) = P (X > x) = 1 FX (x).

Example 3.6 If X is a non-negative continuous random variable, then


F̄X (x) = P (X > x) is called the survival function, and hX (x) = fX (x)/F̄X (x) is the
hazard function.
Find the hazard function for the following distributions.

(a) Pareto:
a 1
fX (x) =
(1 + x)a
for 0 < x < 1 and a > 1.

(b) Weibull:
cx⌧
fX (x) = c⌧ x⌧ 1
e
for x 0, c > 0 and ⌧ > 0.

48
3.6. Discrete vs. continuous random variables

Solution
First notice that the hazard function, often denoted by (x), is (applying the chain
rule): ✓ ◆
d fX (x) fX (x)
(x) = ln F̄X (x) = = .
dx F̄X (x) F̄X (x)

(a) For the Pareto distribution, using Example 3.4: 3


1
F̄X (x) = 1 FX (x) =
(1 + x)a 1

hence:
d( (a 1) ln(1 + x)) a 1
(x) = = .
dx 1+x
(b) For the Weibull distribution, using Example 3.4:
cx⌧
F̄X (x) = 1 FX (x) = e

hence:
d( cx⌧ )
(x) = = c⌧ x⌧ 1
.
dx

Example 3.7 Show that, in general, the hazard function of a non-negative


continuous random variable X does not decrease as x increases if:
F̄X (x + y)
F̄X (x)

does not increase as x increases, for all y 0.


Hint: Di↵erentiate the logarithm of the given expression with respect to x.

Solution
We are asked to show that (x)  (x + y) when y 0. We are told that
F̄X (x + y)/F̄X (x) does not increase as x increases, and so ln F̄X (x + y) ln F̄X (x)
has a non-positive derivative with respect to x. Di↵erentiating with respect to x, we
have:
(x + y) + (x)  0
which is the required result.

3.6 Discrete vs. continuous random variables


As in ST104b Statistics 2, we focus on two classes of random variables and their
associated probability distributions:

discrete random variables – typically assumed for variables which we can count

49
3. Random variables and univariate distributions

continuous random variables – typically assumed for variables which we can


measure.

Example 3.8 We consider some examples to demonstrate the distinction between


discreteness and continuity.

3 1. Discrete model: Any real-world variable which can be counted, taking natural
numbers 0, 1, 2, . . . , is a candidate for being modelled with a discrete random
variable. Examples include the number of children in a household, the number
of passengers on a flight etc.

2. Continuous model: Any real-world variables which can be measured on a


continuous scale would be a candidate for being modelled with a continuous
random variable. In practice, measurement is limited by the accuracy of the
measuring device (for example, think how accurately you can read o↵ a ruler).
Examples include height and weight of people, duration of a flight etc.

3. Continuous model for a discrete situation: Consider the value of claims


received at an insurance company. All values will be monetary amounts, which
can be expressed in terms of the smallest unit of currency. For example, in the
UK, a claim could be £100 (equivalently 10,000 pence, since £1 = 100 pence) or
£10,000 (equivalently 1,000,000 pence). Since pence are the smallest unit of the
currency, the value of claims must be a positive integer number of pence. Hence
the value of claims is a discrete variable. However, due to the (very) large
number of distinct possible values, we may consider using a continuous random
variable as an approximating model for the value of claims.

4. Neither discrete nor continuous model: A random variable can also be a


mixture of discrete and continuous parts. For example, consider the value of
payments which an insurance company needs to make on all insurance policies
of a particular type. Most policies result in no claims, so the payment for them
is 0. For those policies which do result in a claim, the size of each claim is some
number greater than 0. The resulting model has some discrete characteristics (a
policy either results in a claim or it does not) and some continuous
characteristics (treating the value of the claim as being measured on a
continuous scale). Therefore, we may choose to model this situation using a
random variable which is neither discrete nor continuous, i.e. with a mixture
distribution.

In ST104b Statistics 2, the possible values which a random variable could take was
referred to as the sample space, which for many distributions is a subset of R, the real
line. In this course, we will use the term support.

Support of a function

The support of a positive real-valued function, f , is the subset of the real line where
f takes values strictly greater than zero:

{x 2 R : f (x) > 0}.

50
3.7. Discrete random variables

3.7 Discrete random variables

Discrete random variables have supports which are in some countable subset
{x1 , x2 , . . .} of R. This means that the probability that a discrete random variable X
can take a value outside {x1 , x2 , . . .} is zero, i.e. P (X = x) = 0 for x 62 {x1 , x2 , . . .}.
Probability distributions of discrete random variables can, of course, be represented by
distribution functions, but we may also consider probability at a point using the
3
probability mass function (pmf). We have the following definition, along with the
properties of a pmf.

Probability mass function

The probability mass function of a discrete random variable X is the function


pX : R ! [0, 1] given by:
pX (x) = P (X = x).
For brevity, we may refer to these simply as ‘mass functions’. If pX is a mass function
then:

i. 0  pX (x)  1 for all x 2 R

ii. pX (x) = 0 for x 62 {x1 , x2 , . . .}


P
iii. pX (x) = 1.
x

Since the support of a mass function is the countable set {x1 , x2 , . . .}, in summations
such as in property iii. above, where the limits of summation are not explicitly
provided, the sum will be assumed to be over the support of the mass function.
The probability distribution of a discrete random variable may be represented either by
its distribution function or its mass function. Unsurprisingly, these two types of function
are related.

Relationship between mass and distribution functions

If X is a discrete random variable such that pX is its mass function and FX is its
distribution function, then:

i. pX (x) = FX (x) FX (x )
P
ii. FX (x) = pX (xi ).
xi x

From i. above we can deduce that FX (x) = FX (x ) + pX (x). Since pX (x) = 0 for
x 62 {x1 , x2 , . . .}, we have that FX (x) = FX (x ) for x 62 {x1 , x2 , . . .}. This means that
the distribution function is a step function, i.e. flat except for discontinuities at the
points {x1 , x2 , . . .} which represent the non-zero probabilities at the given points in the
support.
We now consider some common discrete probability distributions, several of which were

51
3. Random variables and univariate distributions

first mentioned in ST104b Statistics 2. We refer to ‘families’ of probability


distributions, with di↵erent members of each family distinguished by one or more
parameters.

3.7.1 Degenerate distribution

3 A degenerate distribution concentrates all probability at a single point. If X is a


degenerate random variable, its support is {a}, for some constant 1 < a < 1.
Denoted X ⇠ Degenerate(a), its mass function is:
(
1 for x = a
pX (x) =
0 otherwise

and its distribution function is:


(
0 for x < a
FX (x) =
1 for x a.

3.7.2 Discrete uniform distribution


A discrete uniform distribution assigns equal (i.e. the same, or ‘uniform’)
probabilities to each member of its support. If X is a discrete uniform random variable
with support {x1 , x2 , . . . , xn }, denoted X ⇠ Uniform{x1 , xn }, its mass function is:
8
< 1 for x 2 {x1 , x2 , . . . , xn }
>
pX (x) = n
>
:0 otherwise.

The corresponding distribution function is a step function with steps of equal


magnitude 1/n. This can be expressed as:
8
>
> 0 for x < x1
>
<
FX (x) = bxc x1 + 1 for x1  x  xn
>
> n
>
:
1 for x > xn

where bxc is the ‘floor’ of x, i.e. the largest integer which is less than or equal to x.

3.7.3 Bernoulli distribution


A Bernoulli distribution assigns probabilities ⇡ and 1 ⇡ to the only two possible
outcomes, often referred to as ‘success’ and ‘failure’, respectively, although these do not
necessarily have to represent ‘good’ and ‘bad’ outcomes, respectively. Since the support
must be a set of real numbers, these are assigned the values 1 and 0, respectively. If X
is a Bernoulli random variable, denoted X ⇠ Bernoulli(⇡), its mass function is:
(
⇡ x (1 ⇡)1 x for x = 0, 1
pX (x) =
0 otherwise.

52
3.7. Discrete random variables

Sometimes the notation p is used to denote ⇡. The corresponding distribution function


is a step function, given by:
8
>
<0 for x < 0
FX (x) = 1 ⇡ for 0  x < 1
>
:
1 for x 1.
3
3.7.4 Binomial distribution
A binomial distribution assigns probabilities to the number of successes in n
independent trials each with only two possible outcomes with a constant probability of
success. If X is a binomial random variable, denoted X ⇠ Bin(n, ⇡), its mass function is:
(
n
x
⇡ x (1 ⇡)n x for x = 0, 1, 2, . . . , n
pX (x) =
0 otherwise

where: ✓ ◆
n n!
=
x x! (n x)!
is the binomial coefficient. Sometimes the notation p and q are used to denote ⇡ and
1 ⇡, respectively. This has the benefit of brevity, since q is more concise than 1 ⇡.
Proof of the validity of this mass function was shown in ST104b Statistics 2.
The corresponding distribution function is a step function, given by:
8
>
> 0 for x < 0
>
<Pbxc
n
FX (x) = ⇡ i (1 ⇡)n i for 0  x < n
>
> i=0
i
>
:
1 for x n.

Note the special case when n = 1 corresponds to the Bernoulli(⇡) distribution, and that
the sum of n independent and identically distributed Bernoulli(⇡) random variables has
a Bin(n, ⇡) distribution.

3.7.5 Geometric distribution


Somewhat confusingly, there are two versions of the geometric distribution with
subtle di↵erences over the support and hence of what the random variable represents.
Regardless of this nuance, both versions involve the repetition of independent and
identically distributed Bernoulli trials until the first success occurs.

First version

In the first version of the geometric distribution, X represents the trial number of the
first success. As such the support is {1, 2, . . .}, with the mass function:
(
(1 ⇡)x 1 ⇡ for x = 1, 2, . . .
pX (x) =
0 otherwise

53
3. Random variables and univariate distributions

with its distribution function given by:


(
0 for x < 1
FX (x) = bxc
1 (1 ⇡) for x 1.

Second version

3 In the second version of the geometric distribution, X represents the number of failures
before the first success. As such the support is {0, 1, 2, . . .}, with the mass function:
(
(1 ⇡)x ⇡ for x = 0, 1, 2, . . .
pX (x) =
0 otherwise
with its distribution function given by:
(
0 for x < 0
FX (x) = bxc+1
1 (1 ⇡) for x 0.
Either version could be denoted as X ⇠ Geo(⇡), although be sure to be clear which
version is being used in an application.

3.7.6 Negative binomial distribution


The negative binomial distribution extends the geometric distribution, and hence
also has two versions.

First version

In the first version this distribution is used to represent the trial number of the rth
success in independent Bernoulli(⇡) trials, where r = 1, 2, . . .. When r = 1, this is a
special case which is the (first) version of the geometric distribution above. If X is a
negative binomial random variable, denoted X ⇠ Neg. Bin(r, ⇡), where ⇡ is the
constant probability of success, its mass function is:
(
x 1
r 1
⇡ r (1 ⇡)x r for x = r, r + 1, r + 2, . . .
pX (x) =
0 otherwise.
Note why the mass function has this form. In order to have r successes for the first time
on the xth trial, we must have r 1 successes in the first x 1 trials. The number of
ways in which these r 1 successes may occur is xr 11 and the probability associated
with each of these sequences is ⇡ r 1 (1 ⇡)x r . Since the xth, i.e. final, trial must be a
success, we then multiply xr 11 ⇡ r 1 (1 ⇡)x r by ⇡. From the mass function it can be
seen that the negative binomial distribution generalises the (first version of the)
geometric distribution, such that if X ⇠ Geo(⇡) then X ⇠ Neg. Bin(1, ⇡).
Its distribution function is given by:
8
>
<0
> for x < r
bxc
FX (x) = X
>
> pX (i) for x r.
:
i=r

54
3.7. Discrete random variables

Second version

The second version of the negative binomial distribution is formulated as the number of
failures before the rth success occurs. In this formulation the mass function is:
(
x+r 1
r 1
⇡ r (1 ⇡)x for x = 0, 1, 2, . . .
pX (x) =
0 otherwise 3
while its distribution function is:
8
>
<0
> for x < 0
bxc
FX (x) = X
>
> pX (i) for x 0.
:
i=0

3.7.7 Polya distribution


The Polya distribution extends the second version of the negative binomial distribution
to allow for non-integer values of r. If X is a Polya random variable, denoted by
X ⇠ Polya(r, ⇡), its mass function is:
8
< (r + x) ⇡ r (1 ⇡)x for x = 0, 1, 2, . . .
pX (x) = x! (r)
:
0 otherwise

where is the gamma function defined as:


Z 1
(↵) = t↵ 1 e t
dt.
0

Integration by parts yields a useful property of the gamma function:

(↵) = (↵ 1) (↵ 1) for ↵ > 1

and we also have:


(1) = 1. (3.1)
One interpretation of the above property is that the gamma function extends the
factorial function to non-integer values. It is clear from (3.1) that:

(n) = (n 1)!

for any positive integer n.

3.7.8 Hypergeometric distribution


A hypergeometric distribution is used to represent the number of successes when n
objects are selected without replacement from a population of N objects, where K  N
of the objects represent ‘success’ and the remaining N K objects represent ‘failure’. If

55
3. Random variables and univariate distributions

X is a hypergeometric random variable, denoted X ⇠ Hyper(n, N, K), with support


{max(0, n + K N ), . . . , min(n, K)}, its mass function is:
8 K N K
>
< x n x
N
for x = max(0, n + K N ), . . . , min(n, K)
pX (x) =
>
:
n
0 otherwise
3 where n 2 {0, 1, 2, . . . , N }, K 2 {0, 1, 2, . . . , N } and N 2 {0, 1, 2, . . .} are parameters.
We omit the distribution function of the hypergeometric distribution in this course.
Note this is similar to the binomial distribution except that sampling is with
replacement in a binomial model, whereas sampling is without replacement in a
hypergeometric model.

3.7.9 Poisson distribution


A Poisson distribution is used to model the number of occurrences of events over a
fixed interval, typically in space or time. This distribution has a single parameter,
> 0, and the support of the distribution is {0, 1, 2, . . .}. If X is a Poisson random
variable, denoted X ⇠ Pois( ), its mass function is:
8 x
<e
>
for x = 0, 1, 2, . . .
pX (x) = x!
>
:0 otherwise
while its distribution function is:
8
>
> 0 for x < 0
<
FX (x) = bxc
X i
>
>
:e i!
for x 0.
i=0

Example 3.9 Suppose a fair die is rolled 10 times. Let X be the random variable
to represent the number of 6s which appear. Derive the distribution of X, i.e. FX .

Solution
For a fair die, the probability of a 6 is 1/6, and hence the probability of a non-6 is
5/6. By independence of the outcome of each roll, we have:
x
X x ✓ ◆ ✓ ◆i ✓ ◆10
X i
10 1 5
FX (x) = P (X  x) = P (X = i) = .
i=0 i=0
i 6 6

In full: 8
>
> 0 for x < 0
>
> bxc ✓ ◆ ✓ ◆i ✓ ◆10
<X i
10 1 5
FX (x) = for 0  x  10
>
> i 6 6
>
> i=0
:
1 for x > 10.

56
3.7. Discrete random variables

Example 3.10 Show that the ratio of any two successive hypergeometric
probabilities, i.e. P (X = x + 1) and P (X = x) equals:

n x K x
x+1N K n+x+1
for any valid x and x + 1.
3
Solution
If X ⇠ Hyper(n, N, K), then from its mass function we have:

pX (x + 1)
pX (x)
K N K
, K N K
x+1 n x 1 x n x
= N N
n n

K N K
x+1 n x 1
= K N K
x n x

K! (N K)!
=
(x + 1)! (K x 1)! (n x 1)! (N K n + x + 1)!
x! (K x)! (n x)! (N K n + x)!

K! (N K)!
n x K x
= .
x+1N K n+x+1

Example 3.11 Consider a generalisation of the hypergeometric distribution, such


that in a population of N objects, N1 are of type 1, N2 are of type 2, . . . and Nk are
of type k, where:
Xk
Ni = N.
i=1

Derive an expression for the probability of n1 , n2 , . . . , nk of the first, second etc. up


to the kth type of object, when a random sample of size n is selected without
replacement.

Solution
In total there are Nn ways of selecting a random sample of size n from the
population of N objects. There are Nnii ways to arrange the ni of the Ni objects for
i = 1, 2, . . . , k. By the rule of product, the required probability is:
Q
k
Ni
ni
i=1
N
.
n

57
3. Random variables and univariate distributions

Activity 3.4 Consider the first version of the geometric distribution.

(a) Show that its mass function:


(
(1 ⇡)x 1 ⇡ for x = 1, 2, . . .
pX (x) =
3 0 otherwise

is a valid mass function.

(b) Use the mass function in (a) to derive the associated distribution function.

Activity 3.5 Consider the first version of the negative binomial distribution. Show
that its mass function:
8✓ ◆
< x 1 ⇡ r (1 ⇡)x r for x = r, r + 1, r + 2, . . .
pX (x) = r 1
:
0 otherwise

is a valid mass function.

3.8 Continuous random variables


Continuous random variables have supports which are either the real numbers R, or one
or more intervals in R. Equivalently, this means that the distribution function of a
continuous random variable is continuous (unlike the step functions for discrete random
variables). Instead of a (probability) mass function, we describe a continuous
distribution with a (probability) density function.

Continuous random variable

A random variable X is continuous if its distribution function can be expressed as:


Z x
FX (x) = fX (t) dt for x 2 R
1

for some integrable function fX : R ! [0, 1), known as the (probability) density
function. In reverse, the density function can be derived from the distribution
function by di↵erentiating:
d
fX (x) = FX (t) = FX0 (x) for all x 2 R.
dt t=x

If fX is a valid density function, then:

i. fX (x) 0 for all x 2 R


R1
ii. 1 X
f (x) dx = 1.

58
3.8. Continuous random variables

Example 3.12 Show that:


(
(n + 2)(n + 1)xn (1 x) for 0  x  1
fX (x) =
0 otherwise

is a valid density function, where n is a positive integer.


3
Solution
Immediately, fX (x) 0 for all real x. So it remains to check that the function
integrates to 1 over its support. We have:
Z 1 Z 1
fX (x) = fX (x) dx = (n + 2)(n + 1)xn (1 x) dx
1 0
Z 1
= (n + 2)(n + 1)(xn xn+1 ) dx
0
 ✓ ◆ 1
xn+1 xn+2
= (n + 2)(n + 1)
n+1 n+2 0
h i1
= (n + 2)xn+1 (n + 1)xn+2
0

= (n + 2) (n + 1)
= 1.

Hence fX is a valid density function.

Example 3.13 Let X be a random variable with density function:


(
xe x for x 0
fX (x) =
0 otherwise.
Derive the distribution function of X, i.e. FX .

Solution
Applying integration by parts, we have:
Z x h ix Z 1
FX (x) = te t dt = te t
+ e t
dt
0 0 0
h ix
x t
= xe + e
0
x x
= xe e +1
=1 (1 + x)e x .
In full: (
0 for x < 0
FX (x) = x
1 (1 + x)e for x 0.

59
3. Random variables and univariate distributions

Example 3.14 A logistic curve has distribution function:


1
FX (x) = x
for 1 < x < 1.
1+e

(a) Verify this is a valid distribution function.


3 (b) Derive the corresponding density function.

Solution

(a) Applying the chain rule, we have that:

e x
FX0 (x) = >0
(1 + e x )2

and so FX is strictly increasing. We also have that:

lim FX (x) = 0 and lim FX (x) = 1


x! 1 x!1

verifying this is a valid distribution function.

(b) The density function is simply FX0 (x), i.e. we have:

e x
fX (x) = for 1 < x < 1.
(1 + e x )2

Note this is a member of logistic distribution family.

Example 3.15 Let X be a continuous random variable with density function


fX (x) which is symmetric, i.e. fX (x) = fX ( x) for all x. For any real constant k,
show that:
P ( k < X < k) = 2FX (k) 1.

Solution
We have:

P ( k < X < k) = P ( k < X  0) + P (0 < X < k)


Z 0 Z k
= fX (x) dx + fX (x) dx
k 0
Z 0 Z k
= fX ( x) dx + fX (x) dx
k 0
Z k Z k
= fX (x) dx + fX (x) dx
0 0

= 2(FX (k) FX (0)).

60
3.8. Continuous random variables

Due to the symmetry of X, we have that FX (0) = 0.5. Therefore:

2(FX (k) FX (0)) = 2(FX (k) 0.5) = 2FX (k) 1.

It is important to remember that whereas mass functions return probabilities, since


pX (x) = P (X = x), hence values of mass functions must be within [0, 1], values of a
density fucntion are not probabilities, rather probabilities are given by the area below
3
the density function (and above the x-axis).

Probability of an event for a continuous random variable

If X is a continuous random variable with density function fX , then for a, b 2 R such


that a  b, we have:
Z b
P (a < X  b) = fX (x) dx = FX (b) FX (a).
a

Setting a = b = x, we have that:

P (X = x) = 0 for all x 2 R.

More generally, for any well-behaved subset A of R, i.e. A ✓ R, then:


Z
P (X 2 A) = fX (x) dx
A

where {X 2 A} means, in full, {! 2 ⌦ : X(!) 2 A} such that A is an interval or a


countable union of intervals.

Note the seemingly counterintuitive result that P (X = x) = 0. This seems strange


because we can observe real-valued measurements of a continuous variable, such as
height. If a person was measured to be 170 cm, what does it mean if we said
P (X = 170) = 0, when clearly we have observed this event? Well, 170 cm has been
expressed to the nearest centimetre, so this simply means the observed height (in
centimetres) fell in the interval [169.5, 170.5], and there is a strictly positive probability
associated with this interval. Even if we used a measuring device with (far) greater
accuracy, there will always be practical limitations to the precision with which we can
measure. Therefore, any measurement on a continuous scale produces an interval rather
than a single value.
We now consider some common continuous probability distributions, several of which
were first mentioned in ST104b Statistics 2. As with discrete distributions, we refer
to ‘families’ of probability distributions, with di↵erent members of each family
distinguished by one or more parameters.

3.8.1 Continuous uniform distribution


A continuous uniform distribution assigns probability equally (uniformly, hence the
name) over its support {[a, b]}, for a < b. If X is a continuous random variable, denoted

61
3. Random variables and univariate distributions

X ⇠ Uniform[a, b], its density function is:


8
< 1
> for a  x  b
fX (x) = b a
>
:0 otherwise.

3 Its distribution function is given by:


8
>
> 0 for x < a
>
>
<x a
FX (x) = for a  x  b
>
> b a
>
>
:1 for x > b.

The special case of X ⇠ Uniform[0, 1], i.e. when the support is the unit interval, is used
in simulations of random samples from distributions, by treating a random drawing
from Uniform[0, 1] as a randomly drawn value of a distribution function, since:

lim FX (x) = 0 and lim FX (x) = 1.


x! 1 x!1

Inverting the distribution function recovers the (simulated) random drawing from the
desired distribution.

3.8.2 Exponential distribution


An exponential distribution arises in reliability theory and queuing theory. For
example, in queuing theory we can model the distribution of interarrival times (if, as is
often assumed, arrivals are treated as having a Poisson distribution with a rate
parameter of > 0) with a positive-valued random variable following an exponential
distribution, i.e. with support {x 0}. If X is an exponential random variable, denoted
X ⇠ Exp( ), its density function is:
(
e x for x 0
fX (x) =
0 otherwise.

Its distribution function is given by:


(
0 for x < 0
FX (x) = x
1 e for x 0.

3.8.3 Normal distribution


A normal distribution is (one of) the most important distribution(s) in statistics. It
has been covered extensively in ST104a Statistics 1 and ST104b Statistics 2.
Recall that a normal distribution is completely specified by its mean, µ, and its
variance, 2 (such that 1 < µ < 1 and 2 > 0) and has a support of R. If X is a
normal random variable, denoted X ⇠ N (µ, 2 ), its density function is:
✓ ◆
1 (x µ)2
fX (x) = p exp for 1 < x < 1.
2⇡ 2 2 2

62
3.8. Continuous random variables

The distribution function of a normal random variable does not have a closed form.
An important special case is the standard normal distribution with µ = 0 and
2
= 1, denoted Z ⇠ N (0, 1), with density function:
✓ 2◆
1 z
fZ (z) = p exp for 1 < z < 1.
2⇡ 2
X and Z are related through the linear transformation: 3
X µ
Z= , X = µ + Z.

The distribution function of Z is denoted by , such that if Z ⇠ N (0, 1), then:


(z) = FZ (z) = P (Z  z).

3.8.4 Gamma distribution


A gamma distribution is a positively-skewed distribution with numerous practical
applications, such as modelling the size of insurance claims and the size of defaults on
loans. The distribution is characterised by two parameters – a shape parameter, ↵ > 0,
and a scale parameter, > 0. If X is a gamma-distributed random variable, denoted
X ⇠ Gamma(↵, ), its density function is:
8
< 1
> ↵ ↵ 1
x e x for x 0
fX (x) = (↵)
>
:0 otherwise

where recall is the gamma function, defined in Section 3.7.7. We omit the distribution
function of the gamma distribution in this course.
Note that when ↵ = 1 the density function reduces to an exponential distribution, i.e. if
X ⇠ Gamma(1, ), then X ⇠ Exp( ).

3.8.5 Beta distribution


A beta distribution is a generalisation of the continuous uniform distribution, defined
over the support [0, 1]. A beta distribution is characterised by two shape parameters,
↵ > 0 and > 0. If X is a beta random variable, denoted X ⇠ Beta(↵, ), its density
function is: 8
< 1 x↵ 1 (1 x) 1 for 0  x  1
>
fX (x) = B(↵, )
>
:0 otherwise
where B(↵, ) is the beta function defined as:
Z 1
(↵) ( )
B(↵, ) = t↵ 1 (1 t) 1
dt = .
0 (↵ + )
We omit the distribution function of the beta distribution in this course.
Note that when ↵ = 1 and = 1 the density function reduces to the continuous uniform
distribution over [0, 1], i.e. if X ⇠ Beta(1, 1), then X ⇠ Uniform[0, 1].

63
3. Random variables and univariate distributions

3.8.6 Triangular distribution


A triangular distribution is a popular choice of input distribution in Monte Carlo
simulation studies. It is specified by easy-to-understand parameters: the minimum
possible value, a, the maximum possible value, b (with a < b), and the modal (i.e. most
likely) value, c, such that a  c  b. The support is [a, b]. If X is a triangular random
3 variable, denoted X ⇠ Triangular(a, b, c), its density function is:
8
> 2(x a)
>
> for a  x < c
>
> (b a)(c a)
>
>
>
>
< 2 for x = c
fX (x) = b a
>
>
>
> 2(b x)
>
> for c < x  b
>
> (b a)(b c)
>
:
0 otherwise.
Its distribution function is given by:
8
>
> 0 for x < a
>
>
>
> (x a)2
>
< (b a)(c a) for a  x  c
FX (x) =
>
> (b x)2
>
> 1 for c < x  b
>
> (b a)(b c)
>
:
1 for x > b.

Example 3.16 A random variable has ‘no memory’ if for all x and for y > 0 it
holds that:
P (X > x + y | X > x) = P (X > y).
Show that if X has either the exponential distribution, or a geometric distribution
with P (X = x) = q x 1 p, then X has no memory. Interpret this property.

Solution
We must check that P (X > x + y | X > x) = P (X > y). This can be written in
terms of the distribution function of X because for y > 0 we have:
P ({X > x + y} \ {X > x})
1 FX (y) = P (X > y) = P (X > x + y | X > x) =
P (X > x)
P (X > x + y)
=
P (X > x)
1 FX (x + y)
= .
1 FX (x)
If X ⇠ Exp( ), then:
x
1 FX (x) = e .
The ‘no memory’ property is verified by noting that:
(x+y)
y e
e = x
.
e

64
3.9. Expectation, variance and higher moments

If X has a geometric distribution, then:

1 FX (x) = q x .

The ‘no memory’ property is verified by noting that:

q x+y
qy =
qx
. 3
The ‘no memory’ property is saying that ‘old is as good as new’. If we think in terms
of lifetimes, it says that you are equally likely to survive for y more years whatever
your current age x may be. This is unrealistic for humans for widely di↵erent ages x,
but may work as a base model in other applications.

3.9 Expectation, variance and higher moments

3.9.1 Mean of a random variable


Measures of central tendency (or location) were introduced as descriptive statistics in
ST104a Statistics 1. For simple datasets, the mean, median and mode were
considered. Central tendency allows us to summarise a single feature of datasets as a
‘typical’ value. This is a simple example of data reduction, i.e. reducing a random
sample of n > 1 observations into a single value, in this instance to reflect where a
sample distribution is centred.
As seen in ST104b Statistics 2, central tendency measures can also be applied to
probability distributions. For example, if X is a continuous random variable with
density function fX and distribution function FX , then the mode is:
mode(X) = max fX (x)
x

i.e. the value of X where the density function reaches a maximum (which may or may
not be unique), and the median is the value m satisfying:
median(X) = FX (m) = 0.5.
Hereafter, we will focus our attention on the mean of X, often referred to as the
expected value of X, or simply the expectation of X.

Mean of a random variable

If X is a random variable with mean µ, then:


8P
< x pX (x) for discrete X
µ = E(X) = Rx (3.2)
: 1
1
x fX (x) dx for continuous X
P
where to ensure that E(X)
R 1is well-defined, we usually require that x |x| pX (x) < 1
for discrete X, and that 1 |x| fX (x) dx < 1 for continuous X.

65
3. Random variables and univariate distributions

Example 3.17 Suppose that X ⇠ Bin(n, ⇡). Hence its mass function is:
(
n
x
⇡ x (1 ⇡)n x for x = 0, 1, 2, . . . , n
pX (x) =
0 otherwise

and so the mean, E(X), is:


3 X
E(X) = x pX (x) (by definition)
x

Xn ✓ ◆
n x
= x ⇡ (1 ⇡)n x
(substituting pX (x))
x=0
x

Xn ✓ ◆
n x
= x ⇡ (1 ⇡)n x
(since x pX (x) when x = 0)
x=1
x
n
X n(n 1)!
= ⇡⇡ x 1 (1 ⇡)n x
(as x/x! = 1/(x 1)!)
x=1
(x 1)! [(n 1) (x 1)]!
n ✓
X ◆
n 1 x 1
= n⇡ ⇡ (1 ⇡)n x
(taking n⇡ outside)
x=1
x 1
n 1✓
X ◆
n 1
= n⇡ ⇡ y (1 ⇡)(n 1) y
(setting y = x 1)
y=0
y

= n⇡ ⇥ 1 (since the sum is a Bin(n 1, ⇡) mass function)


= n⇡.

Example 3.18 Suppose that X ⇠ Exp( ). Hence its density function is:
(
e x for x 0
fX (x) =
0 otherwise

and so the mean, E(X), is:


Z 1 Z 1
x
E(X) = x fX (x) dx = x e dx.
1 0

Note that:
d x
e x.
x e =
d
We may switch the order of di↵erentation with respect to and integration with
respect to x, hence:
Z 1  1
d x d 1 d 1
E(X) = e dx = e x = 1
= 2
= .
d 0 d 0 d
An alternative approach makes use of integration by parts.

66
3.9. Expectation, variance and higher moments

Example 3.19 If the random variable X ⇠ Hyper(n, N, K), show that for a
random sample of size n, the expected value of the number of successes is:
nK
E(X) = .
N

Solution 3
We have:
K K N K K K! N K
X x n x
X x! (K x)! n x
E(X) = x N
= x N!
x=0 n x=1 n! (N n)!

noting that 0 pX (0) = 0, so we may start the summation at x = 1. We proceed to


factor out the result of nK/N , such that:
K (K 1)! N K K K 1 N K
nK X (x 1)! (K x)! n x nK X x 1 n x
E(X) = (N 1)!
= N 1
.
N x=1 N x=1 n 1
(n 1)! (N n)!

If we now change the summation index to start at 0, we have:


K 1 K 1 N K
nK X x n 1 x nK
E(X) = N 1
=
N x=0 n 1
N

since the summation is of X ⇠ Hyper(n 1, N 1, K 1) over its support, and


hence is equal to 1.

3.9.2 Expectation operator

ST104b Statistics 2 introduced properties of the expectation operator, E, notably


‘the expectation of the sum’ equals the ‘sum of the expectations’, i.e. linearity –
reviewed below. We have seen above how the expectation operator is applied to
determine the mean of a random variable X. On many occasions we may be interested
in a function of X, i.e. g(X). Rather than determine the mass or density function of
g(X) and then work out E(g(X)) using (3.2), it is often easier to work directly with the
original mass or density function of X.

Expectation of functions of a random variable

For any well-behaved function g : R ! R, the expectation of g(X) is defined as:


8P
< g(x) pX (x) for discrete X
E(g(X)) = Rx
: 1
1
g(x) fX (x) dx for continuous X
P
where
R1 we usually require that x |g(x)| pX (x) < 1 for discrete X, and that
1
|g(x)| f X (x) dx < 1 for continuous X, to ensure that E(g(X)) is well-defined.

67
3. Random variables and univariate distributions

A key property of the expectation is linearity.

Linearity of the expectation operator

For a random variable X and real constants a0 , a1 , a2 , . . . , ak , then:


k
! k
X X
3 E ai X i = ai E(X i ).
i=0 i=0

The proof of this result is trivial since the property of linearity is inherited directly from
the definition of expectation in terms of a sum or an integral. Note that since:
E(a0 + a1 X + a2 X 2 + · · · + ak X k ) = a0 + a1 E(X) + a2 E(X 2 ) + · · · + ak E(X k )
when k = 1, i.e. for any real constants a0 and a1 , we have:
E(a0 ) = a0 and E(a0 + a1 X) = a0 + a1 E(X).
Also note that if X is a positive random variable, then E(X) 0, as we would expect.

Example 3.20 Suppose X is a random variable with density function:


(
2(1 x) for 0  x  1
fX (x) =
0 otherwise.

Define Y = X 2 , with corresponding density function:


( p
1/ y 1 for 0  y  1
fY (y)
0 otherwise.

Determine E(Y ).

Solution
We can derive E(Y ) in one of two ways. Working directly with fY (y), we have:
Z 1 Z 1 ✓ ◆  1
1 2y 3/2 y2 1
E(Y ) = y fY (y) dy = y p 1 dy = = .
1 0 y 3 2 0 6

Working with fX (x), we have:


Z 1 Z 1  1
2 2 2 2x3 x4 1
E(Y ) = E(X ) = x fX (x) dx = x (1 x) dx = = .
1 0 3 2 0 6

Example 3.21 A bowl has n balls, numbered 1 to n. One ball is selected at


random. Let X be the random variable representing the number of this ball, hence
the support of X is {1, 2, . . . , n}. Suppose the probability that ball x is chosen is kx.
Calculate E(1/X).

68
3.9. Expectation, variance and higher moments

Solution
We must have that:
X n
X n
X kn(n + 1)
pX (x) = kx = k x= =1
x x=1 x=1
2

noting that
P
n
i = n(n + 1)/2. Therefore: 3
i=1

2
k= .
n(n + 1)

Hence:
✓ ◆ X1 Xn Xn
1 1 2x 2 2
E = pX (x) = = = .
X x
x x=1
x n(n + 1) x=1
n(n + 1) n + 1

3.9.3 Variance of a random variable


Measures of dispersion (or spread) were similarly introduced in ST104a Statistics 1 in
the context of descriptive statistics for simple univariate datasets. As with measures of
central tendency, we may apply these to probability distributions. For example, if X is a
random variable with distribution function FX , then the interquartile range (IQR) is:

IQR(X) = Q3 Q1 = FX 1 (0.75) FX 1 (0.25).

Hereafter, we will focus our attention on the variance of X – the average squared
distance from the mean (the standard deviation of X is then simply the positive
square root of the variance).

Variance and standard deviation of a random variable

If X is a random variable, the variance of X is defined as:


(P
2 (x E(X))2 pX (x) for discrete X
= Var(X) = E((X E(X))2 ) = R 1x 2
1
(x E(X)) fX (x) dx for continuous X

whenever the sum or integral is finite. The standard deviation is defined as:
p
= Var(X).

Recall the following properties of variance from ST104b Statistics 2.

i. Var(X) 0, i.e. variance is always non-negative.

ii. Var(a0 + a1 X) = a21 Var(X), i.e. variance is invariant to a change in location.

69
3. Random variables and univariate distributions

Proof:

i. Since (X E(X))2 is a positive random variable, it follows that:

Var(X) = E((X E(X))2 ) 0.

3 ii. Define Y = a0 + a1 X, i.e. Y is a linear transformation of X, then by linearity


E(Y ) = a0 + a1 E(X). Hence Y E(Y ) = a1 (X E(X)) and so:

Var(a0 + a1 X) = Var(Y ) = E((Y E(Y ))2 )


= E(a21 (X E(X))2 )
= a21 Var(X).


In practice it is often easier to derive the variance of a random variable X using one of
the following alternative, but equivalent, results.

i. Var(X) = E(X 2 ) (E(X))2 .

ii. Var(X) = E(X(X 1)) E(X) E(X 1).

Example 3.22 Suppose that X ⇠ Bernoulli(⇡). The mass function is:


(
⇡ x (1 ⇡)1 x for x = 0, 1
pX (x) =
0 otherwise.

Hence the mean of X is:


X 1
X
E(X) = x pX (x) = x ⇡ x (1 ⇡)1 x

x x=0

= 0 ⇥ (1 ⇡) + 1 ⇥ ⇡
= ⇡.

Also, we have:

X 1
X
E(X 2 ) = x2 pX (x) = x2 ⇡ x (1 ⇡)1 x

x x=0

= 02 ⇥ (1 ⇡) + 12 ⇥ ⇡
= ⇡.

Therefore:
Var(X) = E(X 2 ) (E(X))2 = ⇡ ⇡ 2 = ⇡(1 ⇡).

70
3.9. Expectation, variance and higher moments

Example 3.23 Suppose that X ⇠ Bin(n, ⇡). We know that E(X) = n⇡. To find
Var(X) it is most convenient to calculate E(X(X 1)), from which we can recover
E(X 2 ). We have:
X
E(X(X 1)) = x(x 1) pX (x) (by definition)

3
x
n
X ✓ ◆
n x n x
= x(x 1) ⇡ (1 ⇡) (substituting pX (x))
x=0
x

Xn ✓ ◆
n x
= x ⇡ (1 ⇡)n x
(since x pX (x) = 0 when x = 0 and 1)
x=2
x
n
X n(n 1)(n 2)!
= ⇡ 2 ⇡ x 2 (1 ⇡)n x
x=2
(x 2)! [(n 2) (x 2)]!
(as x(x 1)/x! = 1/(x 2)!)
n ✓
X ◆
2 n 2 x 2
= n(n 1)⇡ ⇡ (1 ⇡)n x
x=2
x 2
(taking n(n 1)⇡ 2 outside)
n 2✓
X ◆
n 2 y
= n(n 1)⇡ 2 ⇡ (1 ⇡)(n 2) y (setting y = x 2)
y=0
y

= n(n 1)⇡ 2 ⇥ 1 (since the sum is a Bin(n 2, ⇡) mass function)


= n(n 1)⇡ 2 .

Therefore:

Var(X) = E(X(X 1)) E(X) E(X 1)


= n(n 1)⇡ 2 n⇡(n⇡ 1)
= n⇡((n 1)⇡ (n⇡ 1))
= n⇡(1 ⇡).

In practice, this is often written as Var(X) = npq, where p = ⇡ and q = 1 p.

Activity 3.6 Consider a continuous random variable X with density function:


8 3
<x x for 0  x  2
fX (x) = 4
:
0 otherwise.

Calculate E(X) and Var(X).

71
3. Random variables and univariate distributions

Activity 3.7 Prove that, if > 0, then xe x ! 0 as x ! 1. Hence use integration


by parts to show that if X ⇠ Exp( ), then E(X) = 1/ .

Activity 3.8 For a random variable X, prove that:

3 (a) Var(X) = E(X 2 ) (E(X))2 .

(b) Var(X) = E(X(X 1)) E(X) E(X 1).

Activity 3.9 Suppose X ⇠ Exp( ). Calculate Var(X).

Activity 3.10 Suppose X is a random variable. Show that:

E(I( 1,x] (X)) = FX (x).

3.9.4 Inequalities involving expectation


Here we consider bounds for probabilities and expectations which are beneficial due to
their generality. These can be useful in proofs of convergence results. We begin with the
Markov inequality.

Markov inequality

Let X be a positive random variable with E(X) < 1, then:

E(X)
P (X a) 
a
for any constant a > 0.

Proof: Here we consider the continuous case. A similar argument holds for the discrete
case.
Z 1
P (X a) = fX (x) dx (by definition)
a
Z 1
x
) P (X a)  fX (x) dx (since 1  x/a for x 2 [a, 1))
a a
Z 1
1 R1 R1
) P (X a)  x fX (x) dx (since a
g(x) dx  0
g(x) dx for positive g)
a 0

1
) P (X a)  E(X). (by definition of E(X))
a

So the Markov inequality provides an upper bound on the probability in the upper tail
of a distribution. Its appeal lies in its generality, since no distributional assumptions are
required. However, a consequence is that the bound may be very loose as the following
example demonstrates.

72
3.9. Expectation, variance and higher moments

Example 3.24 Suppose human life expectancy in a developed country is 80 years.


If we let X denote the positive random variable of lifespan, then E(X) = 80.
Without imposing a distributional assumption on lifespan, we may find an upper
bound on the probability that a human in the country lives to be over 160. Using
the Markov inequality we have:

P (X 160) 
E(X)
=
80
= 0.5 3
160 160
which is unrealistic as we would expect this probability to be (very) close to zero!

We now extend the Markov inequality to consider random variables which are not
constrained to be positive, using the Chebyshev inequality.

Chebyshev inequality

Let X be a random variable with Var(X) < 1, then:

Var(X)
P (|X E(X)| a) 
a2
for any constant a > 0.

Proof: This follows from the Markov inequality by setting Y = (X E(X))2 . Hence Y
is a positive random variable so the Chebyshev inequality holds. By definition
E(Y ) = Var(X), so the Markov inequality gives:
Var(X)
P ((X E(X))2 a2 ) = P (|X E(X)| a) 
a2
using a2 in place of a. ⌅
Applying the Chebyshev inequality to a standardised distribution, i.e. if X is a random
variable with E(X) = µ and Var(X) = 2 , then for a real constant k > 0 we have:
✓ ◆
|X µ| 1
P k  2.
k
Proof: This follows immediately from the Chebyshev inequality by setting a = k . ⌅

Example 3.25 Suppose we seek an upper bound on the probability of a random


variable lying beyond two standard deviations from its mean. We have:
1
P (|X µ| 2 ) .
4
If X ⇠ N (µ, 2 ), then it is known that there is (approximately) a 0.05 probability of
being beyond two standard deviations from the mean, which is clearly much lower
than 0.25.

The above example demonstrates that (for the normal distribution at least) the bound
can be very inaccurate. However, it is the generalisability of the result to all
distributions with finite variance which makes this a useful result.

73
3. Random variables and univariate distributions

We now consider a final inequality – the Jensen inequality – but first we begin with
the definition of a convex function.

Convex function

A function g : R ! R is a convex function if for any real constant a we can find k


such that:
3 g(x) g(a) + k(x a) for all x 2 R.

Example 3.26 The function g(x) = x2 is a convex function.

Jensen inequality

If X is a random variable with E(X) < 1 and g is a convex function such that
E(g(X)) < 1, then:
E(g(X)) g(E(X)).

Example 3.27 We consider applications of the Jensen inequality such that we may
derive various relationships involving expectation.

1. Linear: If g(x) = a0 + a1 x, then g is convex. Applying the Jensen inequality we


have:
E(a0 + a1 X) a0 + a1 E(X).
Indeed, by linearity of expectation, E(a0 + a1 X) = a0 + a1 E(X).

2. Quadratic: If g(x) = x2 , then g is convex. Applying the Jensen inequality we


have:
E(X 2 ) (E(X))2 .
It then follows that Var(X) = E(X 2 ) (E(X))2 0, ensuring the variance is
non-negative.

3. Reciprocal: If g(x) = 1/x, then g is convex for x 0. Applying the Jensen


inequality we have: ✓ ◆
1 1
E .
X E(X)

3.9.5 Moments
Characterising a probability distribution by key attributes is desirable. For a random
variable X the mean, E(X), is our preferred measure of central tendency, while the
variance, Var(X) = E((X E(X))2 ), is our preferred measure of dispersion (or its
standard deviation). However, these are not exhaustive of distribution attributes which
may interest us. Skewness (the departure from symmetry) and kurtosis (the fatness
of tails) are also important, albeit less important than the mean and variance on a
relative basis.

74
3.9. Expectation, variance and higher moments

On a rank-order basis we will think of the mean as being the most important attribute
of a distribution, followed by the variance, skewness and then kurtosis. Nonetheless, all
of these attributes may be expressed in terms of moments and central moments,
now defined. (Note that moments were introduced in ST104b Statistics 2 in the
context of method of moments estimation.)

Moments
3
If X is a random variable, and r is a positive integer, then the rth moment of X is:

µr = E(X r )

whenever this is well-defined.

Example 3.28 Setting r = 1 produces the first moment, which is the mean of the
distribution since:
µ1 = E(X 1 ) = E(X) = µ
provided E(X) < 1.
Setting r = 2 produces the second moment, which combined with the mean can be
used to determine the variance since:

Var(X) = E(X 2 ) (E(X))2 = µ2 (µ1 )2 = 2

provided E(X 2 ) < 1.

Moments are determined by the horizontal location of the distribution. For the
variance, our preferred measure of dispersion, we would wish this to be invariant to a
shift up or down the horizontal axis. This leads to central moments which account for
the value of the mean.

Central moments

If X is a random variable, and r is a positive integer, then the rth central moment
of X is:
µ0r = E((X E(X))r )
whenever this is well-defined.

Example 3.29 Setting r = 1 produces the first central moment, which is always
zero since:
µ01 = E((X E(X))1 ) = E(X µ1 ) = E(X) µ1 = E(X) E(X) = 0
provided E(X) < 1.
Setting r = 2 produces the second central moment, which is the variance of the
distribution since:
µ02 = E((X E(X))2 ) = Var(X) = 2
provided E(X 2 ) < 1.

75
3. Random variables and univariate distributions

Central moments can be expressed in terms of (non-central) moments by:


Xr ✓ ◆
r
µ0r = ( µ1 ) i µr i . (3.3)
i=0
i

Example 3.30 Using (3.3), the second central moment can be expressed as:

3 2 ✓ ◆
X 2
µ02 = ( µ1 ) i µ2 i = µ2 2(µ1 )2 + (µ1 )2 = µ2 (µ1 )2
i=0
i

noting that µ0 = E(X 0 ) = E(1) = 1. Of course, this is just an alternative way of


saying that Var(X) = E(X 2 ) (E(X))2 .

Example 3.31 Using (3.3), the third central moment can be expressed as:
3 ✓ ◆
X 3
µ03 = ( µ1 ) i µ3 i = µ3 3µ1 µ2 + 3(µ1 )3 (µ1 )3 = µ3 3µ1 µ2 + 2(µ1 )3 .
i=0
i

Example 3.32 Let X be a random variable which has a Bernoulli distribution with
parameter ⇡.

(a) Show that E(X r ) = ⇡ for r = 1, 2, . . ..

(b) Find the third central moment of X.

(c) Show that the mean of the binomial distribution with parameters n and ⇡ is
equal to n⇡.

Solution
We have that X ⇠ Bernoulli(⇡).

(a) Since X r = X it follows that:

E(X r ) = E(X) = 0 ⇥ (1 ⇡) + 1 ⇥ ⇡ = ⇡.

(b) We have:

E((X E(X))3 ) = ⇡(1 ⇡)3 (1 ⇡)⇡ 3


= ⇡(1 ⇡)((1 ⇡)2 ⇡2)
= ⇡(1 ⇡)(1 2⇡).

P
n
(c) Define Y = Xi , where the Xi s are i.i.d. Bernoulli(⇡) random variables. Hence:
i=1

n
! n n
X X X
E(Y ) = E Xi = E(Xi ) = ⇡ = n⇡.
i=1 i=1 i=1

76
3.9. Expectation, variance and higher moments

Example 3.28 showed that the first moment is the mean (our preferred measure of
central tendency), while Example 3.29 showed that the second central moment is the
variance (our preferred measure of dispersion). We now express skewness and kurtosis in
terms of moments.

Coefficient of skewness

If X is a random variable with Var(X) = 2


< 1, the coefficient of skewness is: 3
✓ ◆3 !
X µ E((X µ)3 ) µ03
Skew(X) = 1 =E = = .
3 (µ02 )3/2

So we see that skewness depends on the third central moment, although the reason for
this may not be immediately clear. It can be explained by first noting that if g(x) = x3 ,
then g is an odd function meaning g( x) = g(x). For a continuous random variable X
with density function fX , the third central moment is:
Z 1
E((X µ)3 ) = (x µ)3 fX (x) dx
1
Z 1
= z 3 fX (µ + z) dz (where z = x µ)
1
Z 1
= z 3 (fX (µ + z) fX (µ z)) dz. (since z 3 is odd)
0

The term (fX (µ + z) fX (µ z)) compares the density at the points a distance z above
and below µ, such that any di↵erence signals asymmetry. If such sources of asymmetry
are far from µ, then when multiplied by z 3 they result in a large coefficent of skewness.

Example 3.33 Suppose X ⇠ Exp( ). To derive the coefficient of skewness we


require the second and third central moments, i.e. µ02 and µ03 , respectively. Here these
will be calculated from the first three (non-central) moments, µ1 , µ2 and µ3 . We will
proceed to find a general expression for the rth moment for no additional e↵ort. We
have:
Z 1
r
µr = E(X ) = xr fX (x) dx
1
Z 1
= xr e x
dx (using the exponential density)
0
h i1 Z 1
= xr e x
+r xr 1 e x
dx (using integration by parts)
0 0
r
= µr 1 .

Noting that µ0 = 1, by recursion we have:


r r 1 2 1 r!
µr = ··· = r

77
3. Random variables and univariate distributions

from which we obtain:


1 2 6
µ1 = , µ2 = 2
and µ3 = 3
.

Hence the second central moment is:


✓ ◆2
3 µ02 = µ2 2
(µ1 ) =
2
2
1
=
1
2

and the third central moment is:


✓ ◆3
6 1 2 1 2
µ03 = µ3 3µ1 µ2 + 2(µ1 ) = 3
3
3⇥ 2
+2⇥ = 3
.

Therefore, the coefficient of skewness is:

µ03 2/ 3 2/ 3
1 = 0 3/2
= 2 3/2
= 3
=2
(µ2 ) (1/ ) 1/

which is positive (recall that the exponential distribution is positively skewed).

Coefficient of kurtosis

If X is a random variable with Var(X) = 2 < 1, the coefficient of kurtosis is:


✓ ◆4 !
X µ E((X µ)4 ) µ04
Kurt(X) = 2 = E 3= 3 = 3. (3.4)
4 (µ02 )2

We see that the term ‘ 3’ appears in the definition of kurtosis. Convention means we
measure kurtosis with respect to a normal distribution. Noting that the fourth central
moment of a normal distribution is 3 4 , we have that the kurtosis for a normal
distribution is:
E((X µ)4 ) 3 4
2 = 4
3 = 4
3 = 0.
The coefficient of kurtosis as defined in (3.4) with the ‘ 3’ term is often called excess
kurtosis, i.e. kurtosis in excess of that of a normal distribution.

Example 3.34 Suppose X ⇠ Exp( ). To derive the coefficient of kurtosis we


require the fourth central moment. Example 3.33 gave us the general expression for
the rth moment:
r!
µr = r .
The fourth central moment is:
X4 ✓ ◆
0 4 9
µ4 = ( µ1 ) i µ4 i = µ4 4µ1 µ3 + 6(µ1 )2 µ2 4(µ1 )4 + (µ1 )4 = 4
.
i=0
i
Since µ02 = 1/ 2 , the coefficient of kurtosis is:
µ04 9/ 4
2 = 3= 3=9 3 = 6.
(µ02 )2 (1/ 2 )2

78
3.9. Expectation, variance and higher moments

Example 3.35 Find the mean and variance of the gamma distribution:
1 ↵ ↵ 1 1
fX (x) = x e x= ↵ ↵ 1
x e x
(↵) (↵ 1)!
for x 0, ↵ > 0 and > 0, where (↵) = (↵ 1)!.
Hint: Note that since fX (x) is a density function, we can write: 3
Z 1
1 ↵ ↵ 1
x e x dx = 1.
0 (↵ 1)!

Solution
We can find the rth moment, and use this result to get the mean and variance. We
have:
Z 1 Z 1
r r 1
E(X ) = µr = x fX (x) dx = xr ↵ ↵ 1
x e x dx
1 0 (↵ 1)!
Z 1
1
= xr+↵ 1 ↵ e x dx
0 (↵ 1)!
Z 1
(r + ↵ 1)! 1 1
= r
x(r+↵) 1 r+↵ e x dx
(↵ 1)! 0 (r + ↵ 1)!
(r + ↵ 1)! 1
= r
(↵ 1)!
since the integrand is a Gamma(r + ↵, ) density function, which integrates to 1. So:
(r + ↵ 1)!
µr = .
(↵ 1)! r
Using the result:
↵! ↵
E(X) = µ1 = =
(↵ 1)!
and:
(↵ + 1)! ↵(↵ + 1)
E(X 2 ) = µ2 = 2
= 2
.
(↵ 1)!
Therefore, the variance is:
↵(↵ + 1) ↵2 ↵
µ2 (µ1 )2 = 2 2
= 2
.

Both the mean and variance increase with ↵ increasing and decrease with
increasing.
We can also compute E(X) and E(X 2 ) by substituting y = x. Note that this gives
dx = (1/ ) dy. For example:
Z 1 Z 1
1 ↵ ↵ 1
E(X) = x fX (x) dx = x x e x dx
1 0 (↵ 1)!
Z 1
1
= y ↵ e y dy
(↵ 1)! 0

and recognise that the integral is the definition of (↵ + 1) = ↵!.

79
3. Random variables and univariate distributions

Example 3.36 Find the mean and variance of the Poisson distribution:

x
e
pX (x) =
x!

for x = 0, 1, 2, . . ., and > 0.


3 Hint: Note that since pX (x) is a mass function, we can write:

1
X x
e
= 1.
x=0
x!

Solution
By direct calculation we have:

1
X x 1
X x 1 1
X y
e
E(X) = x =e =e =e e =
x=0
x! x=1
(x 1)! y=0
y!

setting y = x 1, and:

1
X x 1
X x 1
X x 1
e e
E(X 2 ) = x2 = x2 =e x
x=0
x! x=1
x! x=1
(x 1)!
1
X y
=e (y + 1)
y=0
y!
1
X y 1
X y
y
=e +e
y=0
y! y=0
y!
| {z }
=e E(Y )

=e e +e e
2
= +

again setting y = x 1, and noting that for Y ⇠ Pois( ) then E(Y ) = .


Another way here is to find the rth factorial moment:

µ(r) = E(X (r) ) = E(X(X 1) · · · (X r + 1).

This works out very simply. We can then convert to the mean and variance. The
critical property that makes µ(r) work out easily is that for an integer x from 1 to
r 1 we have:
x(r) x(r) 1
= (x) = .
x! x (x r)!

80
3.9. Expectation, variance and higher moments

We have:
1
X x 1
X
(r) (r) e x(r) x
E(X )= x = e
x=0
x! x=r
x!
1
X 1 x
= e
x=r
(x r)!
3
1
X
r 1 x r
= e
x=r
(x r)!
1
X y
r e
=
y=0
y!
r
= .

The last step follows because we are adding together all the probabilities for a
Poisson distribution with parameter .
Now it is straightforward to get the mean and variance. For the mean we have:

E(X) = E(X (1) ) = .

Since E(X(X 1)) + E(X) = E(X 2 ), then:

µ2 = E(X (2) ) + E(X) = µ(2) + µ(1) = 2


+

and so:
Var(X) = µ2 (µ1 )2 = 2
+ 2
= .

Example 3.37 Find the mean and variance of the Pareto distribution:
a 1
fX (x) =
(1 + x)a
for x > 0 and a > 1.
Hint: It is easier to find E(X + 1) and E((X + 1)2 ). We then have that
E(X) = E(X + 1) 1 and E(X 2 ) comes from E((X + 1)2 ) in a similar manner.

Solution
We can directly integrate for µr , writing the integral as a Pareto distribution with
parameter a r after a transformation. It is easier to work with Y = X + 1, noting
that E(X) = E(Y ) 1 and Var(Y ) = Var(X). We have:
Z 1 Z 1
a 1 a 1 a r 1 a 1
E(Y r ) = (x + 1)r a
dx = a r
dx =
0 (1 + x) a r 1 0 (1 + x) a r 1
provided that a r > 1 (otherwise the integral is not defined). So:
a 1 1
E(Y ) = ) E(X) =
a 2 a 2

81
3. Random variables and univariate distributions

for a > 2. Provided a > 3, then:


✓ ◆2
2 2 a 1 a 1
Var(X) = Var(Y ) = E(Y ) (E(Y )) =
a 3 a 2
(a 1)((a 2)2 (a 1)(a 3))
=
(a 2)2 (a 3)
3 a 1
= .
(a 2)2 (a 3)

3.10 Generating functions

3.10.1 Moment generating functions


In the previous section we saw how useful moments (and central moments) are for
expressing important attributes of a probability distribution such as the mean, variance,
skewness and kurtosis. Many distributions can summarise all of their moments,
E(X), E(X 2 ), . . ., into a single function known as the moment generating function.
This is, literally, a function (when it exists) which can be used to generate the moments
of a distribution.

Moment generating function

The moment generating function (mgf) of a random variable X is a function


MX : R ! [0, 1) defined as:
8P
< etx pX (x) for discrete X
MX (t) = E(etX ) = Rx1
: etx fX (x) dx for continuous X
1

where to be well-defined we require MX (t) < 1 for all t 2 [ h, h] for some h > 0.
So the mgf must be defined in an interval around the origin, which will be necessary
when derivatives of the mgf with respect to t are taken and evaluated when t = 0,
i.e. such as MX0 (0), MX00 (0) etc.

If the expected value E(etX ) is infinite, the random variable X does not have an mgf.

MX (t) is a function of real numbers t. It is not a random variable itself.

The form of the mgf is not interesting or informative in itself. Instead, the reason we
define the mgf is that it is a convenient tool for deriving means and variances of
distributions, using the following results:
MX0 (0) = E(X) and MX00 (0) = E(X 2 )
which also gives:
Var(X) = E(X 2 ) (E(X))2 = MX00 (0) (MX0 (0))2 .

82
3.10. Generating functions

This is useful if the mgf is easier to derive than E(X) and Var(X) directly.
Other moments are obtained from the mgf similarly:
(r)
MX (0) = E(X r ) for r = 1, 2, . . . .

To see why, note that MX (t) is the expected value of an exponential function of X.
Recall the Taylor expansion of ex is: 3
X xi 1
x x2 xr
e =1+x+ + ··· + + ··· = .
2! r! i=0
i!

All the derivatives of ex are also ex , i.e. the rth derivative is:

dr x
e = ex for r = 1, 2, . . . .
dxr
Therefore, we may express the moment generating function as a polynomial in t, i.e. we
have:
X ti 1
t2 tr
MX (t) = 1 + t E(X) + E(X 2 ) + · · · + E(X r ) + · · · = E(X i ).
2! r! i=0
i!

Proof: This follows immediately from the series expansion of ex and the linearity of
expectation:
1
! 1 i
X (tX)i X t
tX
MX (t) = E(e ) = E = E(X i ).
i=0
i! i=0
i!


We are now in a position to understand how moments can be generated from a moment
generating function. There are two approaches.

1. Use the coefficients of E(X r ), for r = 1, 2, . . ., in the series expansion of MX (t).

2. Use derivatives of MX (t).

Determining moments by comparing coefficients

The coefficient of tr in the series expansion of MX (t) is the rth moment divided by
r!. Hence the rth moment can be determined by comparing coefficients. We have:
1 i
X 1
X
t
MX (t) = E(X i ) = ai t i ) E(X r ) = r! ar
i=0
i! i=0

where ai = E(X i )/i!.

This method is quick provided it is easy to derive the polynomial expansion in t.

83
3. Random variables and univariate distributions

Determining moments using derivatives

The rth derivative of a moment generating function evaluated at zero is the rth
moment, that is:
(r) dr
MX (0) = r MX (t) = E(X r ) = µr .
dt t=0

3 When the polynomial expansion in t is not easy to derive, calculating derivatives of


MX (t) is more convenient.

Proof: Since:
t2 tr
MX (t) = 1 + t E(X) + E(X 2 ) + · · · + E(X r ) + · · ·
2! r!
then:
X ti r 1
(r) r r+1 t2
MX (t) = E(X ) + t E(X ) + E(X r+2 ) + · · · = E(X i ).
2! i=r
(i r)!

When evaluated at t = 0 only the first term, E(X r ), is non-zero, proving the result. ⌅
The moment generating function uniquely determines a probability distribution. In
other words, if for two random variables X and Y we have MX (t) = MY (t) (for points
around t = 0), then X and Y have the same distribution.

Uniqueness of the moment generating function

If X and Y are random variables and we can find h > 0 such that MX (t) = MY (t)
for all t 2 [ h, h], then FX (x) = FY (x) for all x 2 R.

We now show examples of deriving the moment generating function and subsequently
using them to obtain moments.

Example 3.38 Suppose X ⇠ Pois( ), i.e. we have:


8 x
<e for x = 0, 1, 2, . . .
pX (x) = x!
:
0 otherwise.

The moment generating function for this distribution is:


X 1
X x X1
e ( et ) x
MX (t) = E(etX ) = etx pX (x) = etx =e = exp( (et 1)).
x x=0
x! x=0
x!

From MX (t) = exp( (et 1)) we obtain:


(et 1)
MX0 (t) = et e

and:
(et 1)
MX00 (t) = et (1 + et )e

84
3.10. Generating functions

and hence (since e0 = 1):

MX0 (0) = = E(X) and MX00 (0) = (1 + ) = E(X 2 ).

Therefore:
Var(X) = E(X 2 ) (E(X))2 = (1 + ) 2
= .
3
Example 3.39 Suppose X ⇠ Geo(⇡), for the second version of the geometric
distribution, i.e. we have:
(
(1 ⇡)x ⇡ for x = 0, 1, 2, . . .
pX (x) =
0 otherwise.

The moment generating function for this distribution is:


X 1
X 1
X
tX tx tx x ⇡
MX (t) = E(e ) = e pX (x) = e (1 ⇡) ⇡ = ⇡ (et (1 ⇡))x =
x x=0 x=0
1 et (1 ⇡)

using the sum to infinity of a geometric series, for t < ln(1 ⇡) to ensure
convergence of the sum.
From MX (t) = ⇡/(1 et (1 ⇡)) we obtain, using the chain rule:

⇡(1 ⇡)et
MX0 (t) =
(1 et (1 ⇡))2

also, using the quotient rule:

⇡(1 ⇡)et (1 (1 ⇡)et )(1 + (1 ⇡)et )


MX00 (t) =
(1 et (1 ⇡))4

and hence (since e0 = 1):


1 ⇡
MX0 (0) = = E(X).

For the variance:
(1 ⇡)(2 ⇡)
MX00 (0) = = E(X 2 )
⇡2
and so:
(1 ⇡)(2 ⇡) (1 ⇡)2 1 ⇡
Var(X) = E(X 2 ) (E(X))2 = = .
⇡2 ⇡2 ⇡2

Example 3.40 Suppose X ⇠ Exp( ), i.e. with density function:


(
e x for x 0
fX (x) =
0 otherwise.

85
3. Random variables and univariate distributions

The moment generating function for this distribution is:


Z 1
MX (t) = E(etX ) = etx fX (x) dx
1
Z 1
= etx e x
dx
3
0
Z 1
( t)x
= e dx
0
Z 1
( t)x
= ( t)e dx
t 0
| {z }
=1

= for t <
t
where note the integral is that of an Exp( t) distribution over its support, hence
is equal to 1.
From MX (t) = /( t) we obtain:

2
MX0 (t) = and MX00 (t) =
( t)2 ( t)3
so:
1 2
E(X) = MX0 (0) = and E(X 2 ) = MX00 (0) = 2

and:
2 1 1
Var(X) = E(X 2 ) (E(X))2 = 2 2
= 2
.

Example 3.41 Suppose X ⇠ Gamma(↵, ), i.e. with density function:


8
< 1
> ↵ ↵ 1
x e x for x 0
fX (x) = (↵)
>
:0 otherwise.

The moment generating function for this distribution is:


Z 1
tX
MX (t) = E(e ) = etx fX (x) dx
1
Z 1
1
= etx ↵ ↵ 1
x e x
dx
0 (↵)
↵ Z 1
1
= ( t)↵ x↵ 1 e ( t)x
dx
( t)↵ 0 (↵)
| {z }
=1
✓ ◆↵
= for t <
t

86
3.10. Generating functions

where note the integral is that of a Gamma(↵, t) distribution over its support,
hence is equal to 1, and where we multiplied by ( t)↵ /( t)↵ = 1 (hence not
a↵ecting the integral) to ‘create’ a Gamma(↵, t) density function. Since ( t)↵
does not depend on x we can place the numerator term inside the integral and the
denominator term outside the integral.
We can divide through by to obtain:
✓ ◆ ↵
3
t
MX (t) = 1 for t < .

Noting the negative binomial expansion given by:


1 ✓
X ◆
n i+n 1
(1 a) = ai
i=0
n 1

we have: 1 ✓ ◆ ✓ ◆i 1
X i+↵ 1 t X (i + ↵ 1)! ti
MX (t) = = .
i=0
↵ 1 i=0
(↵ 1)! i i!
Since the rth moment is the coefficient of tr /r! in the polynomial expansion of
MX (t), we deduce that if X ⇠ Gamma(↵, ), the rth moment is:

(r + ↵ 1)!
E(X r ) = µr = .
(↵ 1)! r

We have previously seen that if X ⇠ Gamma(1, ), then X ⇠ Exp( ), and so:


✓ ◆↵
MX (t) = = = for t <
t t t

as derived in Example 3.40. Note the choice of parameter symbol is arbitrary, such
that = .

Example 3.42 Find the moment generating function of X, where X ⇠ N (µ, 2 ).


Also, find the mean and the variance of Y = eX . (Y has a log-normal distribution,
popular as a skewed distribution for positive random variables.)
Hint: Compute the mgf for a standard normal random variable Z = (X µ)/ ,
where the density function of Z is:
1 z 2 /2
fZ (z) = p e
2⇡
then once you have the mgf of Z you can easily find the mgfs of X and Y without
integrals.

Solution
We are asked to find:
MX (t) = E(etX )

87
3. Random variables and univariate distributions

2
for X ⇠ N (µ, ). We may write X = µ + Z, where Z ⇠ N (0, 1), such that:

MX (t) = E(et(µ+ Z)
) = eµt E(e tZ
) = eµt MZ ( t).

So we only need to derive the mgf for a standard normal random variable. We have:
Z 1 Z 1
zt 1 1
3 z 2 /2 2 2
MZ (t) = e p e dz = p e ((z t) t )/2 dz
1 2⇡ 1 2⇡
Z 1
2 1 2
= et /2 p e (z t) /2 dz
1 2⇡
2 /2
= et .

The last integral is that of a N (t, 1) density function, and so is equal to 1. The step
from line 1 to line 2 follows from the simple algebraic identity:
1 1 2
((z t)2 t2 ) = z + zt.
2 2
The mgf for the general normal distribution is:
✓ 2 2

t
MX (t) = exp µt + .
2

The mean of Y is:


✓ 2

X
E(Y ) = E(e ) = MX (1) = exp µ + .
2

Also, noting Y 2 = e2X , we have:


✓ 2

2 2X 4
E(Y ) = E(e ) = MX (2) = exp 2µ +
2

hence the variance of Y is:

Var(Y ) = E(Y 2 ) (E(Y ))2


✓ ◆ ✓ ◆
4 2 2 2
= exp 2µ + exp 2µ +
2 2
2 2
= e2µ+ (e 1).

Example 3.43 Find the moment generating function of the double exponential, or
Laplace, distribution with density function:
1 |x|
fX (x) = e for 1 < x < 1.
2

88
3.10. Generating functions

Solution
We have:
Z 1 |x| Z 0 Z 1
xt e ex
xt e x
MX (t) = e dx = e dx + ext dx
1 2 1 2 0 2
Z 0 Z 1
ex(1+t) x(1 t)
=
2
dx +
e
2
dx 3
1 0
 0  1
ex(1+t) e x(1 t)
= +
2(1 + t) 1 2(1 t) 0

1 1
= +
2(1 + t) 2(1 t)
1
=
1 t2
where we require |t| < 1.

Example 3.44 A random variable X follows the Laplace distribution with


parameter > 0 if its density function has the form:
|x|
fX (x) = ke for 1<x<1
where k is a normalising constant.

(a) Find k in terms of .


(b) Compute E(X 3 ).
(c) Derive the moment generating function of X and provide the interval on which
this function is well-defined.
(d) Find the variance of X using the moment generating function derived in (c).

Solution

(a) Since:
Z 1 Z 1 ✓Z 0 Z 1 ◆
|x| x x
1= fX (x) dx = k e dx = k e dx + e dx
1 1 1 0
 0  1
!
x x
e e
=k +
1 0
✓ ◆
1 1
=k +

2k
=

it follows that k = /2.

89
3. Random variables and univariate distributions

(b) As X has a symmetric distribution, i.e. f ( x) = f (x), and g(x) = x3 is an odd


function then g( x) = g(x), it follows that:

E(X 3 ) = 0.

(c) We have:
3 Z 1
tX |x|
MX (t) = E(e ) = etx e dx
2 1
✓Z 0 Z 1 ◆
(t+ )x (t )x
= e dx + e dx
2 1 0
✓ ◆
1 1
=
2 t+ t
2
= 2 t2
for |t| < .

(d) Since E(X) = 0 it holds that the variance of X is equal to E(X 2 ). For |t| < ,
we have:
d 2 2t 2 2( 2
t2 )2 + 2 2 t(4t( 2
t2 ))
MX00 (t) = = .
dt ( 2 t2 )2 ( 2 t2 ) 4

Setting t = 0 we have:
6
2 2
E(X 2 ) = MX00 (0) = 8
= 2
.

Note that this should not come as a surprise since X can be written as the
di↵erence of two independent and identically distributed exponential random
variables, each with parameter , say X = T1 T2 . Hence, due to independence:

Var(X) = Var(T1 T2 ) = Var(T1 ) + Var(T2 )


1 1
= 2
+ 2

2
= 2
.

Activity 3.11 Consider the following game. You pay £5 to play the game. A fair
coin is tossed three times. If the first and last tosses are heads, you receive £10 for
each head.

(a) What is the expected return from playing this game?

(b) Derive the moment generating function of the return.

90
3.10. Generating functions

3.10.2 Cumulant generating functions and cumulants


Often we may choose to work with the logarithm of the moment generating function as
the coefficients of the polynomial expansion of this log-transformation have convenient
moment and central moment interpretations.

Cumulant generating function and cumulants


3
A random variable X with moment generating function MX (t) has a cumulant
generating function defined as:

KX (t) = log MX (t)

where ‘log’ is the natural logarithm, i.e. to the base e.

The rth cumulant, r , is the coefficient of tr /r! in the expansion of KX (t), so:

X i ti 1
t2 tr
KX (t) = 1 t + 2 + · · · + r + · · · = .
2! r! i=1
i!

As with the relationship between a moment generating function and moments, the same
relationship holds for a cumulant generating function and cumulants. There are two
approaches.

1. Use the coefficients of k , for k = 1, 2, . . ., in the series expansion of KX (t).

2. Use derivatives of KX (t).

Determining cumulants by comparing coefficients

The coefficient of tr in the series expansion of KX (t) is the rth cumulant divided by
r!. Hence the rth cumulant can be determined by comparing coefficients. We have:
1 i
X 1
X
t
KX (t) = i = ai t i ) r = r! ar
i=0
i! i=0

where ai = i /i!.

Determining cumulants using derivatives

The rth derivative of a cumulant generating function evaluated at zero is the rth
cumulant, that is:
(r) dr
KX (0) = r KX (t) = r .
dt t=0

Cumulants may be expressed in terms of moments and central moments. In particular,


the first cumulant is the mean and the second cumulant is the variance.

91
3. Random variables and univariate distributions

Relationship between cumulants and moments

If X is a random variable with moments {µr }, central moments {µ0r } and cumulants
{r }, for r = 1, 2, . . ., then:

i. the first cumulant is the mean:

3 1 = E(X) = µ1 = µ

ii. the second cumulant is the variance:

2 = Var(X) = µ02 = 2

iii. the third cumulant is the third central moment:

3 = µ03

iv. a function of the fourth and second cumulants yields the fourth central moment:

4 + 322 = µ04 .

In this course we only prove the first two of these results.


Proof:

i. Applying the chain rule, noting that MX (0) = E(e0 ) = 1, we have:


0 MX0 (t) 0
KX (t) = ) KX (0) = MX0 (0) = µ1 = µ ) 1 = µ.
MX (t)
0
ii. Applying the product rule, writing KX (t) = MX0 (t)(MX (t)) 1 , we have:
00 MX00 (t) (MX0 (t))2 00
KX (t) = ) KX (0) = µ2 (µ1 )2 ) 2 = 2
.
MX (t) (MX (t))2

Example 3.45 Suppose that X is a degenerate random variable with probability


mass function: (
1 for x = µ
pX (x) =
0 otherwise.
Show that the cumulant generating function of X is:

KX (t) = µt.

Solution
We have:
X
MX (t) = E(etX ) = etx pX (x) = etµ pX (µ) = etµ ⇥ 1 = etµ .
x

92
3.11. Functions of random variables

Hence:
KX (t) = log MX (t) = µt.

Example 3.46 For the Poisson distribution, its cumulant generating function has a
simpler functional form than its moment generating function. If X ⇠ Pois( ), its
cumulant generating function is:
3
KX (t) = log MX (t) = log exp( (et 1)) = (et 1).

Applying the series expansion of et , we have:


✓ ◆ X1
t2 ti
KX (t) = t + + ··· = .
2! i=1
i!

Comparing coefficients, the rth cumulant is , i.e. r = for all r = 1, 2, . . ..

2 /2
Example 3.47 Suppose Z ⇠ N (0, 1). From Example 3.42 we have MZ (t) = et ,
hence taking the logarithm yields the cumulant generating function:

2 /2 t2
KZ (t) = log MZ (t) = log et = .
2
Hence for a standard normal random variable, 2 = 1 and all other cumulants are
zero.

Activity 3.12 Suppose X ⇠ Exp( ). Use the moment generating function of X to


show that E(X r ) = r!/ r .

Activity 3.13 For each of the following distributions derive the moment generating
function and the cumulant generating function:

(a) Bernoulli(⇡)

(b) Bin(n, ⇡)

(c) Geometric(⇡), first version

(d) Neg. Bin(r, ⇡), first version.

Comment on any relationships found.

Activity 3.14 Use cumulants to calculate the coefficient of skewness for a Poisson
distribution.

3.11 Functions of random variables


A well-behaved function, g : R ! R of a random variable X is also a random variable.
So, if Y = g(X), then Y is a random variable. While we have seen how to determine the

93
3. Random variables and univariate distributions

expectation of functions of a random variable (Section 3.9.2), in matters of statistical


inference we often need to know the actual distribution of g(X), not just its
expectation. We now proceed to show how to determine the distribution.

3.11.1 Distribution, mass and density functions of Y = g(X)

3 Let X be a random variable defined on (⌦, F, P ) and let g : R ! R be a well-behaved


function. Suppose Y = g(X). While g is a function, and hence each input has a single
output, g 1 is not guaranteed to be a function as a single input could
p produce multiple
outputs. An obvious example is g(x) = x2 such that g 1 (x2 ) = ± x with positive and
negative roots, hence there is not a single output for x2 6= 0. In general:

FY (y) = P (Y  y) = P (g(X)  y) 6= P (X  g 1 (y)).

How to proceed? Well, we begin with the concept of the inverse image of a set.

Inverse image of a set

If g : R ! R is a function and B is a subset of R, then the inverse image of B


under g is the set of real numbers whose images under g lie in B, i.e. for all B ✓ R
the inverse image of B under g is:

g 1 (B) = {x 2 R : g(x) 2 B}.

So the inverse image of B under g is the image of B under g 1 . Hence for any
well-behaved B ✓ R, we have:

P (Y 2 B) = P (g(X) 2 B) = P (X 2 g 1 (B))

that is, the probability that g(X) is in B equals the probability that X is in the
inverse image of B.

We can now derive the distribution function of Y = g(X).

Distribution function of Y = g(X)

Suppose Y = g(X). The distribution function of Y is:

FY (y) = P (Y  y) = P (Y 2 ( 1, y]) = P (g(X) 2 ( 1, y]) = P (X 2 g 1 (( 1, y])).

Hence: 8 P
< pX (x) for discrete X
FY (y) = {x:g(x)y}
:R
{x:g(x)y}
fX (x) dx for continuous X.

94
3.11. Functions of random variables

Probability mass and density functions of Y = g(X)

For the discrete case, the probability mass function of Y is simply:


X
pY (y) = P (Y = y) = P (g(X) = y) = P (X 2 g 1 (y)) = pX (x).
{x:g(x)=y}

For the continuous case, the probability density function of Y is simply: 3


d
fY (y) = FY (y).
dy

Example 3.48 Let X be a random variable with continuous cdf FX . Find


expressions for the cdf of the following random variables.

(a) X 2
p
(b) X

(c) G 1 (X)

(d) G 1 (FX (X))

where G is continuous and strictly increasing.

Solution

(a) If y 0, then:
p p p p
P (X 2  y) = P (X  y) P (X < y) = FX ( y) FX ( y).

(b) We must assume X 0. If y 0, then:


p
P ( X  y) = P (0  X  y 2 ) = FX (y 2 ).

(c) We have:
P (G 1 (X)  y) = P (X  G(y)) = FX (G(y)).

(d) We have:

P (G 1 (FX (X))  y) = P (FX (X)  G(y)) = P (X  FX 1 (G(y)))


= FX (FX 1 (G(y)))
= G(y).

Example 3.49 Suppose X ⇠ Bin(n, ⇡) and hence X is the total number of


successes in n independent Bernoulli(⇡) trials. Hence Y = n X is the total number
of failures. We seek the distribution of Y .

95
3. Random variables and univariate distributions

We have Y = g(X) such that g(x) = n x. Hence:


X
FY (y) = pX (x)
{x:g(x)y}

Xn ✓ ◆
n x
= ⇡ (1 ⇡)n x
(note the limits of x)
3 x=n y
x
y ✓ ◆
X n
= ⇡ n i (1 ⇡)i (setting i = n x)
i=0
n i
y ✓ ◆
X n n n
= (1 ⇡)i ⇡ n i . (since n i
= i
)
i=0
i

Note that this is the mass function of a Bin(n, 1 ⇡) distribution, hence by


symmetry Y ⇠ Bin(n, 1 ⇡), as we would expect.

Example 3.50 Let X be a continuous random variable and suppose Y = X 2 . We


seek the distribution function and density function of Y . We have:
p p
FY (y) = P (Y  y) = P (X 2  y) = P ( yX y)
p p
= FX ( y) FX ( y).

Noting that the support must be for {y : y 0}, in full the distribution function of
Y is: ( p p
FX ( y) FX ( y) for y 0
FY (y) =
0 otherwise.
Di↵erentiating, we obtain the density function of Y , noting the application of the
chain rule: 8 p p
< fX ( y) +pfX ( y)
for y 0
fY (y) = 2 y
:
0 otherwise.

Example 3.51 Let X be a continuous random variable with cdf FX (x). Determine
the distribution of Y = FX (X). What do you observe?

Solution
For 0  y  1 we have:

FY (y) = P (Y  y) = P (FX (X)  y)


= P (X  FX 1 (y))
= FX (FX 1 (y))
= y.

96
3.11. Functions of random variables

Hence the density function of Y is:


d
fY (y) = FY (y) = 1
dy

for 0  y  1, and 0 otherwise. Therefore, Y ⇠ Uniform[0, 1].

3
Example 3.52 We apply the density function result in Example 3.50 to the case
where X ⇠ N (0, 1), i.e. the standard normal distribution. Since the support of X is
R, the support of Y = X 2 is the positive real line. The density function of X is:
1 x2 /2
fX (x) = p e for 1 < x < 1.
2⇡

Therefore, setting y = x2 , we have:


✓ ◆ ✓ ◆1/2
1 1 y/2 1 y/2 1 1 1/2 y/2
fY (y) = p p e +p e =p y e .
2 y 2⇡ 2⇡ ⇡ 2
p
Noting that (1/2) = ⇡, the density function is:
8 ✓ ◆1/2
>
< 1 1
y 1/2 1 e y/2
for y 0
fY (y) = (1/2) 2
>
:0 otherwise.

Note that this is the density function of a Gamma(1/2, 1/2) distribution, hence if
X ⇠ N (0, 1) and Y = X 2 , then Y ⇠ Gamma(1/2, 1/2).
In passing we also note, from ST104b Statistics 2, that the square of a standard
normal random variable has a chi-squared distribution with 1 degree of freedom, i.e.
2 2
1 , hence it is also true that Y ⇠ 1 and so we can see that the chi-squared
distribution is a special case of the gamma distribution – there are many
relationships between the various families of distributions!

Example 3.53 Suppose that X is a continuous random variable taking values


between 1 and 1 with distribution function FX (x). Sometimes we want to fold
the distribution of X about the value x = a, that is we want the distribution
function FY (y) of the random variable Y obtained from X by taking Y = X a, for
X > a, and Y = a X, for X < a (in other words Y = |X a|). Find FY (y) by
working out directly P (Y  y). What is the density function of Y ?
2
A particularly important application is the case when X has a N (µ, ) distribution,
and a = µ, which has pdf:
✓ ◆
1 (x a)2
fX (x) = p exp .
2⇡ 2 2 2

Apply the result to this case.

97
3. Random variables and univariate distributions

Solution
The description of Y says that Y = |X a|, hence:

FY (y) = P (Y  y) = P (a y  X  a + y)
= P (a y < X  a + y)
3 (
FX (a + y) FX (a y) for y 0
=
0 for y < 0.

The density function fY (y) is obtained by di↵erentiating with respect to y, hence:


(
fX (a + y) + fX (a y) for y 0
fY (y) =
0 otherwise.
2
In the case where X ⇠ N (µ, ) and a = µ, the density function of Y is:
8 p ✓ ◆
< p 2 exp y2
for y 0
fY (y) = ⇡ 2 2 2
:
0 otherwise.

This is sometimes called a half-normal distribution.

Activity 3.15 Let X be a continuous random variable with a support of R.


Determine the density function of |X| in terms of fX .

Activity 3.16 Let X ⇠ N (µ, 2


). Determine the density function of |X µ|.

3.11.2 Monotone functions of random variables


We now apply the material from Section 3.11.1 to the case when the function g is
monotone.

Monotone function

A function g : R ! R is monotone in both of the following cases.

i. g is monotone increasing if:

g(x1 )  g(x2 ) for all x1 < x2

ii. g monotone decreasing if:

g(x1 ) g(x2 ) for all x1 < x2 .

Strict monotonicity replaces the above inequalities with strict inequalities. For a
strictly monotone function, the inverse image of an interval is also an interval.

98
3.11. Functions of random variables

Distribution function of Y = g(X) when g is strictly monotone

If X is a random variable, g : R ! R is strictly monotone and Y = g(X), then the


distribution function of Y is:
(
P (X  g 1 (y)) for g increasing
FY (y) = P (X 2 g 1 (( 1, y])) =

(
P (X g 1 (y)) for g decreasing
3
FX (g 1 (y)) for g increasing
= 1
(3.5)
1 FX (g (y) ) for g decreasing

where y is in the range of g and:


(
1 FX (g 1 (y)) P (X = g 1 (y)) for discrete X
FX (g (y) ) =
FX (g 1 (y)) for continuous X.

Density function of Y = g(X) when g is monotone

If X is a continuous random variable, g : R ! R is monotone and Y = g(X), then


the density function of Y is:
8
<f (g 1 (y)) d g 1 (y) for y in the range of g
X
fY (y) = dy
:
0 otherwise.

Proof: Let X be a random variable with density function fX (x) and let Y = g(X), i.e.
X = g 1 (Y ).
If g 1 (·) is increasing, then:
FY (y) = P (Y  y) = P (g 1 (Y )  g 1 (y)) = P (X  g 1 (y)) = FX (g 1 (y))
hence:
d d d
fY (y) = FY (y) = FX (g 1 (y)) = fX (g 1 (y)) g 1 (y).
dy dy dy
If g 1 (·) is decreasing, then:
FY (y) = P (Y  y) = P (g 1 (Y ) g 1 (y)) = P (X g 1 (y)) = 1 FX (g 1 (y))
hence: ✓ ◆
d 1 1 dg 1 (y)
fY (y) = FX (g (y)) = fX (g (y)) .
dy dy
Recall that the derivative of a decreasing function is negative.
Combining both cases, if g 1 (·) is monotone, then:
d
fY (y) = fX (g 1 (y)) g 1 (y) .
dy

99
3. Random variables and univariate distributions

Example 3.54 For any continuous random variable X we consider, its distribution
function, FX , is strictly increasing over its support, say S ✓ R. The inverse function,
FX 1 , known as the quantile function, is strictly increasing on [0, 1], such that
FX 1 : [0, 1] ! S.
Let U ⇠ Uniform[0, 1], hence its distribution function is:
3 8
>
<0 for u < 0
FU (u) = u for 0  u  1
>
:
1 for u > 1.

Let X = FX 1 (U ), hence its distribution function is:

FX (x) = FU ((FX 1 ) 1 (x)) for x 2 S

which can be used to simulate random samples from a required distribution by


simulating values of u from Uniform[0, 1], and then view FX 1 (u) as a random
drawing from FX .
Suppose X ⇠ Exp( ), hence its distribution function is:
(
0 for x < 0
FX (x) = x
1 e for x 0.

The quantile function (i.e. the inverse distribution function) is:


✓ ◆
1 1
FX 1 (u) = log for 0  u  1.
1 u

Therefore, if U ⇠ Uniform[0, 1], then:


✓ ◆
1 1
log ⇠ Exp( ).
1 U

Example 3.55 Let X ⇠ Gamma(↵, ), hence:


1 ↵ ↵ 1 x
fX (x) = x e
(↵)
for x 0, and 0 otherwise.
We seek Y = 1/X. We have that:
1 d 1
g 1 (y) = and g 1 (y) = < 0.
y dy y2
Therefore:
d 1 1
fY (y) = fX (g 1 (y)) g 1 (y) = ↵
(1/y)↵ 1 e /y
dy (↵) y2
1 ↵ ↵ 1 /y
= y e
(↵)

100
3.11. Functions of random variables

for y > 0, and 0 otherwise. This is the inverse gamma distribution.

Example 3.56 Suppose X has the Weibull distribution:


cx⌧
fX (x) = c⌧ x⌧ 1
e

for x 0, where c, ⌧ > 0 are constants. What is the density function of Y = cX ⌧ ?


3
Solution
The transformation y = cx⌧ is strictly increasing on [0, 1). The inverse
transformation is x = (y/c)1/⌧ . It follows that for y 0 we have:
✓ ◆1/⌧ ⇣ ⌘ !
Y y 1/⌧
FY (y) = P (Y  y) = P 
c c
✓ ⇣ y ⌘1/⌧ ◆
=P X
c
✓⇣ ⌘ ◆
y 1/⌧
= FX .
c

Di↵erentiating with respect to y, for y 0 we have:


✓⇣ ⌘ ◆ ⇣ ⌘
y 1/⌧ 1 y 1/⌧ 1 1
fY (y) = fX
c ⌧ c c
✓ ⇣ ⌘ ◆⌧ 1
y 1/⌧ 1 ⇣ y ⌘1/⌧ 1 1
= c⌧ e y
c ⌧ c c
= e y.

This is an exponential distribution, specifically Exp(1).

Example 3.57 Standardisation and ‘reverse’ standardisation are examples of a


scale or location transformation. A classic example is:
X µ
X ⇠ N (µ, 2 ) ) Z = ⇠ N (0, 1)
or, in reverse:
2
Z ⇠ N (0, 1) ) X = µ + Z ⇠ N (µ, ).
In fact, the distributional assumption of normality is not essential for standardisation
(or its reverse). Let Z be any random variable, and g be the linear function:
g(z) = µ + z
for µ 2 R and > 0. If X = g(Z), i.e. X = µ + Z, then X is a linear
transformation of Z, being related through a scale (or location) transformation. This
means properties of X can be derived from properties of Z.
The distribution function of X is:
✓ ◆ ✓ ◆
x µ x µ
FX (x) = P (X  x) = P (µ + Z  x) = P Z = FZ .

101
3. Random variables and univariate distributions

If Z is a continuous random variable, then the density function of X is (applying the


chain rule): ✓ ◆
1 x µ
fX (x) = fZ .

The moment generating functions of X and Z are also related, since:

3 MX (t) = E(etX ) = E(et(µ+ Z)


) = E(eµt e tZ
) = eµt E(e tZ
) = eµt MZ ( t).

The cumulant generating functions are also related, as seen by taking logarithms:

KX (t) = µt + KZ ( t).

Cumulants of X and Z are hence related, as seen by comparing coefficients:


(
µ + Z,1 for r = 1
X,r = r
Z,r for r = 2, 3, . . . .

Therefore:
2
E(X) = µ + E(Z) and Var(X) = Var(Z).
If we impose the distributional assumption of normality, such that Z ⇠ N (0, 1), and
continue to let X = µ + Z, the density function of X is:
✓ ◆ ✓ ◆2 !
1 x µ 1 1 1 x µ
fX (x) = fZ = p exp
2⇡ 2
✓ ◆
1 (x µ)2
= p exp
2⇡ 2 2 2
2
and hence X ⇠ N (µ, ).
Recall that the cumulant generating function of the standard normal distribution
(Example 3.47) is KZ (t) = t2 /2, hence the cumulant generating function of X is:
2 2
t
KX (t) = µt + KZ ( t) = µt + .
2
2
So if X ⇠ N (µ, ), then:
0 2 00 2
KX (t) = µ + t and KX (t) =
0 00
hence 1 = KX (0) = µ and 2 = KX (0) = 2 (as always for the first two cumulants).
However, we see that:
r = 0 for r > 2
and so by the one-to-one mapping (i.e. uniqueness) of a cumulant generating
function to a probability distribution, any distribution for which r = 0 for r > 2 is
a normal distribution.

Example 3.58 Let X ⇠ N (µ, 2


) and Y = eX . Find the density function of Y .

102
3.11. Functions of random variables

Solution
Let Y = g(X) = exp(X). Hence X = g 1 (Y ) = ln(Y ), and:

d 1
g 1 (y) = .
dy y
Therefore:
3
d
fY (y) = fX (g 1 (y)) g 1 (y)
dy
✓ ◆
1 (ln y µ)2 1
= p exp .
2⇡ 2 2 2 y

Example 3.59 Suppose X has the density function (sometimes called a Type II
Beta distribution):
1 x 1
fX (x) =
B(↵, ) (1 + x)↵+
for 0 < x < 1, where ↵, > 0 are constants. What is the density function of
Y = X/(1 + X)?

Solution
The transformation y = x/(1 + x) is strictly increasing on (0, 1), and has the
unique inverse function x = y/(1 y). Hence:

dx
fY (y) = fX (x) .
dy

Since dx/dy = 1/(1 y)2 > 0, we have:

dx
fY (y) = fX (x)
dy
✓ ◆
y 1
= fX
1 y (1 y)2
1 (y/(1 y)) 1 1
= ↵+
B(↵, ) (1 + y/(1 y)) (1 y)2
1 1 1
= y (1 y)↵+1
B(↵, ) (1 y)2
1 1
= y (1 y)↵ 1 .
B(↵, )

This is a beta distribution, i.e. Y ⇠ Beta(↵, ).

103
3. Random variables and univariate distributions

Example 3.60 Let X ⇠ Uniform[0, 1]. Suppose Y = a + (b a)X.

(a) Determine the distribution of Y .

(b) Determine Var(Y ).

3 Solution

(a) Rearranging, we have X = (Y a)/(b a), hence dx/dy = 1/(b a) so:


✓ ◆
y a 1 1 1
fY (y) = fX =1⇥ = .
b a b a b a b a

In full: (
1/(b a) for a  y  b
fY (y) =
0 otherwise
that is, Y ⇠ Uniform[a, b].

(b) We have that Var(X) = E(X 2 ) (E(X))2 , where:


Z 1 Z 1  1
x2 1
E(X) = x fX (x) dx = x dx = =
1 0 2 0 2

and: Z Z 
1 1 1
2 2 2 x3 1
E(X ) = x fX (x) dx = x dx = =
1 0 3 0 3
giving: ✓ ◆2
2 2 1 1 1
Var(X) = E(X ) (E(X)) = = .
3 2 12
Hence:
(b a)2
Var(Y ) = Var(a + (b a)X) = Var((b a)X) = (b a)2 Var(X) = .
12

Activity 3.17 Let X be a positive, continuous random variable. Determine the


density function of 1/X in terms of fX .

Activity 3.18 Let X ⇠ Exp( ). Determine the density function of 1/X.

3.12 Convergence of sequences of random variables


In this section we consider aspects related to convergence of sequences of random
variables. However, we begin with the definition of convergence for a sequence of real
numbers (that is, constants) x1 , x2 , . . . which we denote by {xn }.

104
3.12. Convergence of sequences of random variables

Convergence of a real sequence

If {xn } is a sequence of real numbers, then xn converges to a real number x if and


only if, for all ✏ > 0, there exists an integer N such that:

|xn x| < ✏ for all n > N.

The convergence of a real sequence can be written as xn ! x as n ! 1. 3

If we now consider a sequence of random variables rather than constants, i.e. {Xn }, in
matters of convergence it does not make sense to compare |Xn X| (a random variable)
to a constant ✏ > 0. Below, we introduce four di↵erent types of convergence.

Convergence in distribution

A sequence of random variables {Xn } converges in distribution if:

P (Xn  x) ! P (X  x) as n ! 1

equivalently:
FXn (x) ! FX (x) as n ! 1
for all x at which the distribution function is continuous. This is denoted as:
d
Xn ! X.

Convergence in probability

A sequence of random variables {Xn } converges in probability if, for any ✏ > 0,
we have:
P (|Xn X| < ✏) ! 1 as n ! 1.
This is denoted as:
p
Xn ! X.

Convergence almost surely

A sequence of random variables {Xn } converges almost surely to X if, for any
✏ > 0, we have: ⇣ ⌘
P lim |Xn X| < ✏ = 1.
n!1

This is denoted as:


a.s.
Xn ! X.

105
3. Random variables and univariate distributions

Convergence in mean square

A sequence of random variables {Xn } converges in mean square to X if:

E((Xn X)2 ) ! 0 as n ! 1.

This is denoted as:


3 m.s.
Xn ! X.

The above types of convergence di↵er in terms of their strength. If {Xn } converges
almost surely, then {Xn } converges in probability:
a.s. p
Xn ! X ) Xn ! X.
If {Xn } converges in mean square, then {Xn } converges in probability:
m.s. p
Xn ! X ) Xn ! X.
If {Xn } converges in probability, then {Xn } converges in distribution:
p d
Xn ! X ) Xn ! X.
Combining these results, we can say that the set of all sequences which converge in
distribution contains the set of all sequences which converge in probability, which in
turn contains the set of all sequences which converge almost surely and in mean square.
We may write this as:
a.s.
)
Xn ! X p d
m.s. ) Xn ! X ) Xn ! X.
Xn ! X

Example 3.61 For a sequence of random variables {Xn }, we prove that


convergence in mean square implies convergence in probability.
Consider P (|Xn X| > ✏) for ✏ > 0. Applying the Chebyshev inequality, we have:

E((Xn X)2 )
P (|Xn X| > ✏)  .
✏2
m.s.
If Xn ! X, then E((Xn X)2 ) ! 0. Therefore, {P (|Xn X| > ✏)} is a sequence of
positive real numbers bounded above by a sequence which converges to zero. Hence
p
we conclude P (|Xn X| > ✏) ! 0 as n ! 1, and so Xn ! X.

Example 3.62 For a sequence of random variables {Xn }, we prove that if


d p
Xn ! a, where a is a constant, then this implies Xn ! a.
Making use of the distribution function of the degenerate distribution (Section
3.7.1), we have:

P (|Xn a| > ✏) = P (Xn a > ✏) + P (Xn a< ✏)


= P (Xn > a + ✏) + P (Xn < a ✏)
 (1 FXn (a + ✏)) + FXn (a ✏).

106
3.13. A reminder of your learning outcomes

d
If Xn ! a, then FXn ! Fa and hence FXn (a + ✏) ! 1 and FXn (a ✏) ! 0.
Therefore, {P (|Xn a| > ✏)} is a sequence of positive real numbers bounded above
by a sequence which converges to zero. Hence we conclude P (|Xn a| > ✏) ! 0 as
p
n ! 1 and hence Xn ! a.

Example 3.63 In the following two cases explain in which, if any, of the three 3
modes (in mean square, in probability, in distribution) Xn converges to 0.

n
(a) Let Xn = 1 with probability 2 and 0 otherwise.
1
(b) Let Xn = n with probability n and 0 otherwise, and assume that the Xn s are
independent.

Solution

n n
(a) We have P (Xn = 1) = 2 and P (Xn = 0) = 1 2 . Since:

E(|Xn 0|2 ) = E(Xn2 ) = 02 ⇥ P (Xn = 0) + 12 ⇥ P (Xn = 1) = 2 n


!0
m.s. p d
as n ! 1, then Xn ! 0. Hence also Xn ! 0 and Xn ! 0.
1
(b) We have P (Xn = n) = n and P (Xn = 0) = 1 n 1 . Since:

E(|Xn 0|2 ) = E(Xn2 ) = 02 ⇥ P (Xn = 0) + n2 ⇥ P (Xn = n) = n ! 1

as n ! 1, then Xn is not mean square convergent. For all ✏ > 0 we have:


1
P (|Xn 0| > ✏) = 0 = P (Xn > ✏) = P (Xn = n) = n !0
p d
as n ! 1. Therefore, Xn ! 0 and Xn ! 0.

Activity 3.19 Consider a sequence of random variables {Xn } with cumulant


generating functions {KXn } and a random variable X with cumulant generating
function KX (t). Suppose, in addition, that all these cumulant generating functions
are well-defined for |t| < a. If KXn ! KX (t) as n ! 1 for all t such that |t| < a,
what can we conclude?

Activity 3.20 Consider a sequence of random variables {Xn } and constant a.


p d
Prove that Xn ! c implies Xn ! a.

3.13 A reminder of your learning outcomes


On completion of this chapter, you should be able to:

provide both formal and informal definitions of a random variable

107
3. Random variables and univariate distributions

formulate problems in terms of random variables

explain the characteristics of distribution functions

explain the distinction between discrete and continuous random variables

provide the probability mass function (pmf) and support for some common discrete
3 distributions

provide the probability density function (pdf) and support for some common
continuous distributions

explain whether a function defines a valid mass or density

calculate moments for discrete and continuous distributions

prove and manipulate inequalities involving the expectation operator

derive moment generating functions for discrete and continuous distributions

calculate moments from a moment generating function

calculate cumulants from a cumulant generating function

determine the distribution of a function of a random variable

summarise scale/location and probability integral transformations.

3.14 Sample examination questions


Solutions can be found in Appendix C.

1. Let X be a discrete random variable with mass function defined by:


(
k x for x = 1, 2, . . .
pX (x) =
0 otherwise

where k is a constant with 0 < k < 1.


(a) Show that k = 1/2.
(b) Show that the distribution function FX (x) of X has values at x = 1, 2, . . .
given by: ✓ ◆x
1
FX (x) = 1 for x = 1, 2, . . . .
2

(c) Show that the moment generating function of X is given by:

et
MX (t) = for t < log 2.
2 et
Hence find E(X).

108
3.14. Sample examination questions

2. (a) Let X be a positive random variable with E(X) < 1. Prove the Markov
inequality:
E(X)
P (X a) 
a
for any constant a > 0.
2
(b) For a random variable X with E(X) = µ and Var(X) = , state Chebyshev’s
inequality. 3
(c) By considering an Exp(1) random variable show, for 0 < a < 1, that:
a
ae  1 and a2 (1 + e (1+a)
e (1 a)
)  1.

You can use the mean and variance of an exponential random variable without
proof, as long as they are stated clearly.

3. Let X1 and X2 be independent continuous uniform random variables, with X1


defined over [ 1, 1] and X2 defined over [ 2, 1].
(a) Show, by considering P (W1 < w), that the density function of W1 = X12 is:
8
< p 1
for 0  w  1
fW1 (w) = 2 w
:
0 otherwise.

(b) Show that the density function of W2 = X22 is:


8 1
>
> p for 0  w  1
>
>
<3 w
fW2 (w) = 1
>
> p for 1  w < 4
>
> 6 w
:
0 otherwise.

(Hint: Consider P (W2 < w) and P (1  W2 < w) for 0 < w < 1 and 1  w < 4,
respectively.)
p
(c) Show that the density function of Y = W2 is:
8
>
<2/3 for 0  y < 1
fY (y) = 1/3 for 1  y < 2
>
:
0 otherwise.

109
3. Random variables and univariate distributions

110
B.2. Chapter 3 – Random variables and univariate distributions

B.2 Chapter 3 – Random variables and univariate


distributions
B
1. It is essential to distinguish between events and the probabilities of events.
(a) {X ≤ 2} is the event that the absolute value of the difference between the
values is at most 2.
(b) {X = 0} is the event that both dice show the same value.
(c) P (X ≤ 2) is the probability that the absolute value of the difference between
the values is at most 2.
(d) P (X = 0) is the probability that both dice show the same value.

2. We have the following given the definitions of the random variables X and Y .
(a) {X < 3}.
(b) P (X < 3).
(c) {Y = 1}.
(d) P (Y = 0).
(e) P (X = 6, Y = 0).
(f) P (Y < X).

3. Let Y denote the value of claims paid. The distribution function of Y is:
P (X ≤ x ∩ X > k) P (k < X ≤ x)
FY (x) = P (Y ≤ x) = P (X ≤ x | X > k) = = .
P (X > k) P (X > k)
Hence, in full: 
0
 for x ≤ k
FY (x) = FX (x) − FX (k)
 for x > k.
1 − FX (k)

Let Z denote the value of claims not paid. The distribution function of Z is:
P (X ≤ x ∩ X ≤ k) P (X ≤ x)
FZ (x) = P (Z ≤ x) = P (X ≤ x | X ≤ k) = = .
P (X ≤ k) P (X ≤ k)
Hence, in full: 
 FX (x)

for x ≤ k
FZ (x) = FX (k)

1 for x > k.

4. This problem makes use of the following results from mathematics, concerning
sums of geometric series. If r 6= 1, then:
n−1
X a(1 − rn )
arx =
x=0
1−r

197
B. Solutions to Activities

and if |r| < 1, then:



X a
arx = .
B x=0
1−r

(a) We first note that pX is a positive real-valued function with respect to its
support. Noting that 1 − π < 1, we have:

X π
(1 − π)x−1 π = = 1.
x=1
1 − (1 − π)

Hence the two necessary conditions for a valid mass function are satisfied.

(b) The distribution function for the (first version of the) geometric distribution is:
x
X X π(1 − (1 − π)x )
FX (x) = pX (t) = (1 − π)t−1 π = = 1 − (1 − π)x .
t:t≤x t=1
1 − (1 − π)

In full: (
0 for x < 1
FX (x) =
1 − (1 − π)bxc for x ≥ 1.

5. We first note that pX is a positive real-valued function with respect to its support.
We then have:
∞ ∞   ∞  
X X x−1 r X y+r−1
pX (x) = π (1−π) x−r
=π r
(1−π)y = π r (1−(1−π))−r = 1
x=r x=r
r − 1 y=0
r − 1

where y = x − r. Hence the two necessary conditions for a valid mass function are
satisfied.

6. We have:
∞ 2 2
x4 x3 x5
Z Z   
2 16
E(X) = x fX (x) dx = x − dx = − = .
−∞ 0 4 3 20 0 15

Also:
∞ 2 2
x5
 4
x6
Z Z  
2 2 3 x 4
E(X ) = x fX (x) dx = x − dx = − = .
−∞ 0 4 4 24 0 3

Hence:  2
2 4 2 16 44
Var(X) = E(X ) − (E(X)) = − = .
3 15 225

198
B.2. Chapter 3 – Random variables and univariate distributions

7. We need to evaluate:
Z ∞ Z ∞
E(X) = x fX (x) dx = x λ e−λx dx.
−∞ 0 B
We note that:
x x 1
x e−λx = λx
= 2 2
=
e 1 + λx + λ x /2 + · · · 1/x + λ + λ2 x/2 + · · ·
such that the numerator is fixed (the constant 1), and the denominator tends to
infinity as x → ∞. Hence:
x e−λx → 0 as x → ∞.
Applying integration by parts, we have:
Z ∞ ∞ ∞
e−λx
i∞ Z 
−λx
h
−λx −λx 1
E(X) x λe dx = − x e + e dx = 0 + − = .
0 0 0 λ 0 λ

8. (a) We have:
Var(X) = E((X − E(X))2 )
= E(X 2 − 2X E(X) + (E(X))2 )
= E(X 2 ) − 2(E(X))2 + (E(X))2
= E(X 2 ) − (E(X))2 .

(b) We have:
E(X(X − 1)) − E(X) E(X − 1) = E(X 2 ) − E(X) − (E(X))2 + E(X)
= E(X 2 ) − (E(X))2
= Var(X).

9. We note that:
2 −λx d2 −λx
x λe = λ 2e

and so:
∞ ∞
d2 −1
Z Z
2
2
E(X ) = 2
x fX (x) dx = x2 λe−λx dx = λ 2
λ = 2.
−∞ 0 dλ λ
Therefore:
2 1 1
Var(X) = E(X 2 ) − (E(X))2 = 2
− 2 = 2.
λ λ λ

10. We consider the proof for a continuous random variable X. The indicator function
takes the value 1 for values ≤ x and 0 otherwise. Hence:
Z ∞ Z x
E(I(−∞,x] (X)) = I(−∞,x] (t) fX (t) dt = fX (t) dt = FX (x).
−∞ −∞

199
B. Solutions to Activities

11. (a) Let the random variable X denote the return from the game. Hence X is a
discrete random variable, which can take three possible values: −£5 (if we do
not throw a head first and last), £15 (if we throw HT H) and £25 (if we throw
B HHH). The probabilities associated with these values are 3/4, 1/8 and 1/8,
respectively. Therefore, the expected return from playing the game is:
X 3 1 1
E(X) = x pX (x) = −5 × + 15 × + 25 × = £1.25.
x
4 8 8

(b) The moment generating function is:


X 3 1 1 e−5t
MX (t) = E(etX ) = etx pX (x) = e−5t × +e15t × +e25t × = (6+e20t +e30t ).
x
4 8 8 8

12. If X ∼ Exp(λ), we have that:


λ 1
MX (t) = = for t < λ.
λ−t 1 − t/λ
Writing as a polynomial in t, we have:
∞  i
1 X t
MX (t) = =
1 − t/λ i=0
λ

since t/λ < 1. Since the coefficient of tr is E(X r )/r! in the polynomial expansion of
MX (t), for the exponential distribution we have:
E(X r ) 1 r!
= r ⇒ E(X r ) = .
r! λ λr

13. (a) If X ∼ Bernoulli(π), then the moment generating function is:


X 1
X
tX tx
MX (t) = E(e ) = e pX (x) = etx π x (1 − π)1−x = (1 − π) + πet
x x=0

and hence the cumulant generating function is:

KX (t) = log MX (t) = log((1 − π) + πet ).

(b) If X ∼ Bin(n, π), then the moment generating function is:


X
MX (t) = E(etX ) = etx pX (x)
x
n  
X n x
tx
= e π (1 − π)n−x
x=0
x
n  
X n
= (πet )x (1 − π)n−x
x=0
x

= ((1 − π) + πet )n

200
B.2. Chapter 3 – Random variables and univariate distributions

using the binomial expansion. Hence the cumulant generating function is:

KX (t) = log MX (t) = log(((1 − π) + πet )n ) = n log((1 − π) + πet ).


B
Note that if X ∼ Bernoulli(π) and Y ∼ Bin(n, π), then:

MY (t) = (MX (t))n and KY (t) = nKX (t).

This is not a coincidence, as a Bin(n, π) random variable is equal to the sum of


n independent Bernoulli(π) random variables (since π is constant, the sum of n
independent and identically distributed Bernoulli random variables).
(c) If X ∼ Geo(π), then the moment generating function is:
X
MX (t) = E(etX ) = etx pX (x)
x

X
= etx (1 − π)x−1 π
x=1

X
= πet ((1 − π)et )x−1
x=1

πet
=
1 − (1 − π)et

provided |(1 − π)et | < 1. Hence the cumulant generating function is:

πet
 
KX (t) = log MX (t) = log .
1 − (1 − π)et

(d) If X ∼ Neg. Bin(r, π), then the moment generating function is:
X
MX (t) = E(etX ) = etx pX (x)
x
∞  
X x−1 x
tx
= e π (1 − π)x−r
x=r
r−1

t r
X (x − 1)!
= (πe ) ((1 − π)et )x−r
x=r
(r − 1)! (x − r)!

X (y + r − 1)!
= (πet )r ((1 − π)et )y (setting y = x − r)
y=0
(r − 1)! y!
r
πet

=
1 − (1 − π)et

using the negative binomial expansion. Hence the cumulant generating


function is: r 
πet

KX (t) = log MX (t) = log .
1 − (1 − π)et

201
B. Solutions to Activities

Note the relationship between the moment generating functions of Geo(π) and
Neg. Bin(r, π). A Neg. Bin(r, π) random variable is equal to the sum of r
independent Geometric(π) random variables (since π is constant, the sum of r
B independent and identically distributed geometric random variables).

14. If X ∼ Pois(λ), then we know:


MX (t) = exp(λ(et − 1)).
Hence the cumulant generating function is:
t2 t3
 
t t

KX (t) = log MX (t) = log exp(λ(e − 1)) = λ(e − 1) = λ t + + + · · · .
2! 3!
By comparing coefficients, the third cumulant is κ3 = λ. We also have that
µ02 = Var(X) = λ and µ03 = κ3 = λ. Therefore, the coefficient of skewness is:
µ03 1
γ1 = 0 3/2
=√ .
(µ2 ) λ

15. Let Y = |X|. We have:


FY (y) = P (Y ≤ y) = P (|X| ≤ y) = P (−y ≤ X ≤ y) = FX (y) − FX (−y).
In full, the distribution function of Y is:
(
FX (y) − FX (−y) for y ≥ 0
FY (y) =
0 otherwise.

The density function is obtained by differentiating, hence:


(
fX (y) + fX (−y) for y ≥ 0
fY (y) =
0 otherwise.

16. Let Y = |X − µ|, where X ∼ N (µ, σ 2 ), hence X − µ ∼ N (0, σ 2 ). Therefore:


r
1 −y 2 /(2σ 2 ) 1 −y 2 /(2σ 2 ) 2 −y2 /(2σ2 )
fX (y) + fX (−y) = √ e +√ e = e .
2πσ 2 2πσ 2 πσ 2
In full, the density function of Y is:
r
2 −y2 /(2σ2 )
e for y ≥ 0

fY (y) = πσ 2
0 otherwise.

17. Let Y = 1/X. As X is a positive random variable, the function g(x) = 1/x is
well-behaved and monotonic. Therefore:
(
fX (1/y)/y 2 for y > 0
fY (y) =
0 otherwise.

202
B.3. Chapter 4 – Multivariate distributions

18. If X ∼ Exp(λ), then its density function is:


(
λe−λx for x ≥ 0
fX (x) =
0 otherwise.
B
Hence:  
1 1
fY (y) = fX − 2 = λeλ/y y −2 .
y y
In full, the density function of Y is:
(
λy −2 e−λ/y for y > 0
fY (y) =
0 otherwise.

19. A distribution is uniquely characterised by its cumulant generating function.


d
Therefore, if KXn (t) → KX (t) as n → ∞, then Xn −→ x.

20. This is a special case of the result that convergence in probability implies
convergence in distribution. Note that convergence in distribution requires
convergence to the distribution function, except at discontinuities. We note that:

P (|Xn − a| < ) = (1 − FXn (a + )) + FXn (a − ) + P (Xn = a − ).

Since Xn converges in probability to a, for any  > 0, the left-hand side converges
to zero. Each element on the right-hand side is positive, so we must have:

FXn (a + ) → 1 and FXn (a − ) → 0.

Therefore, at each point where the distribution function is continuous, FXn → FX ,


where FX is the distribution function of a degenerate random variable with all
d
mass at a. Hence Xn −→ a.

B.3 Chapter 4 – Multivariate distributions


1. If FX1 ,X2 ,...,Xn is the joint distribution function of X1 , X2 , . . . , Xn , then for any
i = 1, 2, . . . , n we have:
• lim FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = 0
xi →−∞

• lim FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ) = 1


x1 →∞,x2 →∞,...,xn →∞

• lim FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xi−1 , xi + h, xi+1 , . . . , xn ) =


h↓0
FX1 ,X2 ,...,Xn (x1 , x2 , . . . , xn ).

2. (a) This cannot work, because:

G(∞, ∞) = FX (∞) + FY (∞) = 1 + 1 = 2

which is inconsistent with G(∞, ∞) being a probability.

203
C.2. Chapter 3 – Random variables and univariate distributions

3. We have:

πn = P (nth toss is heads and there are two heads in the first n − 1 tosses)
(n − 1)!
= 0.5 × × (0.5)2 × (1 − 0.5)n−3
2! (n − 3)!
(n − 1)(n − 2)
= (0.5)n ×
2
C
= (n − 1)(n − 2)2−n−1 .

Let N be the toss number when the third head occurs in the repeated tossing of a
fair coin. Therefore:

X ∞
X ∞
X
1 = P (N < ∞) = P (N = n) = πn = (n − 1)(n − 2)2−n−1
n=3 n=3 n=3

and so: ∞
X
(n − 1)(n − 2)2−n = 2.
n=3

C.2 Chapter 3 – Random variables and univariate


distributions
1. (a) We must have:

X k
1= kx = .
x=1
1−k
Solving, k = 1/2.
(b) For x = 1, 2, . . ., we have:
x  i  x
X 1 (1/2)(1 − (1/2)x ) 1
FX (x) = P (X ≤ x) = = =1− .
i=1
2 1 − 1/2 2

(c) We have:
∞ ∞
tX
X
x tx ket
X et
t x
MX (t) = E(e ) = k e = (ke ) = = .
x=1 x=1
1 − ket 2 − et

For the above to be valid, the sum to infinity has to be valid. That is, ket < 1,
meaning t < log 2. We then have:

2et
MX0 (t) =
(2 − et )2

so that E(X) = MX0 (0) = 2.

217
C. Solutions to Sample examination questions

2. (a) Let I(A) be the indicator function equal to 1 under A, and 0 otherwise. For
any a > 0, we have:
 
I(X ≥ a)X E(X)
P (X ≥ a) = E(I(X ≥ a)) ≤ E ≤ .
a a

(b) Substitute Y = (X − E(X))2 in (a) and replacing a by a2 , we have:


C
Var(X)
P (|X − E(X)| ≥ a) ≤ .
a2

(c) Let X ∼ Exp(1), with mean and variance both equal to 1. Hence, for a > 0, we
have: Z ∞
P (X > a) = e−x dx = e−a .
a
So, by the Markov inequality, e ≤ E(X)/a = 1/a, implying ae−a ≤ 1. At the
−a

same time, for 0 < a < 1, we have:

P (|X − 1| > a) = P (X > 1 + a) + P (X < 1 − a)


Z ∞ Z 1−a
−x
= e dx + e−x dx
1+a 0

= e−(1+a) + 1 − e−(1−a) .

Hence, using (b), we have:


Var(X) 1
e−(1+a) + 1 − e−(1−a) ≤ 2
= 2
a a
implying the other inequality.

3. (a) For 0 < w < 1, we have:



√ √
Z
√ w
1
P (W1 ≤ w) = P (− w < X1 < w) = dx = w.

− w 2

Hence the cumulative distribution function of W1 is FW1 (w) = w, for
0 < w < 1, and so:
1

 √ for 0 < w < 1
0
fW1 (w) = FW1 (w) = 2 w
0 otherwise.

(b) The range of W2 is [0, 4]. For 0 < w < 1, we have:



w √
√ √
Z
1 2 w
FW2 (w) = P (W2 < w) = P (− w < X2 < w) = √
dx = .
− w 3 3

For 1 ≤ w < 4, X2 is in the range [−2, −1]. Hence for 1 ≤ w < 4 we have:
Z −1 √
√ 1 −1 + w
FW2 (w)−FW2 (1) = P (1 ≤ W2 < w) = P (− w < X2 < −1) = √ dx = .
− w 3 3

218
C.3. Chapter 4 – Multivariate distributions

Hence differentiating with respect to w, we get:


 √
1/(3√w)
 for 0 < w < 1
fW2 (w) = 1/(6 w) for 1 ≤ w < 4

0 otherwise.

(c) For 0 < y < 2, we have: C



2y/3y
 for 0 < y 2 < 1
d 2
fY (y) = fW2 (y 2 ) y = 2y/6y for 1 ≤ y 2 < 4
dy 
0 otherwise.

Hence: 
2/3 for 0 < y < 1

fY (y) = 1/3 for 1 ≤ y < 2

0 otherwise.

C.3 Chapter 4 – Multivariate distributions


1. (a) We must have:
Z 2Z 2 2 2 2
x2
Z  Z
1= a(x+y) dx dy = a + xy dy = a (2+2y) dy = a[2y+y 2 ]20 = 8a
0 0 0 2 0 0

so that a = 1/8.
(b) For 0 < x < 2, we have:

1 2 x 0 1 2 x2 2x + x2
Z Z Z  
0
FX (x) = (x + y) dx dy = + xy dy = .
8 0 0 8 0 2 8
The mean is:
2 2 2 2
x3 x2 y
Z Z Z 
1 2 1
E(X) = (x + xy) dx dy = + dy
8 0 0 8 0 3 2 0
1 2 8
Z  
= + 2y dy
8 0 3
 2
1 8y 2
= +y
8 3 0

7
= .
6

2. (a) Integrating out x first, we have:


Z 2Z y 2
y4 25
Z
2
1= axy dx dy = a dy = × a.
0 0 0 2 10
Hence a = 5/16.

219

You might also like