ST2133 ch 3
ST2133 ch 3
provide the probability mass function (pmf) and support for some common discrete
distributions
provide the probability density function (pdf) and support for some common
continuous distributions
41
3. Random variables and univariate distributions
3.3 Introduction
This chapter extends the discussion of (univariate) random variables and common
distributions of random variables introduced in ST104b Statistics 2, so it is advisable
to review that material first before proceeding. Here, we continue with the study of
random variables (which, recall, associate real numbers – the sample space – with
3 experimental outcomes), probability distributions (which, recall, associate probability
with the sample space), and expectations (which, recall, provide the central tendency,
or location, of a distribution). Fully appreciating random variables, and the associated
statistical notation, is a core part of understanding distribution theory.
Example 3.1 Continuing Example 2.26 (flipping a coin three times), suppose we
define the random variable X to denote the number of heads. We assume the
outcomes of the flips are independent (which is a reasonable assumption), with a
constant probability of success (which is also reasonable, as it is the same coin).
Hence X can take the values 0, 1, 2 and 3, and so is a discrete random variable such
that ⌦ = {0, 1, 2, 3}. For completeness, when writing out the probability (mass)
function we should specify the probability of X for all real values, not just those in
⌦. This is easily achieved with ‘0 otherwise’. Hence:
8
>
> (1 ⇡)3 for x = 0
>
>
>
> 2
<3⇡(1 ⇡) for x = 1
P (X = x) = pX (x) = 3⇡ 2 (1 ⇡) for x = 2
>
>
>
> ⇡3 for x = 3
>
>
:0 otherwise.
Formally, we can define a random variable drawing on our study of probability space in
Chapter 2. Recall that we define probability as a measure which maps events to the
unit interval, i.e. [0, 1]. Hence in the expression ‘P (A)’, the argument ‘A’ must represent
42
3.4. Mapping outcomes to real numbers
Random variable
Ax = {! 2 ⌦ : X(!) x} 3
then:
Ax 2 F for all x 2 R.
Therefore, Ax is an event for every real-valued x.
The above definition tells us that random variables map experimental outcomes to real
numbers, i.e. X : ⌦ ! R. Hence the random variable X is a function, such that if ! is
an experimental outcome, then X(!) is a real number. The remainder of the definition
allows us to discuss quantities such as P (X x). Technically, we should write in full
P ({! 2 ⌦ : X(!) x}), but will use P (X x) for brevity.
Our interest usually extends beyond a random variable X, such that we may wish to
consider functions of a random variable.
Example 3.2 Continuing Example 3.1, let Y denote the number of tails. Hence
Y (!) = 3 X(!), for any outcome !. More concisely, this can be written as
Y = 3 X, i.e. as a linear transformation of X.
1
All functions considered in this course are well-behaved, and so we will omit a technical discussion.
43
3. Random variables and univariate distributions
3 Example 3.3 We extend Examples 3.1 and 3.2. When flipping a coin three times,
the sample space is:
Let F be the collection of all possible subsets of ⌦, i.e. F = {0, 1}⌦ , such that any
set of outcomes is an event. As the random variable X denotes the number of heads,
its full specification is:
X(T T T ) = 0
and X(HHH) = 3.
P (Y = 2) = P (3 X = 2) = P (X = 1) = 3⇡(1 ⇡)2
and:
P (Y 2) = P (X > 0) = 1 P (X 0) = 1 (1 ⇡)3 .
Since X and Y are simple counts, i.e. the number of heads and tails, respectively,
these are positive random variables.
44
3.5. Distribution functions
Activity 3.1 Two fair dice are rolled. Let X denote the absolute value of the
di↵erence between the values shown on the top face of the dice. Express each of the
following in words.
(a) {X 2}.
(b) {X = 0}. 3
(c) P (X 2).
(d) P (X = 0).
Activity 3.2 A die is rolled and a coin is tossed. Defining the random variable X to
be the value shown on the die, and the random variable Y to represent the coin
outcome such that: (
1 if heads
Y =
0 if tails
write down concise mathematical expressions for each of the following.
(b) The probability that the value on the die is less than 3.
(d) The probability that the number of heads shown is less than 1.
(f) The probability that the number of heads is less than the value on the die.
Distribution function
FX (x) = P (X x).
i. The terms ‘distribution function’, ‘cumulative distribution function’ and ‘cdf’ can
45
3. Random variables and univariate distributions
be used synonymously.
ii. FX denotes the cdf of the random variable X, similarly FY denotes the cdf of the
random variable Y etc. If an application has only one random variable, such as X,
where it is unambiguous what the random variable is, we may simply write F .
Right continuity
(Note that g(x+) refers to the limit of the values given by g as we approach point x
from the right.2 )
i. P (X > x) = 1 FX (x).
iv. P (X = x) = FX (x) FX (x ).
2
Similarly, we may define left continuity as g(x ) = g(x) for all x 2 R, where g(x ) = lim g(x h).
h#0
46
3.5. Distribution functions
Solution
(a) We have:
Z x Z x x
1 1
FX (x) = fX (t) dt = dt = arctan t
1 1 ⇡(1 + t2 ) ⇡ 1
1 1 ⇡
= arctan x
⇡ ⇡ 2
1 1
= arctan x + .
⇡ 2
(b) We have:
Z x Z x x
e t 1 1
FX (x) = fX (t) dt = t 2
dt = t
= x
.
1 1 (1 + e ) 1+e 1 1+e
(c) We have:
Z x Z x x
a 1 1 1
FX (x) = fX (t) dt = dt = =1 .
1 0 (1 + t)a (1 + y)a 1
0 (1 + x)a 1
47
3. Random variables and univariate distributions
x(x + 1)
FX (x) =
42
over the support {1, 2, . . . , 6}. Determine the mass function of X, i.e. pX .
3 Solution
We have:
x(x + 1) (x 1)x x
pX (x) = FX (x) FX (x 1) = = .
42 42 21
In full: (
x/21 for x = 1, 2, . . . , 6
pX (x) =
0 otherwise.
Activity 3.3 Let X be a random variable which models the value of claims received
at an insurance company. Suppose that only claims greater than k are paid. Write
an expression for the distribution functions of claims paid and claims not paid in
terms of the distribution function of X.
While the distribution function returns P (X x), there are occasions when we are
interested in P (X > x), i.e. the probability that X is larger than x. In models of
lifetime, then the event {X > x} represents survival beyond time x. This gives rise to
the survival function.
Survival function
(a) Pareto:
a 1
fX (x) =
(1 + x)a
for 0 < x < 1 and a > 1.
(b) Weibull:
cx⌧
fX (x) = c⌧ x⌧ 1
e
for x 0, c > 0 and ⌧ > 0.
48
3.6. Discrete vs. continuous random variables
Solution
First notice that the hazard function, often denoted by (x), is (applying the chain
rule): ✓ ◆
d fX (x) fX (x)
(x) = ln F̄X (x) = = .
dx F̄X (x) F̄X (x)
hence:
d( (a 1) ln(1 + x)) a 1
(x) = = .
dx 1+x
(b) For the Weibull distribution, using Example 3.4:
cx⌧
F̄X (x) = 1 FX (x) = e
hence:
d( cx⌧ )
(x) = = c⌧ x⌧ 1
.
dx
Solution
We are asked to show that (x) (x + y) when y 0. We are told that
F̄X (x + y)/F̄X (x) does not increase as x increases, and so ln F̄X (x + y) ln F̄X (x)
has a non-positive derivative with respect to x. Di↵erentiating with respect to x, we
have:
(x + y) + (x) 0
which is the required result.
discrete random variables – typically assumed for variables which we can count
49
3. Random variables and univariate distributions
3 1. Discrete model: Any real-world variable which can be counted, taking natural
numbers 0, 1, 2, . . . , is a candidate for being modelled with a discrete random
variable. Examples include the number of children in a household, the number
of passengers on a flight etc.
In ST104b Statistics 2, the possible values which a random variable could take was
referred to as the sample space, which for many distributions is a subset of R, the real
line. In this course, we will use the term support.
Support of a function
The support of a positive real-valued function, f , is the subset of the real line where
f takes values strictly greater than zero:
50
3.7. Discrete random variables
Discrete random variables have supports which are in some countable subset
{x1 , x2 , . . .} of R. This means that the probability that a discrete random variable X
can take a value outside {x1 , x2 , . . .} is zero, i.e. P (X = x) = 0 for x 62 {x1 , x2 , . . .}.
Probability distributions of discrete random variables can, of course, be represented by
distribution functions, but we may also consider probability at a point using the
3
probability mass function (pmf). We have the following definition, along with the
properties of a pmf.
Since the support of a mass function is the countable set {x1 , x2 , . . .}, in summations
such as in property iii. above, where the limits of summation are not explicitly
provided, the sum will be assumed to be over the support of the mass function.
The probability distribution of a discrete random variable may be represented either by
its distribution function or its mass function. Unsurprisingly, these two types of function
are related.
If X is a discrete random variable such that pX is its mass function and FX is its
distribution function, then:
i. pX (x) = FX (x) FX (x )
P
ii. FX (x) = pX (xi ).
xi x
From i. above we can deduce that FX (x) = FX (x ) + pX (x). Since pX (x) = 0 for
x 62 {x1 , x2 , . . .}, we have that FX (x) = FX (x ) for x 62 {x1 , x2 , . . .}. This means that
the distribution function is a step function, i.e. flat except for discontinuities at the
points {x1 , x2 , . . .} which represent the non-zero probabilities at the given points in the
support.
We now consider some common discrete probability distributions, several of which were
51
3. Random variables and univariate distributions
where bxc is the ‘floor’ of x, i.e. the largest integer which is less than or equal to x.
52
3.7. Discrete random variables
where: ✓ ◆
n n!
=
x x! (n x)!
is the binomial coefficient. Sometimes the notation p and q are used to denote ⇡ and
1 ⇡, respectively. This has the benefit of brevity, since q is more concise than 1 ⇡.
Proof of the validity of this mass function was shown in ST104b Statistics 2.
The corresponding distribution function is a step function, given by:
8
>
> 0 for x < 0
>
<Pbxc
n
FX (x) = ⇡ i (1 ⇡)n i for 0 x < n
>
> i=0
i
>
:
1 for x n.
Note the special case when n = 1 corresponds to the Bernoulli(⇡) distribution, and that
the sum of n independent and identically distributed Bernoulli(⇡) random variables has
a Bin(n, ⇡) distribution.
First version
In the first version of the geometric distribution, X represents the trial number of the
first success. As such the support is {1, 2, . . .}, with the mass function:
(
(1 ⇡)x 1 ⇡ for x = 1, 2, . . .
pX (x) =
0 otherwise
53
3. Random variables and univariate distributions
Second version
3 In the second version of the geometric distribution, X represents the number of failures
before the first success. As such the support is {0, 1, 2, . . .}, with the mass function:
(
(1 ⇡)x ⇡ for x = 0, 1, 2, . . .
pX (x) =
0 otherwise
with its distribution function given by:
(
0 for x < 0
FX (x) = bxc+1
1 (1 ⇡) for x 0.
Either version could be denoted as X ⇠ Geo(⇡), although be sure to be clear which
version is being used in an application.
First version
In the first version this distribution is used to represent the trial number of the rth
success in independent Bernoulli(⇡) trials, where r = 1, 2, . . .. When r = 1, this is a
special case which is the (first) version of the geometric distribution above. If X is a
negative binomial random variable, denoted X ⇠ Neg. Bin(r, ⇡), where ⇡ is the
constant probability of success, its mass function is:
(
x 1
r 1
⇡ r (1 ⇡)x r for x = r, r + 1, r + 2, . . .
pX (x) =
0 otherwise.
Note why the mass function has this form. In order to have r successes for the first time
on the xth trial, we must have r 1 successes in the first x 1 trials. The number of
ways in which these r 1 successes may occur is xr 11 and the probability associated
with each of these sequences is ⇡ r 1 (1 ⇡)x r . Since the xth, i.e. final, trial must be a
success, we then multiply xr 11 ⇡ r 1 (1 ⇡)x r by ⇡. From the mass function it can be
seen that the negative binomial distribution generalises the (first version of the)
geometric distribution, such that if X ⇠ Geo(⇡) then X ⇠ Neg. Bin(1, ⇡).
Its distribution function is given by:
8
>
<0
> for x < r
bxc
FX (x) = X
>
> pX (i) for x r.
:
i=r
54
3.7. Discrete random variables
Second version
The second version of the negative binomial distribution is formulated as the number of
failures before the rth success occurs. In this formulation the mass function is:
(
x+r 1
r 1
⇡ r (1 ⇡)x for x = 0, 1, 2, . . .
pX (x) =
0 otherwise 3
while its distribution function is:
8
>
<0
> for x < 0
bxc
FX (x) = X
>
> pX (i) for x 0.
:
i=0
(n) = (n 1)!
55
3. Random variables and univariate distributions
Example 3.9 Suppose a fair die is rolled 10 times. Let X be the random variable
to represent the number of 6s which appear. Derive the distribution of X, i.e. FX .
Solution
For a fair die, the probability of a 6 is 1/6, and hence the probability of a non-6 is
5/6. By independence of the outcome of each roll, we have:
x
X x ✓ ◆ ✓ ◆i ✓ ◆10
X i
10 1 5
FX (x) = P (X x) = P (X = i) = .
i=0 i=0
i 6 6
In full: 8
>
> 0 for x < 0
>
> bxc ✓ ◆ ✓ ◆i ✓ ◆10
<X i
10 1 5
FX (x) = for 0 x 10
>
> i 6 6
>
> i=0
:
1 for x > 10.
56
3.7. Discrete random variables
Example 3.10 Show that the ratio of any two successive hypergeometric
probabilities, i.e. P (X = x + 1) and P (X = x) equals:
n x K x
x+1N K n+x+1
for any valid x and x + 1.
3
Solution
If X ⇠ Hyper(n, N, K), then from its mass function we have:
pX (x + 1)
pX (x)
K N K
, K N K
x+1 n x 1 x n x
= N N
n n
K N K
x+1 n x 1
= K N K
x n x
K! (N K)!
=
(x + 1)! (K x 1)! (n x 1)! (N K n + x + 1)!
x! (K x)! (n x)! (N K n + x)!
⇥
K! (N K)!
n x K x
= .
x+1N K n+x+1
Solution
In total there are Nn ways of selecting a random sample of size n from the
population of N objects. There are Nnii ways to arrange the ni of the Ni objects for
i = 1, 2, . . . , k. By the rule of product, the required probability is:
Q
k
Ni
ni
i=1
N
.
n
57
3. Random variables and univariate distributions
(b) Use the mass function in (a) to derive the associated distribution function.
Activity 3.5 Consider the first version of the negative binomial distribution. Show
that its mass function:
8✓ ◆
< x 1 ⇡ r (1 ⇡)x r for x = r, r + 1, r + 2, . . .
pX (x) = r 1
:
0 otherwise
for some integrable function fX : R ! [0, 1), known as the (probability) density
function. In reverse, the density function can be derived from the distribution
function by di↵erentiating:
d
fX (x) = FX (t) = FX0 (x) for all x 2 R.
dt t=x
58
3.8. Continuous random variables
= (n + 2) (n + 1)
= 1.
Solution
Applying integration by parts, we have:
Z x h ix Z 1
FX (x) = te t dt = te t
+ e t
dt
0 0 0
h ix
x t
= xe + e
0
x x
= xe e +1
=1 (1 + x)e x .
In full: (
0 for x < 0
FX (x) = x
1 (1 + x)e for x 0.
59
3. Random variables and univariate distributions
Solution
e x
FX0 (x) = >0
(1 + e x )2
e x
fX (x) = for 1 < x < 1.
(1 + e x )2
Solution
We have:
60
3.8. Continuous random variables
P (X = x) = 0 for all x 2 R.
61
3. Random variables and univariate distributions
The special case of X ⇠ Uniform[0, 1], i.e. when the support is the unit interval, is used
in simulations of random samples from distributions, by treating a random drawing
from Uniform[0, 1] as a randomly drawn value of a distribution function, since:
Inverting the distribution function recovers the (simulated) random drawing from the
desired distribution.
62
3.8. Continuous random variables
The distribution function of a normal random variable does not have a closed form.
An important special case is the standard normal distribution with µ = 0 and
2
= 1, denoted Z ⇠ N (0, 1), with density function:
✓ 2◆
1 z
fZ (z) = p exp for 1 < z < 1.
2⇡ 2
X and Z are related through the linear transformation: 3
X µ
Z= , X = µ + Z.
where recall is the gamma function, defined in Section 3.7.7. We omit the distribution
function of the gamma distribution in this course.
Note that when ↵ = 1 the density function reduces to an exponential distribution, i.e. if
X ⇠ Gamma(1, ), then X ⇠ Exp( ).
63
3. Random variables and univariate distributions
Example 3.16 A random variable has ‘no memory’ if for all x and for y > 0 it
holds that:
P (X > x + y | X > x) = P (X > y).
Show that if X has either the exponential distribution, or a geometric distribution
with P (X = x) = q x 1 p, then X has no memory. Interpret this property.
Solution
We must check that P (X > x + y | X > x) = P (X > y). This can be written in
terms of the distribution function of X because for y > 0 we have:
P ({X > x + y} \ {X > x})
1 FX (y) = P (X > y) = P (X > x + y | X > x) =
P (X > x)
P (X > x + y)
=
P (X > x)
1 FX (x + y)
= .
1 FX (x)
If X ⇠ Exp( ), then:
x
1 FX (x) = e .
The ‘no memory’ property is verified by noting that:
(x+y)
y e
e = x
.
e
64
3.9. Expectation, variance and higher moments
1 FX (x) = q x .
q x+y
qy =
qx
. 3
The ‘no memory’ property is saying that ‘old is as good as new’. If we think in terms
of lifetimes, it says that you are equally likely to survive for y more years whatever
your current age x may be. This is unrealistic for humans for widely di↵erent ages x,
but may work as a base model in other applications.
i.e. the value of X where the density function reaches a maximum (which may or may
not be unique), and the median is the value m satisfying:
median(X) = FX (m) = 0.5.
Hereafter, we will focus our attention on the mean of X, often referred to as the
expected value of X, or simply the expectation of X.
65
3. Random variables and univariate distributions
Example 3.17 Suppose that X ⇠ Bin(n, ⇡). Hence its mass function is:
(
n
x
⇡ x (1 ⇡)n x for x = 0, 1, 2, . . . , n
pX (x) =
0 otherwise
Xn ✓ ◆
n x
= x ⇡ (1 ⇡)n x
(substituting pX (x))
x=0
x
Xn ✓ ◆
n x
= x ⇡ (1 ⇡)n x
(since x pX (x) when x = 0)
x=1
x
n
X n(n 1)!
= ⇡⇡ x 1 (1 ⇡)n x
(as x/x! = 1/(x 1)!)
x=1
(x 1)! [(n 1) (x 1)]!
n ✓
X ◆
n 1 x 1
= n⇡ ⇡ (1 ⇡)n x
(taking n⇡ outside)
x=1
x 1
n 1✓
X ◆
n 1
= n⇡ ⇡ y (1 ⇡)(n 1) y
(setting y = x 1)
y=0
y
Example 3.18 Suppose that X ⇠ Exp( ). Hence its density function is:
(
e x for x 0
fX (x) =
0 otherwise
Note that:
d x
e x.
x e =
d
We may switch the order of di↵erentation with respect to and integration with
respect to x, hence:
Z 1 1
d x d 1 d 1
E(X) = e dx = e x = 1
= 2
= .
d 0 d 0 d
An alternative approach makes use of integration by parts.
66
3.9. Expectation, variance and higher moments
Example 3.19 If the random variable X ⇠ Hyper(n, N, K), show that for a
random sample of size n, the expected value of the number of successes is:
nK
E(X) = .
N
Solution 3
We have:
K K N K K K! N K
X x n x
X x! (K x)! n x
E(X) = x N
= x N!
x=0 n x=1 n! (N n)!
67
3. Random variables and univariate distributions
The proof of this result is trivial since the property of linearity is inherited directly from
the definition of expectation in terms of a sum or an integral. Note that since:
E(a0 + a1 X + a2 X 2 + · · · + ak X k ) = a0 + a1 E(X) + a2 E(X 2 ) + · · · + ak E(X k )
when k = 1, i.e. for any real constants a0 and a1 , we have:
E(a0 ) = a0 and E(a0 + a1 X) = a0 + a1 E(X).
Also note that if X is a positive random variable, then E(X) 0, as we would expect.
Determine E(Y ).
Solution
We can derive E(Y ) in one of two ways. Working directly with fY (y), we have:
Z 1 Z 1 ✓ ◆ 1
1 2y 3/2 y2 1
E(Y ) = y fY (y) dy = y p 1 dy = = .
1 0 y 3 2 0 6
68
3.9. Expectation, variance and higher moments
Solution
We must have that:
X n
X n
X kn(n + 1)
pX (x) = kx = k x= =1
x x=1 x=1
2
noting that
P
n
i = n(n + 1)/2. Therefore: 3
i=1
2
k= .
n(n + 1)
Hence:
✓ ◆ X1 Xn Xn
1 1 2x 2 2
E = pX (x) = = = .
X x
x x=1
x n(n + 1) x=1
n(n + 1) n + 1
Hereafter, we will focus our attention on the variance of X – the average squared
distance from the mean (the standard deviation of X is then simply the positive
square root of the variance).
whenever the sum or integral is finite. The standard deviation is defined as:
p
= Var(X).
69
3. Random variables and univariate distributions
Proof:
⌅
In practice it is often easier to derive the variance of a random variable X using one of
the following alternative, but equivalent, results.
x x=0
= 0 ⇥ (1 ⇡) + 1 ⇥ ⇡
= ⇡.
Also, we have:
X 1
X
E(X 2 ) = x2 pX (x) = x2 ⇡ x (1 ⇡)1 x
x x=0
= 02 ⇥ (1 ⇡) + 12 ⇥ ⇡
= ⇡.
Therefore:
Var(X) = E(X 2 ) (E(X))2 = ⇡ ⇡ 2 = ⇡(1 ⇡).
70
3.9. Expectation, variance and higher moments
Example 3.23 Suppose that X ⇠ Bin(n, ⇡). We know that E(X) = n⇡. To find
Var(X) it is most convenient to calculate E(X(X 1)), from which we can recover
E(X 2 ). We have:
X
E(X(X 1)) = x(x 1) pX (x) (by definition)
3
x
n
X ✓ ◆
n x n x
= x(x 1) ⇡ (1 ⇡) (substituting pX (x))
x=0
x
Xn ✓ ◆
n x
= x ⇡ (1 ⇡)n x
(since x pX (x) = 0 when x = 0 and 1)
x=2
x
n
X n(n 1)(n 2)!
= ⇡ 2 ⇡ x 2 (1 ⇡)n x
x=2
(x 2)! [(n 2) (x 2)]!
(as x(x 1)/x! = 1/(x 2)!)
n ✓
X ◆
2 n 2 x 2
= n(n 1)⇡ ⇡ (1 ⇡)n x
x=2
x 2
(taking n(n 1)⇡ 2 outside)
n 2✓
X ◆
n 2 y
= n(n 1)⇡ 2 ⇡ (1 ⇡)(n 2) y (setting y = x 2)
y=0
y
Therefore:
71
3. Random variables and univariate distributions
Markov inequality
E(X)
P (X a)
a
for any constant a > 0.
Proof: Here we consider the continuous case. A similar argument holds for the discrete
case.
Z 1
P (X a) = fX (x) dx (by definition)
a
Z 1
x
) P (X a) fX (x) dx (since 1 x/a for x 2 [a, 1))
a a
Z 1
1 R1 R1
) P (X a) x fX (x) dx (since a
g(x) dx 0
g(x) dx for positive g)
a 0
1
) P (X a) E(X). (by definition of E(X))
a
⌅
So the Markov inequality provides an upper bound on the probability in the upper tail
of a distribution. Its appeal lies in its generality, since no distributional assumptions are
required. However, a consequence is that the bound may be very loose as the following
example demonstrates.
72
3.9. Expectation, variance and higher moments
P (X 160)
E(X)
=
80
= 0.5 3
160 160
which is unrealistic as we would expect this probability to be (very) close to zero!
We now extend the Markov inequality to consider random variables which are not
constrained to be positive, using the Chebyshev inequality.
Chebyshev inequality
Var(X)
P (|X E(X)| a)
a2
for any constant a > 0.
Proof: This follows from the Markov inequality by setting Y = (X E(X))2 . Hence Y
is a positive random variable so the Chebyshev inequality holds. By definition
E(Y ) = Var(X), so the Markov inequality gives:
Var(X)
P ((X E(X))2 a2 ) = P (|X E(X)| a)
a2
using a2 in place of a. ⌅
Applying the Chebyshev inequality to a standardised distribution, i.e. if X is a random
variable with E(X) = µ and Var(X) = 2 , then for a real constant k > 0 we have:
✓ ◆
|X µ| 1
P k 2.
k
Proof: This follows immediately from the Chebyshev inequality by setting a = k . ⌅
The above example demonstrates that (for the normal distribution at least) the bound
can be very inaccurate. However, it is the generalisability of the result to all
distributions with finite variance which makes this a useful result.
73
3. Random variables and univariate distributions
We now consider a final inequality – the Jensen inequality – but first we begin with
the definition of a convex function.
Convex function
Jensen inequality
If X is a random variable with E(X) < 1 and g is a convex function such that
E(g(X)) < 1, then:
E(g(X)) g(E(X)).
Example 3.27 We consider applications of the Jensen inequality such that we may
derive various relationships involving expectation.
3.9.5 Moments
Characterising a probability distribution by key attributes is desirable. For a random
variable X the mean, E(X), is our preferred measure of central tendency, while the
variance, Var(X) = E((X E(X))2 ), is our preferred measure of dispersion (or its
standard deviation). However, these are not exhaustive of distribution attributes which
may interest us. Skewness (the departure from symmetry) and kurtosis (the fatness
of tails) are also important, albeit less important than the mean and variance on a
relative basis.
74
3.9. Expectation, variance and higher moments
On a rank-order basis we will think of the mean as being the most important attribute
of a distribution, followed by the variance, skewness and then kurtosis. Nonetheless, all
of these attributes may be expressed in terms of moments and central moments,
now defined. (Note that moments were introduced in ST104b Statistics 2 in the
context of method of moments estimation.)
Moments
3
If X is a random variable, and r is a positive integer, then the rth moment of X is:
µr = E(X r )
Example 3.28 Setting r = 1 produces the first moment, which is the mean of the
distribution since:
µ1 = E(X 1 ) = E(X) = µ
provided E(X) < 1.
Setting r = 2 produces the second moment, which combined with the mean can be
used to determine the variance since:
Moments are determined by the horizontal location of the distribution. For the
variance, our preferred measure of dispersion, we would wish this to be invariant to a
shift up or down the horizontal axis. This leads to central moments which account for
the value of the mean.
Central moments
If X is a random variable, and r is a positive integer, then the rth central moment
of X is:
µ0r = E((X E(X))r )
whenever this is well-defined.
Example 3.29 Setting r = 1 produces the first central moment, which is always
zero since:
µ01 = E((X E(X))1 ) = E(X µ1 ) = E(X) µ1 = E(X) E(X) = 0
provided E(X) < 1.
Setting r = 2 produces the second central moment, which is the variance of the
distribution since:
µ02 = E((X E(X))2 ) = Var(X) = 2
provided E(X 2 ) < 1.
75
3. Random variables and univariate distributions
Example 3.30 Using (3.3), the second central moment can be expressed as:
3 2 ✓ ◆
X 2
µ02 = ( µ1 ) i µ2 i = µ2 2(µ1 )2 + (µ1 )2 = µ2 (µ1 )2
i=0
i
Example 3.31 Using (3.3), the third central moment can be expressed as:
3 ✓ ◆
X 3
µ03 = ( µ1 ) i µ3 i = µ3 3µ1 µ2 + 3(µ1 )3 (µ1 )3 = µ3 3µ1 µ2 + 2(µ1 )3 .
i=0
i
Example 3.32 Let X be a random variable which has a Bernoulli distribution with
parameter ⇡.
(c) Show that the mean of the binomial distribution with parameters n and ⇡ is
equal to n⇡.
Solution
We have that X ⇠ Bernoulli(⇡).
E(X r ) = E(X) = 0 ⇥ (1 ⇡) + 1 ⇥ ⇡ = ⇡.
(b) We have:
P
n
(c) Define Y = Xi , where the Xi s are i.i.d. Bernoulli(⇡) random variables. Hence:
i=1
n
! n n
X X X
E(Y ) = E Xi = E(Xi ) = ⇡ = n⇡.
i=1 i=1 i=1
76
3.9. Expectation, variance and higher moments
Example 3.28 showed that the first moment is the mean (our preferred measure of
central tendency), while Example 3.29 showed that the second central moment is the
variance (our preferred measure of dispersion). We now express skewness and kurtosis in
terms of moments.
Coefficient of skewness
So we see that skewness depends on the third central moment, although the reason for
this may not be immediately clear. It can be explained by first noting that if g(x) = x3 ,
then g is an odd function meaning g( x) = g(x). For a continuous random variable X
with density function fX , the third central moment is:
Z 1
E((X µ)3 ) = (x µ)3 fX (x) dx
1
Z 1
= z 3 fX (µ + z) dz (where z = x µ)
1
Z 1
= z 3 (fX (µ + z) fX (µ z)) dz. (since z 3 is odd)
0
The term (fX (µ + z) fX (µ z)) compares the density at the points a distance z above
and below µ, such that any di↵erence signals asymmetry. If such sources of asymmetry
are far from µ, then when multiplied by z 3 they result in a large coefficent of skewness.
77
3. Random variables and univariate distributions
µ03 2/ 3 2/ 3
1 = 0 3/2
= 2 3/2
= 3
=2
(µ2 ) (1/ ) 1/
Coefficient of kurtosis
We see that the term ‘ 3’ appears in the definition of kurtosis. Convention means we
measure kurtosis with respect to a normal distribution. Noting that the fourth central
moment of a normal distribution is 3 4 , we have that the kurtosis for a normal
distribution is:
E((X µ)4 ) 3 4
2 = 4
3 = 4
3 = 0.
The coefficient of kurtosis as defined in (3.4) with the ‘ 3’ term is often called excess
kurtosis, i.e. kurtosis in excess of that of a normal distribution.
78
3.9. Expectation, variance and higher moments
Example 3.35 Find the mean and variance of the gamma distribution:
1 ↵ ↵ 1 1
fX (x) = x e x= ↵ ↵ 1
x e x
(↵) (↵ 1)!
for x 0, ↵ > 0 and > 0, where (↵) = (↵ 1)!.
Hint: Note that since fX (x) is a density function, we can write: 3
Z 1
1 ↵ ↵ 1
x e x dx = 1.
0 (↵ 1)!
Solution
We can find the rth moment, and use this result to get the mean and variance. We
have:
Z 1 Z 1
r r 1
E(X ) = µr = x fX (x) dx = xr ↵ ↵ 1
x e x dx
1 0 (↵ 1)!
Z 1
1
= xr+↵ 1 ↵ e x dx
0 (↵ 1)!
Z 1
(r + ↵ 1)! 1 1
= r
x(r+↵) 1 r+↵ e x dx
(↵ 1)! 0 (r + ↵ 1)!
(r + ↵ 1)! 1
= r
(↵ 1)!
since the integrand is a Gamma(r + ↵, ) density function, which integrates to 1. So:
(r + ↵ 1)!
µr = .
(↵ 1)! r
Using the result:
↵! ↵
E(X) = µ1 = =
(↵ 1)!
and:
(↵ + 1)! ↵(↵ + 1)
E(X 2 ) = µ2 = 2
= 2
.
(↵ 1)!
Therefore, the variance is:
↵(↵ + 1) ↵2 ↵
µ2 (µ1 )2 = 2 2
= 2
.
Both the mean and variance increase with ↵ increasing and decrease with
increasing.
We can also compute E(X) and E(X 2 ) by substituting y = x. Note that this gives
dx = (1/ ) dy. For example:
Z 1 Z 1
1 ↵ ↵ 1
E(X) = x fX (x) dx = x x e x dx
1 0 (↵ 1)!
Z 1
1
= y ↵ e y dy
(↵ 1)! 0
79
3. Random variables and univariate distributions
Example 3.36 Find the mean and variance of the Poisson distribution:
x
e
pX (x) =
x!
1
X x
e
= 1.
x=0
x!
Solution
By direct calculation we have:
1
X x 1
X x 1 1
X y
e
E(X) = x =e =e =e e =
x=0
x! x=1
(x 1)! y=0
y!
setting y = x 1, and:
1
X x 1
X x 1
X x 1
e e
E(X 2 ) = x2 = x2 =e x
x=0
x! x=1
x! x=1
(x 1)!
1
X y
=e (y + 1)
y=0
y!
1
X y 1
X y
y
=e +e
y=0
y! y=0
y!
| {z }
=e E(Y )
=e e +e e
2
= +
This works out very simply. We can then convert to the mean and variance. The
critical property that makes µ(r) work out easily is that for an integer x from 1 to
r 1 we have:
x(r) x(r) 1
= (x) = .
x! x (x r)!
80
3.9. Expectation, variance and higher moments
We have:
1
X x 1
X
(r) (r) e x(r) x
E(X )= x = e
x=0
x! x=r
x!
1
X 1 x
= e
x=r
(x r)!
3
1
X
r 1 x r
= e
x=r
(x r)!
1
X y
r e
=
y=0
y!
r
= .
The last step follows because we are adding together all the probabilities for a
Poisson distribution with parameter .
Now it is straightforward to get the mean and variance. For the mean we have:
and so:
Var(X) = µ2 (µ1 )2 = 2
+ 2
= .
Example 3.37 Find the mean and variance of the Pareto distribution:
a 1
fX (x) =
(1 + x)a
for x > 0 and a > 1.
Hint: It is easier to find E(X + 1) and E((X + 1)2 ). We then have that
E(X) = E(X + 1) 1 and E(X 2 ) comes from E((X + 1)2 ) in a similar manner.
Solution
We can directly integrate for µr , writing the integral as a Pareto distribution with
parameter a r after a transformation. It is easier to work with Y = X + 1, noting
that E(X) = E(Y ) 1 and Var(Y ) = Var(X). We have:
Z 1 Z 1
a 1 a 1 a r 1 a 1
E(Y r ) = (x + 1)r a
dx = a r
dx =
0 (1 + x) a r 1 0 (1 + x) a r 1
provided that a r > 1 (otherwise the integral is not defined). So:
a 1 1
E(Y ) = ) E(X) =
a 2 a 2
81
3. Random variables and univariate distributions
where to be well-defined we require MX (t) < 1 for all t 2 [ h, h] for some h > 0.
So the mgf must be defined in an interval around the origin, which will be necessary
when derivatives of the mgf with respect to t are taken and evaluated when t = 0,
i.e. such as MX0 (0), MX00 (0) etc.
If the expected value E(etX ) is infinite, the random variable X does not have an mgf.
The form of the mgf is not interesting or informative in itself. Instead, the reason we
define the mgf is that it is a convenient tool for deriving means and variances of
distributions, using the following results:
MX0 (0) = E(X) and MX00 (0) = E(X 2 )
which also gives:
Var(X) = E(X 2 ) (E(X))2 = MX00 (0) (MX0 (0))2 .
82
3.10. Generating functions
This is useful if the mgf is easier to derive than E(X) and Var(X) directly.
Other moments are obtained from the mgf similarly:
(r)
MX (0) = E(X r ) for r = 1, 2, . . . .
To see why, note that MX (t) is the expected value of an exponential function of X.
Recall the Taylor expansion of ex is: 3
X xi 1
x x2 xr
e =1+x+ + ··· + + ··· = .
2! r! i=0
i!
All the derivatives of ex are also ex , i.e. the rth derivative is:
dr x
e = ex for r = 1, 2, . . . .
dxr
Therefore, we may express the moment generating function as a polynomial in t, i.e. we
have:
X ti 1
t2 tr
MX (t) = 1 + t E(X) + E(X 2 ) + · · · + E(X r ) + · · · = E(X i ).
2! r! i=0
i!
Proof: This follows immediately from the series expansion of ex and the linearity of
expectation:
1
! 1 i
X (tX)i X t
tX
MX (t) = E(e ) = E = E(X i ).
i=0
i! i=0
i!
⌅
We are now in a position to understand how moments can be generated from a moment
generating function. There are two approaches.
The coefficient of tr in the series expansion of MX (t) is the rth moment divided by
r!. Hence the rth moment can be determined by comparing coefficients. We have:
1 i
X 1
X
t
MX (t) = E(X i ) = ai t i ) E(X r ) = r! ar
i=0
i! i=0
83
3. Random variables and univariate distributions
The rth derivative of a moment generating function evaluated at zero is the rth
moment, that is:
(r) dr
MX (0) = r MX (t) = E(X r ) = µr .
dt t=0
Proof: Since:
t2 tr
MX (t) = 1 + t E(X) + E(X 2 ) + · · · + E(X r ) + · · ·
2! r!
then:
X ti r 1
(r) r r+1 t2
MX (t) = E(X ) + t E(X ) + E(X r+2 ) + · · · = E(X i ).
2! i=r
(i r)!
When evaluated at t = 0 only the first term, E(X r ), is non-zero, proving the result. ⌅
The moment generating function uniquely determines a probability distribution. In
other words, if for two random variables X and Y we have MX (t) = MY (t) (for points
around t = 0), then X and Y have the same distribution.
If X and Y are random variables and we can find h > 0 such that MX (t) = MY (t)
for all t 2 [ h, h], then FX (x) = FY (x) for all x 2 R.
We now show examples of deriving the moment generating function and subsequently
using them to obtain moments.
and:
(et 1)
MX00 (t) = et (1 + et )e
84
3.10. Generating functions
Therefore:
Var(X) = E(X 2 ) (E(X))2 = (1 + ) 2
= .
3
Example 3.39 Suppose X ⇠ Geo(⇡), for the second version of the geometric
distribution, i.e. we have:
(
(1 ⇡)x ⇡ for x = 0, 1, 2, . . .
pX (x) =
0 otherwise.
using the sum to infinity of a geometric series, for t < ln(1 ⇡) to ensure
convergence of the sum.
From MX (t) = ⇡/(1 et (1 ⇡)) we obtain, using the chain rule:
⇡(1 ⇡)et
MX0 (t) =
(1 et (1 ⇡))2
85
3. Random variables and univariate distributions
= for t <
t
where note the integral is that of an Exp( t) distribution over its support, hence
is equal to 1.
From MX (t) = /( t) we obtain:
2
MX0 (t) = and MX00 (t) =
( t)2 ( t)3
so:
1 2
E(X) = MX0 (0) = and E(X 2 ) = MX00 (0) = 2
and:
2 1 1
Var(X) = E(X 2 ) (E(X))2 = 2 2
= 2
.
86
3.10. Generating functions
where note the integral is that of a Gamma(↵, t) distribution over its support,
hence is equal to 1, and where we multiplied by ( t)↵ /( t)↵ = 1 (hence not
a↵ecting the integral) to ‘create’ a Gamma(↵, t) density function. Since ( t)↵
does not depend on x we can place the numerator term inside the integral and the
denominator term outside the integral.
We can divide through by to obtain:
✓ ◆ ↵
3
t
MX (t) = 1 for t < .
we have: 1 ✓ ◆ ✓ ◆i 1
X i+↵ 1 t X (i + ↵ 1)! ti
MX (t) = = .
i=0
↵ 1 i=0
(↵ 1)! i i!
Since the rth moment is the coefficient of tr /r! in the polynomial expansion of
MX (t), we deduce that if X ⇠ Gamma(↵, ), the rth moment is:
(r + ↵ 1)!
E(X r ) = µr = .
(↵ 1)! r
as derived in Example 3.40. Note the choice of parameter symbol is arbitrary, such
that = .
Solution
We are asked to find:
MX (t) = E(etX )
87
3. Random variables and univariate distributions
2
for X ⇠ N (µ, ). We may write X = µ + Z, where Z ⇠ N (0, 1), such that:
MX (t) = E(et(µ+ Z)
) = eµt E(e tZ
) = eµt MZ ( t).
So we only need to derive the mgf for a standard normal random variable. We have:
Z 1 Z 1
zt 1 1
3 z 2 /2 2 2
MZ (t) = e p e dz = p e ((z t) t )/2 dz
1 2⇡ 1 2⇡
Z 1
2 1 2
= et /2 p e (z t) /2 dz
1 2⇡
2 /2
= et .
The last integral is that of a N (t, 1) density function, and so is equal to 1. The step
from line 1 to line 2 follows from the simple algebraic identity:
1 1 2
((z t)2 t2 ) = z + zt.
2 2
The mgf for the general normal distribution is:
✓ 2 2
◆
t
MX (t) = exp µt + .
2
Example 3.43 Find the moment generating function of the double exponential, or
Laplace, distribution with density function:
1 |x|
fX (x) = e for 1 < x < 1.
2
88
3.10. Generating functions
Solution
We have:
Z 1 |x| Z 0 Z 1
xt e ex
xt e x
MX (t) = e dx = e dx + ext dx
1 2 1 2 0 2
Z 0 Z 1
ex(1+t) x(1 t)
=
2
dx +
e
2
dx 3
1 0
0 1
ex(1+t) e x(1 t)
= +
2(1 + t) 1 2(1 t) 0
1 1
= +
2(1 + t) 2(1 t)
1
=
1 t2
where we require |t| < 1.
Solution
(a) Since:
Z 1 Z 1 ✓Z 0 Z 1 ◆
|x| x x
1= fX (x) dx = k e dx = k e dx + e dx
1 1 1 0
0 1
!
x x
e e
=k +
1 0
✓ ◆
1 1
=k +
2k
=
89
3. Random variables and univariate distributions
E(X 3 ) = 0.
(c) We have:
3 Z 1
tX |x|
MX (t) = E(e ) = etx e dx
2 1
✓Z 0 Z 1 ◆
(t+ )x (t )x
= e dx + e dx
2 1 0
✓ ◆
1 1
=
2 t+ t
2
= 2 t2
for |t| < .
(d) Since E(X) = 0 it holds that the variance of X is equal to E(X 2 ). For |t| < ,
we have:
d 2 2t 2 2( 2
t2 )2 + 2 2 t(4t( 2
t2 ))
MX00 (t) = = .
dt ( 2 t2 )2 ( 2 t2 ) 4
Setting t = 0 we have:
6
2 2
E(X 2 ) = MX00 (0) = 8
= 2
.
Note that this should not come as a surprise since X can be written as the
di↵erence of two independent and identically distributed exponential random
variables, each with parameter , say X = T1 T2 . Hence, due to independence:
2
= 2
.
Activity 3.11 Consider the following game. You pay £5 to play the game. A fair
coin is tossed three times. If the first and last tosses are heads, you receive £10 for
each head.
90
3.10. Generating functions
The rth cumulant, r , is the coefficient of tr /r! in the expansion of KX (t), so:
X i ti 1
t2 tr
KX (t) = 1 t + 2 + · · · + r + · · · = .
2! r! i=1
i!
As with the relationship between a moment generating function and moments, the same
relationship holds for a cumulant generating function and cumulants. There are two
approaches.
The coefficient of tr in the series expansion of KX (t) is the rth cumulant divided by
r!. Hence the rth cumulant can be determined by comparing coefficients. We have:
1 i
X 1
X
t
KX (t) = i = ai t i ) r = r! ar
i=0
i! i=0
where ai = i /i!.
The rth derivative of a cumulant generating function evaluated at zero is the rth
cumulant, that is:
(r) dr
KX (0) = r KX (t) = r .
dt t=0
91
3. Random variables and univariate distributions
If X is a random variable with moments {µr }, central moments {µ0r } and cumulants
{r }, for r = 1, 2, . . ., then:
3 1 = E(X) = µ1 = µ
2 = Var(X) = µ02 = 2
3 = µ03
iv. a function of the fourth and second cumulants yields the fourth central moment:
4 + 322 = µ04 .
KX (t) = µt.
Solution
We have:
X
MX (t) = E(etX ) = etx pX (x) = etµ pX (µ) = etµ ⇥ 1 = etµ .
x
92
3.11. Functions of random variables
Hence:
KX (t) = log MX (t) = µt.
Example 3.46 For the Poisson distribution, its cumulant generating function has a
simpler functional form than its moment generating function. If X ⇠ Pois( ), its
cumulant generating function is:
3
KX (t) = log MX (t) = log exp( (et 1)) = (et 1).
2 /2
Example 3.47 Suppose Z ⇠ N (0, 1). From Example 3.42 we have MZ (t) = et ,
hence taking the logarithm yields the cumulant generating function:
2 /2 t2
KZ (t) = log MZ (t) = log et = .
2
Hence for a standard normal random variable, 2 = 1 and all other cumulants are
zero.
Activity 3.13 For each of the following distributions derive the moment generating
function and the cumulant generating function:
(a) Bernoulli(⇡)
(b) Bin(n, ⇡)
Activity 3.14 Use cumulants to calculate the coefficient of skewness for a Poisson
distribution.
93
3. Random variables and univariate distributions
How to proceed? Well, we begin with the concept of the inverse image of a set.
So the inverse image of B under g is the image of B under g 1 . Hence for any
well-behaved B ✓ R, we have:
P (Y 2 B) = P (g(X) 2 B) = P (X 2 g 1 (B))
that is, the probability that g(X) is in B equals the probability that X is in the
inverse image of B.
Hence: 8 P
< pX (x) for discrete X
FY (y) = {x:g(x)y}
:R
{x:g(x)y}
fX (x) dx for continuous X.
94
3.11. Functions of random variables
(a) X 2
p
(b) X
(c) G 1 (X)
Solution
(a) If y 0, then:
p p p p
P (X 2 y) = P (X y) P (X < y) = FX ( y) FX ( y).
(c) We have:
P (G 1 (X) y) = P (X G(y)) = FX (G(y)).
(d) We have:
95
3. Random variables and univariate distributions
Xn ✓ ◆
n x
= ⇡ (1 ⇡)n x
(note the limits of x)
3 x=n y
x
y ✓ ◆
X n
= ⇡ n i (1 ⇡)i (setting i = n x)
i=0
n i
y ✓ ◆
X n n n
= (1 ⇡)i ⇡ n i . (since n i
= i
)
i=0
i
Noting that the support must be for {y : y 0}, in full the distribution function of
Y is: ( p p
FX ( y) FX ( y) for y 0
FY (y) =
0 otherwise.
Di↵erentiating, we obtain the density function of Y , noting the application of the
chain rule: 8 p p
< fX ( y) +pfX ( y)
for y 0
fY (y) = 2 y
:
0 otherwise.
Example 3.51 Let X be a continuous random variable with cdf FX (x). Determine
the distribution of Y = FX (X). What do you observe?
Solution
For 0 y 1 we have:
96
3.11. Functions of random variables
3
Example 3.52 We apply the density function result in Example 3.50 to the case
where X ⇠ N (0, 1), i.e. the standard normal distribution. Since the support of X is
R, the support of Y = X 2 is the positive real line. The density function of X is:
1 x2 /2
fX (x) = p e for 1 < x < 1.
2⇡
Note that this is the density function of a Gamma(1/2, 1/2) distribution, hence if
X ⇠ N (0, 1) and Y = X 2 , then Y ⇠ Gamma(1/2, 1/2).
In passing we also note, from ST104b Statistics 2, that the square of a standard
normal random variable has a chi-squared distribution with 1 degree of freedom, i.e.
2 2
1 , hence it is also true that Y ⇠ 1 and so we can see that the chi-squared
distribution is a special case of the gamma distribution – there are many
relationships between the various families of distributions!
97
3. Random variables and univariate distributions
Solution
The description of Y says that Y = |X a|, hence:
FY (y) = P (Y y) = P (a y X a + y)
= P (a y < X a + y)
3 (
FX (a + y) FX (a y) for y 0
=
0 for y < 0.
Monotone function
Strict monotonicity replaces the above inequalities with strict inequalities. For a
strictly monotone function, the inverse image of an interval is also an interval.
98
3.11. Functions of random variables
(
P (X g 1 (y)) for g decreasing
3
FX (g 1 (y)) for g increasing
= 1
(3.5)
1 FX (g (y) ) for g decreasing
Proof: Let X be a random variable with density function fX (x) and let Y = g(X), i.e.
X = g 1 (Y ).
If g 1 (·) is increasing, then:
FY (y) = P (Y y) = P (g 1 (Y ) g 1 (y)) = P (X g 1 (y)) = FX (g 1 (y))
hence:
d d d
fY (y) = FY (y) = FX (g 1 (y)) = fX (g 1 (y)) g 1 (y).
dy dy dy
If g 1 (·) is decreasing, then:
FY (y) = P (Y y) = P (g 1 (Y ) g 1 (y)) = P (X g 1 (y)) = 1 FX (g 1 (y))
hence: ✓ ◆
d 1 1 dg 1 (y)
fY (y) = FX (g (y)) = fX (g (y)) .
dy dy
Recall that the derivative of a decreasing function is negative.
Combining both cases, if g 1 (·) is monotone, then:
d
fY (y) = fX (g 1 (y)) g 1 (y) .
dy
⌅
99
3. Random variables and univariate distributions
Example 3.54 For any continuous random variable X we consider, its distribution
function, FX , is strictly increasing over its support, say S ✓ R. The inverse function,
FX 1 , known as the quantile function, is strictly increasing on [0, 1], such that
FX 1 : [0, 1] ! S.
Let U ⇠ Uniform[0, 1], hence its distribution function is:
3 8
>
<0 for u < 0
FU (u) = u for 0 u 1
>
:
1 for u > 1.
100
3.11. Functions of random variables
101
3. Random variables and univariate distributions
The cumulant generating functions are also related, as seen by taking logarithms:
KX (t) = µt + KZ ( t).
Therefore:
2
E(X) = µ + E(Z) and Var(X) = Var(Z).
If we impose the distributional assumption of normality, such that Z ⇠ N (0, 1), and
continue to let X = µ + Z, the density function of X is:
✓ ◆ ✓ ◆2 !
1 x µ 1 1 1 x µ
fX (x) = fZ = p exp
2⇡ 2
✓ ◆
1 (x µ)2
= p exp
2⇡ 2 2 2
2
and hence X ⇠ N (µ, ).
Recall that the cumulant generating function of the standard normal distribution
(Example 3.47) is KZ (t) = t2 /2, hence the cumulant generating function of X is:
2 2
t
KX (t) = µt + KZ ( t) = µt + .
2
2
So if X ⇠ N (µ, ), then:
0 2 00 2
KX (t) = µ + t and KX (t) =
0 00
hence 1 = KX (0) = µ and 2 = KX (0) = 2 (as always for the first two cumulants).
However, we see that:
r = 0 for r > 2
and so by the one-to-one mapping (i.e. uniqueness) of a cumulant generating
function to a probability distribution, any distribution for which r = 0 for r > 2 is
a normal distribution.
102
3.11. Functions of random variables
Solution
Let Y = g(X) = exp(X). Hence X = g 1 (Y ) = ln(Y ), and:
d 1
g 1 (y) = .
dy y
Therefore:
3
d
fY (y) = fX (g 1 (y)) g 1 (y)
dy
✓ ◆
1 (ln y µ)2 1
= p exp .
2⇡ 2 2 2 y
Example 3.59 Suppose X has the density function (sometimes called a Type II
Beta distribution):
1 x 1
fX (x) =
B(↵, ) (1 + x)↵+
for 0 < x < 1, where ↵, > 0 are constants. What is the density function of
Y = X/(1 + X)?
Solution
The transformation y = x/(1 + x) is strictly increasing on (0, 1), and has the
unique inverse function x = y/(1 y). Hence:
dx
fY (y) = fX (x) .
dy
dx
fY (y) = fX (x)
dy
✓ ◆
y 1
= fX
1 y (1 y)2
1 (y/(1 y)) 1 1
= ↵+
B(↵, ) (1 + y/(1 y)) (1 y)2
1 1 1
= y (1 y)↵+1
B(↵, ) (1 y)2
1 1
= y (1 y)↵ 1 .
B(↵, )
103
3. Random variables and univariate distributions
3 Solution
In full: (
1/(b a) for a y b
fY (y) =
0 otherwise
that is, Y ⇠ Uniform[a, b].
and: Z Z
1 1 1
2 2 2 x3 1
E(X ) = x fX (x) dx = x dx = =
1 0 3 0 3
giving: ✓ ◆2
2 2 1 1 1
Var(X) = E(X ) (E(X)) = = .
3 2 12
Hence:
(b a)2
Var(Y ) = Var(a + (b a)X) = Var((b a)X) = (b a)2 Var(X) = .
12
104
3.12. Convergence of sequences of random variables
If we now consider a sequence of random variables rather than constants, i.e. {Xn }, in
matters of convergence it does not make sense to compare |Xn X| (a random variable)
to a constant ✏ > 0. Below, we introduce four di↵erent types of convergence.
Convergence in distribution
P (Xn x) ! P (X x) as n ! 1
equivalently:
FXn (x) ! FX (x) as n ! 1
for all x at which the distribution function is continuous. This is denoted as:
d
Xn ! X.
Convergence in probability
A sequence of random variables {Xn } converges in probability if, for any ✏ > 0,
we have:
P (|Xn X| < ✏) ! 1 as n ! 1.
This is denoted as:
p
Xn ! X.
A sequence of random variables {Xn } converges almost surely to X if, for any
✏ > 0, we have: ⇣ ⌘
P lim |Xn X| < ✏ = 1.
n!1
105
3. Random variables and univariate distributions
E((Xn X)2 ) ! 0 as n ! 1.
The above types of convergence di↵er in terms of their strength. If {Xn } converges
almost surely, then {Xn } converges in probability:
a.s. p
Xn ! X ) Xn ! X.
If {Xn } converges in mean square, then {Xn } converges in probability:
m.s. p
Xn ! X ) Xn ! X.
If {Xn } converges in probability, then {Xn } converges in distribution:
p d
Xn ! X ) Xn ! X.
Combining these results, we can say that the set of all sequences which converge in
distribution contains the set of all sequences which converge in probability, which in
turn contains the set of all sequences which converge almost surely and in mean square.
We may write this as:
a.s.
)
Xn ! X p d
m.s. ) Xn ! X ) Xn ! X.
Xn ! X
E((Xn X)2 )
P (|Xn X| > ✏) .
✏2
m.s.
If Xn ! X, then E((Xn X)2 ) ! 0. Therefore, {P (|Xn X| > ✏)} is a sequence of
positive real numbers bounded above by a sequence which converges to zero. Hence
p
we conclude P (|Xn X| > ✏) ! 0 as n ! 1, and so Xn ! X.
106
3.13. A reminder of your learning outcomes
d
If Xn ! a, then FXn ! Fa and hence FXn (a + ✏) ! 1 and FXn (a ✏) ! 0.
Therefore, {P (|Xn a| > ✏)} is a sequence of positive real numbers bounded above
by a sequence which converges to zero. Hence we conclude P (|Xn a| > ✏) ! 0 as
p
n ! 1 and hence Xn ! a.
Example 3.63 In the following two cases explain in which, if any, of the three 3
modes (in mean square, in probability, in distribution) Xn converges to 0.
n
(a) Let Xn = 1 with probability 2 and 0 otherwise.
1
(b) Let Xn = n with probability n and 0 otherwise, and assume that the Xn s are
independent.
Solution
n n
(a) We have P (Xn = 1) = 2 and P (Xn = 0) = 1 2 . Since:
107
3. Random variables and univariate distributions
provide the probability mass function (pmf) and support for some common discrete
3 distributions
provide the probability density function (pdf) and support for some common
continuous distributions
et
MX (t) = for t < log 2.
2 et
Hence find E(X).
108
3.14. Sample examination questions
2. (a) Let X be a positive random variable with E(X) < 1. Prove the Markov
inequality:
E(X)
P (X a)
a
for any constant a > 0.
2
(b) For a random variable X with E(X) = µ and Var(X) = , state Chebyshev’s
inequality. 3
(c) By considering an Exp(1) random variable show, for 0 < a < 1, that:
a
ae 1 and a2 (1 + e (1+a)
e (1 a)
) 1.
You can use the mean and variance of an exponential random variable without
proof, as long as they are stated clearly.
(Hint: Consider P (W2 < w) and P (1 W2 < w) for 0 < w < 1 and 1 w < 4,
respectively.)
p
(c) Show that the density function of Y = W2 is:
8
>
<2/3 for 0 y < 1
fY (y) = 1/3 for 1 y < 2
>
:
0 otherwise.
109
3. Random variables and univariate distributions
110
B.2. Chapter 3 – Random variables and univariate distributions
2. We have the following given the definitions of the random variables X and Y .
(a) {X < 3}.
(b) P (X < 3).
(c) {Y = 1}.
(d) P (Y = 0).
(e) P (X = 6, Y = 0).
(f) P (Y < X).
3. Let Y denote the value of claims paid. The distribution function of Y is:
P (X ≤ x ∩ X > k) P (k < X ≤ x)
FY (x) = P (Y ≤ x) = P (X ≤ x | X > k) = = .
P (X > k) P (X > k)
Hence, in full:
0
for x ≤ k
FY (x) = FX (x) − FX (k)
for x > k.
1 − FX (k)
Let Z denote the value of claims not paid. The distribution function of Z is:
P (X ≤ x ∩ X ≤ k) P (X ≤ x)
FZ (x) = P (Z ≤ x) = P (X ≤ x | X ≤ k) = = .
P (X ≤ k) P (X ≤ k)
Hence, in full:
FX (x)
for x ≤ k
FZ (x) = FX (k)
1 for x > k.
4. This problem makes use of the following results from mathematics, concerning
sums of geometric series. If r 6= 1, then:
n−1
X a(1 − rn )
arx =
x=0
1−r
197
B. Solutions to Activities
(a) We first note that pX is a positive real-valued function with respect to its
support. Noting that 1 − π < 1, we have:
∞
X π
(1 − π)x−1 π = = 1.
x=1
1 − (1 − π)
Hence the two necessary conditions for a valid mass function are satisfied.
(b) The distribution function for the (first version of the) geometric distribution is:
x
X X π(1 − (1 − π)x )
FX (x) = pX (t) = (1 − π)t−1 π = = 1 − (1 − π)x .
t:t≤x t=1
1 − (1 − π)
In full: (
0 for x < 1
FX (x) =
1 − (1 − π)bxc for x ≥ 1.
5. We first note that pX is a positive real-valued function with respect to its support.
We then have:
∞ ∞ ∞
X X x−1 r X y+r−1
pX (x) = π (1−π) x−r
=π r
(1−π)y = π r (1−(1−π))−r = 1
x=r x=r
r − 1 y=0
r − 1
where y = x − r. Hence the two necessary conditions for a valid mass function are
satisfied.
6. We have:
∞ 2 2
x4 x3 x5
Z Z
2 16
E(X) = x fX (x) dx = x − dx = − = .
−∞ 0 4 3 20 0 15
Also:
∞ 2 2
x5
4
x6
Z Z
2 2 3 x 4
E(X ) = x fX (x) dx = x − dx = − = .
−∞ 0 4 4 24 0 3
Hence: 2
2 4 2 16 44
Var(X) = E(X ) − (E(X)) = − = .
3 15 225
198
B.2. Chapter 3 – Random variables and univariate distributions
7. We need to evaluate:
Z ∞ Z ∞
E(X) = x fX (x) dx = x λ e−λx dx.
−∞ 0 B
We note that:
x x 1
x e−λx = λx
= 2 2
=
e 1 + λx + λ x /2 + · · · 1/x + λ + λ2 x/2 + · · ·
such that the numerator is fixed (the constant 1), and the denominator tends to
infinity as x → ∞. Hence:
x e−λx → 0 as x → ∞.
Applying integration by parts, we have:
Z ∞ ∞ ∞
e−λx
i∞ Z
−λx
h
−λx −λx 1
E(X) x λe dx = − x e + e dx = 0 + − = .
0 0 0 λ 0 λ
8. (a) We have:
Var(X) = E((X − E(X))2 )
= E(X 2 − 2X E(X) + (E(X))2 )
= E(X 2 ) − 2(E(X))2 + (E(X))2
= E(X 2 ) − (E(X))2 .
(b) We have:
E(X(X − 1)) − E(X) E(X − 1) = E(X 2 ) − E(X) − (E(X))2 + E(X)
= E(X 2 ) − (E(X))2
= Var(X).
9. We note that:
2 −λx d2 −λx
x λe = λ 2e
dλ
and so:
∞ ∞
d2 −1
Z Z
2
2
E(X ) = 2
x fX (x) dx = x2 λe−λx dx = λ 2
λ = 2.
−∞ 0 dλ λ
Therefore:
2 1 1
Var(X) = E(X 2 ) − (E(X))2 = 2
− 2 = 2.
λ λ λ
10. We consider the proof for a continuous random variable X. The indicator function
takes the value 1 for values ≤ x and 0 otherwise. Hence:
Z ∞ Z x
E(I(−∞,x] (X)) = I(−∞,x] (t) fX (t) dt = fX (t) dt = FX (x).
−∞ −∞
199
B. Solutions to Activities
11. (a) Let the random variable X denote the return from the game. Hence X is a
discrete random variable, which can take three possible values: −£5 (if we do
not throw a head first and last), £15 (if we throw HT H) and £25 (if we throw
B HHH). The probabilities associated with these values are 3/4, 1/8 and 1/8,
respectively. Therefore, the expected return from playing the game is:
X 3 1 1
E(X) = x pX (x) = −5 × + 15 × + 25 × = £1.25.
x
4 8 8
since t/λ < 1. Since the coefficient of tr is E(X r )/r! in the polynomial expansion of
MX (t), for the exponential distribution we have:
E(X r ) 1 r!
= r ⇒ E(X r ) = .
r! λ λr
= ((1 − π) + πet )n
200
B.2. Chapter 3 – Random variables and univariate distributions
using the binomial expansion. Hence the cumulant generating function is:
πet
=
1 − (1 − π)et
provided |(1 − π)et | < 1. Hence the cumulant generating function is:
πet
KX (t) = log MX (t) = log .
1 − (1 − π)et
(d) If X ∼ Neg. Bin(r, π), then the moment generating function is:
X
MX (t) = E(etX ) = etx pX (x)
x
∞
X x−1 x
tx
= e π (1 − π)x−r
x=r
r−1
∞
t r
X (x − 1)!
= (πe ) ((1 − π)et )x−r
x=r
(r − 1)! (x − r)!
∞
X (y + r − 1)!
= (πet )r ((1 − π)et )y (setting y = x − r)
y=0
(r − 1)! y!
r
πet
=
1 − (1 − π)et
201
B. Solutions to Activities
Note the relationship between the moment generating functions of Geo(π) and
Neg. Bin(r, π). A Neg. Bin(r, π) random variable is equal to the sum of r
independent Geometric(π) random variables (since π is constant, the sum of r
B independent and identically distributed geometric random variables).
17. Let Y = 1/X. As X is a positive random variable, the function g(x) = 1/x is
well-behaved and monotonic. Therefore:
(
fX (1/y)/y 2 for y > 0
fY (y) =
0 otherwise.
202
B.3. Chapter 4 – Multivariate distributions
20. This is a special case of the result that convergence in probability implies
convergence in distribution. Note that convergence in distribution requires
convergence to the distribution function, except at discontinuities. We note that:
Since Xn converges in probability to a, for any > 0, the left-hand side converges
to zero. Each element on the right-hand side is positive, so we must have:
203
C.2. Chapter 3 – Random variables and univariate distributions
3. We have:
πn = P (nth toss is heads and there are two heads in the first n − 1 tosses)
(n − 1)!
= 0.5 × × (0.5)2 × (1 − 0.5)n−3
2! (n − 3)!
(n − 1)(n − 2)
= (0.5)n ×
2
C
= (n − 1)(n − 2)2−n−1 .
Let N be the toss number when the third head occurs in the repeated tossing of a
fair coin. Therefore:
∞
X ∞
X ∞
X
1 = P (N < ∞) = P (N = n) = πn = (n − 1)(n − 2)2−n−1
n=3 n=3 n=3
and so: ∞
X
(n − 1)(n − 2)2−n = 2.
n=3
(c) We have:
∞ ∞
tX
X
x tx ket
X et
t x
MX (t) = E(e ) = k e = (ke ) = = .
x=1 x=1
1 − ket 2 − et
For the above to be valid, the sum to infinity has to be valid. That is, ket < 1,
meaning t < log 2. We then have:
2et
MX0 (t) =
(2 − et )2
217
C. Solutions to Sample examination questions
2. (a) Let I(A) be the indicator function equal to 1 under A, and 0 otherwise. For
any a > 0, we have:
I(X ≥ a)X E(X)
P (X ≥ a) = E(I(X ≥ a)) ≤ E ≤ .
a a
(c) Let X ∼ Exp(1), with mean and variance both equal to 1. Hence, for a > 0, we
have: Z ∞
P (X > a) = e−x dx = e−a .
a
So, by the Markov inequality, e ≤ E(X)/a = 1/a, implying ae−a ≤ 1. At the
−a
= e−(1+a) + 1 − e−(1−a) .
For 1 ≤ w < 4, X2 is in the range [−2, −1]. Hence for 1 ≤ w < 4 we have:
Z −1 √
√ 1 −1 + w
FW2 (w)−FW2 (1) = P (1 ≤ W2 < w) = P (− w < X2 < −1) = √ dx = .
− w 3 3
218
C.3. Chapter 4 – Multivariate distributions
Hence:
2/3 for 0 < y < 1
fY (y) = 1/3 for 1 ≤ y < 2
0 otherwise.
so that a = 1/8.
(b) For 0 < x < 2, we have:
1 2 x 0 1 2 x2 2x + x2
Z Z Z
0
FX (x) = (x + y) dx dy = + xy dy = .
8 0 0 8 0 2 8
The mean is:
2 2 2 2
x3 x2 y
Z Z Z
1 2 1
E(X) = (x + xy) dx dy = + dy
8 0 0 8 0 3 2 0
1 2 8
Z
= + 2y dy
8 0 3
2
1 8y 2
= +y
8 3 0
7
= .
6
219