0% found this document useful (0 votes)
45 views70 pages

Understanding Multivariate Distributions

Chapter 2 discusses multivariate distributions, specifically focusing on the distributions of two random variables, X1 and X2, derived from a coin toss example. It defines a random vector as an ordered pair of random variables and explains how to calculate joint probabilities using cumulative distribution functions (cdf) and probability mass functions (pmf). The chapter also differentiates between discrete and continuous random vectors, providing examples and formulas for their joint probability density functions (pdf).

Uploaded by

bheddine05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views70 pages

Understanding Multivariate Distributions

Chapter 2 discusses multivariate distributions, specifically focusing on the distributions of two random variables, X1 and X2, derived from a coin toss example. It defines a random vector as an ordered pair of random variables and explains how to calculate joint probabilities using cumulative distribution functions (cdf) and probability mass functions (pmf). The chapter also differentiates between discrete and continuous random vectors, providing examples and formulas for their joint probability density functions (pdf).

Uploaded by

bheddine05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 2

Multivariate Distributions

2.1 Distributions of Two Random Variables


We begin the discussion of a pair of random variables with the following example. A
coin is tossed three times and our interest is in the ordered number pair (number of
H’s on first two tosses, number of H’s on all three tosses), where H and T represent,
respectively, heads and tails. Let C = {T T T, T T H, T HT, HT T, T HH, HT H, HHT,
HHH} denote the sample space. Let X1 denote the number of H’s on the first
two tosses and X2 denote the number of H’s on all three flips. Then our inter-
est can be represented by the pair of random variables (X1 , X2 ). For example,
(X1 (HT H), X2 (HT H)) represents the outcome (1, 2). Continuing in this way, X1
and X2 are real-valued functions defined on the sample space C, which take us from
the sample space to the space of ordered number pairs.

D = {(0, 0), (0, 1), (1, 1), (1, 2), (2, 2), (2, 3)}.

Thus X1 and X2 are two random variables defined on the space C, and, in this
example, the space of these random variables is the two-dimensional set D, which is
a subset of two-dimensional Euclidean space R2 . Hence (X1 , X2 ) is a vector function
from C to D. We now formulate the definition of a random vector.

Definition 2.1.1 (Random Vector). Given a random experiment with a sample


space C, consider two random variables X1 and X2 , which assign to each element
c of C one and only one ordered pair of numbers X1 (c) = x1 , X2 (c) = x2 . Then
we say that (X1 , X2 ) is a random vector. The space of (X1 , X2 ) is the set of
ordered pairs D = {(x1 , x2 ) : x1 = X1 (c), x2 = X2 (c), c ∈ C}.

We often denote random vectors using vector notation X = (X1 , X2 ) , where


the denotes the transpose of the row vector (X1 , X2 ). Also, we often use (X, Y )
to denote random vectors.
Let D be the space associated with the random vector (X1 , X2 ). Let A be a
subset of D. As in the case of one random variable, we speak of the event A. We
wish to define the probability of the event A, which we denote by PX1 ,X2 [A]. As

85
86 Multivariate Distributions

with random variables in Section 1.5 we can uniquely define PX1 ,X2 in terms of the
cumulative distribution function (cdf), which is given by

FX1 ,X2 (x1 , x2 ) = P [{X1 ≤ x1 } ∩ {X2 ≤ x2 }], (2.1.1)

for all (x1 , x2 ) ∈ R2 . Because X1 and X2 are random variables, each of the events
in the above intersection and the intersection of the events are events in the original
sample space C. Thus the expression is well defined. As with random variables, we
write P [{X1 ≤ x1 } ∩ {X2 ≤ x2 }] as P [X1 ≤ x1 , X2 ≤ x2 ]. As Exercise 2.1.3 shows,

P [a1 < X1 ≤ b1 , a2 < X2 ≤ b2 ] = FX1 ,X2 (b1 , b2 ) − FX1 ,X2 (a1 , b2 )


−FX1 ,X2 (b1 , a2 )+FX1 ,X2 (a1 , a2 ). (2.1.2)

Hence, all induced probabilities of sets of the form (a1 , b1 ]×(a2 , b2 ] can be formulated
in terms of the cdf. We often call this cdf the joint cumulative distribution
function of (X1 , X2 ).
As with random variables, we are mainly concerned with two types of random
vectors, namely discrete and continuous. We first discuss the discrete type.
A random vector (X1 , X2 ) is a discrete random vector if its space D is finite
or countable. Hence, X1 and X2 are both discrete also. The joint probability
mass function (pmf) of (X1 , X2 ) is defined by

pX1 ,X2 (x1 , x2 ) = P [X1 = x1 , X2 = x2 ], (2.1.3)

for all (x1 , x2 ) ∈ D. As with random variables, the pmf uniquely defines the cdf. It
also is characterized by the two properties

(i) 0 ≤ pX1 ,X2 (x1 , x2 ) ≤ 1 and (ii) pX1 ,X2 (x1 , x2 ) = 1. (2.1.4)
D

For an event B ∈ D, we have



P [(X1 , X2 ) ∈ B] = pX1 ,X2 (x1 , x2 ).
B

Example 2.1.1. Consider the example at the beginning of this section where a fair
coin is flipped three times and X1 and X2 are the number of heads on the first two
flips and all 3 flips, respectively. We can conveniently table the pmf of (X1 , X2 ) as

Support of X2
0 1 2 3
1 1
0 8 8 0 0
2 2
Support of X1 1 0 8 8 0
1 1
2 0 0 8 8

For instance, P (X1 ≥ 2, X2 ≥ 2) = p(2, 2) + p(2, 3) = 2/8.


2.1. Distributions of Two Random Variables 87

At times it is convenient to speak of the support of a discrete random vec-


tor (X1 , X2 ). These are all the points (x1 , x2 ) in the space of (X1 , X2 ) such
that p(x1 , x2 ) > 0. In the last example the support consists of the six points
{(0, 0), (0, 1), (1, 1), (1, 2), (2, 2), (2, 3)}.
We say a random vector (X1 , X2 ) with space D is of the continuous type if its
cdf FX1 ,X2 (x1 , x2 ) is continuous. For the most part, the continuous random vectors
in this book have cdfs that can be represented as integrals of nonnegative functions.
That is, FX1 ,X2 (x1 , x2 ) can be expressed as
 x1  x2
FX1 ,X2 (x1 , x2 ) = fX1 ,X2 (w1 , w2 ) dw1 dw2 , (2.1.5)
−∞ −∞
2
for all (x1 , x2 ) ∈ R . We call the integrand the joint probability density func-
tion (pdf) of (X1 , X2 ). Then
∂ 2 FX1 ,X2 (x1 , x2 )
= fX1 ,X2 (x1 , x2 ),
∂x1 ∂x2
except possibly on events that have probability zero. A pdf is essentially character-
ized by the two properties
(i) fX1 ,X2 (x1 , x2 ) ≥ 0 and (ii) D
fX1 ,X2 (x1 , x2 ) dx1 dx2 = 1. (2.1.6)
For the reader’s benefit, Section 4.2 of the accompanying resource Mathematical
Comments 1 offers a short review of double integration. For an event A ∈ D, we
have  
P [(X1 , X2 ) ∈ A] = fX1 ,X2 (x1 , x2 ) dx1 dx2 .
A
Note that the P [(X1 , X2 ) ∈ A] is just the volume under the surface z = fX1 ,X2 (x1 , x2 )
over the set A.
Remark 2.1.1. As with univariate random variables, we often drop the subscript
(X1 , X2 ) from joint cdfs, pdfs, and pmfs, when it is clear from the context. We also
use notation such as f12 instead of fX1 ,X2 . Besides (X1 , X2 ), we often use (X, Y )
to express random vectors.
We next present two examples of jointly continuous random variables.
Example 2.1.2. Consider a continuous random vector (X, Y ) which is uniformly
distributed over the unit circle in R2 . Since the area of the unit circle is π, the joint
pdf is " 1  
−1 < y < 1, − 1 − y 2 < x< 1 − y2
f (x, y) = π
0 elsewhere.
Probabilities of certain events follow immediately from geometry. For instance, let
A be the interior of the circle with radius 1/2. Then P [(X, Y ) ∈ A] = π(1/2)2 /π =
1/4. Next, let B√be the ring formed by the concentric
√ circles with the respective
radii of 1/2 and 2/2. Then P [(X, Y ) ∈ B] = π[( 2/2)2 − (1/2)2 ]/π = 1/4. The
regions A and B have the same area and hence for this uniform pdf are equilikely.

1 Downloadable at the site listed in the Preface.


88 Multivariate Distributions

In the next example, we use the general fact that double integrals can be ex-
pressed as iterated univariate integrals. Thus double integrations can be carried
out using iterated univariate integrations. This is discussed in some detail with
examples in Section 4.2 of the accompanying resource Mathematical Comments.2
The aid of a simple sketch of the region of integration is valuable in setting up the
upper and lower limits of integration for each of the iterated integrals.
Example 2.1.3. Suppose an electrical component has two batteries. Let X and Y
denote the lifetimes in standard units of the respective batteries. Assume that the
pdf of (X, Y ) is " 2 2
4xye−(x +y ) x > 0, y > 0
f (x, y) =
0 elsewhere.
The surface z = f (x, y) is sketched in Figure 2.1.1 where the grid squares are
0.1 by 0.1. From the figure, the pdf peaks at about (x, y) = (0.7, 0.7). Solving
the equations ∂f /∂x = 0 and ∂f /∂y =√0 simultaneously
√ shows that actually the
maximum of f (x, y) occurs at (x, y) = ( 2/2, 2/2). The batteries are more likely
to die in regions near the peak. The surface tapers to 0 as x and y get large in√any
direction. for instance, the probability that both batteries survive beyond 2/2
units is given by
 √ √   ∞  ∞
2 2 2 2
P X> ,Y > = √ √ 4xye−(x +y ) dxdy
2 2 2/2 2/2
  
∞ ∞
2 2
= √ 2xe−x √ 2ye−y dy dx
2/2 2/2
  
∞ ∞ ( )2
−z −w
= e e dw dz = e−1/2 ≈ 0.3679,
1/2 1/2

where we made use of the change-in-variables z = x2 and w = y 2 . In contrast to the


last example, consider the regions A = {(x, y) : |x − (1/2)| < 0.3, |y − (1/2)| < 0.3}
and B = {(x, y) : |x−2| < 0.3, |y −2| < 0.3}. The reader should locate these regions
on Figure 2.1.1. The areas of A and B are the same, but it is clear from the figure
that P [(X, Y ) ∈ A] is much larger than P [(X, Y ) ∈ B]. Exercise 2.1.6 confirms this
by showing that P [(X, Y ) ∈ A] = 0.1879 while P [(X, Y ) ∈ B] = 0.0026.
For a continuous random vector (X1 , X2 ), the support of (X1 , X2 ) contains all
points (x1 , x2 ) for which f (x1 , x2 ) > 0. We denote the support of a random vector
by S. As in the univariate case, S ⊂ D.
As in the last two examples, we extend the definition of a pdf fX1 ,X2 (x1 , x2 )
over R2 by using zero elsewhere. We do this consistently so that tedious, repetitious
references to the space D can be avoided. Once this is done, we replace
   ∞  ∞
fX1 ,X2 (x1 , x2 ) dx1 dx2 by f (x1 , x2 ) dx1 dx2 .
D −∞ −∞
2 Downloadable at the site listed in the Preface.
2.1. Distributions of Two Random Variables 89

x y

Figure 2.1.1: A sketch of the the surface of the joint pdf discussed in Example
2.1.3. On the figure, the origin is located at the intersection of the x and z axes
and the grid squares are 0.1 by 0.1, so points are
√ easily
√ located. As discussed in the
text, the peak of the pdf occurs at the point ( 2/2, 2/2).

Likewise we may extend the pmf pX1 ,X2 (x1 , x2 ) over a convenient set by using zero
elsewhere. Hence, we replace
 
pX1 ,X2 (x1 , x2 ) by p(x1 , x2 ).
D x2 x1

2.1.1 Marginal Distributions


Let (X1 , X2 ) be a random vector. Then both X1 and X2 are random variables.
We can obtain their distributions in terms of the joint distribution of (X1 , X2 ) as
follows. Recall that the event which defined the cdf of X1 at x1 is {X1 ≤ x1 }.
However,

{X1 ≤ x1 } = {X1 ≤ x1 } ∩ {−∞ < X2 < ∞} = {X1 ≤ x1 , −∞ < X2 < ∞}.

Taking probabilities, we have

FX1 (x1 ) = P [X1 ≤ x1 , −∞ < X2 < ∞], (2.1.7)


90 Multivariate Distributions

Table 2.1.1: Joint and Marginal Distributions for the discrete random vector
(X1 , X2 ) of Example 2.1.1.
Support of X2
0 1 2 3 pX1 (x1 )
1 1 2
0 8 8 0 0 8
2 2 4
Support of X1 1 0 8 8 0 8
1 1 2
2 0 0 8 8 8
1 3 3 1
pX2 (x2 ) 8 8 8 8

for all x1 ∈ R. By Theorem 1.3.6 we can write this equation as FX1 (x1 ) =
limx2 ↑∞ F (x1 , x2 ). Thus we have a relationship between the cdfs, which we can
extend to either the pmf or pdf depending on whether (X1 , X2 ) is discrete or con-
tinuous.
First consider the discrete case. Let DX1 be the support of X1 . For x1 ∈ DX1 ,
Equation (2.1.7) is equivalent to
$ *
  
FX1 (x1 ) = pX1 ,X2 (w1 , x2 ) = pX1 ,X2 (w1 , x2 ) .
w1 ≤x1 ,−∞<x2 <∞ w1 ≤x1 x2 <∞

By the uniqueness of cdfs, the quantity in braces must be the pmf of X1 evaluated
at w1 ; that is, 
pX1 (x1 ) = pX1 ,X2 (x1 , x2 ), (2.1.8)
x2 <∞

for all x1 ∈ DX1 . Hence, to find the probability that X1 is x1 , keep x1 fixed and
sum pX1 ,X2 over all of x2 . In terms of a tabled joint pmf with rows comprised of
X1 support values and columns comprised of X2 support values, this says that the
distribution of X1 can be obtained by the marginal sums of the rows. Likewise, the
pmf of X2 can be obtained by marginal sums of the columns.
Consider the joint discrete distribution of the random vector (X1 , X2 ) as pre-
sented in Example 2.1.1. In Table 2.1.1, we have added these marginal sums. The
final row of this table is the pmf of X2 , while the final column is the pmf of X1 .
In general, because these distributions are recorded in the margins of the table, we
often refer to them as marginal pmfs.

Example 2.1.4. Consider a random experiment that consists of drawing at random


one chip from a bowl containing 10 chips of the same shape and size. Each chip has
an ordered pair of numbers on it: one with (1, 1), one with (2, 1), two with (3, 1),
one with (1, 2), two with (2, 2), and three with (3, 2). Let the random variables
X1 and X2 be defined as the respective first and second values of the ordered pair.
Thus the joint pmf p(x1 , x2 ) of X1 and X2 can be given by the following table, with
p(x1 , x2 ) equal to zero elsewhere.
2.1. Distributions of Two Random Variables 91

x2
x1 1 2 p1 (x1 )
1 1 2
1 10 10 10
1 2 3
2 10 10 10
2 3 5
3 10 10 10
4 6
p2 (x2 ) 10 10

The joint probabilities have been summed in each row and each column and these
sums recorded in the margins to give the marginal probability mass functions of X1
and X2 , respectively. Note that it is not necessary to have a formula for p(x1 , x2 )
to do this.
We next consider the continuous case. Let DX1 be the support of X1 . For
x1 ∈ DX1 , Equation (2.1.7) is equivalent to
 x1  ∞  x1 " ∞ #
FX1 (x1 ) = fX1 ,X2 (w1 , x2 ) dx2 dw1 = fX1 ,X2 (w1 , x2 ) dx2 dw1 .
−∞ −∞ −∞ −∞

By the uniqueness of cdfs, the quantity in braces must be the pdf of X1 , evaluated
at w1 ; that is,  ∞
fX1 (x1 ) = fX1 ,X2 (x1 , x2 ) dx2 (2.1.9)
−∞

for all x1 ∈ DX1 . Hence, in the continuous case the marginal pdf of X1 is found by
integrating out x2 . Similarly, the marginal pdf of X2 is found by integrating out
x1 .
Example 2.1.5 (Example 2.1.2, continued). Consider the vector of continuous
random variables (X, Y ) discussed in Example 2.1.2. The space of the random
vector is the unit circle with center at (0, 0) as shown in Figure 2.1.2. To find the
marginal
√ distribution
√ of X, fix x between −1 and 1 and then integrate out y from
− 1 − x2 to 1 − x2 as the arrow shows on Figure 2.1.2. Hence, the marginal pdf
of X is  √1−x2
1 2
fX (x) = √ dy = 1 − x2 , −1 < x < 1.
− 1−x2 π π
Although (X, Y ) has a joint uniform distribution, the distribution of X is unimodal
with peak at 0. This is not surprising. Since the joint distribution is uniform, from
Figure 2.1.2 X is more likely to be near 0 than at either extreme −1 or 1. Because
the joint pdf is symmetric in x and y, the marginal pdf of Y is the same as that of
X.

Example 2.1.6. Let X1 and X2 have the joint pdf


"
x1 + x2 0 < x1 < 1, 0 < x2 < 1
f (x1 , x2 ) =
0 elsewhere.
92 Multivariate Distributions

Region of Integration for Example A.3.1.

1.0

( x, 1 − x2 )
0.5
0.0
y

−0.5

( x, − 1 − x2 )
−1.0

−1.0 −0.5 0.0 0.5 1.0

Figure 2.1.2: Region of integration for Example 2.1.5. It depicts the integration
with respect to y at a fixed but arbitrary x.

Notice the space of the random vector is the interior of the square with vertices
(0, 0), (1, 0), (1, 1) and (0, 1). The marginal pdf of X1 is
 1
f1 (x1 ) = (x1 + x2 ) dx2 = x1 + 12 , 0 < x1 < 1,
0

zero elsewhere, and the marginal pdf of X2 is


 1
f2 (x2 ) = (x1 + x2 ) dx1 = 12 + x2 , 0 < x2 < 1,
0

zero elsewhere. A probability like P (X1 ≤ 12 ) can be computed from either f1 (x1 )
or f (x1 , x2 ) because
 1/2  1  1/2
f (x1 , x2 ) dx2 dx1 = f1 (x1 ) dx1 = 83 .
0 0 0

Suppose, though, we want to find the probability P (X1 + X2 ≤ 1). Notice that
the region of integration is the interior of the triangle with vertices (0, 0), (1, 0) and
2.1. Distributions of Two Random Variables 93

(0, 1). The reader should sketch this region on the space of (X1 , X2 ). Fixing x1 and
integrating with respect to x2 , we have
 1  1−x1
P (X1 + X2 ≤ 1) = (x1 + x2 ) dx2 dx1
0 0
 1
(1 − x1 )2
= x1 (1 − x1 ) + dx1
0 2
 1  
1 1 2 1
= − x dx1 = .
0 2 2 1 3

This latter probability is the volume under the surface f (x1 , x2 ) = x1 + x2 above
the set {(x1 , x2 ) : 0 < x1 , x1 + x2 ≤ 1}.
Example 2.1.7 (Example 2.1.3, Continued). Recall that the random variables X
and Y of Example 2.1.3 were the lifetimes of two batteries installed in an electrical
component. The joint pdf of (X, Y ) is sketched in Figure 2.1.1. Its space is the
positive quadrant of R2 so there are no constraints involving both x and y. Using
the change-in-variable w = y 2 , the marginal pdf of X is
 ∞  ∞
−(x2 +y 2 ) −x2 2
fX (x) = 4xye dy = 2xe e−w dw = 2xe−x ,
0 0

for x > 0. By the symmetry of x and y in the model, the pdf of Y is the same as
that of X. To determine the median lifetime, θ, of these batteries, we need to solve
 θ
1 2 2
= 2xe−x dx = 1 − e−θ ,
2 0

2
where again we have made
√ use of the change-in-variables z = x . Solving this
equation, we obtain θ = log 2 ≈ 0.8326. So 50% of the batteries have lifetimes
exceeding 0.83 units.

2.1.2 Expectation
The concept of expectation extends in a straightforward manner. Let (X1 , X2 ) be a
random vector and let Y = g(X1 , X2 ) for some real-valued function; i.e., g : R2 → R.
Then Y is a random variable and we could determine its expectation by obtaining
the distribution of Y . But Theorem 1.8.1 is true for random vectors also. Note the
proof we gave for this theorem involved the discrete case, and Exercise 2.1.12 shows
its extension to the random vector case.
Suppose (X1 , X2 ) is of the continuous type. Then E(Y ) exists if
 ∞ ∞
|g(x1 , x2 )|fX1 ,X2 (x1 , x2 ) dx1 dx2 < ∞.
−∞ −∞

Then  ∞  ∞
E(Y ) = g(x1 , x2 )fX1 ,X2 (x1 , x2 ) dx1 dx2 . (2.1.10)
−∞ −∞
94 Multivariate Distributions

Likewise if (X1 , X2 ) is discrete, then E(Y ) exists if



|g(x1 , x2 )|pX1 ,X2 (x1 , x2 ) < ∞.
x1 x2

Then 
E(Y ) = g(x1 , x2 )pX1 ,X2 (x1 , x2 ). (2.1.11)
x1 x2

We can now show that E is a linear operator.

Theorem 2.1.1. Let (X1 , X2 ) be a random vector. Let Y1 = g1 (X1 , X2 ) and Y2 =


g2 (X1 , X2 ) be random variables whose expectations exist. Then for all real numbers
k1 and k2 ,
E(k1 Y1 + k2 Y2 ) = k1 E(Y1 ) + k2 E(Y2 ). (2.1.12)

Proof: We prove it for the continuous case. The existence of the expected value of
k1 Y1 + k2 Y2 follows directly from the triangle inequality and linearity of integrals;
i.e.,
 ∞  ∞
|k1 g1 (x1 , x2 ) + k2 g1 (x1 , x2 )|fX1 ,X2 (x1 , x2 ) dx1 dx2
−∞ −∞
 ∞  ∞
≤ |k1 | |g1 (x1 , x2 )|fX1 ,X2 (x1 , x2 ) dx1 dx2
−∞ −∞
 ∞ ∞
+ |k2 | |g2 (x1 , x2 )|fX1 ,X2 (x1 , x2 ) dx1 dx2 < ∞.
−∞ −∞

By once again using linearity of the integral, we have


 ∞ ∞
E(k1 Y1 + k2 Y2 ) = [k1 g1 (x1 , x2 ) + k2 g2 (x1 , x2 )]fX1 ,X2 (x1 , x2 ) dx1 dx2
−∞ −∞
 ∞ ∞
= k1 g1 (x1 , x2 )fX1 ,X2 (x1 , x2 ) dx1 dx2
−∞ −∞
 ∞  ∞
+ k2 g2 (x1 , x2 )fX1 ,X2 (x1 , x2 ) dx1 dx2
−∞ −∞
= k1 E(Y1 ) + k2 E(Y2 ),

i.e., the desired result.

We also note that the expected value of any function g(X2 ) of X2 can be found
in two ways:
 ∞ ∞  ∞
E(g(X2 )) = g(x2 )f (x1 , x2 ) dx1 dx2 = g(x2 )fX2 (x2 ) dx2 ,
−∞ −∞ −∞

the latter single integral being obtained from the double integral by integrating on
x1 first. The following example illustrates these ideas.
2.1. Distributions of Two Random Variables 95

Example 2.1.8. Let X1 and X2 have the pdf


"
8x1 x2 0 < x1 < x2 < 1
f (x1 , x2 ) =
0 elsewhere.

Figure 2.1.3 shows the space for (X1 , X2 ). Then


 ∞ ∞
E(X1 X22 ) = x1 x22 f (x1 , x2 ) dx1 dx2 .
−∞ −∞

To compute the integration, as shown by the arrow on Figure 2.1.3, we fix x2 and


then integrate x1 from 0 to x2 . We then integrate out x2 from 0 to 1. Hence,
 ∞ ∞  1  x2  1
x1 x22 f (x1 , x2 ) = 8x21 x32 dx1 dx2 = 8 6 8
3 x2 dx2 = 21 .
−∞ −∞ 0 0 0

In addition,
 1  x2
E(X2 ) = x2 (8x1 x2 ) dx1 dx2 = 45 .
0 0

Since X2 has the pdf f2 (x2 ) = 4x32 , 0 < x2 < 1, zero elsewhere, the latter expecta-
tion can also be found by
 1
E(X2 ) = x2 (4x32 ) dx2 = 45 .
0

Using Theorem 2.1.1,

E(7X1 X22 + 5X2 ) = 7E(X1 X22 ) + 5E(X2 )


8
= (7)( 21 ) + (5)( 45 ) = 20
3 .

Example 2.1.9. Continuing with Example 2.1.8, suppose the random variable Y
is defined by Y = X1 /X2 . We determine E(Y ) in two ways. The first way is by
definition; i.e., find the distribution of Y and then determine its expectation. The
cdf of Y , for 0 < y ≤ 1, is
 1  yx2
FY (y) = P (Y ≤ y) = P (X1 ≤ yX2 ) = 8x1 x2 dx1 dx2
0 0
 1
= 4y 2 x32 dx2 = y 2 .
0

Hence, the pdf of Y is


"
2y 0<y<1
fY (y) = FY (y) =
0 elsewhere,

which leads to  1
2
E(Y ) = y(2y) dy = .
0 3
96 Multivariate Distributions

1.0
0.8

( 0, y ) ( y, y )
0.6
y

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 2.1.3: Region of integration for Example 2.1.8. The arrow depicts the
integration with respect to x1 at a fixed but arbitrary x2 .

For the second way, we make use of expression (2.1.10) and find E(Y ) directly by
   1 " x2   #
X1 x1
E(Y ) = E = 8x1 x2 dx1 dx2
X2 0 0 x2
 1
8 3 2
= x dx2 = .
0 3 2 3

We next define the moment generating function of a random vector.

Definition 2.1.2 (Moment Generating Function of a Random Vector). Let X =


(X1 , X2 ) be a random vector. If E(et1 X1 +t2 X2 ) exists for |t1 | < h1 and |t2 | <
h2 , where h1 and h2 are positive, it is denoted by MX1 ,X2 (t1 , t2 ) and is called the
moment generating function (mgf ) of X.

As in the one-variable case, if it exists, the mgf of a random vector uniquely


determines the distribution of the random vector.
Let t = (t1 , t2 ) . Then we can write the mgf of X as
  
MX1 ,X2 (t) = E et X , (2.1.13)

so it is quite similar to the mgf of a random variable. Also, the mgfs of X1 and X2
are immediately seen to be MX1 ,X2 (t1 , 0) and MX1 ,X2 (0, t2 ), respectively. If there
is no confusion, we often drop the subscripts on M .
2.1. Distributions of Two Random Variables 97

Example 2.1.10. Let the continuous-type random variables X and Y have the
joint pdf " −y
e 0<x<y<∞
f (x, y) =
0 elsewhere.
The reader should sketch the space of (X, Y ). The mgf of this joint distribution is
 ∞  ∞
M (t1 , t2 ) = exp(t1 x + t2 y − y) dy dx
0 x
1
= ,
(1 − t1 − t2 )(1 − t2 )
provided that t1 + t2 < 1 and t2 < 1. Furthermore, the moment-generating func-
tions of the marginal distributions of X and Y are, respectively,
1
M (t1 , 0) = , t1 < 1
1 − t1
1
M (0, t2 ) = , t2 < 1.
(1 − t2 )2
These moment-generating functions are, of course, respectively, those of the
marginal probability density functions,
 ∞
f1 (x) = e−y dy = e−x , 0 < x < ∞,
x

zero elsewhere, and


 y
−y
f2 (y) = e dx = ye−y , 0 < y < ∞,
0

zero elsewhere.
We also need to define the expected value of the random vector itself, but this
is not a new concept because it is defined in terms of componentwise expectation:
Definition 2.1.3 (Expected Value of a Random Vector). Let X = (X1 , X2 ) be a
random vector. Then the expected value of X exists if the expectations of X1 and
X2 exist. If it exists, then the expected value is given by
E(X1 )
E[X] = . (2.1.14)
E(X2 )

EXERCISES
2.1.1. Let f (x1 , x2 ) = 4x1 x2 , 0 < x1 < 1, 0 < x2 < 1, zero elsewhere, be the pdf
of X1 and X2 . Find P (0 < X1 < 12 , 14 < X2 < 1), P (X1 = X2 ), P (X1 < X2 ), and
P (X1 ≤ X2 ).
Hint: Recall that P (X1 = X2 ) would be the volume under the surface f (x1 , x2 ) =
4x1 x2 and above the line segment 0 < x1 = x2 < 1 in the x1 x2 -plane.
98 Multivariate Distributions

2.1.2. Let A1 = {(x, y) : x ≤ 2, y ≤ 4}, A2 = {(x, y) : x ≤ 2, y ≤ 1}, A3 =


{(x, y) : x ≤ 0, y ≤ 4}, and A4 = {(x, y) : x ≤ 0 y ≤ 1} be subsets of the
space A of two random variables X and Y , which is the entire two-dimensional
plane. If P (A1 ) = 78 , P (A2 ) = 48 , P (A3 ) = 38 , and P (A4 ) = 28 , find P (A5 ), where
A5 = {(x, y) : 0 < x ≤ 2, 1 < y ≤ 4}.

2.1.3. Let F (x, y) be the distribution function of X and Y . For all real constants
a < b, c < d, show that P (a < X ≤ b, c < Y ≤ d) = F (b, d) − F (b, c) − F (a, d) +
F (a, c).

2.1.4. Show that the function F (x, y) that is equal to 1 provided that x + 2y ≥ 1,
and that is equal to zero provided that x + 2y < 1, cannot be a distribution function
of two random variables.
Hint: Find four numbers a < b, c < d, so that

F (b, d) − F (a, d) − F (b, c) + F (a, c)

is less than zero.

2.1.5. Given that the nonnegative function g(x) has the property that
 ∞
g(x) dx = 1,
0

show that 
2g( x21 + x22 )
f (x1 , x2 ) =  , 0 < x1 < ∞, 0 < x2 < ∞,
π x21 + x22
zero elsewhere, satisfies the conditions for a pdf of two continuous-type random
variables X1 and X2 .
Hint: Use polar coordinates.

2.1.6. Consider Example 2.1.3.

(a) Show that P (a < X < b, c < Y < d) = (exp{−a2 } − exp{−b2 })(exp{−c2 } −
exp{−d2 }).

(b) Using Part (a) and the notation in Example 2.1.3, show that P [(X, Y ) ∈ A] =
0.1879 while P [(X, Y ) ∈ B] = 0.0026.

(c) Show that the following R program computes P (a < X < b, c < Y < d).
Then use it to compute the probabilities in Part (b).

plifetime <- function(a,b,c,d)


{(exp(-a^2) - exp(-b^2))*(exp(-c^2) - exp(-d^2))}

2.1.7. Let f (x, y) = e−x−y , 0 < x < ∞, 0 < y < ∞, zero elsewhere, be the pdf of
X and Y . Then if Z = X + Y , compute P (Z ≤ 0), P (Z ≤ 6), and, more generally,
P (Z ≤ z), for 0 < z < ∞. What is the pdf of Z?
2.1. Distributions of Two Random Variables 99

2.1.8. Let X and Y have the pdf f (x, y) = 1, 0 < x < 1, 0 < y < 1, zero elsewhere.
Find the cdf and pdf of the product Z = XY .
2.1.9. Let 13 cards be taken, at random and without replacement, from an ordinary
deck of playing cards. If X is the number of spades in these 13 cards, find the pmf of
X. If, in addition, Y is the number of hearts in these 13 cards, find the probability
P (X = 2, Y = 5). What is the joint pmf of X and Y ?
2.1.10. Let the random variables X1 and X2 have the joint pmf described as follows:

(x1 , x2 ) (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2)


2 3 2 2 2 1
p(x1 , x2 ) 12 12 12 12 12 12

and p(x1 , x2 ) is equal to zero elsewhere.


(a) Write these probabilities in a rectangular array as in Example 2.1.4, recording
each marginal pdf in the “margins.”
(b) What is P (X1 + X2 = 1)?
2.1.11. Let X1 and X2 have the joint pdf f (x1 , x2 ) = 15x21 x2 , 0 < x1 < x2 < 1,
zero elsewhere. Find the marginal pdfs and compute P (X1 + X2 ≤ 1).
Hint: Graph the space X1 and X2 and carefully choose the limits of integration
in determining each marginal pdf.
2.1.12. Let X1 , X2 be two random variables with the joint pmf p(x1 , x2 ), (x1 , x2 ) ∈
S, where S is the support of X1 , X2 . Let Y = g(X1 , X2 ) be a function such that

|g(x1 , x2 )|p(x1 , x2 ) < ∞.
(x1 ,x2 )∈S

By following the proof of Theorem 1.8.1, show that



E(Y ) = g(x1 , x2 )p(x1 , x2 ).
(x1 ,x2 )∈S

2.1.13. Let X1 , X2 be two random variables with the joint pmf p(x1 , x2 ) = (x1 +
x2 )/12, for x1 = 1, 2, x2 = 1, 2, zero elsewhere. Compute E(X1 ), E(X12 ), E(X2 ),
E(X22 ), and E(X1 X2 ). Is E(X1 X2 ) = E(X1 )E(X2 )? Find E(2X1 − 6X22 + 7X1 X2 ).
2.1.14. Let X1 , X2 be two random variables with joint pdf f (x1 , x2 ) = 4x1 x2 ,
0 < x1 < 1, 0 < x2 < 1, zero elsewhere. Compute E(X1 ), E(X12 ), E(X2 ), E(X22 ),
and E(X1 X2 ). Is E(X1 X2 ) = E(X1 )E(X2 )? Find E(3X2 − 2X12 + 6X1 X2 ).
2.1.15. Let X1 , X2 be two random variables with joint pmf p(x1 , x2 ) = (1/2)x1 +x2 ,
for 1 ≤ xi < ∞, i = 1, 2, where x1 and x2 are integers, zero elsewhere. Determine
the joint mgf of X1 , X2 . Show that M (t1 , t2 ) = M (t1 , 0)M (0, t2 ).
2.1.16. Let X1 , X2 be two random variables with joint pdf f (x1 , x2 ) = x1 exp{−x2 },
for 0 < x1 < x2 < ∞, zero elsewhere. Determine the joint mgf of X1 , X2 . Does
M (t1 , t2 ) = M (t1 , 0)M (0, t2 )?
2.1.17. Let X and Y have the joint pdf f (x, y) = 6(1 − x − y), x + y < 1, 0 < x,
0 < y, zero elsewhere. Compute P (2X + 3Y < 1) and E(XY + 2X 2 ).
100 Multivariate Distributions

2.2 Transformations: Bivariate Random Variables


Let (X1 , X2 ) be a random vector. Suppose we know the joint distribution of
(X1 , X2 ) and we seek the distribution of a transformation of (X1 , X2 ), say, Y =
g(X1 , X2 ). We may be able to obtain the cdf of Y . Another way is to use a trans-
formation as we did for univariate random variables in Sections 1.6 and 1.7. In this
section, we extend this theory to random vectors. It is best to discuss the discrete
and continuous cases separately. We begin with the discrete case.
There are no essential difficulties involved in a problem like the following. Let
pX1 ,X2 (x1 , x2 ) be the joint pmf of two discrete-type random variables X1 and X2
with S the (two-dimensional) set of points at which pX1 ,X2 (x1 , x2 ) > 0; i.e., S is the
support of (X1 , X2 ). Let y1 = u1 (x1 , x2 ) and y2 = u2 (x1 , x2 ) define a one-to-one
transformation that maps S onto T . The joint pmf of the two new random variables
Y1 = u1 (X1 , X2 ) and Y2 = u2 (X1 , X2 ) is given by
"
pX1 ,X2 [w1 (y1 , y2 ), w2 (y1 , y2 )] (y1 , y2 ) ∈ T
pY1 ,Y2 (y1 , y2 ) =
0 elsewhere,

where x1 = w1 (y1 , y2 ), x2 = w2 (y1 , y2 ) is the single-valued inverse of y1 = u1 (x1 , x2 ),


y2 = u2 (x1 , x2 ). From this joint pmf pY1 ,Y2 (y1 , y2 ) we may obtain the marginal pmf
of Y1 by summing on y2 or the marginal pmf of Y2 by summing on y1 .
In using this change of variable technique, it should be emphasized that we need
two “new” variables to replace the two “old” variables. An example helps explain
this technique.

Example 2.2.1. In a large metropolitan area during flu season, suppose that two
strains of flu, A and B, are occurring. For a given week, let X1 and X2 be the
respective number of reported cases of strains A and B with the joint pmf

μx1 1 μx2 2 e−μ1 e−μ2


pX1 ,X2 (x1 , x2 ) = , x1 = 0, 1, 2, 3, . . . , x2 = 0, 1, 2, 3, . . . ,
x1 !x2 !
and is zero elsewhere, where the parameters μ1 and μ2 are positive real numbers.
Thus the space S is the set of points (x1 , x2 ), where each of x1 and x2 is a non-
negative integer. Further, repeatedly using the Maclaurin series for the exponential
function,3 we have

 ∞
μx1 1 −μ2  μx2 2
E(X1 ) = e−μ1 x1 e
x1 =0
x1 ! x !
x =0 2
2

∞
μx1 1 −1
= e−μ1 x1 μ1 · 1 = μ1 .
x1 =1
(x1 − 1)!

Thus μ1 is the mean number of cases of Strain A flu reported during a week.


Likewise, μ2 is the mean number of cases of Strain B flu reported during a week.
3 See for example the discussion on Taylor series in Mathematical Comments as referenced in

the Preface.
2.2. Transformations: Bivariate Random Variables 101

A random variable of interest is Y1 = X1 + X2 ; i.e., the total number of reported


cases of A and B flu during a week. By Theorem 2.1.1, we know E(Y1 ) = μ1 + μ2 ;
however, we wish to determine the distribution of Y1 . If we use the change of
variable technique, we need to define a second random variable Y2 . Because Y2 is
of no interest to us, let us choose it in such a way that we have a simple one-to-one
transformation. For this example, we take Y2 = X2 . Then y1 = x1 + x2 and y2 = x2
represent a one-to-one transformation that maps S onto
T = {(y1 , y2 ) : y2 = 0, 1, . . . , y1 and y1 = 0, 1, 2, . . .}.
Note that if (y1 , y2 ) ∈ T , then 0 ≤ y2 ≤ y1 . The inverse functions are given by
x1 = y1 − y2 and x2 = y2 . Thus the joint pmf of Y1 and Y2 is
μy11 −y2 μy22 e−μ1 −μ2
pY1 ,Y2 (y1 , y2 ) = , (y1 , y2 ) ∈ T ,
(y1 − y2 )!y2 !
and is zero elsewhere. Consequently, the marginal pmf of Y1 is given by

y1
pY1 (y1 ) = pY1 ,Y2 (y1 , y2 )
y2 =0

e−μ1 −μ2 
y1
y1 !
= μy1 −y2 μy22
y1 ! y =0 (y1 − y2 )!y2 ! 1
2

(μ1 + μ2 )y1 e−μ1 −μ2


= , y1 = 0, 1, 2, . . . ,
y1 !
and is zero elsewhere, where the third equality follows from the binomial expansion.

For the continuous case we begin with an example that illustrates the cdf tech-
nique.
Example 2.2.2. Consider an experiment in which a person chooses at random
a point (X1 , X2 ) from the unit square S = {(x1 , x2 ) : 0 < x1 < 1, 0 < x2 < 1}.
Suppose that our interest is not in X1 or in X2 but in Z = X1 + X2 . Once a suitable
probability model has been adopted, we shall see how to find the pdf of Z. To be
specific, let the nature of the random experiment be such that it is reasonable to
assume that the distribution of probability over the unit square is uniform. Then
the pdf of X1 and X2 may be written
"
1 0 < x1 < 1, 0 < x2 < 1
fX1 ,X2 (x1 , x2 ) = (2.2.1)
0 elsewhere,
and this describes the probability model. Now let the cdf of Z be denoted by
FZ (z) = P (X1 + X2 ≤ z). Then


⎪ 0 z<0
⎪ z z−x1
⎨ z2
0 0
dx 2 dx1 = 2 0≤z<1
FZ (z) = 1 1 (2−z)2

⎪ 1 − z−1 z−x1 dx2 dx1 = 1 − 2 1≤z<2


1 2 ≤ z.
102 Multivariate Distributions

Since FZ (z) exists for all values of z, the pmf of Z may then be written

⎨ z 0<z<1
fZ (z) = 2−z 1≤z <2 (2.2.2)

0 elsewhere.

In the last example, we used the cdf technique to find the distribution of the
transformed random vector. Recall in Chapter 1, Theorem 1.7.1 gave a transfor-
mation technique to directly determine the pdf of the transformed random variable
for one-to-one transformations. As discussed in Section 4.1 of the accompanying re-
source Mathematical Comments,4 this is based on the change-in-variable technique
for univariate integration. Further Section 4.2 of this resource shows that a simi-
lar change-in-variable technique exists for multiple integration. We now discuss in
general the transformation technique for the continuous case based on this theory.
Let (X1 , X2 ) have a jointly continuous distribution with pdf fX1 ,X2 (x1 , x2 ) and
support set S. Consider the transformed random vector (Y1 , Y2 ) = T (X1 , X2 ) where
T is a one-to-one continuous transformation. Let T = T (S) denote the support of
(Y1 , Y2 ). The transformation is depicted in Figure 2.2.1. Rewrite the transforma-
tion in terms of its components as (Y1 , Y2 ) = T (X1 , X2 ) = (u1 (X1 , X2 ), u2 (X1 , X2 )),
where the functions y1 = u1 (x1 , x2 ) and y2 = u2 (x1 , x2 ) define T . Since the trans-
formation is one-to-one, the inverse transformation T −1 exists. We write it as
x1 = w1 (y1 , y2 ), x2 = w2 (y1 , y2 ). Finally, we need the Jacobian of the transfor-
mation which is the determinant of order 2 given by
 
 ∂x1 ∂x1 
 ∂y 
J =  1 ∂y2 
.
 ∂x 2 ∂x2 
∂y ∂y
1 2

Note that J plays the role of dx/dy in the univariate case. We assume that these
first-order partial derivatives are continuous and that the Jacobian J is not identi-
cally equal to zero in T .
Let B be any region5 in T and let A = T −1 (B) as shown in Figure 2.2.1.
Because the transformation T is one-to-one, P [(X1 , X2 ) ∈ A] = P [T (X1 , X2 ) ∈
T (A)] = P [(Y1 , Y2 ) ∈ B]. Then based on the change-in-variable technique, cited
above, we have

P [(X1 , X2 ) ∈ A] = fX1 ,X2 (x1 , x2 ) dx1 dy2
A
= fX1 ,X2 [T −1 (y1 , y2 )]|J| dy1 dy2
T (A)

= fX1 ,X2 [w1 (y1 , y2 ), w2 (y1 , y2 )]|J| dy1 dy2 .
B
4 See the reference for Mathematical Comments in the Preface.
5 Technically an event in the support of (Y1 , Y2 ).
2.2. Transformations: Bivariate Random Variables 103

Since B is arbitrary, the last integrand must be the joint pdf of (Y1 , Y2 ). That is
the pdf of (Y1 , Y2 ) is
"
fX1 ,X2 [w1 (y1 , y2 ), w2 (y1 , y2 )]|J| (y1 , y2 ) ∈ T
fY1 ,Y2 (y1 , y2 ) = (2.2.3)
0 elsewhere.

Several examples of this result are given next.

y2

x2
T

B
S

y1

x1

Figure 2.2.1: A general sketch of the supports of (X1 , X2 ), (S), and (Y1 , Y2 ), (T ).

Example 2.2.3. Reconsider Example 2.2.2, where (X1 , X2 ) have the uniform dis-
tribution over the unit square with the pdf given in expression (2.2.1). The support
of (X1 , X2 ) is the set S = {(x1 , x2 ) : 0 < x1 < 1, 0 < x2 < 1} as depicted in Figure
2.2.2.
Suppose Y1 = X1 + X2 and Y2 = X1 − X2 . The transformation is given by

y1 = u1 (x1 , x2 ) = x1 + x2
y2 = u2 (x1 , x2 ) = x1 − x2 .

This transformation is one-to-one. We first determine the set T in the y1 y2 -plane


that is the mapping of S under this transformation. Now

x1 = w1 (y1 , y2 ) = 12 (y1 + y2 )
x2 = w2 (y1 , y2 ) = 12 (y1 − y2 ).

To determine the set S in the y1 y2 -plane onto which T is mapped under the transfor-
mation, note that the boundaries of S are transformed as follows into the boundaries
104 Multivariate Distributions

x2

x2 = 1

x1 = 0 S x1 = 1

x1
(0, 0) x2 = 0

Figure 2.2.2: The support of (X1 , X2 ) of Example 2.2.3.

of T :
x1 = 0 into 0 = 12 (y1 + y2 )
x1 = 1 into 1 = 12 (y1 + y2 )
x2 = 0 into 0 = 12 (y1 − y2 )
x2 = 1 into 1 = 12 (y1 − y2 ).
Accordingly, T is shown in Figure 2.2.3. Next, the Jacobian is given by
 
 ∂x1 ∂x1   
   1 1 
 ∂y1 ∂y2   2 2  1
J =
 ∂x2 ∂x2  =  1 1  = −2.
   2 −2 
∂y1 ∂y2

Although we suggest transforming the boundaries of S, others might want to


use the inequalities
0 < x1 < 1 and 0 < x2 < 1
directly. These four inequalities become
0 < 12 (y1 + y2 ) < 1 and 0 < 12 (y1 − y2 ) < 1.
It is easy to see that these are equivalent to
−y1 < y2 , y2 < 2 − y1 , y2 < y1 y1 − 2 < y2 ;
and they define the set T .
Hence, the joint pdf of (Y1 , Y2 ) is given by
"
fX1 ,X2 [ 12 (y1 + y2 ), 12 (y1 − y2 )]|J| = 1
2 (y1 , y2 ) ∈ T
fY1 ,Y2 (y1 , y2 ) =
0 elsewhere.
2.2. Transformations: Bivariate Random Variables 105

y2

y2 = y 1 y2 = 2 – y1

T
(0, 0) y1

y2 = –y1 y2 = y1 – 2

Figure 2.2.3: The support of (Y1 , Y2 ) of Example 2.2.3.

The marginal pdf of Y1 is given by


 ∞
fY1 (y1 ) = fY1 ,Y2 (y1 , y2 ) dy2 .
−∞

If we refer to Figure 2.2.3, we can see that


⎧ y1
1

⎨ −y1 2 dy2 = y1 0 < y1 ≤ 1
2−y1 1
fY1 (y1 ) = dy2 = 2 − y1 1 < y1 < 2
⎪ y −2 2
⎩ 01 elsewhere,

which agrees with expression (2.2.2) of Example 2.2.2. In a similar manner, the
marginal pdf fY2 (y2 ) is given by
⎧ y +2
⎪ 1
2
⎨ −y2 2 dy1 = y2 + 1 −1 < y2 ≤ 0
2−y2 1
2 dy1 = 1 − y2
fY2 (y2 ) = 0 < y2 < 1

⎩ y2
0 elsewhere.

Example 2.2.4. Let Y1 = 12 (X1 − X2 ), where X1 and X2 have the joint pdf
" 1  
4 exp − x1 +x
2
2
0 < x1 < ∞, 0 < x2 < ∞
fX1 ,X2 (x1 , x2 ) =
0 elsewhere.

Let Y2 = X2 so that y1 = 12 (x1 − x2 ), y2 = x2 or, equivalently, x1 = 2y1 + y2 , x2 =


y2 , define a one-to-one transformation from S = {(x1 , x2 ) : 0 < x1 < ∞, 0 < x2 <
∞} onto T = {(y1 , y2 ) : −2y1 < y2 and 0 < y2 < ∞, −∞ < y1 < ∞}. The
Jacobian of the transformation is
 
 2 1 
J =   = 2;
0 1 
106 Multivariate Distributions

hence the joint pdf of Y1 and Y2 is


" |2|
4 e−y1 −y2 (y1 , y2 ) ∈ T
fY1 ,Y2 (y1 , y2 ) =
0 elsewhere.

Thus the pdf of Y1 is given by


⎧ ∞ 1 −y −y 1
⎨ −2y1 2 e 1 2 dy2 = 2 ey1 −∞ < y1 < 0
fY1 (y1 ) =
⎩ ∞ 1 −y1 −y2
0 2 e dy2 = 12 e−y1 0 ≤ y1 < ∞,
or
1
fY1 (y1 ) = 2 e−|y1 | , −∞ < y1 < ∞. (2.2.4)
Recall from expression (1.9.20) of Chapter 1 that Y1 has the Laplace distribution.
This pdf is also frequently called the double exponential pdf.
Example 2.2.5. Let X1 and X2 have the joint pdf
"
10x1 x22 0 < x1 < x2 < 1
fX1 ,X2 (x1 , x2 ) =
0 elsewhere.

Suppose Y1 = X1 /X2 and Y2 = X2 . Hence, the inverse transformation is x1 = y1 y2


and x2 = y2 , which has the Jacobian
 
 y2 y1 
J =  = y2 .
0 1 

The inequalities defining the support S of (X1 , X2 ) become

0 < y1 y2 , y1 y2 < y2 , and y2 < 1.

These inequalities are equivalent to

0 < y1 < 1 and 0 < y2 < 1,

which defines the support set T of (Y1 , Y2 ). Hence, the joint pdf of (Y1 , Y2 ) is

fY1 ,Y2 (y1 , y2 ) = 10y1 y2 y22 |y2 | = 10y1 y24 , (y1 , y2 ) ∈ T .

The marginal pdfs are


 1
fY1 (y1 ) = 10y1 y24 dy2 = 2y1 , 0 < y1 < 1,
0

zero elsewhere, and


 1
fY2 (y2 ) = 10y1 y24 dy1 = 5y24 , 0 < y1 < 1,
0

zero elsewhere.
2.2. Transformations: Bivariate Random Variables 107

In addition to the change-of-variable and cdf techniques for finding distributions


of functions of random variables, there is another method, called the moment
generating function (mgf ) technique, which works well for linear functions of
random variables. In Subsection 2.1.2, we pointed out that if Y = g(X1 , X2 ), then
E(Y ), if it exists, could be found by
 ∞ ∞
E(Y ) = g(x1 , x2 )fX1 ,X2 (x1 , x2 ) dx1 dx2
−∞ −∞

in the continuous case, with summations replacing integrals in the discrete case.
Certainly, that function g(X1 , X2 ) could be exp{tu(X1 , X2 )}, so that in reality
we would be finding the mgf of the function Z = u(X1 , X2 ). If we could then
recognize this mgf as belonging to a certain distribution, then Z would have that
distribution. We give two illustrations that demonstrate the power of this technique
by reconsidering Examples 2.2.1 and 2.2.4.
Example 2.2.6 (Continuation of Example 2.2.1). Here X1 and X2 have the joint
pmf
$ x1 x2 −μ −μ
μ1 μ2 e 1 e 2
pX1 ,X2 (x1 , x2 ) = x1 !x2 ! x1 = 0, 1, 2, 3, . . . , x2 = 0, 1, 2, 3, . . .
0 elsewhere,

where μ1 and μ2 are fixed positive real numbers. Let Y = X1 + X2 and consider
∞ 
 ∞
E(etY ) = et(x1 +x2 ) pX1 ,X2 (x1 , x2 )
x1 =0 x2 =0
∞ x1 −μ1 ∞

tx1 μ e μx2 e−μ2
= e etx2
x1 =0
x1 ! x2 =0
x2 !
 ∞
 ∞


(et μ1 )x1  (et μ2 )x2
−μ1 −μ2
= e e
x1 =0
x1 ! x2 =0
x2 !
 t
  t

= eμ1 (e −1) eμ2 (e −1)
t
= e(μ1 +μ2 )(e −1)
.

Notice that the factors in the brackets in the next-to-last equality are the mgfs of
X1 and X2 , respectively. Hence, the mgf of Y is the same as that of X1 except μ1
has been replaced by μ1 + μ2 . Therefore, by the uniqueness of mgfs, the pmf of Y
must be
(μ1 + μ2 )y
pY (y) = e−(μ1 +μ2 ) , y = 0, 1, 2, . . . ,
y!
which is the same pmf that was obtained in Example 2.2.1.
Example 2.2.7 (Continuation of Example 2.2.4). Here X1 and X2 have the joint
pdf " 1  
4 exp − x1 +x
2
2
0 < x1 < ∞, 0 < x2 < ∞
fX1 ,X2 (x1 , x2 ) =
0 elsewhere.
108 Multivariate Distributions

So the mgf of Y = (1/2)(X1 − X2 ) is given by


 ∞ ∞
1
tY
E(e ) = et(x1 −x2 )/2 e−(x1 +x2 )/2 dx1 dx2
0 0 4
 ∞  ∞
1 −x1 (1−t)/2 1 −x2 (1+t)/2
= e dx1 e dx2
0 2 0 2
1 1 1
= =
1−t 1+t 1 − t2
provided that 1 − t > 0 and 1 + t > 0; i.e., −1 < t < 1. However, the mgf of a
Laplace distribution with pdf (1.9.20) is
 ∞  0 (1+t)x  ∞ (t−1)x
e−|x| e e
etx dx = dx + dx
−∞ 2 −∞ 2 0 2
1 1 1
= + = ,
2(1 + t) 2(1 − t) 1 − t2
provided −1 < t < 1. Thus, by the uniqueness of mgfs, Y has a Laplace distribution
with pdf (1.9.20).

EXERCISES
2.2.1. If p(x1 , x2 ) = ( 23 )x1 +x2 ( 13 )2−x1 −x2 , (x1 , x2 ) = (0, 0), (0, 1), (1, 0), (1, 1), zero
elsewhere, is the joint pmf of X1 and X2 , find the joint pmf of Y1 = X1 − X2 and
Y2 = X1 + X2 .
2.2.2. Let X1 and X2 have the joint pmf p(x1 , x2 ) = x1 x2 /36, x1 = 1, 2, 3 and
x2 = 1, 2, 3, zero elsewhere. Find first the joint pmf of Y1 = X1 X2 and Y2 = X2 ,
and then find the marginal pmf of Y1 .
2.2.3. Let X1 and X2 have the joint pdf h(x1 , x2 ) = 2e−x1 −x2 , 0 < x1 < x2 < ∞,
zero elsewhere. Find the joint pdf of Y1 = 2X1 and Y2 = X2 − X1 .
2.2.4. Let X1 and X2 have the joint pdf h(x1 , x2 ) = 8x1 x2 , 0 < x1 < x2 < 1, zero
elsewhere. Find the joint pdf of Y1 = X1 /X2 and Y2 = X2 .
Hint: Use the inequalities 0 < y1 y2 < y2 < 1 in considering the mapping from S
onto T .
2.2.5. Let X1 and X2 be continuous random variables with the joint probability
density function fX1 ,X2 (x1 , x2 ), −∞ < xi < ∞, i = 1, 2. Let Y1 = X1 + X2 and
Y2 = X2 .
(a) Find the joint pdf fY1 ,Y2 .
(b) Show that  ∞
fY1 (y1 ) = fX1 ,X2 (y1 − y2 , y2 ) dy2 , (2.2.5)
−∞
which is sometimes called the convolution formula.
2.3. Conditional Distributions and Expectations 109

2.2.6. Suppose X1 and X2 have the joint pdf fX1 ,X2 (x1 , x2 ) = e−(x1 +x2 ) , 0 < xi <
∞, i = 1, 2, zero elsewhere.
(a) Use formula (2.2.5) to find the pdf of Y1 = X1 + X2 .
(b) Find the mgf of Y1 .
2.2.7. Use the formula (2.2.5) to find the pdf of Y1 = X1 + X2 , where X1 and X2
have the joint pdf fX1 ,X2 (x1 , x2 ) = 2e−(x1 +x2 ) , 0 < x1 < x2 < ∞, zero elsewhere.
2.2.8. Suppose X1 and X2 have the joint pdf
" −x −x
e 1 e 2 x1 > 0, x2 > 0
f (x1 , x2 ) =
0 elsewhere.
For constants w1 > 0 and w2 > 0, let W = w1 X1 + w2 X2 .
(a) Show that the pdf of W is
" 1 −w/w1
w1 −w2 (e − e−w/w2 ) w>0
fW (w) =
0 elsewhere.

(b) Verify that fW (w) > 0 for w > 0.


(c) Note that the pdf fW (w) has an indeterminate form when w1 = w2 . Rewrite
fW (w) using h defined as w1 − w2 = h. Then use l’Hôpital’s rule to show that
when w1 = w2 , the pdf is given by fW (w) = (w/w12 ) exp{−w/w1 } for w > 0
and zero elsewhere.

2.3 Conditional Distributions and Expectations


In Section 2.1 we introduced the joint probability distribution of a pair of random
variables. We also showed how to recover the individual (marginal) distributions
for the random variables from the joint distribution. In this section, we discuss
conditional distributions, i.e., the distribution of one of the random variables when
the other has assumed a specific value. We discuss this first for the discrete case,
which follows easily from the concept of conditional probability presented in Section
1.4.
Let X1 and X2 denote random variables of the discrete type, which have the joint
pmf pX1 ,X2 (x1 , x2 ) that is positive on the support set S and is zero elsewhere. Let
pX1 (x1 ) and pX2 (x2 ) denote, respectively, the marginal probability mass functions
of X1 and X2 . Let x1 be a point in the support of X1 ; hence, pX1 (x1 ) > 0. Using
the definition of conditional probability, we have
P (X1 = x1 , X2 = x2 ) pX1 ,X2 (x1 , x2 )
P (X2 = x2 |X1 = x1 ) = =
P (X1 = x1 ) pX1 (x1 )
for all x2 in the support SX2 of X2 . Define this function as
pX1 ,X2 (x1 , x2 )
pX2 |X1 (x2 |x1 ) = , x2 ∈ SX2 . (2.3.1)
pX1 (x1 )
110 Multivariate Distributions

For any fixed x1 with pX1 (x1 ) > 0, this function pX2 |X1 (x2 |x1 ) satisfies the con-
ditions of being a pmf of the discrete type because pX2 |X1 (x2 |x1 ) is nonnegative
and
  pX ,X (x1 , x2 )
pX2 |X1 (x2 |x1 ) = 1 2

x x
p X1 (x1 )
2 2

1  pX1 (x1 )
= pX1 ,X2 (x1 , x2 ) = = 1.
pX1 (x1 ) x pX1 (x1 )
2

We call pX2 |X1 (x2 |x1 ) the conditional pmf of the discrete type of random variable
X2 , given that the discrete type of random variable X1 = x1 . In a similar manner,
provided x2 ∈ SX2 , we define the symbol pX1 |X2 (x1 |x2 ) by the relation

pX1 ,X2 (x1 , x2 )


pX1 |X2 (x1 |x2 ) = , x1 ∈ SX1 ,
pX2 (x2 )

and we call pX1 |X2 (x1 |x2 ) the conditional pmf of the discrete type of random variable
X1 , given that the discrete type of random variable X2 = x2 . We often abbreviate
pX1 |X2 (x1 |x2 ) by p1|2 (x1 |x2 ) and pX2 |X1 (x2 |x1 ) by p2|1 (x2 |x1 ). Similarly, p1 (x1 )
and p2 (x2 ) are used to denote the respective marginal pmfs.
Now let X1 and X2 denote random variables of the continuous type and have the
joint pdf fX1 ,X2 (x1 , x2 ) and the marginal probability density functions fX1 (x1 ) and
fX2 (x2 ), respectively. We use the results of the preceding paragraph to motivate
a definition of a conditional pdf of a continuous type of random variable. When
fX1 (x1 ) > 0, we define the symbol fX2 |X1 (x2 |x1 ) by the relation

fX1 ,X2 (x1 , x2 )


fX2 |X1 (x2 |x1 ) = . (2.3.2)
fX1 (x1 )
In this relation, x1 is to be thought of as having a fixed (but any fixed) value for
which fX1 (x1 ) > 0. It is evident that fX2 |X1 (x2 |x1 ) is nonnegative and that
 ∞  ∞
fX1 ,X2 (x1 , x2 )
fX2 |X1 (x2 |x1 ) dx2 = dx2
−∞ −∞ fX1 (x1 )
 ∞
1
= fX ,X (x1 , x2 ) dx2
fX1 (x1 ) −∞ 1 2
1
= fX (x1 ) = 1.
fX1 (x1 ) 1
That is, fX2 |X1 (x2 |x1 ) has the properties of a pdf of one continuous type of random
variable. It is called the conditional pdf of the continuous type of random variable
X2 , given that the continuous type of random variable X1 has the value x1 . When
fX2 (x2 ) > 0, the conditional pdf of the continuous random variable X1 , given that
the continuous type of random variable X2 has the value x2 , is defined by
fX1 ,X2 (x1 , x2 )
fX1 |X2 (x1 |x2 ) = , fX2 (x2 ) > 0.
fX2 (x2 )
2.3. Conditional Distributions and Expectations 111

We often abbreviate these conditional pdfs by f1|2 (x1 |x2 ) and f2|1 (x2 |x1 ), respec-
tively. Similarly, f1 (x1 ) and f2 (x2 ) are used to denote the respective marginal pdfs.
Since each of f2|1 (x2 |x1 ) and f1|2 (x1 |x2 ) is a pdf of one random variable, each
has all the properties of such a pdf. Thus we can compute probabilities and math-
ematical expectations. If the random variables are of the continuous type, the
probability
 b
P (a < X2 < b|X1 = x1 ) = f2|1 (x2 |x1 ) dx2
a
is called “the conditional probability that a < X2 < b, given that X1 = x1 .” If there
is no ambiguity, this may be written in the form P (a < X2 < b|x1 ). Similarly, the
conditional probability that c < X1 < d, given X2 = x2 , is
 d
P (c < X1 < d|X2 = x2 ) = f1|2 (x1 |x2 ) dx1 .
c

If u(X2 ) is a function of X2 , the conditional expectation of u(X2 ), given that


X1 = x1 , if it exists, is given by
 ∞
E[u(X2 )|x1 ] = u(x2 )f2|1 (x2 |x1 ) dx2 .
−∞

Note that E[u(X2 )|x1 ] is a function of x1 . If they do exist, then E(X2 |x1 ) is the
mean and E{[X2 − E(X2 |x1 )]2 |x1 } is the conditional variance of the conditional
distribution of X2 , given X1 = x1 , which can be written more simply as Var(X2 |x1 ).
It is convenient to refer to these as the “conditional mean” and the “conditional
variance” of X2 , given X1 = x1 . Of course, we have

Var(X2 |x1 ) = E(X22 |x1 ) − [E(X2 |x1 )]2

from an earlier result. In a like manner, the conditional expectation of u(X1 ), given
X2 = x2 , if it exists, is given by
 ∞
E[u(X1 )|x2 ] = u(x1 )f1|2 (x1 |x2 ) dx1 .
−∞

With random variables of the discrete type, these conditional probabilities and
conditional expectations are computed by using summation instead of integration.
An illustrative example follows.
Example 2.3.1. Let X1 and X2 have the joint pdf
"
2 0 < x1 < x2 < 1
f (x1 , x2 ) =
0 elsewhere.

Then the marginal probability density functions are, respectively,


$
1
f1 (x1 ) = x1 2 dx2 = 2(1 − x1 ) 0 < x1 < 1
0 elsewhere,
112 Multivariate Distributions

and " x2
0
2 dx1 = 2x2 0 < x2 < 1
f2 (x2 ) =
0 elsewhere.
The conditional pdf of X1 , given X2 = x2 , 0 < x2 < 1, is
" 2 1
f1|2 (x1 |x2 ) = 2x2 = x2 0 < x1 < x2 < 1
0 elsewhere.

Here the conditional mean and the conditional variance of X1 , given X2 = x2 , are
respectively,
 ∞
E(X1 |x2 ) = x1 f1|2 (x1 |x2 ) dx1
−∞
 x2  
1
= x1 dx1
0 x2
x2
= , 0 < x2 < 1,
2
and
 (  
x2
x2 )2 1
Var(X1 |x2 ) = x1 − dx1
0 2 x2
x22
= , 0 < x2 < 1.
12
Finally, we compare the values of

P (0 < X1 < 12 |X2 = 34 ) and P (0 < X1 < 12 ).

We have
 1/2  1/2
1 3
P (0 < X1 < 2 |X2 = 4) = f1|2 (x1 | 34 ) dx1 = ( 43 ) dx1 = 23 ,
0 0

but
1/2 1/2
P (0 < X1 < 12 ) = 0 f1 (x1 ) dx1 = 0 2(1 − x1 ) dx1 = 34 .
Since E(X2 |x1 ) is a function of x1 , then E(X2 |X1 ) is a random variable with its
own distribution, mean, and variance. Let us consider the following illustration of
this.
Example 2.3.2. Let X1 and X2 have the joint pdf
"
6x2 0 < x2 < x1 < 1
f (x1 , x2 ) =
0 elsewhere.

Then the marginal pdf of X1 is


 x1
f1 (x1 ) = 6x2 dx2 = 3x21 , 0 < x1 < 1,
0
2.3. Conditional Distributions and Expectations 113

zero elsewhere. The conditional pdf of X2 , given X1 = x1 , is


6x2 2x2
f2|1 (x2 |x1 ) = = 2 , 0 < x2 < x1 ,
3x21 x1

zero elsewhere, where 0 < x1 < 1. The conditional mean of X2 , given X1 = x1 , is


 x1  
2x2 2
E(X2 |x1 ) = x2 2 dx2 = x1 , 0 < x1 < 1.
0 x1 3

Now E(X2 |X1 ) = 2X1 /3 is a random variable, say Y . The cdf of Y = 2X1 /3 is
 
3y 2
G(y) = P (Y ≤ y) = P X1 ≤ , 0≤y< .
2 3

From the pdf f1 (x1 ), we have


 3y/2
27y 3 2
G(y) = 3x21 dx1 = , 0≤y< .
0 8 3

Of course, G(y) = 0 if y < 0, and G(y) = 1 if 23 < y. The pdf, mean, and variance
of Y = 2X1 /3 are
81y 2 2
g(y) = , 0≤y< ,
8 3
zero elsewhere,
 2/3  
81y 2 1
E(Y ) = y dy = ,
0 8 2
and   
2/3
2 81y 2 1 1
Var(Y ) = y dy − = .
0 8 4 60
Since the marginal pdf of X2 is
 1
f2 (x2 ) = 6x2 dx1 = 6x2 (1 − x2 ), 0 < x2 < 1,
x2

1 1
zero elsewhere, it is easy to show that E(X2 ) = 2 and Var(X2 ) = 20 . That is, here

E(Y ) = E[E(X2 |X1 )] = E(X2 )

and
Var(Y ) = Var[E(X2 |X1 )] ≤ Var(X2 ).

Example 2.3.2 is excellent, as it provides us with the opportunity to apply many


of these new definitions as well as review the cdf technique for finding the distri-
bution of a function of a random variable, namely Y = 2X1 /3. Moreover, the two
observations at the end of this example are no accident because they are true in
general.
114 Multivariate Distributions

Theorem 2.3.1. Let (X1 , X2 ) be a random vector such that the variance of X2 is
finite. Then,
(a) E[E(X2 |X1 )] = E(X2 ).
(b) Var[E(X2 |X1 )] ≤ Var(X2 ).
Proof: The proof is for the continuous case. To obtain it for the discrete case,
exchange summations for integrals. We first prove (a). Note that
 ∞ ∞
E(X2 ) = x2 f (x1 , x2 ) dx2 dx1
−∞ −∞
 ∞  ∞
f (x1 , x2 )
= x2 dx2 f1 (x1 ) dx1
−∞ −∞ f1 (x1 )
 ∞
= E(X2 |x1 )f1 (x1 ) dx1
−∞
= E[E(X2 |X1 )],

which is the first result.


Next we show (b). Consider with μ2 = E(X2 ),

Var(X2 ) = E[(X2 − μ2 )2 ]
= E{[X2 − E(X2 |X1 ) + E(X2 |X1 ) − μ2 ]2 }
= E{[X2 − E(X2 |X1 )]2 } + E{[E(X2 |X1 ) − μ2 ]2 }
+ 2E{[X2 − E(X2 |X1 )][E(X2 |X1 ) − μ2 ]}.

We show that the last term of the right-hand member of the immediately preceding
equation is zero. It is equal to
 ∞ ∞
2 [x2 − E(X2 |x1 )][E(X2 |x1 ) − μ2 ]f (x1 , x2 ) dx2 dx1
−∞ −∞
 ∞ " ∞ #
f (x1 , x2 )
=2 [E(X2 |x1 ) − μ2 ] [x2 − E(X2 |x1 )] dx2 f1 (x1 ) dx1 .
−∞ −∞ f1 (x1 )

But E(X2 |x1 ) is the conditional mean of X2 , given X1 = x1 . Since the expression
in the inner braces is equal to

E(X2 |x1 ) − E(X2 |x1 ) = 0,

the double integral is equal to zero. Accordingly, we have

Var(X2 ) = E{[X2 − E(X2 |X1 )]2 } + E{[E(X2 |X1 ) − μ2 ]2 }.

The first term in the right-hand member of this equation is nonnegative because it
is the expected value of a nonnegative function, namely [X2 − E(X2 |X1 )]2 . Since
E[E(X2 |X1 )] = μ2 , the second term is Var[E(X2 |X1 )]. Hence we have

Var(X2 ) ≥ Var[E(X2 |X1 )],


2.3. Conditional Distributions and Expectations 115

which completes the proof.

Intuitively, this result could have this useful interpretation. Both the random
variables X2 and E(X2 |X1 ) have the same mean μ2 . If we did not know μ2 , we
could use either of the two random variables to guess at the unknown μ2 . Since,
however, Var(X2 ) ≥ Var[E(X2 |X1 )], we would put more reliance in E(X2 |X1 ) as a
guess. That is, if we observe the pair (X1 , X2 ) to be (x1 , x2 ), we could prefer to use
E(X2 |x1 ) to x2 as a guess at the unknown μ2 . When studying the use of sufficient
statistics in estimation in Chapter 7, we make use of this famous result, attributed
to C. R. Rao and David Blackwell.
We finish this section with an example illustrating Theorem 2.3.1.
Example 2.3.3. Let X1 and X2 be discrete random variables. Suppose the condi-
tional pmf of X1 given X2 and the marginal distribution of X2 are given by
    x2
x2 1
p(x1 |x2 ) = , x1 = 0, 1, . . . , x2
x1 2
 x −1
2 1 2
p(x2 ) = , x2 = 1, 2, 3 . . . .
3 3
Let us determine the mgf of X1 . For fixed x2 , by the binomial theorem,
x2  
  x −x  x
  x2 tx1 1 2 1 1 1
E etX1 |x2 = e
x1 =0
x1 2 2
 x2
1 1 t
= + e .
2 2
Hence, by the geometric series and Theorem 2.3.1,
    
E etX1 = E E etX1 |X2
∞  x  x −1
1 1 t 22 1 2
= + e
x2 =1
2 2 3 3
  ∞  x −1
2 1 1 t 1 1 t 2
= + e + e
3 2 2 x2 =1
6 6
 
2 1 1 t 1
= + e ,
3 2 2 1 − [(1/6) + (1/6)et ]

provided (1/6) + (1/6)et < 1 or t < log 5 (which includes t = 0).

EXERCISES
2.3.1. Let X1 and X2 have the joint pdf f (x1 , x2 ) = x1 + x2 , 0 < x1 < 1, 0 <
x2 < 1, zero elsewhere. Find the conditional mean and variance of X2 , given
X1 = x1 , 0 < x1 < 1.
116 Multivariate Distributions

2.3.2. Let f1|2 (x1 |x2 ) = c1 x1 /x22 , 0 < x1 < x2 , 0 < x2 < 1, zero elsewhere, and
f2 (x2 ) = c2 x42 , 0 < x2 < 1, zero elsewhere, denote, respectively, the conditional pdf
of X1 , given X2 = x2 , and the marginal pdf of X2 . Determine:

(a) The constants c1 and c2 .

(b) The joint pdf of X1 and X2 .

(c) P ( 14 < X1 < 12 |X2 = 58 ).

(d) P ( 14 < X1 < 12 ).

2.3.3. Let f (x1 , x2 ) = 21x21 x32 , 0 < x1 < x2 < 1, zero elsewhere, be the joint pdf
of X1 and X2 .

(a) Find the conditional mean and variance of X1 , given X2 = x2 , 0 < x2 < 1.

(b) Find the distribution of Y = E(X1 |X2 ).

(c) Determine E(Y ) and Var(Y ) and compare these to E(X1 ) and Var(X1 ), re-
spectively.

2.3.4. Suppose X1 and X2 are random variables of the discrete type that have
the joint pmf p(x1 , x2 ) = (x1 + 2x2 )/18, (x1 , x2 ) = (1, 1), (1, 2), (2, 1), (2, 2), zero
elsewhere. Determine the conditional mean and variance of X2 , given X1 = x1 , for
x1 = 1 or 2. Also, compute E(3X1 − 2X2 ).

2.3.5. Let X1 and X2 be two random variables such that the conditional distribu-
tions and means exist. Show that:

(a) E(X1 + X2 | X2 ) = E(X1 | X2 ) + X2 ,

(b) E(u(X2 ) | X2 ) = u(X2 ).

2.3.6. Let the joint pdf of X and Y be given by


" 2
(1+x+y)3 0 < x < ∞, 0 < y < ∞
f (x, y) =
0 elsewhere.

(a) Compute the marginal pdf of X and the conditional pdf of Y , given X = x.

(b) For a fixed X = x, compute E(1 + x + Y |x) and use the result to compute
E(Y |x).

2.3.7. Suppose X1 and X2 are discrete random variables which have the joint pmf
p(x1 , x2 ) = (3x1 + x2 )/24, (x1 , x2 ) = (1, 1), (1, 2), (2, 1), (2, 2), zero elsewhere. Find
the conditional mean E(X2 |x1 ), when x1 = 1.

2.3.8. Let X and Y have the joint pdf f (x, y) = 2 exp{−(x + y)}, 0 < x < y < ∞,
zero elsewhere. Find the conditional mean E(Y |x) of Y , given X = x.
2.4. Independent Random Variables 117

2.3.9. Five cards are drawn at random and without replacement from an ordinary
deck of cards. Let X1 and X2 denote, respectively, the number of spades and the
number of hearts that appear in the five cards.

(a) Determine the joint pmf of X1 and X2 .


(b) Find the two marginal pmfs.

(c) What is the conditional pmf of X2 , given X1 = x1 ?


2.3.10. Let X1 and X2 have the joint pmf p(x1 , x2 ) described as follows:

(x1 , x2 ) (0, 0) (0, 1) (1, 0) (1, 1) (2, 0) (2, 1)


1 3 4 3 6 1
p(x1 , x2 ) 18 18 18 18 18 18

and p(x1 , x2 ) is equal to zero elsewhere. Find the two marginal probability mass
functions and the two conditional means.
Hint: Write the probabilities in a rectangular array.

2.3.11. Let us choose at random a point from the interval (0, 1) and let the random
variable X1 be equal to the number that corresponds to that point. Then choose
a point at random from the interval (0, x1 ), where x1 is the experimental value of
X1 ; and let the random variable X2 be equal to the number that corresponds to
this point.

(a) Make assumptions about the marginal pdf f1 (x1 ) and the conditional pdf
f2|1 (x2 |x1 ).

(b) Compute P (X1 + X2 ≥ 1).


(c) Find the conditional mean E(X1 |x2 ).
2.3.12. Let f (x) and F (x) denote, respectively, the pdf and the cdf of the random
variable X. The conditional pdf of X, given X > x0 , x0 a fixed number, is defined
by f (x|X > x0 ) = f (x)/[1−F (x0 )], x0 < x, zero elsewhere. This kind of conditional
pdf finds application in a problem of time until death, given survival until time x0 .

(a) Show that f (x|X > x0 ) is a pdf.


(b) Let f (x) = e−x , 0 < x < ∞, and zero elsewhere. Compute P (X > 2|X > 1).

2.4 Independent Random Variables


Let X1 and X2 denote the random variables of the continuous type that have the
joint pdf f (x1 , x2 ) and marginal probability density functions f1 (x1 ) and f2 (x2 ),
respectively. In accordance with the definition of the conditional pdf f2|1 (x2 |x1 ),
we may write the joint pdf f (x1 , x2 ) as

f (x1 , x2 ) = f2|1 (x2 |x1 )f1 (x1 ).


118 Multivariate Distributions

Suppose that we have an instance where f2|1 (x2 |x1 ) does not depend upon x1 . Then
the marginal pdf of X2 is, for random variables of the continuous type,
 ∞
f2 (x2 ) = f2|1 (x2 |x1 )f1 (x1 ) dx1
−∞
 ∞
= f2|1 (x2 |x1 ) f1 (x1 ) dx1
−∞
= f2|1 (x2 |x1 ).

Accordingly,

f2 (x2 ) = f2|1 (x2 |x1 ) and f (x1 , x2 ) = f1 (x1 )f2 (x2 ),

when f2|1 (x2 |x1 ) does not depend upon x1 . That is, if the conditional distribution
of X2 , given X1 = x1 , is independent of any assumption about x1 , then f (x1 , x2 ) =
f1 (x1 )f2 (x2 ).
The same discussion applies to the discrete case too, which we summarize in
parentheses in the following definition.

Definition 2.4.1 (Independence). Let the random variables X1 and X2 have the
joint pdf f (x1 , x2 ) [joint pmf p(x1 , x2 )] and the marginal pdfs [pmfs] f1 (x1 ) [p1 (x1 )]
and f2 (x2 ) [p2 (x2 )], respectively. The random variables X1 and X2 are said to be
independent if, and only if, f (x1 , x2 ) ≡ f1 (x1 )f2 (x2 ) [p(x1 , x2 ) ≡ p1 (x1 )p2 (x2 )].
Random variables that are not independent are said to be dependent.

Remark 2.4.1. Two comments should be made about the preceding definition.
First, the product of two positive functions f1 (x1 )f2 (x2 ) means a function that is
positive on the product space. That is, if f1 (x1 ) and f2 (x2 ) are positive on, and
only on, the respective spaces S1 and S2 , then the product of f1 (x1 ) and f2 (x2 )
is positive on, and only on, the product space S = {(x1 , x2 ) : x1 ∈ S1 , x2 ∈ S2 }.
For instance, if S1 = {x1 : 0 < x1 < 1} and S2 = {x2 : 0 < x2 < 3}, then
S = {(x1 , x2 ) : 0 < x1 < 1, 0 < x2 < 3}. The second remark pertains to the
identity. The identity in Definition 2.4.1 should be interpreted as follows. There
may be certain points (x1 , x2 ) ∈ S at which f (x1 , x2 ) = f1 (x1 )f2 (x2 ). However, if A
is the set of points (x1 , x2 ) at which the equality does not hold, then P (A) = 0. In
subsequent theorems and the subsequent generalizations, a product of nonnegative
functions and an identity should be interpreted in an analogous manner.

Example 2.4.1. Suppose an urn contains 10 blue, 8 red, and 7 yellow balls that
are the same except for color. Suppose 4 balls are drawn without replacement. Let
X and Y be the number of red and blue balls drawn, respectively. The joint pmf
of (X, Y ) is
108 7

x y 4−x−y
p(x, y) = 25 , 0 ≤ x, y ≤ 4; x + y ≤ 4.
4
2.4. Independent Random Variables 119

Since X + Y ≤ 4, it would seem that X and Y are dependent. To see that this is
true by definition, we first find the marginal pmf’s which are:
10 15 
pX (x) = x
254−x
 , 0 ≤ x ≤ 4;
8 417 
pY (y) =
y
254−y
 , 0 ≤ y ≤ 4.
4

To show dependence, we need to find only one point in the support of (X1 , X2 ) where
the joint pmf does not factor into the product of the marginal pmf’s. Suppose we
select the point x = 1 and y = 1. Then, using R for calculation, we compute (to 4
places):
   
7 25
p(1, 1) = 10 · 8 · / = 0.1328
2 4
   
15 25
pX (1) = 10 / = 0.3597
3 4
   
17 25
pY (1) = 8 / = 0.4300.
3 4

Since 0.1328 = 0.1547 = 0.3597 · 0.4300, X and Y are dependent random variables.

Example 2.4.2. Let the joint pdf of X1 and X2 be


"
x1 + x2 0 < x1 < 1, 0 < x2 < 1
f (x1 , x2 ) =
0 elsewhere.

We show that X1 and X2 are dependent. Here the marginal probability density
functions are
" ∞ 1 1
f1 (x1 ) = −∞ f (x1 , x2 ) dx2 = 0 (x1 + x2 ) dx2 = x1 + 2 0 < x1 < 1
0 elsewhere,

and
" ∞ 1 1
−∞ f (x1 , x2 ) dx1 = 0 (x1 + x2 ) dx1 = 2 + x2 0 < x2 < 1
f2 (x2 ) =
0 elsewhere.

Since f (x1 , x2 ) ≡ f1 (x1 )f2 (x2 ), the random variables X1 and X2 are dependent.

The following theorem makes it possible to assert that the random variables X1
and X2 of Example 2.4.2 are dependent, without computing the marginal probability
density functions.

Theorem 2.4.1. Let the random variables X1 and X2 have supports S1 and S2 ,
respectively, and have the joint pdf f (x1 , x2 ). Then X1 and X2 are independent if
120 Multivariate Distributions

and only if f (x1 , x2 ) can be written as a product of a nonnegative function of x1


and a nonnegative function of x2 . That is,

f (x1 , x2 ) ≡ g(x1 )h(x2 ),

where g(x1 ) > 0, x1 ∈ S1 , zero elsewhere, and h(x2 ) > 0, x2 ∈ S2 , zero elsewhere.
Proof. If X1 and X2 are independent, then f (x1 , x2 ) ≡ f1 (x1 )f2 (x2 ), where f1 (x1 )
and f2 (x2 ) are the marginal probability density functions of X1 and X2 , respectively.
Thus the condition f (x1 , x2 ) ≡ g(x1 )h(x2 ) is fulfilled.
Conversely, if f (x1 , x2 ) ≡ g(x1 )h(x2 ), then, for random variables of the contin-
uous type, we have
 ∞  ∞
f1 (x1 ) = g(x1 )h(x2 ) dx2 = g(x1 ) h(x2 ) dx2 = c1 g(x1 )
−∞ −∞

and  ∞  ∞
f2 (x2 ) = g(x1 )h(x2 ) dx1 = h(x2 ) g(x1 ) dx1 = c2 h(x2 ),
−∞ −∞
where c1 and c2 are constants, not functions of x1 or x2 . Moreover, c1 c2 = 1 because
 ∞ ∞  ∞  ∞
1= g(x1 )h(x2 ) dx1 dx2 = g(x1 ) dx1 h(x2 ) dx2 = c2 c1 .
−∞ −∞ −∞ −∞

These results imply that

f (x1 , x2 ) ≡ g(x1 )h(x2 ) ≡ c1 g(x1 )c2 h(x2 ) ≡ f1 (x1 )f2 (x2 ).

Accordingly, X1 and X2 are independent.

This theorem is true for the discrete case also. Simply replace the joint pdf by
the joint pmf. For instance, the discrete random variables X and Y of Example
2.4.1 are immediately seen to be dependent because the support of (X, Y ) is not a
product space.
Next, consider the joint distribution of the continuous random vector (X, Y )
given in Example 2.1.3. The joint pdf is
2 2
f (x, y) = 4xe−x ye−y , x > 0, y > 0.

which is a product of a nonnegative function of x and a nonnegative function of y.


Further, the joint support is a product space. Hence, X and Y are independent
random variables.
Example 2.4.3. Let the pdf of the random variable X1 and X2 be f (x1 , x2 ) =
8x1 x2 , 0 < x1 < x2 < 1, zero elsewhere. The formula 8x1 x2 might suggest to some
that X1 and X2 are independent. However, if we consider the space S = {(x1 , x2 ) :
0 < x1 < x2 < 1}, we see that it is not a product space. This should make it clear
that, in general, X1 and X2 must be dependent if the space of positive probability
density of X1 and X2 is bounded by a curve that is neither a horizontal nor a
vertical line.
2.4. Independent Random Variables 121

Instead of working with pdfs (or pmfs) we could have presented independence
in terms of cumulative distribution functions. The following theorem shows the
equivalence.

Theorem 2.4.2. Let (X1 , X2 ) have the joint cdf F (x1 , x2 ) and let X1 and X2 have
the marginal cdfs F1 (x1 ) and F2 (x2 ), respectively. Then X1 and X2 are independent
if and only if

F (x1 , x2 ) = F1 (x1 )F2 (x2 ) for all (x1 , x2 ) ∈ R2 . (2.4.1)

Proof: We give the proof for the continuous case. Suppose expression (2.4.1) holds.
Then the mixed second partial is

∂2
F (x1 , x2 ) = f1 (x1 )f2 (x2 ).
∂x1 ∂x2
Hence, X1 and X2 are independent. Conversely, suppose X1 and X2 are indepen-
dent. Then by the definition of the joint cdf,
 x1  x2
F (x1 , x2 ) = f1 (w1 )f2 (w2 ) dw2 dw1
−∞ −∞
 x1  x2
= f1 (w1 ) dw1 · f2 (w2 ) dw2 = F1 (x1 )F2 (x2 ).
−∞ −∞

Hence, condition (2.4.1) is true.

We now give a theorem that frequently simplifies the calculations of probabilities


of events that involves independent variables.

Theorem 2.4.3. The random variables X1 and X2 are independent random vari-
ables if and only if the following condition holds,

P (a < X1 ≤ b, c < X2 ≤ d) = P (a < X1 ≤ b)P (c < X2 ≤ d) (2.4.2)

for every a < b and c < d, where a, b, c, and d are constants.

Proof: If X1 and X2 are independent, then an application of the last theorem and
expression (2.1.2) shows that

P (a < X1 ≤ b, c < X2 ≤ d) = F (b, d) − F (a, d) − F (b, c) + F (a, c)


= F1 (b)F2 (d) − F1 (a)F2 (d) − F1 (b)F2 (c)
+F1 (a)F2 (c)
= [F1 (b) − F1 (a)][F2 (d) − F2 (c)],

which is the right side of expression (2.4.2). Conversely, condition (2.4.2) implies
that the joint cdf of (X1 , X2 ) factors into a product of the marginal cdfs, which in
turn by Theorem 2.4.2 implies that X1 and X2 are independent.
122 Multivariate Distributions

Example 2.4.4 (Example 2.4.2, Continued). Independence is necessary for condi-


tion (2.4.2). For example, consider the dependent variables X1 and X2 of Example
2.4.2. For these random variables, we have
1/2 1/2
P (0 < X1 < 12 , 0 < X2 < 12 ) = 0 0 (x1 + x2 ) dx1 dx2 = 18 ,

whereas
1/2
P (0 < X1 < 12 ) = 0
(x1 + 12 ) dx1 = 3
8
and
1/2 1
P (0 < X2 < 12 ) = 0 (2 + x1 ) dx2 = 38 .
Hence, condition (2.4.2) does not hold.
Not merely are calculations of some probabilities usually simpler when we have
independent random variables, but many expectations, including certain moment
generating functions, have comparably simpler computations. The following result
proves so useful that we state it in the form of a theorem.
Theorem 2.4.4. Suppose X1 and X2 are independent and that E(u(X1 )) and
E(v(X2 )) exist. Then

E[u(X1 )v(X2 )] = E[u(X1 )]E[v(X2 )].

Proof. We give the proof in the continuous case. The independence of X1 and X2
implies that the joint pdf of X1 and X2 is f1 (x1 )f2 (x2 ). Thus we have, by definition
of expectation,
 ∞ ∞
E[u(X1 )v(X2 )] = u(x1 )v(x2 )f1 (x1 )f2 (x2 ) dx1 dx2
−∞ −∞
 ∞  ∞
= u(x1 )f1 (x1 ) dx1 v(x2 )f2 (x2 ) dx2
−∞ −∞
= E[u(X1 )]E[v(X2 )].

Hence, the result is true.

Upon taking the functions u(·) and v(·) to be the identity functions in Theorem
2.4.4, we have that for independent random variables X1 and X2 ,

E(X1 X2 ) = E(X1 )E(X2 ). (2.4.3)

We next prove a very useful theorem about independent random variables. The
proof of the theorem relies heavily upon our assertion that an mgf, when it exists,
is unique and that it uniquely determines the distribution of probability.
Theorem 2.4.5. Suppose the joint mgf, M (t1 , t2 ), exists for the random variables
X1 and X2 . Then X1 and X2 are independent if and only if

M (t1 , t2 ) = M (t1 , 0)M (0, t2 );

that is, the joint mgf is identically equal to the product of the marginal mgfs.
2.4. Independent Random Variables 123

Proof. If X1 and X2 are independent, then

M (t1 , t2 ) = E(et1 X1 +t2 X2 )


= E(et1 X1 et2 X2 )
= E(et1 X1 )E(et2 X2 )
= M (t1 , 0)M (0, t2 ).

Thus the independence of X1 and X2 implies that the mgf of the joint distribution
factors into the product of the moment-generating functions of the two marginal
distributions.
Suppose next that the mgf of the joint distribution of X1 and X2 is given by
M (t1 , t2 ) = M (t1 , 0)M (0, t2 ). Now X1 has the unique mgf, which, in the continuous
case, is given by  ∞
M (t1 , 0) = et1 x1 f1 (x1 ) dx1 .
−∞

Similarly, the unique mgf of X2 , in the continuous case, is given by


 ∞
M (0, t2 ) = et2 x2 f2 (x2 ) dx2 .
−∞

Thus we have
 ∞  ∞
t1 x1
M (t1 , 0)M (0, t2 ) = e f1 (x1 ) dx1 et2 x2 f2 (x2 ) dx2
−∞ −∞
 ∞ ∞
= et1 x1 +t2 x2 f1 (x1 )f2 (x2 ) dx1 dx2 .
−∞ −∞

We are given that M (t1 , t2 ) = M (t1 , 0)M (0, t2 ); so


 ∞ ∞
M (t1 , t2 ) = et1 x1 +t2 x2 f1 (x1 )f2 (x2 ) dx1 dx2 .
−∞ −∞

But M (t1 , t2 ) is the mgf of X1 and X2 . Thus


 ∞ ∞
M (t1 , t2 ) = et1 x1 +t2 x2 f (x1 , x2 ) dx1 dx2 .
−∞ −∞

The uniqueness of the mgf implies that the two distributions of probability that are
described by f1 (x1 )f2 (x2 ) and f (x1 , x2 ) are the same. Thus

f (x1 , x2 ) ≡ f1 (x1 )f2 (x2 ).

That is, if M (t1 , t2 ) = M (t1 , 0)M (0, t2 ), then X1 and X2 are independent. This
completes the proof when the random variables are of the continuous type. With
random variables of the discrete type, the proof is made by using summation instead
of integration.
124 Multivariate Distributions

Example 2.4.5 (Example 2.1.10, Continued). Let (X, Y ) be a pair of random


variables with the joint pdf
" −y
e 0<x<y<∞
f (x, y) =
0 elsewhere.

In Example 2.1.10, we showed that the mgf of (X, Y ) is


 ∞  ∞
M (t1 , t2 ) = exp(t1 x + t2 y − y) dydx
0 x
1
= ,
(1 − t1 − t2 )(1 − t2 )

provided that t1 + t2 < 1 and t2 < 1. Because M (t1 , t2 ) = M (t1 , 0)M (t1 , 0), the
random variables are dependent.

Example 2.4.6 (Exercise 2.1.15, Continued). For the random variable X1 and X2
defined in Exercise 2.1.15, we showed that the joint mgf is

exp{t1 } exp{t2 }
M (t1 , t2 ) = , ti < log 2 , i = 1, 2.
2 − exp{t1 } 2 − exp{t2 }

We showed further that M (t1 , t2 ) = M (t1 , 0)M (0, t2 ). Hence, X1 and X2 are inde-
pendent random variables.

EXERCISES

2.4.1. Show that the random variables X1 and X2 with joint pdf
"
12x1 x2 (1 − x2 ) 0 < x1 < 1, 0 < x2 < 1
f (x1 , x2 ) =
0 elsewhere

are independent.

2.4.2. If the random variables X1 and X2 have the joint pdf f (x1 , x2 ) = 2e−x1 −x2 , 0 <
x1 < x2 , 0 < x2 < ∞, zero elsewhere, show that X1 and X2 are dependent.
1
2.4.3. Let p(x1 , x2 ) = 16 , x1 = 1, 2, 3, 4, and x2 = 1, 2, 3, 4, zero elsewhere, be the
joint pmf of X1 and X2 . Show that X1 and X2 are independent.

2.4.4. Find P (0 < X1 < 13 , 0 < X2 < 13 ) if the random variables X1 and X2 have
the joint pdf f (x1 , x2 ) = 4x1 (1 − x2 ), 0 < x1 < 1, 0 < x2 < 1, zero elsewhere.

2.4.5. Find the probability of the union of the events a < X1 < b, −∞ < X2 < ∞,
and −∞ < X1 < ∞, c < X2 < d if X1 and X2 are two independent variables with
P (a < X1 < b) = 32 and P (c < X2 < d) = 58 .
2.5. The Correlation Coefficient 125

2.4.6. If f (x1 , x2 ) = e−x1 −x2 , 0 < x1 < ∞, 0 < x2 < ∞, zero elsewhere, is the
joint pdf of the random variables X1 and X2 , show that X1 and X2 are independent
and that M (t1 , t2 ) = (1 − t1 )−1 (1 − t2 )−1 , t2 < 1, t1 < 1. Also show that
E(et(X1 +X2 ) ) = (1 − t)−2 , t < 1.
Accordingly, find the mean and the variance of Y = X1 + X2 .
2.4.7. Let the random variables X1 and X2 have the joint pdf f (x1 , x2 ) = 1/π, for
(x1 − 1)2 + (x2 + 2)2 < 1, zero elsewhere. Find f1 (x1 ) and f2 (x2 ). Are X1 and X2
independent?
2.4.8. Let X and Y have the joint pdf f (x, y) = 3x, 0 < y < x < 1, zero elsewhere.
Are X and Y independent? If not, find E(X|y).
2.4.9. Suppose that a man leaves for work between 8:00 a.m. and 8:30 a.m. and
takes between 40 and 50 minutes to get to the office. Let X denote the time of
departure and let Y denote the time of travel. If we assume that these random
variables are independent and uniformly distributed, find the probability that he
arrives at the office before 9:00 a.m.
2.4.10. Let X and Y be random variables with the space consisting of the four
points (0, 0), (1, 1), (1, 0), (1, −1). Assign positive probabilities to these four points
so that the correlation coefficient is equal to zero. Are X and Y independent?
2.4.11. Two line segments, each of length two units, are placed along the x-axis.
The midpoint of the first is between x = 0 and x = 14 and that of the second is
between x = 6 and x = 20. Assuming independence and uniform distributions for
these midpoints, find the probability that the line segments overlap.
2.4.12. Cast a fair die and let X = 0 if 1, 2, or 3 spots appear, let X = 1 if 4 or 5
spots appear, and let X = 2 if 6 spots appear. Do this two independent times,
obtaining X1 and X2 . Calculate P (|X1 − X2 | = 1).
2.4.13. For X1 and X2 in Example 2.4.6, show that the mgf of Y = X1 + X2 is
e2t /(2 − et )2 , t < log 2, and then compute the mean and variance of Y .

2.5 The Correlation Coefficient


Let (X, Y ) denote a random vector. In the last section, we discussed the concept
of independence between X and Y . What if, though, X and Y are dependent
and, if so, how are they related? There are many measures of dependence. In
this section, we introduce a parameter ρ of the joint distribution of (X, Y ) which
measures linearity between X and Y . In this section, we assume the existence of
all expectations under discussion.
Definition 2.5.1. Let (X, Y ) have a joint distribution. Denote the means of X
and Y respectively by μ1 and μ2 and their respective variances by σ12 and σ22 . The
covariance of (X, Y ) is denoted by cov(X, Y ) and is defined by the expectation
cov(X, Y ) = E[(X − μ1 )(Y − μ2 )]. (2.5.1)
126 Multivariate Distributions

It follows by the linearity of expectation, Theorem 2.1.1, that the covariance of


X and Y can also be expressed as

cov(X, Y ) = E(XY − μ2 X − μ1 Y + μ1 μ2 )
= E(XY ) − μ2 E(X) − μ1 E(Y ) + μ1 μ2
= E(XY ) − μ1 μ2 , (2.5.2)

which is often easier to compute than using the definition, (2.5.1).


The measure that we seek is a standardized (unitless) version of the covariance.

Definition 2.5.2. If each of σ1 and σ2 is positive, then the correlation coefficient


between X and Y is defined by

E[(X − μ1 )(Y − μ2 )] cov(X, Y )


ρ= = . (2.5.3)
σ1 σ2 σ1 σ2

It should be noted that the expected value of the product of two random variables
is equal to the product of their expectations plus their covariance; that is, E(XY ) =
μ1 μ2 + cov(X, Y ) = μ1 μ2 + ρσ1 σ2 .
As illustrations, we present two examples. The first is for a discrete model while
the second concerns a continuous model.

Example 2.5.1. Reconsider the random vector (X1 , X2 ) of Example 2.1.1 where a
fair coin is flipped three times and X1 is the number of heads on the first two flips
while X2 is the number of heads on all three flips. Recall that Table 2.1.1 contains
the marginal distributions of X1 and X2 . By symmetry of these pmfs, we have
E(X1 ) = 1 and E(X2 ) = 3/2. To compute the correlation coefficient of (X1 , X2 ),
we next sketch the computation of the required moments:

1 1 3 3 1
E(X12 ) = + 22 · = ⇒ σ12 = − 12 = ;
2 4 2 2 2
 2
3 3 1 3 1
E(X22 ) = + 4 · + 9 · = 3 ⇒ σ22 = 3 − 12 = ;
8 8 8 2 2
2 2 1 1 3 1
E(X1 X2 ) = + 1 · 2 · + 2 · 2 · + 2 · 3 · = 2 ⇒ cov(X1 , X2 ) = 2 − 1 · =
8 8 8 8 2 2
 
From which it follows that ρ = (1/2)/( (1/2) 3/4) = 0.816.

Example 2.5.2. Let the random variables X and Y have the joint pdf
"
x + y 0 < x < 1, 0 < y < 1
f (x, y) =
0 elsewhere.

We next compute the correlation coefficient ρ of X and Y . Now


 1  1
7
μ1 = E(X) = x(x + y) dxdy =
0 0 12
2.5. The Correlation Coefficient 127

and    2
1 1
7 11
σ12 2
= E(X ) − μ21 = 2
x (x + y) dxdy − = .
0 0 12 144
Similarly,
7 11
μ2 = E(Y ) = and σ22 = E(Y 2 ) − μ22 = .
12 144
The covariance of X and Y is
 1  1  2
7 1
E(XY ) − μ1 μ2 = xy(x + y) dxdy − =− .
0 0 12 144
Accordingly, the correlation coefficient of X and Y is
1
− 144 1
ρ= + =− .
11
( 144 11
)( 144 ) 11

We next establish that, in general, |ρ| ≤ 1.


Theorem 2.5.1. For all jointly distributed random variables (X, Y ) whose corre-
lation coefficient ρ exists, −1 ≤ ρ ≤ 1.
Proof: Consider the polynomial in v given by
, -
h(v) = E [(X − μ1 ) + v(Y − μ2 )]2 .

Then h(v) ≥ 0, for all v. Hence, the discriminant of h(v) is less than or equal to 0.
To obtain the discriminant, we expand h(v) as
h(v) = σ12 + 2vρσ1 σ2 + v 2 σ22 .
Hence, the discriminant of h(v) is 4ρ2 σ12 σ22 − 4σ22 σ12 . Since this is less than or equal
to 0, we have
4ρ2 σ12 σ22 ≤ 4σ22 σ12 or ρ2 ≤ 1,
which is the result sought.
Theorem 2.5.2. If X and Y are independent random variables then cov(X, Y ) = 0
and, hence, ρ = 0.
Proof: Because X and Y are independent, it follows from expression (2.4.3) that
E(XY ) = E(X)E(Y ). Hence, by (2.5.2) the covariance of X and Y is 0; i.e., ρ = 0.

As the following example shows, the converse of this theorem is not true:
Example 2.5.3. Let X and Y be jointly discrete random variables whose distri-
bution has mass 1/4 at each of the four points (−1, 0), (0, −1), (1, 0) and (0, 1). It
follows that both X and Y have the same marginal distribution with range {−1, 0, 1}
and respective probabilities 1/4, 1/2, and 1/4. Hence, μ1 = μ2 = 0 and a quick cal-
culation shows that E(XY ) = 0. Thus, ρ = 0. However, P (X = 0, Y = 0) = 0
while P (X = 0)P (Y = 0) = (1/2)(1/2) = 1/4. Thus, X and Y are dependent but
the correlation coefficient of X and Y is 0.
128 Multivariate Distributions

Although the converse of Theorem 2.5.2 is not true, the contrapositive is; i.e.,
if ρ = 0 then X and Y are dependent. For instance, in Example 2.5.1, since
ρ = 0.816, we know that the random variables X1 and X2 discussed in this example
are dependent. As discussed in Section 10.8, this contrapositive is often used in
Statistics.
Exercise 2.5.7 points out that in the proof of Theorem 2.5.1, the discriminant
of the polynomial h(v) is 0 if and only if ρ = ±1. In that case X and Y are linear
functions of one another with probability one; although, as shown, the relationship is
degenerate. This suggests the following interesting question: When ρ does not have
one of its extreme values, is there a line in the xy-plane such that the probability
for X and Y tends to be concentrated in a band about this line? Under certain
restrictive conditions this is, in fact, the case, and under those conditions we can
look upon ρ as a measure of the intensity of the concentration of the probability for
X and Y about that line.
We summarize these thoughts in the next theorem. For notation, let f (x, y)
denote the joint pdf of two random variables X and Y and let f1 (x) denote the
marginal pdf of X. Recall from Section 2.3 that the conditional pdf of Y , given
X = x, is
f (x, y)
f2|1 (y|x) =
f1 (x)
at points where f1 (x) > 0, and the conditional mean of Y , given X = x, is given by
 ∞
 ∞ yf (x, y) dy
−∞
E(Y |x) = yf2|1 (y|x) dy = ,
−∞ f1 (x)

when dealing with random variables of the continuous type. This conditional mean
of Y , given X = x, is, of course, a function of x, say u(x). In a like vein, the
conditional mean of X, given Y = y, is a function of y, say v(y).
In case u(x) is a linear function of x, say u(x) = a + bx, we say the conditional
mean of Y is linear in x; or that Y has a linear conditional mean. When u(x) =
a + bx, the constants a and b have simple values which we show in the following
theorem.

Theorem 2.5.3. Suppose (X, Y ) have a joint distribution with the variances of X
and Y finite and positive. Denote the means and variances of X and Y by μ1 , μ2
and σ12 , σ22 , respectively, and let ρ be the correlation coefficient between X and Y . If
E(Y |X) is linear in X then
σ2
E(Y |X) = μ2 + ρ (X − μ1 ) (2.5.4)
σ1

and
E(Var(Y |X)) = σ22 (1 − ρ2 ). (2.5.5)

Proof: The proof is given in the continuous case. The discrete case follows similarly
2.5. The Correlation Coefficient 129

by changing integrals to sums. Let E(Y |x) = a + bx. From


 ∞
yf (x, y) dy
−∞
E(Y |x) = = a + bx,
f1 (x)
we have  ∞
yf (x, y) dy = (a + bx)f1 (x). (2.5.6)
−∞
If both members of Equation (2.5.6) are integrated on x, it is seen that
E(Y ) = a + bE(X)
or
μ2 = a + bμ1 , (2.5.7)
where μ1 = E(X) and μ2 = E(Y ). If both members of Equation (2.5.6) are first
multiplied by x and then integrated on x, we have
E(XY ) = aE(X) + bE(X 2 ),
or
ρσ1 σ2 + μ1 μ2 = aμ1 + b(σ12 + μ21 ), (2.5.8)
where ρσ1 σ2 is the covariance of X and Y . The simultaneous solution of equations
(2.5.7) and (2.5.8) yields
σ2 σ2
a = μ2 − ρ μ1 and b = ρ .
σ1 σ1
These values give the first result (2.5.4).
Next, the conditional variance of Y is given by
 ∞ 2
σ2
Var(Y |x) = y − μ2 − ρ (x − μ1 ) f2|1 (y|x) dy
−∞ σ1
 ∞ 2
σ2
(y − μ2 ) − ρ (x − μ1 ) f (x, y) dy
−∞ σ1
= . (2.5.9)
f1 (x)
This variance is nonnegative and is at most a function of x alone. If it is multiplied
by f1 (x) and integrated on x, the result obtained is nonnegative. This result is
 ∞ ∞ 2
σ2
(y − μ2 ) − ρ (x − μ1 ) f (x, y) dydx
−∞ −∞ σ1
 ∞ ∞
σ2 σ2
= (y − μ2 )2 − 2ρ (y − μ2 )(x − μ1 ) + ρ2 22 (x − μ1 )2 f (x, y) dydx
−∞ −∞ σ1 σ1
σ2 σ2
= E[(Y − μ2 )2 ] − 2ρ E[(X − μ1 )(Y − μ2 )] + ρ2 22 E[(X − μ1 )2 ]
σ1 σ1
2
σ2 σ
= σ22 − 2ρ ρσ1 σ2 + ρ2 22 σ12
σ1 σ1
= σ22 − 2ρ2 σ22 + ρ2 σ22 = σ22 (1 − ρ2 ),
130 Multivariate Distributions

which is the desired result.

Note that if the variance, Equation (2.5.9), is denoted by k(x), then E[k(X)] =
σ22 (1 − ρ2 ) ≥ 0. Accordingly, ρ2 ≤ 1, or −1 ≤ ρ ≤ 1. This verifies Theorem 2.5.1
for the special case of linear conditional means.
As a corollary to Theorem 2.5.3, suppose that the variance, Equation (2.5.9), is
positive but not a function of x; that is, the variance is a constant k > 0. Now if k
is multiplied by f1 (x) and integrated on x, the result is k, so that k = σ22 (1 − ρ2 ).
Thus, in this case, the variance of each conditional distribution of Y , given X = x, is
σ22 (1 − ρ2 ). If ρ = 0, the variance of each conditional distribution of Y , given X = x,
is σ22 , the variance of the marginal distribution of Y . On the other hand, if ρ2 is near
1, the variance of each conditional distribution of Y , given X = x, is relatively small,
and there is a high concentration of the probability for this conditional distribution
near the mean E(Y |x) = μ2 + ρ(σ2 /σ1 )(x − μ1 ). Similar comments can be made
about E(X|y) if it is linear. In particular, E(X|y) = μ1 + ρ(σ1 /σ2 )(y − μ2 ) and
E[Var(X|Y )] = σ12 (1 − ρ2 ).
Example 2.5.4. Let the random variables X and Y have the linear conditional
1
means E(Y |x) = 4x + 3 and E(X|y) = 16 y − 3. In accordance with the general
formulas for the linear conditional means, we see that E(Y |x) = μ2 if x = μ1 and
E(X|y) = μ1 if y = μ2 . Accordingly, in this special case, we have μ2 = 4μ1 + 3
1
and μ1 = 16 μ2 − 3 so that μ1 = − 15 4 and μ2 = −12. The general formulas for the
linear conditional means also show that the product of the coefficients of x and y,
respectively, is equal to ρ2 and that the quotient of these coefficients is equal to
1
σ22 /σ12 . Here ρ2 = 4( 16 ) = 14 with ρ = 12 (not − 21 ), and σ22 /σ12 = 64. Thus, from the
two linear conditional means, we are able to find the values of μ1 , μ2 , ρ, and σ2 /σ1 ,
but not the values of σ1 and σ2 .

E(Y|x) = bx
a

x
–h (0, 0) h

–a

Figure 2.5.1: Illustration for Example 2.5.5.


2.5. The Correlation Coefficient 131

Example 2.5.5. To illustrate how the correlation coefficient measures the intensity
of the concentration of the probability for X and Y about a line, let these random
variables have a distribution that is uniform over the area depicted in Figure 2.5.1.
That is, the joint pdf of X and Y is
" 1
4ah −a + bx < y < a + bx, −h < x < h
f (x, y) =
0 elsewhere.

We assume here that b ≥ 0, but the argument can be modified for b ≤ 0. It is easy
to show that the pdf of X is uniform, namely
$
a+bx 1 1
−a+bx 4ah dy = 2h −h < x < h
f1 (x) =
0 elsewhere.

The conditional mean and variance are


a2
E(Y |x) = bx and var(Y |x) = .
3
From the general expressions for those characteristics we know that
σ2 a2
b=ρ and = σ22 (1 − ρ2 ).
σ1 3
Additionally, we know that σ12 = h2 /3. If we solve these three equations, we obtain
an expression for the correlation coefficient, namely
bh
ρ= √ .
a2 + b 2 h2
Referring to Figure 2.5.1, we note
1. As a gets small (large), the straight-line effect is more (less) intense and ρ is
closer to 1 (0).
2. As h gets large (small), the straight-line effect is more (less) intense and ρ is
closer to 1 (0).
3. As b gets large (small), the straight-line effect is more (less) intense and ρ is
closer to 1 (0).
Recall that in Section 2.1 we introduced the mgf for the random vector (X, Y ).
As for random variables, the joint mgf also gives explicit formulas for certain mo-
ments. In the case of random variables of the continuous type,
 ∞ ∞
∂ k+m M (t1 , t2 )
= xk y m et1 x+t2 y f (x, y) dxdy,
∂tk1 ∂tm
2 −∞ −∞

so that
  ∞ ∞
∂ k+m M (t1 , t2 ) 
 = xk y m f (x, y) dxdy = E(X k Y m ).
∂tk1 ∂tm
2 t1 =t2 =0 −∞ −∞
132 Multivariate Distributions

For instance, in a simplified notation that appears to be clear,

∂M (0, 0)
μ1 = E(X) =
∂t1
∂M (0, 0)
μ2 = E(Y ) =
∂t2
∂ 2 M (0, 0)
σ12 = E(X 2 ) − μ21 = − μ21
∂t21
∂ 2 M (0, 0)
σ22 = E(Y 2 ) − μ22 = − μ22
∂t22
∂ 2 M (0, 0)
E[(X − μ1 )(Y − μ2 )] = − μ1 μ2 , (2.5.10)
∂t1 ∂t2
and from these we can compute the correlation coefficient ρ.
It is fairly obvious that the results of equations (2.5.10) hold if X and Y are
random variables of the discrete type. Thus the correlation coefficients may be com-
puted by using the mgf of the joint distribution if that function is readily available.
An illustrative example follows.
Example 2.5.6 (Example 2.1.10, Continued). In Example 2.1.10, we considered
the joint density " −y
e 0<x<y<∞
f (x, y) =
0 elsewhere,
and showed that the mgf was
1
M (t1 , t2 ) = ,
(1 − t1 − t2 )(1 − t2 )

for t1 + t2 < 1 and t2 < 1. For this distribution, equations (2.5.10) become

μ1 = 1, μ2 = 2
σ12 = 1, σ22 = 2 (2.5.11)
E[(X − μ1 )(Y − μ2 )] = 1.

Verification of (2.5.11) is left as an exercise; see Exercise 2.5.5. If, momentarily,


√ we
accept these results, the correlation coefficient of X and Y is ρ = 1/ 2.

EXERCISES
2.5.1. Let the random variables X and Y have the joint pmf
(a) p(x, y) = 31 , (x, y) = (0, 0), (1, 1), (2, 2), zero elsewhere.
(b) p(x, y) = 13 , (x, y) = (0, 2), (1, 1), (2, 0), zero elsewhere.

(c) p(x, y) = 13 , (x, y) = (0, 0), (1, 1), (2, 0), zero elsewhere.
2.5. The Correlation Coefficient 133

In each case compute the correlation coefficient of X and Y .


2.5.2. Let X and Y have the joint pmf described as follows:

(x, y) (1, 1) (1, 2) (1, 3) (2, 1) (2, 2) (2, 3)


2 4 3 1 1 4
p(x, y) 15 15 15 15 15 15

and p(x, y) is equal to zero elsewhere.


(a) Find the means μ1 and μ2 , the variances σ12 and σ22 , and the correlation
coefficient ρ.

(b) Compute E(Y |X = 1), E(Y |X = 2), and the line μ2 + ρ(σ2 /σ1 )(x − μ1 ). Do
the points [k, E(Y |X = k)], k = 1, 2, lie on this line?
2.5.3. Let f (x, y) = 2, 0 < x < y, 0 < y < 1, zero elsewhere, be the joint pdf of
X and Y . Show that the conditional means are, respectively, (1 + x)/2, 0 < x < 1,
and y/2, 0 < y < 1. Show that the correlation coefficient of X and Y is ρ = 21 .

2.5.4. Show that the variance of the conditional distribution of Y , given X = x, in


Exercise 2.5.3, is (1 − x)2 /12, 0 < x < 1, and that the variance of the conditional
distribution of X, given Y = y, is y 2 /12, 0 < y < 1.

2.5.5. Verify the results of equations (2.5.11) of this section.


2.5.6. Let X and Y have the joint pdf f (x, y) = 1, −x < y < x, 0 < x < 1,
zero elsewhere. Show that, on the set of positive probability density, the graph of
E(Y |x) is a straight line, whereas that of E(X|y) is not a straight line.
2.5.7. In the proof of Theorem 2.5.1, consider the case when the discriminant of
the polynomial h(v) is 0. Show that this is equivalent to ρ = ±1. Consider the case
when ρ = 1. Find the unique root of h(v) and then use the fact that h(v) is 0 at
this root to show that Y is a linear function of X with probability 1.
2.5.8. Let ψ(t1 , t2 ) = log M (t1 , t2 ), where M (t1 , t2 ) is the mgf of X and Y . Show
that
∂ψ(0, 0) ∂ 2 ψ(0, 0)
, , i = 1, 2,
∂ti ∂t2i
and
∂ 2 ψ(0, 0)
∂t1 ∂t2
yield the means, the variances, and the covariance of the two random variables.
Use this result to find the means, the variances, and the covariance of X and Y of
Example 2.5.6.

2.5.9. Let X and Y have the joint pmf p(x, y) = 17 , (0, 0), (1, 0), (0, 1), (1, 1), (2, 1),
(1, 2), (2, 2), zero elsewhere. Find the correlation coefficient ρ.
2.5.10. Let X1 and X2 have the joint pmf described by the following table:
134 Multivariate Distributions

(x1 , x2 ) (0, 0) (0, 1) (0, 2) (1, 1) (1, 2) (2, 2)


1 2 1 3 4 1
p(x1 , x2 ) 12 12 12 12 12 12

Find p1 (x1 ), p2 (x2 ), μ1 , μ2 , σ12 , σ22 , and ρ.


2.5.11. Let σ12 = σ22 = σ 2 be the common variance of X1 and X2 and let ρ be the
correlation coefficient of X1 and X2 . Show for k > 0 that
2(1 + ρ)
P [|(X1 − μ1 ) + (X2 − μ2 )| ≥ kσ] ≤ .
k2

2.6 Extension to Several Random Variables


The notions about two random variables can be extended immediately to n random
variables. We make the following definition of the space of n random variables.
Definition 2.6.1. Consider a random experiment with the sample space C. Let
the random variable Xi assign to each element c ∈ C one and only one real num-
ber Xi (c) = xi , i = 1, 2, . . . , n. We say that (X1 , . . . , Xn ) is an n-dimensional
random vector. The space of this random vector is the set of ordered n-tuples
D = {(x1 , x2 , . . . , xn ) : x1 = X1 (c), . . . , xn = Xn (c), c ∈ C}. Furthermore, let A be
a subset of the space D. Then P [(X1 , . . . , Xn ) ∈ A] = P (C), where C = {c : c ∈
C and (X1 (c), X2 (c), . . . , Xn (c)) ∈ A}.
In this section, we often use vector notation. We denote (X1 , . . . , Xn ) by the
n-dimensional column vector X and the observed values (x1 , . . . , xn ) of the random
variables by x. The joint cdf is defined to be

FX (x) = P [X1 ≤ x1 , . . . , Xn ≤ xn ]. (2.6.1)

We say that the n random variables X1 , X2 , . . . , Xn are of the discrete type or


of the continuous type and have a distribution of that type according to whether
the joint cdf can be expressed as
 
FX (x) = ··· p(w1 , . . . , wn ),
w1 ≤x1 ,...,wn ≤xn

or as  x1  x2  xn
FX (x) = ··· f (w1 , . . . , wn ) dwn · · · dw1 .
−∞ −∞ −∞

For the continuous case,


∂n
FX (x) = f (x), (2.6.2)
∂x1 · · · ∂xn
except possibly on points that have probability zero.
In accordance with the convention of extending the definition of a joint pdf,
it is seen that a continuous function f essentially satisfies the conditions of being
a pdf if (a) f is defined and is nonnegative for all real values of its argument(s)
2.6. Extension to Several Random Variables 135

and (b) its integral over all real values of its argument(s) is 1. Likewise, a point
function p essentially satisfies the conditions of being a joint pmf if (a) p is defined
and is nonnegative for all real values of its argument(s) and (b) its sum over all real
values of its argument(s) is 1. As in previous sections, it is sometimes convenient
to speak of the support set of a random vector. For the discrete case, this would be
all points in D that have positive mass, while for the continuous case these would
be all points in D that can be embedded in an open set of positive probability. We
use S to denote support sets.
Example 2.6.1. Let
"
e−(x+y+z) 0 < x, y, z < ∞
f (x, y, z) =
0 elsewhere

be the pdf of the random variables X, Y , and Z. Then the distribution function of
X, Y , and Z is given by

F (x, y, z) = P (X ≤ x, Y ≤ y, Z ≤ z)
 z y x
= e−u−v−w dudvdw
0 0 0
= (1 − e−x )(1 − e−y )(1 − e−z ), 0 ≤ x, y, z < ∞,

and is equal to zero elsewhere. The relationship (2.6.2) can easily be verified.
Let (X1 , X2 , . . . , Xn ) be a random vector and let Y = u(X1 , X2 , . . . , Xn ) for
some function u. As in the bivariate case, the expected value of the random variable
exists if the n-fold integral
 ∞  ∞
··· |u(x1 , x2 , . . . , xn )|f (x1 , x2 , . . . , xn ) dx1 dx2 · · · dxn
−∞ −∞

exists when the random variables are of the continuous type, or if the n-fold sum
 
··· |u(x1 , x2 , . . . , xn )|p(x1 , x2 , . . . , xn )
xn x1

exists when the random variables are of the discrete type. If the expected value of
Y exists, then its expectation is given by
 ∞  ∞
E(Y ) = ··· u(x1 , x2 , . . . , xn )f (x1 , x2 , . . . , xn ) dx1 dx2 · · · dxn (2.6.3)
−∞ −∞

for the continuous case, and by


 
E(Y ) = ··· u(x1 , x2 , . . . , xn )p(x1 , x2 , . . . , xn ) (2.6.4)
xn x1

for the discrete case. The properties of expectation discussed in Section 2.1 hold
for the n-dimensional case also. In particular, E is a linear operator. That is, if
136 Multivariate Distributions

Yj = uj (X1 , . . . , Xn ) for j = 1, . . . , m and each E(Yi ) exists, then


⎡ ⎤
m 
m
E⎣ kj Yj ⎦ = kj E [Yj ] , (2.6.5)
j=1 j=1

where k1 , . . . , km are constants.


We next discuss the notions of marginal and conditional probability density
functions from the point of view of n random variables. All of the preceding defini-
tions can be directly generalized to the case of n variables in the following manner.
Let the random variables X1 , X2 , . . . , Xn be of the continuous type with the joint
pdf f (x1 , x2 , . . . , xn ). By an argument similar to the two-variable case, we have for
every b,
 b
FX1 (b) = P (X1 ≤ b) = f1 (x1 ) dx1 ,
−∞

where f1 (x1 ) is defined by the (n − 1)-fold integral


 ∞  ∞
f1 (x1 ) = ··· f (x1 , x2 , . . . , xn ) dx2 · · · dxn .
−∞ −∞

Therefore, f1 (x1 ) is the pdf of the random variable X1 and f1 (x1 ) is called the
marginal pdf of X1 . The marginal probability density functions f2 (x2 ), . . . , fn (xn )
of X2 , . . . , Xn , respectively, are similar (n − 1)-fold integrals.
Up to this point, each marginal pdf has been a pdf of one random variable.
It is convenient to extend this terminology to joint probability density functions,
which we do now. Let f (x1 , x2 , . . . , xn ) be the joint pdf of the n random variables
X1 , X2 , . . . , Xn , just as before. Now, however, take any group of k < n of these
random variables and find the joint pdf of them. This joint pdf is called the marginal
pdf of this particular group of k variables. To fix the ideas, take n = 6, k = 3, and
let us select the group X2 , X4 , X5 . Then the marginal pdf of X2 , X4 , X5 is the joint
pdf of this particular group of three variables, namely,
 ∞ ∞ ∞
f (x1 , x2 , x3 , x4 , x5 , x6 ) dx1 dx3 dx6 ,
−∞ −∞ −∞

if the random variables are of the continuous type.


Next we extend the definition of a conditional pdf. Suppose f1 (x1 ) > 0. Then
we define the symbol f2,...,n|1 (x2 , . . . , xn |x1 ) by the relation

f (x1 , x2 , . . . , xn )
f2,...,n|1 (x2 , . . . , xn |x1 ) = ,
f1 (x1 )
and f2,...,n|1 (x2 , . . . , xn |x1 ) is called the joint conditional pdf of X2 , . . . , Xn ,
given X1 = x1 . The joint conditional pdf of any n − 1 random variables, say
X1 , . . . , Xi−1 , Xi+1 , . . . , Xn , given Xi = xi , is defined as the joint pdf of X1 , . . . , Xn
divided by the marginal pdf fi (xi ), provided that fi (xi ) > 0. More generally, the
joint conditional pdf of n−k of the random variables, for given values of the remain-
ing k variables, is defined as the joint pdf of the n variables divided by the marginal
2.6. Extension to Several Random Variables 137

pdf of the particular group of k variables, provided that the latter pdf is positive.
We remark that there are many other conditional probability density functions; for
instance, see Exercise 2.3.12.
Because a conditional pdf is the pdf of a certain number of random variables,
the expectation of a function of these random variables has been defined. To em-
phasize the fact that a conditional pdf is under consideration, such expectations
are called conditional expectations. For instance, the conditional expectation of
u(X2 , . . . , Xn ), given X1 = x1 , is, for random variables of the continuous type,
given by
 ∞  ∞
E[u(X2 , . . . , Xn )|x1 ] = ··· u(x2 , . . . , xn )f2,...,n|1 (x2 , . . . , xn |x1 ) dx2 · · · dxn
−∞ −∞

provided f1 (x1 ) > 0 and the integral converges (absolutely). A useful random
variable is given by h(X1 ) = E[u(X2 , . . . , Xn )|X1 )].
The above discussion of marginal and conditional distributions generalizes to
random variables of the discrete type by using pmfs and summations instead of
integrals.
Let the random variables X1 , X2 , . . . , Xn have the joint pdf f (x1 , x2 , . . . , xn ) and
the marginal probability density functions f1 (x1 ), f2 (x2 ), . . . , fn (xn ), respectively.
The definition of the independence of X1 and X2 is generalized to the mutual
independence of X1 , X2 , . . . , Xn as follows: The random variables X1 , X2 , . . . , Xn
are said to be mutually independent if and only if

f (x1 , x2 , . . . , xn ) ≡ f1 (x1 )f2 (x2 ) · · · fn (xn ),

for the continuous case. In the discrete case, X1 , X2 , . . . , Xn are said to be mutually
independent if and only if

p(x1 , x2 , . . . , xn ) ≡ p1 (x1 )p2 (x2 ) · · · pn (xn ).

Suppose X1 , X2 , . . . , Xn are mutually independent. Then

P (a1 < X1 < b1 , a2 < X2 < b2 , . . . , an < Xn < bn )


= P (a1 < X1 < b1 )P (a2 < X2 < b2 ) · · · P (an < Xn < bn )
n
= P (ai < Xi < bi ),
i=1
2n
where the symbol i=1 ϕ(i) is defined to be
n
ϕ(i) = ϕ(1)ϕ(2) · · · ϕ(n).
i=1

The theorem that


E[u(X1 )v(X2 )] = E[u(X1 )]E[v(X2 )]
138 Multivariate Distributions

for independent random variables X1 and X2 becomes, for mutually independent


random variables X1 , X2 , . . . , Xn ,
E[u1 (X1 )u2 (X2 ) · · · un (Xn )] = E[u1 (X1 )]E[u2 (X2 )] · · · E[un (Xn )],
or  n
 n
E ui (Xi ) = E[ui (Xi )].
i=1 i=1
The moment-generating function (mgf) of the joint distribution of n random
variables X1 , X2 , . . . , Xn is defined as follows. Suppose that
E[exp(t1 X1 + t2 X2 + · · · + tn Xn )]
exists for −hi < ti < hi , i = 1, 2, . . . , n, where each hi is positive. This expectation
is denoted by M (t1 , t2 , . . . , tn ) and it is called the mgf of the joint distribution of
X1 , . . . , Xn (or simply the mgf of X1 , . . . , Xn ). As in the cases of one and two
variables, this mgf is unique and uniquely determines the joint distribution of the
n variables (and hence all marginal distributions). For example, the mgf of the
marginal distributions of Xi is M (0, . . . , 0, ti , 0, . . . , 0), i = 1, 2, . . . , n; that of the
marginal distribution of Xi and Xj is M (0, . . . , 0, ti , 0, . . . , 0, tj , 0, . . . , 0); and so on.
Theorem 2.4.5 of this chapter can be generalized, and the factorization
n
M (t1 , t2 , . . . , tn ) = M (0, . . . , 0, ti , 0, . . . , 0) (2.6.6)
i=1

is a necessary and sufficient condition for the mutual independence of X1 , X2 , . . . , Xn .


Note that we can write the joint mgf in vector notation as
M (t) = E[exp(t X)], for t ∈ B ⊂ Rn ,
where B = {t : −hi < ti < hi , i = 1, . . . , n}.
The following is a theorem that proves useful in the sequel. It gives the mgf of
a linear combination of independent random variables.
Theorem 2.6.1. Suppose X1 , X2 , . . . , Xn are n mutually independent random vari-
ables. Suppose, for all i = 1, 2, . . . , n, Xi has mgf Mi (t), for −hi < t < hi , where
n
hi > 0. Let T = i=1 ki Xi , where k1 , k2 , . . . , kn are constants. Then T has the
mgf given by
n
MT (t) = Mi (ki t), − min{hi } < t < min{hi }. (2.6.7)
i i
i=1

Proof. Assume t is in the interval (− mini {hi }, mini {hi }). Then, by independence,
n 
 Pn 
tki Xi (tki )Xi
MT (t) = E e i=1 =E e
i=1
n n

= E etki Xi = Mi (ki t),
i=1 i=1

which completes the proof.


2.6. Extension to Several Random Variables 139

Example 2.6.2. Let X1 , X2 , and X3 be three mutually independent random vari-


ables and let each have the pdf
"
2x 0 < x < 1
f (x) = (2.6.8)
0 elsewhere.

The joint pdf of X1 , X2 , X3 is f (x1 )f (x2 )f (x3 ) = 8x1 x2 x3 , 0 < xi < 1, i = 1, 2, 3,


zero elsewhere. Then, for illustration, the expected value of 5X1 X23 + 3X2 X34 is
 1  1  1
(5x1 x32 + 3x2 x43 )8x1 x2 x3 dx1 dx2 dx3 = 2.
0 0 0

Let Y be the maximum of X1 , X2 , and X3 . Then, for instance, we have

P (Y ≤ 12 ) = P (X1 ≤ 12 , X2 ≤ 12 , X3 ≤ 12 )
 1/2  1/2  1/2
= 8x1 x2 x3 dx1 dx2 dx3
0 0 0
= ( 12 )6 = 1
64 .

In a similar manner, we find that the cdf of Y is



⎨ 0 y<0
G(y) = P (Y ≤ y) = y6 0 ≤ y < 1

1 1 ≤ y.

Accordingly, the pdf of Y is


"
6y 5 0<y<1
g(y) =
0 elsewhere.

Remark 2.6.1. If X1 , X2 , and X3 are mutually independent, they are pairwise


independent (that is, Xi and Xj , i = j, where i, j = 1, 2, 3, are independent).
However, the following example, attributed to S. Bernstein, shows that pairwise
independence does not necessarily imply mutual independence. Let X1 , X2 , and X3
have the joint pmf
" 1
4 (x1 , x2 , x3 ) ∈ {(1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 1)}
p(x1 , x2 , x3 ) =
0 elsewhere.

The joint pmf of Xi and Xj , i = j, is


" 1
4 (xi , xj ) ∈ {(0, 0), (1, 0), (0, 1), (1, 1)}
pij (xi , xj ) =
0 elsewhere,

whereas the marginal pmf of Xi is


" 1
2 xi = 0, 1
pi (xi ) =
0 elsewhere.
140 Multivariate Distributions

Obviously, if i = j, we have

pij (xi , xj ) ≡ pi (xi )pj (xj ),

and thus Xi and Xj are independent. However,

p(x1 , x2 , x3 ) ≡ p1 (x1 )p2 (x2 )p3 (x3 ).

Thus X1 , X2 , and X3 are not mutually independent.


Unless there is a possible misunderstanding between mutual and pairwise inde-
pendence, we usually drop the modifier mutual. Accordingly, using this practice in
Example 2.6.2, we say that X1 , X2 , X3 are independent random variables, meaning
that they are mutually independent. Occasionally, for emphasis, we use mutually
independent so that the reader is reminded that this is different from pairwise in-
dependence.
In addition, if several random variables are mutually independent and have
the same distribution, we say that they are independent and identically dis-
tributed, which we abbreviate as iid. So the random variables in Example 2.6.2
are iid with the common pdf given in expression (2.6.8).

The following is a useful corollary to Theorem 2.6.1 for iid random variables. Its
proof is asked for in Exercise 2.6.7.

Corollary 2.6.1. Suppose X1 , X2 , . . . , Xn are iid random variables with the com-
n
mon mgf M (t), for −h < t < h, where h > 0. Let T = i=1 Xi . Then T has the
mgf given by
MT (t) = [M (t)]n , −h < t < h. (2.6.9)


2.6.1 Multivariate Variance-Covariance Matrix
This section makes explicit use of matrix algebra and it is considered as an optional
section.
In Section 2.5 we discussed the covariance between two random variables. In
this section we want to extend this discussion to the n-variate case. Let X =
(X1 , . . . , Xn ) be an n-dimensional random vector. Recall that we defined E(X) =
(E(X1 ), . . . , E(Xn )) , that is, the expectation of a random vector is just the vector
of the expectations of its components. Now suppose W is an m × n matrix of
random variables, say, W = [Wij ] for the random variables Wij , 1 ≤ i ≤ m and
1 ≤ j ≤ n. Note that we can always string out the matrix into an mn × 1 random
vector. Hence, we define the expectation of a random matrix

E[W] = [E(Wij )]. (2.6.10)

As the following theorem shows, the linearity of the expectation operator easily
follows from this definition:

Theorem 2.6.2. Let W1 and W2 be m × n matrices of random variables, let A1


and A2 be k × m matrices of constants, and let B be an n × l matrix of constants.
2.6. Extension to Several Random Variables 141

Then

E[A1 W1 + A2 W2 ] = A1 E[W1 ] + A2 E[W2 ] (2.6.11)


E[A1 W1 B] = A1 E[W1 ]B. (2.6.12)

Proof: Because of the linearity of the operator E on random variables, we have for
the (i, j)th components of expression (2.6.11) that
m 
 
m 
m 
m
E a1is W1sj + a2is W2sj = a1is E[W1sj ] + a2is E[W2sj ].
s=1 s=1 s=1 s=1

Hence by (2.6.10), expression (2.6.11) is true. The derivation of expression (2.6.12)


follows in the same manner.

Let X = (X1 , . . . , Xn ) be an n-dimensional random vector, such that σi2 =


Var(Xi ) < ∞. The mean of X is μ = E[X] and we define its variance-covariance
matrix as
Cov(X) = E[(X − μ)(X − μ) ] = [σij ], (2.6.13)
where σii denotes σi2 . As Exercise 2.6.8 shows, the ith diagonal entry of Cov(X) is
σi2 = Var(Xi ) and the (i, j)th off diagonal entry is Cov(Xi , Xj ).

Example 2.6.3 (Example 2.5.6, Continued). In Example 2.5.6, we considered


the joint pdf
" −y
e 0<x<y<∞
f (x, y) =
0 elsewhere,
and showed that the first two moments are

μ1 = 1, μ2 = 2
σ12 = 1, σ22 = 2 (2.6.14)
E[(X − μ1 )(Y − μ2 )] = 1.

Let Z = (X, Y ) . Then using the present notation, we have

1 1 1
E[Z] = and Cov(Z) = .
2 1 2

Two properties of Cov(Xi , Xj ) needed later are summarized in the following


theorem:

Theorem 2.6.3. Let X = (X1 , . . . , Xn ) be an n-dimensional random vector, such


that σi2 = σii = Var(Xi ) < ∞. Let A be an m × n matrix of constants. Then

Cov(X) = E[XX ] − μμ (2.6.15)


Cov(AX) = ACov(X)A . (2.6.16)
142 Multivariate Distributions

Proof: Use Theorem 2.6.2 to derive (2.6.15); i.e.,

Cov(X) = E[(X − μ)(X − μ) ]


= E[XX − μX − Xμ + μμ ]
= E[XX ] − μE[X ] − E[X]μ + μμ ,

which is the desired result. The proof of (2.6.16) is left as an exercise.

All variance-covariance matrices are positive semi-definite matrices; that is,


a Cov(X)a ≥ 0, for all vectors a ∈ Rn . To see this let X be a random vector and
let a be any n × 1 vector of constants. Then Y = a X is a random variable and,
hence, has nonnegative variance; i.e.,

0 ≤ Var(Y ) = Var(a X) = a Cov(X)a; (2.6.17)

hence, Cov(X) is positive semi-definite.

EXERCISES
2.6.1. Let X, Y, Z have joint pdf f (x, y, z) = 2(x + y + z)/3, 0 < x < 1, 0 < y <
1, 0 < z < 1, zero elsewhere.
(a) Find the marginal probability density functions of X, Y, and Z.
(b) Compute P (0 < X < 12 , 0 < Y < 12 , 0 < Z < 12 ) and P (0 < X < 12 ) = P (0 <
Y < 12 ) = P (0 < Z < 12 ).
(c) Are X, Y , and Z independent?
(d) Calculate E(X 2 Y Z + 3XY 4 Z 2 ).
(e) Determine the cdf of X, Y, and Z.
(f ) Find the conditional distribution of X and Y , given Z = z, and evaluate
E(X + Y |z).
(g) Determine the conditional distribution of X, given Y = y and Z = z, and
compute E(X|y, z).
2.6.2. Let f (x1 , x2 , x3 ) = exp[−(x1 + x2 + x3 )], 0 < x1 < ∞, 0 < x2 < ∞, 0 <
x3 < ∞, zero elsewhere, be the joint pdf of X1 , X2 , X3 .
(a) Compute P (X1 < X2 < X3 ) and P (X1 = X2 < X3 ).
(b) Determine the joint mgf of X1 , X2 , and X3 . Are these random variables
independent?
2.6.3. Let X1 , X2 , X3 , and X4 be four independent random variables, each with
pdf f (x) = 3(1 − x)2 , 0 < x < 1, zero elsewhere. If Y is the minimum of these four
variables, find the cdf and the pdf of Y .
Hint: P (Y > y) = P (Xi > y , i = 1, . . . , 4).
2.7. Transformations for Several Random Variables 143

2.6.4. A fair die is cast at random three independent times. Let the random variable
Xi be equal to the number of spots that appear on the ith trial, i = 1, 2, 3. Let the
random variable Y be equal to max(Xi ). Find the cdf and the pmf of Y .
Hint: P (Y ≤ y) = P (Xi ≤ y, i = 1, 2, 3).
2.6.5. Let M (t1 , t2 , t3 ) be the mgf of the random variables X1 , X2 , and X3 of
Bernstein’s example, described in the remark following Example 2.6.2. Show that

M (t1 , t2 , 0) = M (t1 , 0, 0)M (0, t2 , 0), M (t1 , 0, t3 ) = M (t1 , 0, 0)M (0, 0, t3),

and
M (0, t2 , t3 ) = M (0, t2 , 0)M (0, 0, t3)
are true, but that

M (t1 , t2 , t3 ) = M (t1 , 0, 0)M (0, t2 , 0)M (0, 0, t3 ).

Thus X1 , X2 , X3 are pairwise independent but not mutually independent.


2.6.6. Let X1 , X2 , and X3 be three random variables with means, variances, and
correlation coefficients, denoted by μ1 , μ2 , μ3 ; σ12 , σ22 , σ32 ; and ρ12 , ρ13 , ρ23 , respec-
tively. For constants b2 and b3 , suppose E(X1 −μ1 |x2 , x3 ) = b2 (x2 −μ2 )+b3 (x3 −μ3 ).
Determine b2 and b3 in terms of the variances and the correlation coefficients.
2.6.7. Prove Corollary 2.6.1.
2.6.8. Let X = (X1 , . . . , Xn ) be an n-dimensional random vector, with the variance-
covariance matrix given in display (2.6.13). Show that the ith diagonal entry of
Cov(X) is σi2 = Var(Xi ) and that the (i, j)th off diagonal entry is Cov(Xi , Xj ).
2.6.9. Let X1 , X2 , X3 be iid with common pdf f (x) = exp(−x), 0 < x < ∞, zero
elsewhere. Evaluate:
(a) P (X1 < X2 |X1 < 2X2 ).
(b) P (X1 < X2 < X3 |X3 < 1).

2.7 Transformations for Several Random Variables


In Section 2.2 it was seen that the determination of the joint pdf of two functions of
two random variables of the continuous type was essentially a corollary to a theorem
in analysis having to do with the change of variables in a twofold integral. This
theorem has a natural extension to n-fold integrals. This extension is as follows.
Consider an integral of the form
 
··· f (x1 , x2 , . . . , xn ) dx1 dx2 · · · dxn
A

taken over a subset A of an n-dimensional space S. Let

y1 = u1 (x1 , x2 , . . . , xn ), y2 = u2 (x1 , x2 , . . . , xn ), . . . , yn = un (x1 , x2 , . . . , xn ),


144 Multivariate Distributions

together with the inverse functions

x1 = w1 (y1 , y2 , . . . , yn ), x2 = w2 (y1 , y2 , . . . , yn ), . . . , xn = wn (y1 , y2 , . . . , yn )

define a one-to-one transformation that maps S onto T in the y1 , y2 , . . . , yn space


and, hence, maps the subset A of S onto a subset B of T . Let the first partial
derivatives of the inverse functions be continuous and let the n by n determinant
(called the Jacobian)  
 ∂x1 ∂x1 ∂x1 
 ∂y1 ∂y2 · · · ∂y 
 n 
 ∂x2 ∂x2 ∂x2 
 ∂y · · · ∂yn 
J =  . 1 
∂y2
. .. 
 .. .. . 
 
 ∂xn ∂xn 
 ∂xn 
· · · ∂yn
∂y1 ∂y2

not be identically zero in T . Then


 
··· f (x1 , x2 , . . . , xn ) dx1 dx2 · · · dxn
 A
= ··· f [w1 (y1 , . . . , yn ), w2 (y1 , . . . , yn ), . . . , wn (y1 , . . . , yn )]|J| dy1 dy2 · · · dyn .
B

Whenever the conditions of this theorem are satisfied, we can determine the joint pdf
of n functions of n random variables. Appropriate changes of notation in Section
2.2 (to indicate n-space as opposed to 2-space) are all that are needed to show
that the joint pdf of the random variables Y1 = u1 (X1 , X2 , . . . , Xn ), . . . , Yn =
un (X1 , X2 , . . . , Xn ), where the joint pdf of X1 , . . . , Xn is f (x1 , . . . , xn ), is given by

g(y1 , y2 , . . . , yn ) = f [w1 (y1 , . . . , yn ), . . . , wn (y1 , . . . , yn )]|J|,

where (y1 , y2 , . . . , yn ) ∈ T , and is zero elsewhere.


Example 2.7.1. Let X1 , X2 , X3 have the joint pdf
"
48x1 x2 x3 0 < x1 < x2 < x3 < 1
f (x1 , x2 , x3 ) = (2.7.1)
0 elsewhere.

If Y1 = X1 /X2 , Y2 = X2 /X3 , and Y3 = X3 , then the inverse transformation is given


by
x1 = y1 y2 y3 , x2 = y2 y3 , and x3 = y3 .
The Jacobian is given by
 
 y2 y3 y1 y3 y1 y2 
 
J =  0 y3 y2  = y2 y32 .

 0 0 1 

Moreover, inequalities defining the support are equivalent to

0 < y1 y2 y3 , y1 y2 y3 < y2 y3 , y2 y3 < y3 , and y3 < 1,


2.7. Transformations for Several Random Variables 145

which reduces to the support T of Y1 , Y2 , Y3 of

T = {(y1 , y2 , y3 ) : 0 < yi < 1, i = 1, 2, 3}.

Hence the joint pdf of Y1 , Y2 , Y3 is

g(y1 , y2 , y3 ) = 48(y1 y2 y3 )(y2 y3 )y3 |y2 y32 |


"
48y1 y23 y35 0 < yi < 1, i = 1, 2, 3
= (2.7.2)
0 elsewhere.

The marginal pdfs are

g1 (y1 ) = 2y1 , 0 < y1 < 1, zero elsewhere


g2 (y2 ) = 4y23 , 0 < y2 < 1, zero elsewhere
g3 (y3 ) = 6y35 , 0 < y3 < 1, zero elsewhere.

Because g(y1 , y2 , y3 ) = g1 (y1 )g2 (y2 )g3 (y3 ), the random variables Y1 , Y2 , Y3 are mu-
tually independent.
Example 2.7.2. Let X1 , X2 , X3 be iid with common pdf
" −x
e 0<x<∞
f (x) =
0 elsewhere.

Consequently, the joint pdf of X1 , X2 , X3 is


" P3
e− i=1 xi 0 < xi < ∞, i = 1, 2, 3
fX1 ,X2 ,X3 (x1 , x2 , x3 ) =
0 elsewhere.

Consider the random variables Y1 , Y2 , Y3 defined by


X1 X2
Y1 = X1 +X2 +X3 , Y2 = X1 +X2 +X3 , and Y3 = X1 + X2 + X3 .

Hence, the inverse transformation is given by

x1 = y1 y3 , x2 = y2 y3 , and x3 = y3 − y1 y3 − y2 y3 ,

with the Jacobian


 
 y3 0 y1 
 
J =  0 y3 y2  = y2.
 3
 −y3 −y3 1 − y1 − y2 

The support of X1 , X2 , X3 maps onto

0 < y1 y3 < ∞, 0 < y2 y3 < ∞, and 0 < y3 (1 − y1 − y2 ) < ∞,

which is equivalent to the support T given by

T = {(y1 , y2 , y3 ) : 0 < y1 , 0 < y2 , 0 < 1 − y1 − y2 , 0 < y3 < ∞}.


146 Multivariate Distributions

Hence the joint pdf of Y1 , Y2 , Y3 is

g(y1 , y2 , y3 ) = y32 e−y3 , (y1 , y2 , y3 ) ∈ T .

The marginal pdf of Y1 is


 1−y1  ∞
g1 (y1 ) = y32 e−y3 dy3 dy2 = 2(1 − y1 ), 0 < y1 < 1,
0 0

zero elsewhere. Likewise the marginal pdf of Y2 is

g2 (y2 ) = 2(1 − y2 ), 0 < y2 < 1,

zero elsewhere, while the pdf of Y3 is


 1  1−y1
1
g3 (y3 ) = y32 e−y3 dy2 dy1 = y32 e−y3 , 0 < y3 < ∞,
0 0 2
zero elsewhere. Because g(y1 , y2 , y3 ) = g1 (y1 )g2 (y2 )g3 (y3 ), Y1 , Y2 , Y3 are dependent
random variables.
Note, however, that the joint pdf of Y1 and Y3 is
 1−y1
g13 (y1 , y3 ) = y32 e−y3 dy2 = (1 − y1 )y32 e−y3 , 0 < y1 < 1, 0 < y3 < ∞,
0

zero elsewhere. Hence Y1 and Y3 are independent. In a similar manner, Y2 and Y3


are also independent. Because the joint pdf of Y1 and Y2 is
 ∞
g12 (y1 , y2 ) = y32 e−y3 dy3 = 2, 0 < y1 , 0 < y2 , y1 + y2 < 1,
0

zero elsewhere, Y1 and Y2 are seen to be dependent.


We now consider some other problems that are encountered when transforming
variables. Let X have the Cauchy pdf
1
f (x) = , −∞ < x < ∞,
π(1 + x2 )

and let Y = X 2 . We seek the pdf g(y) of Y . Consider the transformation y = x2 .


This transformation maps the space of X, namely S = {x : −∞ < x < ∞}, onto
T = {y : 0 ≤ y < ∞}. However, the transformation is not one-to-one. To each
y ∈ T , with the exception of y = 0, there correspond two points x ∈ S. For
example, if y = 4, we may have either x = 2 or x = −2. In such an instance, we
represent S as the union of two disjoint sets A1 and A2 such that y = x2 defines
a one-to-one transformation that maps each of A1 and A2 onto T . If we take A1
to be {x : −∞ < x < 0} and A2 to be {x : 0 ≤ x < ∞}, we see that A1 is
mapped onto {y : 0 < y < ∞}, whereas A2 is mapped onto {y : 0 ≤ y < ∞},
and these sets are not the same. Our difficulty is caused by the fact that x = 0
is an element of S. Why, then, do we not return to the Cauchy pdf and take
2.7. Transformations for Several Random Variables 147

f (0) = 0? Then our new S is S = {−∞ < x < ∞ but x = 0}. We then take
A1 = {x : −∞ < x < 0} and A2 = {x : 0 < x < ∞}. Thus y = x2 , with the

inverse x = − y, maps A1 onto T = {y : 0 < y < ∞} and the transformation is

one-to-one. Moreover, the transformation y = x2 , with inverse x = y, maps A2
onto T = {y : 0 < y < ∞} and the transformation is one-to-one. Consider the

probability P (Y ∈ B), where B ⊂ T . Let A3 = {x : x = − y, y ∈ B} ⊂ A1 and

let A4 = {x : x = y, y ∈ B} ⊂ A2 . Then Y ∈ B when and only when X ∈ A3 or
X ∈ A4 . Thus we have

P (Y ∈ B) = P (X ∈ A3 ) + P (X ∈ A4 )
 
= f (x) dx + f (x) dx.
A3 A4

√ √
In the first of these integrals, let x = − y. Thus the Jacobian, say J1 , is −1/2 y;

furthermore, the set A3 is mapped onto B. In the second integral let x = y. Thus

the Jacobian, say J2 , is 1/2 y; furthermore, the set A4 is also mapped onto B.
Finally,
   
√  1  √ 1
P (Y ∈ B) = f (− y) − √  dy + f ( y) √ dy
B 2 y B 2 y

√ √ 1
= [f (− y) + f ( y)] √ dy.
B 2 y

Hence the pdf of Y is given by

1 √ √
g(y) = √ [f (− y) + f ( y)], y∈T.
2 y

With f (x) the Cauchy pdf we have


" 1 √
π(1+y) y 0<y<∞
g(y) =
0 elsewhere.

In the preceding discussion of a random variable of the continuous type, we had


√ √
two inverse functions, x = − y and x = y. That is why we sought to partition
S (or a modification of S) into two disjoint subsets such that the transformation
y = x2 maps each onto the same T . Had there been three inverse functions, we
would have sought to partition S (or a modified form of S) into three disjoint
subsets, and so on. It is hoped that this detailed discussion makes the following
paragraph easier to read.
Let f (x1 , x2 , . . . , xn ) be the joint pdf of X1 , X2 , . . . , Xn , which are random vari-
ables of the continuous type. Let S denote the n-dimensional space where this joint
pdf f (x1 , x2 , . . . , xn ) > 0, and consider the transformation y1 = u1 (x1 , x2 , . . . , xn ),
. . . , yn = un (x1 , x2 , . . . , xn ), which maps S onto T in the y1 , y2 , . . . , yn space. To
each point of S there corresponds, of course, only one point in T ; but to a point
in T there may correspond more than one point in S. That is, the transformation
148 Multivariate Distributions

may not be one-to-one. Suppose, however, that we can represent S as the union of
a finite number, say k, of mutually disjoint sets A1 , A2 , . . . , Ak so that

y1 = u1 (x1 , x2 , . . . , xn ), . . . , yn = un (x1 , x2 , . . . , xn )

define a one-to-one transformation of each Ai onto T . Thus to each point in T


there corresponds exactly one point in each of A1 , A2 , . . . , Ak . For i = 1, . . . , k, let

x1 = w1i (y1 , y2 , . . . , yn ), x2 = w2i (y1 , y2 , . . . , yn ), . . . , xn = wni (y1 , y2 , . . . , yn ),

denote the k groups of n inverse functions, one group for each of these k transfor-
mations. Let the first partial derivatives be continuous and let each
 ∂w 
 1i ∂w1i
· · · ∂w 1i 
 ∂y1 ∂y2 ∂yn 
 ∂w2i ∂w2i · · · ∂w2i 
 ∂y1 ∂yn 
Ji =  . ..  , i = 1, 2, . . . , k,
∂y2
.
 .. .. . 
 ∂wni ∂wni ∂wni 
 ∂y ∂y · · · ∂y
1 2 n

be not identically equal to zero in T . Considering the probability of the union


of k mutually exclusive events and by applying the change-of-variable technique
to the probability of each of these events, it can be seen that the joint pdf of
Y1 = u1 (X1 , X2 , . . . , Xn ), Y2 = u2 (X1 , X2 , . . . , Xn ), . . . , Yn = un (X1 , X2 , . . . , Xn ),
is given by


k
g(y1 , y2 , . . . , yn ) = f [w1i (y1 , . . . , yn ), . . . , wni (y1 , . . . , yn )]|Ji |,
i=1

provided that (y1 , y2 , . . . , yn ) ∈ T , and equals zero elsewhere. The pdf of any Yi ,
say Y1 , is then
 ∞  ∞
g1 (y1 ) = ··· g(y1 , y2 , . . . , yn ) dy2 · · · dyn .
−∞ −∞

Example 2.7.3. Let X1 and X2 have the joint pdf defined over the unit circle
given by " 1
π 0 < x21 + x22 < 1
f (x1 , x2 ) =
0 elsewhere.
Let Y1 = X12 + X22 and Y2 = X12 /(X12 + X22 ). Thus y1 y2 = x21 and x22 = y1 (1 − y2 ).
The support S maps onto T = {(y1 , y2 ) : 0 < yi < 1, i = 1, 2}. For each ordered
pair (y1 , y2 ) ∈ T , there are four points in S, given by
√ 
(x1 , x2 ) such that x1 = y1 y2 and x2 = y1 (1 − y2 )
√ 
(x1 , x2 ) such that x1 = y1 y2 and x2 = − y1 (1 − y2 )
√ 
(x1 , x2 ) such that x1 = − y1 y2 and x2 = y1 (1 − y2 )
√ 
and (x1 , x2 ) such that x1 = − y1 y2 and x2 = − y1 (1 − y2 ).
2.7. Transformations for Several Random Variables 149

The value of the first Jacobian is


   
 1
y2 /y1 1
y1 /y2 
 2 2 
J1 =    
 
 1 (1 − y2 )/y1 − 1 y1 /(1 − y2 ) 
2 2
" ! ! #
1 1 − y2 y2 1 1
= − − =−  .
4 y2 1 − y2 4 y2 (1 − y2 )

It iseasy to see that the absolute value of each of the four Jacobians equals
1/4 y2 (1 − y2 ). Hence, the joint pdf of Y1 and Y2 is the sum of four terms and can
be written as
1 1 1
g(y1 , y2 ) = 4  =  , (y1 , y2 ) ∈ T .
π 4 y2 (1 − y2 ) π y2 (1 − y2 )

Thus Y1 and Y2 are independent random variables by Theorem 2.4.1.

Of course, as in the bivariate case, we can use the mgf technique by noting that
if Y = g(X1 , X2 , . . . , Xn ) is a function of the random variables, then the mgf of Y
is given by
 ∞ ∞  ∞
 tY 
E e = ··· etg(x1 ,x2 ,...,xn ) f (x1 , x2 , . . . , xn ) dx1 dx2 · · · dxn ,
−∞ −∞ −∞

in the continuous case, where f (x1 , x2 , . . . , xn ) is the joint pdf. In the discrete case,
summations replace the integrals. This procedure is particularly useful in cases in
which we are dealing with linear functions of independent random variables.

Example 2.7.4 (Extension of Example 2.2.6). Let X1 , X2 , X3 be independent ran-


dom variables with joint pmf
$ x1 x2 x3 −μ −μ −μ
μ1 μ2 μ3 e 1 2 3
p(x1 , x2 , x3 ) = x1 !x2 !x3 ! xi = 0, 1, 2, . . . , i = 1, 2, 3
0 elsewhere.

If Y = X1 + X2 + X3 , the mgf of Y is
  ( )
E etY = E et(X1 +X2 +X3 )
 
= E etX1 etX2 etX3
     
= E etX1 E etX2 E etX3 ,

because of the independence of X1 , X2 , X3 . In Example 2.2.6, we found that


 
E etXi = exp{μi (et − 1)}, i = 1, 2, 3.

Hence,  
E etY = exp{(μ1 + μ2 + μ3 )(et − 1)}.
150 Multivariate Distributions

This, however, is the mgf of the pmf


$
(μ1 +μ2 +μ3 )y e−(μ1 +μ2 +μ3 )
y = 0, 1, 2 . . .
pY (y) = y!
0 elsewhere,

so Y = X1 + X2 + X3 has this distribution.


Example 2.7.5. Let X1 , X2 , X3 , X4 be independent random variables with com-
mon pdf " −x
e x>0
f (x) =
0 elsewhere.
If Y = X1 + X2 + X3 + X4 , then similar to the argument in the last example, the
independence of X1 , X2 , X3 , X4 implies that
         
E etY = E etX1 E etX2 E etX3 E etX4 .

In Section 1.9, we saw that


 
E etXi = (1 − t)−1 , t < 1, i = 1, 2, 3, 4.

Hence,  
E etY = (1 − t)−4 .
In Section 3.3, we find that this is the mgf of a distribution with pdf
" 1 3 −y
3! y e 0<y<∞
fY (y) =
0 elsewhere.

Accordingly, Y has this distribution.


EXERCISES
2.7.1. Let X1 , X2 , X3 be iid, each with the distribution having pdf f (x) = e−x , 0 <
x < ∞, zero elsewhere. Show that
X1 X1 + X2
Y1 = , Y2 = , Y3 = X1 + X2 + X3
X1 + X2 X1 + X2 + X3
are mutually independent.
2.7.2. If f (x) = 12 , −1 < x < 1, zero elsewhere, is the pdf of the random variable
X, find the pdf of Y = X 2 .
2.7.3. If X has the pdf of f (x) = 14 , −1 < x < 3, zero elsewhere, find the pdf of
Y = X 2.
Hint: Here T = {y : 0 ≤ y < 9} and the event Y ∈ B is the union of two mutually
exclusive events if B = {y : 0 < y < 1}.
2.7.4. Let X1 , X2 , X3 be iid with common pdf f (x) = e−x , x > 0, 0 elsewhere.
Find the joint pdf of Y1 = X1 , Y2 = X1 + X2 , and Y3 = X1 + X2 + X3 .
2.8. Linear Combinations of Random Variables 151

2.7.5. Let X1 , X2 , X3 be iid with common pdf f (x) = e−x , x > 0, 0 elsewhere.
Find the joint pdf of Y1 = X1 /X2 , Y2 = X3 /(X1 + X2 ), and Y3 = X1 + X2 . Are
Y1 , Y2 , Y3 mutually independent?
2.7.6. Let X1 , X2 have the joint pdf f (x1 , x2 ) = 1/π, 0 < x21 + x22 < 1. Let
Y1 = X12 + X22 and Y2 = X2 . Find the joint pdf of Y1 and Y2 .
2.7.7. Let X1 , X2 , X3 , X4 have the joint pdf f (x1 , x2 , x3 , x4 ) = 24, 0 < x1 < x2 <
x3 < x4 < 1, 0 elsewhere. Find the joint pdf of Y1 = X1 /X2 , Y2 = X2 /X3 ,Y3 =
X3 /X4 ,Y4 = X4 and show that they are mutually independent.
2.7.8. Let X1 , X2 , X3 be iid with common mgf M (t) = ((3/4) + (1/4)et )2 , for all
t ∈ R.
(a) Determine the probabilities, P (X1 = k), k = 0, 1, 2.
(b) Find the mgf of Y = X1 + X2 + X3 and then determine the probabilities,
P (Y = k), k = 0, 1, 2, . . . , 6.

2.8 Linear Combinations of Random Variables


In this section, we summarize some results on linear combinations of random vari-
ables that follow from Section 2.6. These results will prove to be quite useful in
Chapter 3 as well as in succeeding chapters.
Let (X1 , . . . , Xn ) denote a random vector. In this section, we consider linear
combinations of these variables, writing them , generally, as

n
T = ai X i , (2.8.1)
i=1

for specified constants a1 , . . . , an . We obtain expressions for the mean and variance
of T .
The mean of T follows immediately from linearity of expectation. For reference,
we state it formally as a theorem.
Theorem 2.8.1. Suppose T is given by expression (2.8.1). Suppose E(Xi ) − μi ,
for i = 1, . . . , n. Then
 n
E(T ) = ai μi . (2.8.2)
i=1

In order to obtain the variance of T , we first state a general result on covariances.


Theorem 2.8.2. Suppose T is the linear combination (2.8.1) and that W is another
m
linear combination given by W = i=1 bi Yi , for random variables Y1 , . . . , Ym and
n m
specified constants b1 , . . . , bm . Let T = i=1 ai Xi and let W = i=1 bi Yi . If
2 2
E[Xi ] < ∞ , and E[Yj ] < ∞ for i = 1, . . . , n and j = 1, . . . , m, then

n 
m
Cov(T, W ) = ai bj Cov(Xi , Yj ). (2.8.3)
i=1 j=1
152 Multivariate Distributions

Proof: Using the definition of the covariance and Theorem 2.8.1, we have the first
equality below, while the second equality follows from the linearity of E:
⎡ ⎤
n  m
Cov(T, W ) = E ⎣ (ai Xi − ai E(Xi ))(bj Yj − bj E(Yj ))⎦
i=1 j=1


n 
m
= ai bj E[(Xi − E(Xi ))(Yj − E(Yj ))],
i=1 j=1

which is the desired result.

To obtain the variance of T , simply replace W by T in expression (2.8.3). We


state the result as a corollary:
ai Xi . Provided E[Xi2 ] < ∞ , for i = 1, . . . , n,
n
Corollary 2.8.1. Let T = i=1


n 
Var(T ) = Cov(T, T ) = a2i Var(Xi ) + 2 ai aj Cov(Xi , Xj ). (2.8.4)
i=1 i<j

Note that if X1 , . . . , Xn are independent random variables, then by Theorem


2.5.2 all the pairwise covariances are 0; i.e., Cov(Xi , Xj ) = 0 for all i = j. This
leads to a simplification of (2.8.4), which we record in the following corollary.
Corollary 2.8.2. If X1 , . . . , Xn are independent random variables and Var(Xi ) =
σi2 , for i = 1, . . . , n, then
 n
Var(T ) = a2i σi2 . (2.8.5)
i=1

Note that we need only Xi and Xj to be uncorrelated for all i = j to obtain this
result.
Next, in addition to independence, we assume that the random variables have
the same distribution. We call such a collection of random variables a random
sample which we now state in a formal definition.
Definition 2.8.1. If the random variables X1 , X2 , . . . , Xn are independent and
identically distributed, i.e. each Xi has the same distribution, then we say that
these random variables constitute a random sample of size n from that common
distribution. We abbreviate independent and identically distributed by iid.
In the next two examples, we find some properties of two functions of a random
sample, namely the sample mean and variance.
Example 2.8.1 (Sample Mean). Let X1 , . . . , Xn be independent and identically
distributed random variables with common mean μ and variance σ 2 . The sample
mean is defined by X = n−1 i=1 Xi . This is a linear combination of the sample
n
−1
observations with ai ≡ n ; hence, by Theorem 2.8.1 and Corollary 2.8.2, we have
σ2
E(X) = μ and Var(X) = n . (2.8.6)

Because E(X) = μ , we often say that X is unbiased for μ.


2.8. Linear Combinations of Random Variables 153

Example 2.8.2 (Sample Variance). Define the sample variance by


 n 

n  2
2 −1 2 −1 2
S = (n − 1) (Xi − X) = (n − 1) Xi − nX , (2.8.7)
i=1 i=1

where the second equality follows after some algebra; see Exercise 2.8.1.
In the average that defines the sample variance S 2 , the division is by n − 1
instead of n. One reason for this is that it makes S 2 unbiased for σ 2 , as next
shown. Using the above theorems, the results of the last example, and the facts
2
that E(X 2 ) = σ 2 + μ2 and E(X ) = (σ 2 /n) + μ2 , we have the following:
 n 
 2
2 −1 2
E(S ) = (n − 1) E(Xi ) − nE(X )
i=1
% &
= (n − 1)−1 nσ 2 + nμ2 − n[(σ 2 /n) + μ2 ]
= σ2 . (2.8.8)

Hence, S 2 is unbiased for σ 2 .

EXERCISES
2.8.1. Derive the second equality in expression (2.8.7).
2.8.2. Let X1 , X2 , X3 , X4 be four iid random variables having the same pdf f (x) =
2x, 0 < x < 1, zero elsewhere. Find the mean and variance of the sum Y of these
four random variables.
2.8.3. Let X1 and X2 be two independent random variables so that the variances
of X1 and X2 are σ12 = k and σ22 = 2, respectively. Given that the variance of
Y = 3X2 − X1 is 25, find k.
2.8.4. If the independent variables X1 and X2 have means μ1 , μ2 and variances
σ12 , σ22 , respectively, show that the mean and variance of the product Y = X1 X2
are μ1 μ2 and σ12 σ22 + μ21 σ22 + μ22 σ12 , respectively.

2.8.5. Find the mean and variance of the sum Y = 5i=1 Xi , where X1 , . . . , X5 are
iid, having pdf f (x) = 6x(1 − x), 0 < x < 1, zero elsewhere.

2.8.6. Determine the mean and variance of the sample mean X = 5−1 5i=1 Xi ,
where X1 , . . . , X5 is a random sample from a distribution having pdf f (x) = 4x3 , 0 <
x < 1, zero elsewhere.
2.8.7. Let X and Y be random variables with μ1 = 1, μ2 = 4, σ12 = 4, σ22 =
6, ρ = 21 . Find the mean and variance of the random variable Z = 3X − 2Y .
2.8.8. Let X and Y be independent random variables with means μ1 , μ2 and
variances σ12 , σ22 . Determine the correlation coefficient of X and Z = X − Y in
terms of μ1 , μ2 , σ12 , σ22 .
154 Multivariate Distributions

2.8.9. Let μ and σ 2 denote the mean and variance of the random variable X. Let
Y = c + bX, where b and c are real constants. Show that the mean and variance of
Y are, respectively, c + bμ and b2 σ 2 .

2.8.10. Determine the correlation coefficient of the random variables X and Y if


var(X) = 4, var(Y ) = 2, and var(X + 2Y ) = 15.
2.8.11. Let X and Y be random variables with means μ1 , μ2 ; variances σ12 , σ22 ; and
correlation coefficient ρ. Show that the correlation coefficient of W = aX +b, a > 0,
and Z = cY + d, c > 0, is ρ.
2.8.12. A person rolls a die, tosses a coin, and draws a card from an ordinary
deck. He receives $3 for each point up on the die, $10 for a head and $0 for a
tail, and $1 for each spot on the card (jack = 11, queen = 12, king = 13). If we
assume that the three random variables involved are independent and uniformly
distributed, compute the mean and variance of the amount to be received.
2.8.13. Let X1 and X2 be independent random variables with nonzero variances.
Find the correlation coefficient of Y = X1 X2 and X1 in terms of the means and
variances of X1 and X2 .

2.8.14. Let X1 and X2 have a joint distribution with parameters μ1 , μ2 , σ12 , σ22 ,
and ρ. Find the correlation coefficient of the linear functions of Y = a1 X1 + a2 X2
and Z = b1 X1 + b2 X2 in terms of the real constants a1 , a2 , b1 , b2 , and the
parameters of the distribution.
2.8.15. Let X1 , X2 , and X3 be random variables with equal variances but with
correlation coefficients ρ12 = 0.3, ρ13 = 0.5, and ρ23 = 0.2. Find the correlation
coefficient of the linear functions Y = X1 + X2 and Z = X2 + X3 .

2.8.16. Find the variance of the sum of 10 random variables if each has variance 5
and if each pair has correlation coefficient 0.5.

2.8.17. Let X and Y have the parameters μ1 , μ2 , σ12 , σ22 , and ρ. Show that the
correlation coefficient of X and [Y − ρ(σ2 /σ1 )X] is zero.

2.8.18. Let S 2 be the sample variance of a random sample from a distribution with
variance σ 2 > 0. Since E(S 2 ) = σ 2 , why isn’t E(S) = σ?
Hint: Use Jensen’s inequality to show that E(S) < σ.

You might also like