chap 8 1
chap 8 1
Transformations
The topic for this chapter is transformations of random variables and random vec-
tors. After applying a function to a random variable X or random vector X, the goal
is to find the distribution of the transformed random variable or joint distribution
of the transformed random vector.
Transformations of random variables appear all over the place in statistics. Here are
a few examples, to preview the kinds of transformations we’ll be looking at in this
chapter.
• Unit conversion: In one dimension, we’ve already seen how standardization and
location-scale transformations can be useful tools for learning about an entire
family of distributions. A location-scale change is linear, converting an r.v. X to
the r.v. Y = aX + b where a and b are constants (with a > 0).
There are also many situations in which we may be interested in nonlinear trans-
formations, e.g., converting from the dollar-yen exchange rate to the yen-dollar
exchange rate, or converting information like “Janet’s waking hours yesterday
consisted of 8 hours of work, 4 hours visiting friends, and 4 hours surfing the web”
to the format “Janet was awake for 16 hours yesterday; she spent 12 of that time
working, 14 of that time visiting friends, and 14 of that time surfing the web”. The
change of variables formula, which is the first result in this chapter, shows what
happens to the distribution when a random vector is transformed.
• Sums and averages as summaries: It is common in statistics to summarize n
observations by their sum or sample average. Turning X1 , . . . , Xn into the sum
T = X1 + · · · + Xn or sample mean X̄n = T /n is a transformation from Rn to R.
The term for a sum of independent random variables is convolution. We have
already encountered stories and MGFs as two techniques for dealing with convo-
lutions. In this chapter, convolution sums and integrals, which are based on the
law of total probability, will give us another way of obtaining the distribution of
a sum of r.v.s.
• Extreme values: In many contexts, we may be interested in the distribution of
the most extreme observations. For disaster preparedness, government agencies
may be concerned about the most extreme flood or earthquake in a 100-year
period; in finance, a portfolio manager with an eye toward risk management will
want to know the worst 1% or 5% of portfolio returns. In these applications,
we are concerned with the maximum or minimum of a set of observations. The
367
368 Introduction to Probability
For a one-to-one g, the situation is particularly simple, because there is only one
value of x such that g(x) = y, namely g 1 (y). Then we can use
1
P (g(X) = y) = P (X = g (y))
to convert between the PMFs of X and g(X), as also discussed in Section 3.7. For
example, it is extremely easy to convert between the Geometric and First Success
distributions.
• In the continuous case, a universal approach is to start from the CDF of g(X), and
translate the event g(X) y into an equivalent event involving X. For general g,
we may have to think carefully about how to express g(X) y in terms of X, and
there is no easy formula we can plug into. But when g is continuous and strictly
increasing, the translation is easy: g(X) y is the same as X g 1 (y), so
1 1
Fg(X) (y) = P (g(X) y) = P (X g (y)) = FX (g (y)).
We can then di↵erentiate with respect to y to get the PDF of g(X). This gives a
one-dimensional version of the change of variables formula, which generalizes to
invertible transformations in multiple dimensions.
Transformations 369
dx
fY (y) = fX (x) ,
dy
When applying the change of variables formula, we can choose whether to compute
dx dy
dy , or compute dx and take the reciprocal. By the chain rule, these give the same
result, so we can do whichever is easier.
h 8.1.2. When finding the distribution of Y , be sure to:
• Check the assumptions of the change of variables theorem carefully if you wish to
apply it (if it doesn’t apply, a good strategy is to start with the CDF of Y ).
• Express your final answer for the PDF of Y as a function of y.
• Specify the support of Y .
The change of variables formula (in the strictly increasing g case) is easy to remem-
ber when written in the form
fY (y)dy = fX (x)dx,
which has an aesthetically pleasing symmetry to it. This formula also makes sense if
we think about units. For example, let X be a measurement in inches and Y = 2.54X
be the conversion into centimeters (cm). Then the units of fX (x) are inches 1
and the units of fY (y) are cm 1 , so it would be absurd to say something like
“fY (y) = fX (x)”. But dx is measured in inches and dy is measured in cm, so fY (y)dy
and fX (x)dx are unitless quantities, and it makes sense to equate them. Better yet,
370 Introduction to Probability
fX (x)dx and fY (y)dy have probability interpretations (recall from Chapter 5 that
fX (x)dx is essentially the probability that X is in a tiny interval of length dx,
centered at x), which makes it easier to think intuitively about what the change of
variables formula is saying.
The next two examples derive the PDFs of two r.v.s that are defined as transforma-
tions of a standard Normal r.v. In the first example the change of variables formula
applies; in the second example it does not.
Example 8.1.3 (Log-Normal PDF). Let X ⇠ N (0, 1), Y = eX . In Chapter 6
we named the distribution of Y the Log-Normal, and we found all of its moments
using the MGF of the Normal distribution. Now we can use the change of variables
formula to find the PDF of Y , since g(x) = ex is strictly increasing. Let y = ex , so
x = log y and dy/dx = ex . Then
dx 1 1
fY (y) = fX (x) = '(x) x = '(log y) , y > 0.
dy e y
Note that after applying the change of variables formula, we write everything on
the right-hand side in terms of y, and we specify the support of the distribution. To
determine the support, we just observe that as x ranges from 1 to 1, ex ranges
from 0 to 1.
We can get the same result by working from the definition of the CDF, translating
the event Y y into an equivalent event involving X. For y > 0,
d 1
fY (y) = (log y) = '(log y) , y > 0. ⇤
dy y
so
p 1 p
fY (y) = 2'( y) · y 1/2
= '( y)y 1/2
, y > 0. ⇤
2
Consider a line which is parallel to the shore and 1 mile away from the shore, as
illustrated in Figure 8.1. An angle of 0 would mean the ray of light is perpendicular
to the shore, while an angle of ⇡/2 would mean the ray is along the shore, shining
to the right from the perspective of the figure.
Let X be the point that the light hits on the line, where the line’s origin is the point
on the line that is closest to the lighthouse. Find the distribution of X.
lighthouse
beach
ocean
1 mile U
0 X
FIGURE 8.1
A lighthouse shining light at a random angle U , viewed from above.
Solution: Looking at the right triangle in Figure 8.1, the length of the opposite side
of U divided by the length of the adjacent side of U is X/1 = X, so
X = tan(U ).
(The figure illustrates a case where U > 0 and, correspondingly, X > 0, but the
same relationship holds when U 0.) Let x be a possible value of X and u be the
corresponding possible value of U , so
By the change of variables formula, which applies since tan is a di↵erentiable, strictly
increasing function on ( ⇡/2, ⇡/2),
du 1 1
fX (x) = fU (u) = · ,
dx ⇡ 1 + x2
which shows that X is Cauchy. In particular, this implies that E|X| is infinite (since
the expected value of a Cauchy does not exist), so on average X is infinitely far
from the origin of the line!
372 Introduction to Probability
The fact that X is Cauchy also makes sense in light of universality of the Uniform.
As shown in Example 7.1.25, the Cauchy CDF is
1
F (x) = arctan(x) + 0.5.
⇡
The inverse is F 1 (v) = tan (⇡ (v 0.5)) , so for V ⇠ Unif(0, 1) we have
1
F (V ) = tan (⇡ (V 0.5)) ⇠ Cauchy.
This agrees with our earlier result since ⇡ (V 0.5) ⇠ Unif( ⇡/2, ⇡/2). ⇤
We can also use the change of variables formula to find the PDF of a location-scale
transformation.
Example 8.1.6 (PDF of a location-scale transformation). Let X have PDF fX ,
and let Y = a + bX, with b 6= 0. Let y = a + bx, to mirror the relationship between
dy
Y and X. Then dx = b, so the PDF of Y is
✓ ◆
dx y a 1
fY (y) = fX (x) = fX . ⇤
dy b |b|
and 0 otherwise. (The inner bars around the Jacobian say to take the determinant
and the outer bars say to take the absolute value.)
That is, to convert fX (x) to fY (y) we express the x in fX (x) in terms of y and then
multiply by the absolute value of the determinant of the Jacobian @x/@y.
As in the 1D case,
1
@x @y
= ,
@y @x
so we can compute whichever of the two Jacobians is easier, and then at the end
express the joint PDF of Y as a function of y.
We will not prove the change of variables formula here, but the idea is to apply the
change of variables formula from multivariable calculus and the fact that if A is a
region in A0 and B = {g(x) : x 2 A} is the corresponding region in B0 , then X 2 A
is equivalent to Y 2 B—they are the same event. So P (X 2 A) = P (Y 2 B), which
shows that Z Z
fX (x)dx = fY (y)dy.
A B
P (Y = y) = P (X = y 1/3 )
dx 1
fY (y) = fX (x) = fX (y 1/3 ) 2/3 .
dy 3y
Exercise 23 is a cautionary tale about someone who failed to use a Jacobian when
it was needed.
The next two examples apply the 2D change of variables formula.
Example 8.1.9 (Box-Muller). Let U ⇠ Unif(0, 2⇡), and let T ⇠ Expo(1) be inde-
pendent of U . Define
p p
X = 2T cos U and Y = 2T sin U.
Find the joint PDF of (X, Y ). Are they independent? What are their marginal
distributions?
374 Introduction to Probability
Solution:
The joint PDF of U and T is
1 t
fU,T (u, t) = e ,
2⇡
for u 2 (0, 2⇡) and t > 0. Viewing (X, Y ) as a point in the plane,
X 2 + Y 2 = 2T (cos2 U + sin2 U ) = 2T
p
is the squared distance from the origin and U is the angle; that is, ( 2T , U ) expresses
(X, Y ) in polar coordinates.
Since we can recover (U, T ) from (X, Y ), the transformation is invertible. The
Jacobian matrix
p !
@(x, y) 2t sin u p12t cos u
= p
@(u, t) 2t cos u p1 sin u
2t
| sin2 u cos2 u| = 1
p p
(which is never 0). Then letting x = 2t cos u, y = 2t sin u to mirror the transfor-
mation from (U, T ) to (X, Y ), we have
@(u, t)
fX,Y (x, y) = fU,T (u, t) · | |
@(x, y)
1 t
= e ·1
2⇡
1 1 2 2
= e 2 (x +y )
2⇡
1 2 1 y 2 /2
= p e x /2 · p e ,
2⇡ 2⇡
for all real x and y.
The joint PDF fX,Y factors into a function of x times a function of y, so X and Y
are independent. Furthermore, we recognize the joint PDF as the product of two
standard Normal PDFs, so X and Y are i.i.d. N (0, 1) r.v.s! This result is called the
Box-Muller method for generating Normal r.v.s. ⇤
Example 8.1.10 (Bivariate Normal joint PDF). In Chapter 7, we saw some prop-
erties of the Bivariate Normal distribution and found its joint MGF. Now let’s find
its joint PDF.
Let (Z, W ) be BVN with N (0, 1) marginals and Corr(Z, W ) = ⇢. (If we want the
joint PDF when the marginals are not standard Normal, we can standardize both
components separately and use the result below.) Assume that 1 < ⇢ < 1 since
otherwise the distribution is degenerate (with Z and W perfectly correlated).
Transformations 375
Z=X
W = ⇢X + ⌧ Y,
p
with ⌧ = 1 ⇢2 and X, Y i.i.d. N (0, 1). We also need the inverse transformation.
Solving Z = X for X, we have X = Z. Plugging this into W = ⇢X +⌧ Y and solving
for Y , we have
X=Z
⇢ 1
Y = Z + W.
⌧ ⌧
The Jacobian is
0 1
1 0
@(x, y)
=@ ⇢ 1 A,
@(z, w)
⌧ ⌧
which has absolute determinant 1/⌧ . So by the change of variables formula,
@(x, y)
fZ,W (z, w) = fX,Y (x, y) · | |
@(z, w)
✓ ◆
1 1 2 2
= exp (x + y )
2⇡⌧ 2
✓ ◆
1 1 2 ⇢ 1 2
= exp (z + ( z + w) )
2⇡⌧ 2 ⌧ ⌧
✓ ◆
1 1 2 2
= exp (z + w 2⇢zw) , for all real z, w.
2⇡⌧ 2⌧ 2
In the last step we multiplied things out and used the fact that ⇢2 + ⌧ 2 = 1. ⇤
8.2 Convolutions
task. For example, we used stories to show that the sum of independent Binomials
with the same success probability is Binomial, and that the sum of i.i.d. Geometrics
is Negative Binomial. We used MGFs to show that a sum of independent Normals
is Normal.
h 8.2.2. We use the assumption that X and Y are independent in order to get
from P (Y = t x|X = x) to P (Y = t x) in the last step. We are only justified
in dropping the condition X = x if the conditional distribution of Y given X = x
is the same as the marginal distribution of Y , i.e., X and Y are independent. A
common mistake is to assume that after plugging in x for X, we’ve “already used
the information” that X = x, when in fact we need an independence assumption to
drop the condition. Otherwise we destroy information without justification.
In the continuous case, since the value of a PDF at a point is not a probability, we
Transformations 377
first find the CDF, and then di↵erentiate to get the PDF. By LOTP,
Z 1
FT (t) = P (X + Y t) = P (X + Y t|X = x)fX (x)dx
1
Z 1
= P (Y t x)fX (x)dx
1
Z 1
= FY (t x)fX (x)dx.
1
to Z 1
fT (t) = fY (t x)fX (x)dx.
1
But care is still needed. For example, Exercise 23 shows that an analogous-looking
formula for the PDF of the product of two independent continuous r.v.s is wrong:
a Jacobian is needed (for convolutions, the absolute Jacobian determinant is 1 so it
isn’t noticeable in the convolution integral formula).
Since convolution sums are just the law of total probability, we have already used
them in previous chapters without mentioning the word convolution; see, for ex-
ample, the first and most tedious proof of Theorem 3.8.9 (sum of independent
Binomials), as well as the proof of Theorem 4.8.1 (sum of independent Poissons).
In the following examples, we find the distribution of a sum of Exponentials and a
sum of Uniforms using a convolution integral.
i.i.d.
Example 8.2.4 (Exponential convolution). Let X, Y ⇠ Expo( ). Find the dis-
tribution of T = X + Y .
Solution:
For t > 0, the convolution formula gives
Z 1 Z t
(t x) x
fT (t) = fY (t x)fX (x)dx = e e dx,
1 0
where we restricted the integral to be from 0 to t since we need t x > 0 and x > 0
for the PDFs inside the integral to be nonzero. Simplifying, we have
Z t
2
fT (t) = e t dx = 2 te t , for t > 0.
0
The integrand is 1 if and only if 0 < t x < 1 and 0 < x < 1; this is a parallelogram-
shaped constraint. Equivalently, the constraint is max(0, t 1) < x < min(t, 1).
Transformations 379
1.0
0.8
0.6
x
0.4
0.2
0.0
FIGURE 8.2
Region in the (t, x)-plane where g(t x)g(x) is 1.
From Figure 8.2, we see that for 0 < t 1, x is constrained to be in (0, t), and for
1 < t < 2, x is constrained to be in (t 1, 1). Therefore, the PDF of T is a piecewise
linear function:
8 Z t
>
>
>
< dx = t for 0 < t 1,
0
fT (t) = Z 1
>
>
>
: dx = 2 t for 1 < t < 2.
t 1
Figure 8.3 plots the PDF of T . It is shaped like a triangle with vertices at 0, 1, and
2, so it is called the Triangle(0, 1, 2) distribution.
Heuristically, it makes sense that T is more likely to take on values near the mid-
dle than near the extremes: a value near 1 can be obtained if both X and Y are
moderate, if X is large but Y is small, or if Y is large but X is small. In contrast, a
value near 2 is only possible if both X and Y are large. Thinking back to Example
3.2.5, the PMF of the sum of two die rolls was also shaped like a triangle. A single
die roll has a Discrete Uniform distribution on the integers 1 through 6, so in that
problem we were looking at a convolution of two Discrete Uniforms. It makes sense
that the PDF we obtained here is similar in shape. ⇤
8.3 Beta
In this section and the next, we will introduce two continuous distributions, the
Beta and Gamma, which are related to several named distributions we have already
380 Introduction to Probability
1.0
0.8
0.6
PDF
0.4
0.2
0.0
FIGURE 8.3
PDF of T = X + Y , where X and Y are i.i.d. Unif(0, 1).
studied and are also related to each other via a shared story. This is an interlude
from the subject of transformations, but we’ll eventually need to use a change of
variables to tie the Beta and Gamma distributions together.
The Beta distribution is a continuous distribution on the interval (0, 1). It is a
generalization of the Unif(0, 1) distribution, allowing the PDF to be non-constant
on (0, 1).
Definition 8.3.1 (Beta distribution). An r.v. X is said to have the Beta distribution
with parameters a and b, where a > 0 and b > 0, if its PDF is
1
f (x) = xa 1
(1 x)b 1
, 0 < x < 1,
(a, b)
where the constant (a, b) is chosen to make the PDF integrate to 1. We write this
as X ⇠ Beta(a, b).
Taking a = b = 1, the Beta(1, 1) PDF is constant on (0, 1), so the Beta(1, 1) and
Unif(0, 1) distributions are the same. By varying the values of a and b, we get PDFs
with a variety of shapes; Figure 8.4 shows four examples. Here are a couple of general
patterns:
• If a < 1 and b < 1, the PDF is U-shaped and opens upward. If a > 1 and b > 1,
the PDF opens down.
• If a = b, the PDF is symmetric about 1/2. If a > b, the PDF favors values larger
than 1/2; if a < b, the PDF favors values smaller than 1/2.
By definition, the constant (a, b) satisfies
Z 1
(a, b) = xa 1 (1 x)b 1
dx.
0