STAT 516 Course Notes Part 0: Review of STAT 515: 1 Probability
STAT 516 Course Notes Part 0: Review of STAT 515: 1 Probability
Ted Westling
Department of Mathematics and Statistics
University of Massachusetts Amherst
[email protected]
These notes are based on Chapters 1–6 of Wackerly, Mendenhall, and Scheaffer’s Mathematical
Statistics with Applications, 7th ed. However, in some cases the notation, order, and topics covered
here differ slightly from Wackerly’s. In addition, since this part is intended to be a brief review of
the core concepts covered in 515, motivation and examples are limited.
1 Probability
1.1 Sets
A set is a collection of objects, which are called elements of the set. We write A = {a1 , a2 , a3 } to
say that the set A is comprised of the elements a1 , a2 , and a3 , and we write a1 ∈ A to say that a1
is an element of A. We also write a ∈ / A to say that a is not an element of A.
A subset B of a set A is any set containing only elements of A, but not necessarily all elements
of A. We write B ⊆ A if B is a subset of A. Formally, B ⊆ A if and only if b ∈ B implies b ∈ A.
Every set is a subset of itself, so A ⊆ A is always true. We write B ⊂ A or B ( A if B is a proper
subset of A, meaning that B ⊆ A and B 6= A. Formally, B ⊂ A if and only if B ⊆ A and there is
at least one a ∈ A such that a ∈ / B. Note that a set is not a proper subset of itself.
The empty set, denoted ∅ or {}, is the unique set with no elements. The empty set is a subset of
all sets, so ∅ ⊆ A for all sets A.
The union of two sets A and B, denoted A ∪ B, is the set containing the pooled elements of A and
B. Thus, A ∪ B := {c : c ∈ A or c ∈ B}.
The intersection of two sets A and B, denoted A ∩ B, is the set containing the common elements
of A and B. Thus, A ∩ B := {c : c ∈ A and c ∈ B}.
The complement of a set A relative to a set B, also known as the set difference and denoted Ac B,
also denoted B − A or B\A, is the elements of B that are not contained in A: Ac B := {b ∈ B : b ∈
/
A}. Sometimes, if the universal or reference set S containing all elements under consideration
is known or implicit, then Ac or A will mean Ac S.
There are many useful identities for sets; here are a few:
• (A ∩ B) ∪ (A ∩ C) = A ∩ (B ∪ C),
1
• (A ∪ B) ∩ (A ∪ C) = A ∪ (B ∩ C),
• (A ∩ B)c S = (Ac S) ∪ (B c S), and
• (A ∪ B)c S = (Ac S) ∩ (B c S).
These and other identities are most easily proved using venn diagrams.
A set is finite if the number of elements in the set is finite, and is infinite otherwise. A set
A is countably infinite if it is in one-to-one correspondence with the natural numbers N :=
{1, 2, 3, . . . }.
Note that axioms 3 and 1 together imply that if B ⊆ A are two events on S, then P (B) ≤ P (A).
Together with axiom 2, this implies that P (A) ≤ 1 for all events A on S. Therefore, a probability
is a function from the events on S to [0, 1]. For any event A, P (Ac S) = P (A) = 1 − P (A).
For any two events that may not be disjoint, the additive law of probability, also called the
inclusion/exclusion principle, states that
P (A ∪ B) = P (A) + P (B) − P (A ∩ B) .
The conditional probability of an event A given an event B with P (B) > 0 is defined as
P (A ∩ B)
P (A | B) := .
P (B)
Intuitively, the conditional probability is the probability that the event A occurs given that B is
known to have occurred. For example, the conditional probability of the event A = {1, 2, or 3 is
rolled} given the event B = {an even number is rolled} when rolling a fair six-sided die is
P (A ∩ B) P (2 is rolled) 1/6
P (A | B) = = = = 1/3 .
P (B) P (2, 4, or 6 is rolled) 1/2
The definition of conditional probability implies that P (A ∩ B) = P (A | B)P (B) = P (B | A)P (A).
This is sometimes called the multiplicative law of probability.
2
Two events A and B are independent if P (A ∩ B) = P (A)P (B), P (A | B) = P (A), or P (B |
A) = P (B). Any of these criteria can be used to assess whether two events are independent.
Suppose that B1 , B2 , . . . , Bk are events a sample space S that partition S, meaning that ki=1 Bi =
S
S and Bi ∩ Bj = ∅ for each i, j. Then the law of total probability states that
k
X
P (A) = P (A | Bj )P (Bj )
i=1
for any event A on S. Bayes’ rule (not to be confused with Bae’s rule) states that if in addition
P (Bi ) > 0 for all i,
P (A | Bi )P (Bi )
P (Bi | A) = Pk .
i=1 P (A | Bi )P (Bi )
A discrete sample space is a finite or infinitely countable set S = {s1 , s2 , . . . }. For example, if
the experiment is rolling a single die and observing the number of dots facing up, then the sample
space is {1, 2, 3, 4, 5, 6}. If the experiment is rolling two dice and observing the total number of
dots facing up, the sample space is {2, 3, . . . , 11, 12}. If we pick a “random” real number between
0 and 1, invert it, and round down, then the sample space is {1, 2, 3, 4, . . . }. For discrete sample
spaces, any subset is an event. For a discrete sample space, the probabilityP of any event is simply
the sum of the probabilities of each element in the event: P (A) = a∈A P ({a}). Therefore, to
specify a probability function on a discrete sample space, it suffices to specify the probabilities of
each element in the space in such a way that the probabilities are non-negative and sum to one.
Be careful—this is only true for discrete sample spaces!
For calculating probabilities on finite sample spaces in practice, several formulas from combinatorics
are useful:
• If A has m unique elements and B has n unique elements, then there are mn unique (un-
ordered) pairs of elements {a, b}, where a ∈ A and b ∈ B.
• The number of ways of constructing ordered arrangements (called permutations) of length
r from a set of n distinct objects is
n!
Prn := .
(n − r)!
• The number of unordered subsets (called combinations) of size r from a set of n unique
elements is
n n!
:= .
r n!(n − r)!
3
2 Univariate random variables
A random variable (RV) X is a function from a sample space S to R. For any A ⊆ R, we define
While sample spaces can be sets of any objects, including non-numeric objects, random variables
are always numeric. Sometimes, these are called real random variables or univariate random
variables to distinguish from multivariate random variables and other types of random variables
that we will not discuss.
The distribution function, also known as the cumulative distribution function (CDF), of a
random variable X is defined as
for all x ∈ R. Distribution functions are always monotonic, meaning they are non-decreasing:
x ≤ y implies FX (x) ≤ FX (y). In addition, FX (−∞) = limx→−∞ FX (x) = 0 and FX (∞) =
limx→∞ FX (x) = 1.
A random variable is called discrete if it can take on a finite or countably infinite number of values.
The support X or supp(X) of a discrete random variable X is defined as {x : P (X = x) > 0} ⊂ R.
For a discrete random variable, the function x 7→ fX (x) := P (X = x), known as the probability
mass function (pmf) of X, contains all the information about the distribution of X, since
X
P (X ∈ A) = fX (x)
x∈A∩X
for any A ⊆ R.
The expected value, also known as the mean, of a random variable X is defined as
X
µX := E[X] := xP (X = x) .
x∈supp(X)
P
Technically, E[X] is only defined if x∈supp(X) |x|P (X = x) < ∞. Otherwise, the expected value
doesn’t exist. The mean always exists for random variables with finite support, but may not exist
for random variables with countably infinite
P support. For example, if P (X = x) = 1/[x(x + 1)] for
x ∈ {1, 2, 3, . . . } and 0 otherwise, then x∈supp(X) |x|P (X = x) does not converge, so X does not
have a mean.
P
Similarly, if x∈supp(X) |g(x)|P (X = x) < ∞ for some function g : R → R, then
X
E[g(X)] := g(x)P (X = x) .
x∈supp(X)
4
for any constants a, b ∈ R, functions g, h : R → R, and RV X. In addition, the expectation of any
non-random constant is itself:
E[c] = c
for all c ∈ R.
The variance of a discrete random variable is then defined as
h i X
σX2
:= Var(X) := E (X − µX )2 = (x − µX )2 P (X = x) .
x∈supp(X)
q
2 . Expanding the square and using
The standard deviation of a random variable is σX := σX
the linearity of expectation, we have
h i
E (X − µX )2 = E X 2 − 2XµX + µ2X = E X 2 − 2µX E [X] + µ2X = E X 2 − 2µ2X + µ2X
= E X 2 − µ2X .
There are a number of special discrete distributions that arise fairly frequently, including the
uniform distribution, Bernoulli distribution, binomial distribution, Poisson distribution, geometric
distribution, hypergeometric distribution, and negative binomial distribution. Full definitions of all
of these distributions, their importance, and their properties, can be found in Wackerly Chapter
3 or on Wikipedia. Here, we review the three most important of these distributions: Bernoulli,
binomial, and Poisson.
Suppose that X = {x1 , x2 , . . . , xk }, and P (X = xi ) = 1/k for i = 1, 2, . . . , k. Then X is uniformly
distributed on the set X . Any of the values in X are equally likely.
Suppose that P (X = 1) = p and P (X = 0) = 1 − p for some p ∈ [0, 1]. Then X is said to follow
the Bernoulli distribution with success probability p, and we write X ∼ Bernoulli(p). The mean
of X is p and the variance is p(1 − p), which is largest when p = 0.5. The canonical example of
a random variable following the Bernoulli distribution is the flip of a weighted coin: suppose we
flip a biased/weighted coin that has probability p of coming up heads. Our sample space for this
experiment is {heads, tails}, with probabilities p and 1 − p, respectively. Let X(heads) = 1 and
X(tails) = 0. Then X follows a Bernoulli distribution with probability p.
Suppose that P (X = x) = nx px (1 − p)n−x for all x ∈ {0, 1, 2, . . . , n}, and P (X = x) = 0 otherwise,
where n ∈ {1, 2, . . . } and p ∈ [0, 1]. Then X is said to follow the binomial distribution with
n trials and success probability p, and we write X ∼ Binomial(n, p). The expectation of X is
np and the variance is np(1 − p). The canonical example of a random variable following the
binomial distribution is the number of heads observed when flipping n independent coins, each
with probability p of showing heads. Note that the Bernoulli distribution is a special case of the
binomial distribution with n = 1.
Suppose that P (X = x) = λx e−λ /x! for x ∈ {0, 1, 2, . . . } and P (X = x) = 0 otherwise, where
λ ∈ (0, ∞). Then X follows the Poisson distribution with mean λ, and we write X ∼ Poisson(λ).
The mean of X is λ and the variance is also λ. The Poisson distribution is a good model for the
number of events that occur over a long period of time if the event is very unlikely to occur in any
given time intervals and the events are independent.
5
2.3 Continuous RVs
A random variable X is continuous if it can take any value in an interval (a, b) for a < b. Hence, X
can take an uncountably infinite number of values, so it is not discrete. Note that the distribution
function of a discrete random variable is discontinuous: it has jumps at each value x in the support
of X. In contrast, continuous random variables have continuous distribution functions. For a
continuous RV X, P (X = x) = 0 for all x ∈ R, so the summations developed for discrete random
variables do not apply. Instead, we use integrals to calculate probabilities for continuous RVs.
The derivative of the distribution function of a continuous RV is called the probability density
function (pdf):
d d
fX (x) := FX (x) = P (X ≤ x) .
dx dx
By the fundamental theorem of calculus,
Z x
FX (x) = fX (t) dt .
−∞
This is a valid CDF for any c ∈ (0, ∞). The density function is given by fX (x) = 1/c if x ∈ [0, c]
and fX (x) = 0 otherwise. Therefore, if c ∈ (0, 1), fX (x) > 1 for x ∈ [0, c].
For a continuous random variable X, we define the support of X as supp(X) := {x ∈ R : fX (x) >
0}.
For any a < b, we can calculate P (a < X < b) for a continuous RV X using the pdf:
Z b
P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b) = P (a ≤ X ≤ b) = fX (t) dt .
a
Expected values for continuous RVs are also calculated using integrals:
Z ∞
µX := E[X] := xfX (x) dx
−∞
6
and the expectation of a constant is itself: E[c] = c.
Variance for continuous RVs is defined the same as for discrete RVs:
2
σX := Var(X) := E[(X − µX )2 }
2 = E[X 2 ] − µ2 .
and as before, we have the identity σX X
As with discrete distributions, there are a number of special continuous distributions that arise
frequently in practice, including the uniform, normal, gamma, beta, exponential, logistic, t, chi-
squared, F, Weibull, Gumbel, Cauchy, Laplace, Pareto, and so on. Extensive Wikipedia pages
(including references to research articles) can be found about many of these distributions. Here,
we review a few of the most important of these distributions.
Suppose X has pdf fX (x) = 1/(b − a) for x ∈ [a, b] and fX (x) = 0 otherwise. Then X ∼
Uniform(a, b). E[X] = (a + b)/2 and Var(X) = (b − a)2 /12. The uniform distribution, as suggested
by its name, is an anlogoue of the discrete uniform distribution. In particular, P (c < X < d) =
1/(d − c) for all c, d, ∈ [a, b]: the probability that X falls in a subinterval of [a, b] only depends on
the length of the interval.
Suppose X has pdf fX (x) = λe−λx for x ∈ [0, ∞), and fX (x) = 0 otherwise. Then X ∼
Exponential(λ). The mean of X is 1/λ, and the variance is 1/λ2 . Sometimes the exponential
distribution is parametrized in terms of 1/λ instead. The exponential distribution is used to model
memoryless random variables, because the distribution of X given that X > a is a+Exponential(λ).
That is, X “forgets” that it has already made it to a.
Suppose X has pdf
1 (x−µ)2
fX (x) := √ e− 2σ2
2πσ 2
for all x ∈ R and some µ ∈ R, σ ∈ (0, ∞). Then X is said to follow a normal, or Gaussian,
distribution, and we write X ∼ N (µ, σ 2 ). E[X] = µ and Var(X) = σ 2 . As we will see, the normal
distribution plays a central role in statistics due to the Central Limit Theorem.
The kth moment of a random variable X is defined as E[X k ]. The kth central moment of X is
defined as E[(X − µX )k ].
The moment generating function of a random variable X is defined as
m(t) := E etX
if this expectation exists. If |m(t)| < ∞ for |t| ≤ c and some c > 0, then
dk
= E[X k ]
m(t)
t=0
dtk
for all k ∈ 1, 2, . . . . This is why m is called the moment generating function: if it exists, it can be
used to find all moments of a random variable.
7
Random variables can be neither discrete nor continuous. For example, an RV can be a mix of a
discrete and continuous RV. In addition, not all continuous RVs have density functions, since it is
possible for a distribution function to be continuous but nowhere differentiable (look up the Cantor
function, also known as the devil’s staircase function, or Minkovski’s question mark function).
However, all real RVs have a distribution function, and are uniquely determined by their distribution
function.
Markov’s inequality states that P (X ≥ t) ≤ µX /t for any random variable and t > 0. In
particular, P (|X| ≥ t) ≤ E[|X|]/t). Chebyshev’s inequality states that
2
P (|X − µX | ≥ t) ≤ σX /t2
for any t > 0. These are examples of concentration inequalities or tail bounds for random
variables. Sometimes, the exact distribution of a random variable may be very complicated or
unknown, so it can be useful to have upper bounds on large deviations of the random variable.
For example, suppose someone tells you that the expected value for the maximum temperature
tomorrow is 40 degrees. Then from Markov’s inequality, you know that the probability that the
temperature exceeds 100 degrees is less than or equal to 40/100 = 0.4. That’s pretty crude... But
if they tell you that the variance of the maximum temperature tomorrow is 10, then you know that
the probability that the temperature exceeds 100 degrees is no larger than 10/602 < 0.01, and the
probability that the temperature exceeds 60 degrees is no larger than 10/202 = 0.025.
3 Bivariate RVs
3.1 Definition
FX (x1 , x2 ) := P (X1 ≤ x1 , X2 ≤ x2 ) .
8
A continuous bivariate RV X is one that can take an uncountably infinite number of values.
The joint probability density function (pdf) of X is defined as
∂2
fX (x1 , x2 ) := FX (x1 , x2 ) .
∂x1 ∂x2
Therefore, we have Z x1 Z x2
FX (x1 , x2 ) = fX (t1 , t2 ) dt1 dt2 .
−∞ −∞
The pdf is always non-negative. We can use the pdf to calculate probabilities for a continuous
bivariate RV: Z b2 Z b1
P (a1 ≤ X1 ≤ b1 , a2 ≤ X2 ≤ b2 ) = fX (x1 , x2 ) dx1 dx2 .
a2 a1
The prefix “joint” for distribution, density, and mass functions of bivariate random variables means
informally that we are looking at both coordinates of the random variable at once. This is to
distinguish from the marginal and conditional distribution, mass, and density functions. The
marginal distribution function of X1 is
where X2 := {x2 : (x1 , x2 ) ∈ X }. Similarly, the marginal probability mass function for X2 is given
by X
P (X2 = x2 ) = fX2 (x2 ) := fX (x1 , x2 ) .
x1 ∈X1
Note that fX1 and fX2 are valid univariate pmfs. For a continuous bivariate RV, the marginal pdfs
of X1 and X2 are is
Z ∞ Z ∞
fX1 (x1 ) := fX (x1 , x2 ) dx2 and fX2 (x2 ) := fX (x1 , x2 ) dx1 .
−∞ −∞
Note that these are valid univariate pdfs. Intuitively, marginal distributions give the distribution
of a single component of a random vector, ignoring the other components. For example, consider
the bivariate RV (X1 , X2 ), where X1 is a random person’s height and X2 is their wingspan. The
marginal distribution of X1 is then the height of a random person.
9
For a discrete bivariate RV X, the conditional pmf of X1 given X2 = x2 is
P (X1 = x1 , X2 = x2 ) fX (x1 , x2 )
fX1 |X2 (x1 | x2 ) := P (X1 = x1 | X2 = x2 ) = =
P (X2 = x2 ) fX2 (x2 )
for any x2 such that fX2 (x2 ) > 0. An analogous formula holds for the conditional pmf of X2 given
X1 = x1 . For a continuous bivariate RV, the conditional pdf of X1 given X2 = x2 is
fX (x1 , x2 )
fX1 |X2 (x1 | x2 ) :=
fX2 (x2 )
and analogously for the conditional pdf of X2 given X1 = x1 . Intuitively, conditional pmfs and
pdfs provide the distribution of one component of a random vector given that we know the value
of the other component. In the height/wingspan example above, the conditional distribution of X1
given X2 = x2 is the distribution of heights among people with wingspan x2 .
Two random variables X1 and X2 are independent if FX (x1 , x2 ) = FX1 (x1 )FX2 (x2 ) for all x1 , x2 ∈
R, where FX1 and FX2 are the marginal distribution functions of X1 and X2 . This is equivalent
to fX (x1 , x2 ) = fX1 (x1 )fX2 (x2 ) for all (x1 , x2 ) ∈ X , where fX fX1 , and fX2 are the joint and
marginal pmfs or pdfs, respectively. This implies that fX1 |X2 (x1 | x2 ) = fX1 (x1 ) for all x1 , x2 as
well. Intuitively, this means that knowing something about X2 doesn’t provide any information
about the distribution of X1 , and vice-versa.
10
The correlation of two random variables X1 and X2 is
Cov(X1 , X2 )
ρ = Cor(X1 , X2 ) := .
σX1 σX2
It turns out that the correlation of two random variables is always between −1 and 1. Correlation
measures the strength of the linear relationship between two random variables: Cor(X1 , X2 ) = 1
if and only if X1 = aX2 + b for some a > 0, and Cor(X1 , X2 ) = −1 if and only if X1 = aX2 + b
for some a < 0. In general, values of Cor(X1 , X2 ) in (0, 1] indicate that X1 and X2 tend to vary
together—i.e. if we observe a large value of X2 , we will also tend to observe a large value of X1
and vice versa—while values in [−1, 0) indicate that if we observe a large value of X2 , we will tend
to observe a small value of X1 and vice versa. However, reporting correlation should always be
accompanied by a graphical relationship between the two random variables, since correlation can
be misleading. See, for instance Anscombe’s quartet.
Variance is invariant to translations: Var(c + X) = Var(X) for all constants c ∈ R and RVs. The
variance of a linear combination of random variables X1 , . . . , Xk can be decomposed as:
k k
!
X X X
Var c + ai Xi = a2i Var(Xi ) + 2ai aj Cov(Xi , Xj ) .
i=1 i=1 i<j
Pk 2
If Xi and Xj are independent for each i, j, then this reduces to i=1 ai Var(Xi ).
Covariance is also invariant to translations: Cov(a+X, b+Y ) = Cov(X, Y ) for all constants a, b ∈ R
and RVs X, Y . The covariance of a linear transformation is given by:
11
The law of iterated expectation, also called the tower property, says that
E[X1 ] = E {E[X1 | X2 ]} = E µX1 |X2 (X2 )
and similarly
E[X2 ] = E {E[X2 | X1 ]} = E µX2 |X1 (X1 ) .
These formulas hold for continuous and discrete random variables.
Here is an example. Suppose that X2 ∼ N (µ, σ 2 ), and that conditional upon X2 = x2 , X1 is
distributed according to N (x2 , γ 2 ).
1 2 2 1 2 2
fX (x1 , x2 ) = fX1 |X2 (x1 | x2 )fX2 (x2 ) = √ e−(x1 −x2 ) /(2γ ) √ e−(x2 −µ) /(2µ ) .
2πγ 2πσ
What is E[X1 ]? We first note that µX1 |X2 (x2 ) = E[X1 | X2 = x2 ] = x2 . We can use the law of
iterated expectation to find that E[X1 ] = E[E[X1 | X2 ]] = E[X2 ] = µ.
How can we prove the law of total variance? First we note that
E[Var(X1 | X2 )] = E E(X12 | X2 ) − E(X1 | X2 )2
= E E(X12 | X2 = x2 ) − E E(X1 | X2 )2
= E[X12 ] − E E(X1 | X2 )2 .
The second line followed by linearity of expectation, and third line by the law of iterated expectation.
We also have
Var(E[X1 | X2 ]) = E E(X1 | X2 )2 − E [E(X1 | X2 )]2
= E E(X1 | X2 )2 − E [X1 ]2 ,
= E[X12 ] − E [X1 ]2
= Var(X1 ).
12
4 Transformations of random variables
One of the central tasks in statistics is to study, well, statistics. Given some data (i.e. random
variables) X1 , X2 , . . . , Xn , a statistic is simply a function T := T (X1 , . . . , Xn ) of the data. Starting
in Chapter 8, we will provide an in-depth look at why we care about statistics. To find the
distribution of T , we need to know how to find the distribution of a function, or transformation, of
a random variable.
In this chapter, we will discuss three ways of finding the distribution of a transformation T of
a random variable: (1) the change-of-variables formula (called the method of transformations by
Wackerly), which is only valid for monotone functions, (2) the method of the moment generating
function, and (3) the method of distribution functions. Method (3) is the most general, but also
the most tedious. Methods (1) and (2) can’t always be used, but when they can be used, are easier.
For example, suppose that T (x) := ax + b for some a 6= 0. Then T −1 (u) = (u − b)/a, so fU (u) =
fX ((u − b)/a) for any discrete RV.
If T is injective, meaning T (x) = T (y) if and only if x = y, but not surjective, meaning there are
some elements u ∈ R such that there is no x ∈ R with T (x) = u, then we need to slightly amend
our formula as follows:
(
fX T −1 (u) , u ∈ Ran(T )
fU (u) := P (U = u) = P (T (X) = u) = .
0, otherwise
Here, Ran(T ) = {u : T (x) = u for some x ∈ R} is the range of T .
As an example, consider T (x) := ex . Then Ran(T ) = (0, ∞) and for u > 0, T −1 (u) = log(u), so
(
fX (log(x)) , u > 0
fU (u) = .
0, u≤0
(Here and throughout, log(x) is the so-called natural logarithm with respect to base e, sometimes
denoted ln(x).)
Now suppose that X is a continuous univariate RV with density function fX and T : supp(X) → R
is an invertible and strictly increasing transformation. Then
FU (u) = FT (X) (u) := P (U ≤ u) = P (T (X) ≤ u) = P X ≤ T −1 (u) = FX T −1 (u) .
13
If T is also differentiable, then by the chain rule,
−1 (u)
d d d fX T
FX T −1 (u) = fX T −1 (u) T −1 (u) = 0 −1
fU (u) = fT (X) (u) := P (U ≤ u) = .
du du du T (T (u))
These formulas give the distribution and density functions of U = T (X) in terms of the distribution
and density functions of X and the transformation T . If T is strictly decreasing, then
and
fX T −1 (u)
−1
d −1
fU (u) = −fX T (u) T (u) = − 0 −1 .
du T (T (u))
If T is invertible and either strictly increasing or decreasing, we have
d −1 fX T −1 (u)
−1
fU (u) = fX T (u) T (u) = 0 −1
.
du |T (T (u))|
For example, suppose that T (x) := ax+b for a 6= 0. Then T −1 (u) = (u−b)/a and du d −1
T (x) = 1/a,
so fU (u) = fX ((u − b)/a) /|a|. If a > 0, then T is strictly increasing, so FU (u) = FX ((u − b)/a).
If a < 0, then T is strictly decreasing, so FU (u) = 1 − FX ((u − b)/a).
Suppose that supp(X) ⊆ (0, ∞) and T (x) := log(x). The support condition is important since
log(x) is not defined for x ≤ 0. Then T is strictly increasing, T −1 (u) = eu and du
d −1
T (u) = eu , so
u u u
fU (u) = flog(X) (u) = fX (e ) e and FU (u) = FX (e ). Note that although X is positive, U might
have support equal to all of R. For example, if X ∼ Exp(λ), so that fX (x) = λe−λx for x ≥ 0, then
u
flog(X) (u) = λeu−λe .
As with discrete random variables, we need to be careful when the range of T is not the entirety
of R. For example, set T (x) = ex , so that T −1 (u) = log(u) and du
d −1
T (u) = 1/u. Then since
Ran(T ) = (0, ∞), FT (X) (u) = 0 and fT (X) (u) = 0 for u ≤ 0, and FT (X) (u) = FX (log(u)) and
fT (X) (u) = fX (log(u))/u for u > 0. If X ∼ N (µ, σ 2 ), then
1 [log(u)−µ]2
fT (X) (u) = √ e− 2σ2 .
2πσu
T (X) then follows the log-normal distribution with parameters µ and σ 2 .
Recall
that the moment generating function (mgf) of a random variable X is defined as mX (t) :=
E etX . The mgf of an RV determines its distribution. That is, if two RVs X and Y have mgfs mX
and mY such that mX (t) = mY (t) for all t, then X and Y have the same distribution. Therefore,
another way tofind the distribution of a transformation T of a random variable is to find the mgf
mT (X) (t) := E e tT (X) of the transformation, then see if we can match this mgf to a list of mgfs
of common distributions. This method works for non-invertible transformations T .
For example, suppose that X ∼ N (0, 1) and T (x) = x2 . Then
h 2i Z ∞ 1 Z ∞
1
tx2 −x2 /2 2
mX 2 (t) = E e tX
= √ e dx = √ e−(1−2t)x /2 dx.
−∞ 2π −∞ 2π
14
2 2 2
Suppose that t < 1/2, so that 0 < 1 − 2t. Then e−(1−2t)x /2 = e−x /(2s ) , where s = (1 − 2t)−1/2 .
Therefore, we have Z ∞
1 2 2
mX 2 (t) = s √ e−x /(2s ) dx .
−∞ 2πs
R ∞ 1 −x2 /(2s2 )
Now, −∞ √2πs e dx = 1 because this is the integral of a normal density. Therefore,
mX 2 (t) = s = (1 − 2t)−1/2 . This is the mgf of a chi-squared distribution with k = 1 degree
of freedom, so X 2 follows the chi-squared distribution with one degree of freedom, and we write
X 2 ∼ χ2 (1) or X 2 ∼ χ21 .
As another example, suppose that fX (x) = τ e−τ |x| for τ > 0, and T (x) = |x|. Then
Z ∞ Z ∞
t|x|−τ |x|
m|X| (t) = τe dx = τ e−(τ −t)|x| dx .
−∞ −∞
τ ∞ −c|x|
Z
τ τ
m|X| (t) = ce dx = = .
c −∞ c τ −t
by a similar argument used above. This is the mgf of an exponential distribution with rate λ = τ ,
so |X| ∼ Exp(λ).
The method of moment generating functions can alsoPbe used to find the distribution of a function
of multiple random variables. In particular, if U = ni=1 Xi , where the Xi ’s are all independent,
then " n n n
#
h Pn i Y Y tX Y
t i=1 Xi tXi
mU (t) := E e =E e = E e i
= mXi (t) .
i=1 i=1 i=1
One of the most important examples of the use of this result is showing that the sum of independent
normal random variables is still normal. Suppose that Xi ∼ N (µi , σi2 ), where the Xi ’s are all
2 2
independent, and let U = ni=1 ai Xi . The mgf of Xi is mXi (t) = eµi t+σi t /2 , so the mgf of ai Xi is
P
h i 2 2 2
mai Xi (t) = E etai Xi = E e(ai t)Xi = mXi (ai t) = eai µi t+ai σi t /2 .
(Note that for any random variable Y , maY (t) = mY (at).) Therefore,
n n
" n # " n # !
Y Y X X
2 2 2 2 2 2
mU (t) = mai Xi (t) = exp ai µi t + ai σi t /2 = exp a i µi t + ai σi t /2 .
i=1 i=1 i=1 i=1
We recognize this as the mgf of a normal distribution with mean i=1 ai µi and variance ni=1 a2i σi2 .
Pn P
This is a fundamental result about normal distributions: if Xi ∼ N (µi , σi2 ), where the Xi ’s are all
independent, then
n n n
!
X X X
2 2
ai Xi ∼ N ai µi , ai σi .
i=1 i=1 i=1
We say that the family of normal distributions is closed under linear transformations.
15
4.4 The method of distribution functions
In some cases, the change-of-variables formula (aka the method of transformations) doesn’t apply
because the transformation is not monotone, and the method of moment generating functions
doesn’t work because the mgf doesn’t exist or we can’t find a distribution that the mgf compares
to, but there is still a way to find the distribution function of a transformation of a random variable.
Suppose first that X ∈ R is a continuous univariate (real) random variable with distribution function
F , and that we want to know the distribution of U := T (X), where T : R → R is a known function.
Then
Z
FU (u) := P (U ≤ u) = P (T (X) ≤ u) = P (X ∈ {x : T (x) ≤ u}) = fX (x) dx .
{x:T (x)≤u}
If the set {x : T (x) ≤ u} can be described simply, it may then be possible to find an explicit form
of the distribution function FU of U . The density function of U is then
Z
d d
fU (u) := FU (u) = fX (x) dx .
du du {x:T (x)≤u}
For example, we can also derive the fact that if X ∼ N (0, 1) then X 2 ∼ χ2 (1). By the method of
distribution functions, we have
√ √ √ √
FX 2 (u) = P (X 2 ≤ u) = P (− u ≤ X ≤ u) = FX ( u) − FX (− u).
16
by the fundamental theorem of calculus and the chain rule. Similarly for u < 0,
Z Z 0
d d 1
fU (u) = fX (x) dx = fX (x) dx = 2 fX (1/u) .
du {x:T (x)≤u} du 1/u u
Now we have
1 1 1 1
fX (1/u) = 2 = πγ [u2 + (1/γ)2 ] = i .
u2
h
1/u
2
2
u πγ 1 + γ (π/γ) 1 + [u/ (1/γ)]
But this is just the pdf of a Cauchy distribution with scale parameter 1/γ. Therefore, 1/X ∼
Cauchy(0, 1/γ).
As noted above, if Xi ∼ N (µi , σi2 ), where the Xi ’s are all independent for i = 1, . . . , n, then
n n n
!
X X X
ai Xi ∼ N ai µi , a2i σi2 .
i=1 i=1 i=1
1 Pn
This implies that the sample mean X n := n i=1 Xi for X1 , . . . , Xn ∼ N (µ, σ 2 ) all independent,
then X n ∼ N (µ, σ 2 /n).
There are several other distributions related to transformations of normal random variables that
will be important. First, suppose that Z ∼ N (0, 1). Then, as discussed above, Z 2 ∼ χ2 (1). The
Chi-squared distribution with n degrees of freedom Pn is 2defined as the sum of n IID Chi-squared
distributions with one degree of freedom, i.e. i=1 Zi ∼ χ(n) = χn if Z1 , . . . , Zn ∼ N (0, 1).
Another important distribution is the (Student’s) t-distribution p with k degrees of freedom,
which is defined as the distribution of the random variable p Z/ Wk /k, where Z ∼ N (0, 1), Wk ∼
χ(k), and Z and Wk are independent, and we write Z/ Wk /k ∼ t(k).
Pn 2
Pn that if Z1 , 2. . . , Zn ∼ N (0, 1) are IID, then i=1 (Zi − Z n ) ∼ χ(n − 1),
It turns out to be the case
and furthermore that i=1 (Zi − Z n ) and Z n are independent random variables. This is by no
means an obvious statement, but unfortunately its proof is beyond the scope of this class.
Suppose that X1 , X2 , . . . , Xn are independent real random variables with common distribution
function FX . Let X(1) := min1≤i≤n Xi and X(n) := max1≤i≤n Xi . These are examples of order
statistics of the random sample X1 , X2 , . . . , Xn . The distribution of X(n) is defined, as usual:
FX(n) (x) := P X(n) ≤ x = P max Xi ≤ x .
1≤i≤n
Now, we note that max1≤i≤n Xi ≤ x if and only if Xi ≤ x for all 1 ≤ i ≤ n. Thus, by independence
of X1 , X2 , . . . , Xn ,
Yn
FX(n) (x) = P max Xi ≤ x = P (X1 ≤ x, X2 ≤ x, . . . , Xn ≤ x) = P (Xi ≤ x) = FX (x)n .
1≤i≤n
i=1
17
If the Xi ’s are continuous RVs, then
d d
fX(n) (x) = FX (x) = FX (x)n = nfX (x)FX (x)n−1 .
dx (n) dx
For the distribution of X(1) , we use a similar approach, but we work with the complementary
cdf 1 − FX instead:
FX(1) (x) := P X(1) ≤ x = P min Xi ≤ x = 1 − P min Xi > x .
1≤i≤n 1≤i≤n
We then note that min1≤i≤n Xi > x if and only if Xi > x for all 1 ≤ i ≤ n, so
n
Y n
Y
FX(1) (x) = 1 − P (X1 > x, X2 > x, . . . , Xn > x) = 1 − P (Xi > x) = 1 − [1 − FX (x)]
i=1 i=1
= 1 − [1 − FX (x)]n .
Here is an example. A random variable X follows the Weibull distribution with scale λ > 0 and
shape k > 0 if its density function is given by
k x k−1 −(x/λ)k
fX (x) = e
λ λ
k
for x > 0, and 0 for x ≤ 0. The cdf of X is given by FX (x) = 1 − e−(x/λ) for x > 0, and 0 for
x ≤ 0. Suppose X1 , X2 , . . . , Xn are independent Weibull(λ, k) RVs. Then the distribution function
of X(1) is given by
( k )
n
h
−(x/λ)k
in
−n(x/λ)k x
FX(1) (x) = 1 − [1 − FX (x)] = 1 − e =1−e = 1 − exp − .
λ/n1/k
This is none other than the distribution function of a Weibull λ/n1/k , k random variable! This
property of the Weibull distribution is useful in reliability analysis. Suppose that a machine or
system is made up of a bunch of sub-components, and the system fails whenever any of the sub-
components fails. Then the time to failure of the system is the minimum of the time to failure of
each of the sub-components. The closure of Weibull under minimums makes it useful for this type
of modeling.
Suppose that X1 , . . . , Xn are IID with CDF FX and have mean µX = E[X] and variance σX 2 =
18
P (X n ≤ x) of X n . However, it turns out that for large n, X n looks approximately like a N (µ, σ 2 /n)
random variable, regardless of what the distribution FX is. This is a deep and important theorem
in probability theory, and is extremely useful throughout statistics.
Formally, we introduce the random variable
n
Xn − µ √
Xn − µ 1 X Xi − µ
Zn := √ = n =√ .
σ/ n σ n σ
i=1
which by the CLT converges in distribution to N (0, 1). Hence, for large n, √Xn −np is approximately
np(1−p)
N (0, 1), so Xn is approximately N (np, np(1 − p)). Note that E[Xn ] = np and Var(Xn ) = np(1 − p).
Figure 2 shows the Binomial(n, p) CDF and PMF for n = 5, 10, 50, 100 and the N (np, np(1 − p)
CDF and PDF approximation. The approximation is excellent for n = 50 and n = 100.
It is also worth remarking that the CLT in part motivates the use of the normal distribution for
modeling in statistics. The CLT tells us that any random variable that is the sum of a moderate or
large number of other roughly independent factors is approximately normally distributed, so it can
be reasonably modeled using a normal random distribution even if this is not quite realistic. For
19
√
Figure 1: Top: cumulative distribution functions (CDFs) of (X n − µ)(σ/ n) for X1 , . . . , Xn ∼
Exp(2) (solid black line) and the CDF of N (0, 1) (dotted red line). Bottom: probability density
functions of these two random variables.
20
Figure 2: Top: cumulative distribution functions (CDFs) of a Binomial(n, p) random variable and
its N (np, np(1 − p)) approximation. Bottom: probability mass/density functions of these two
random variables.
21