MA225 L3 Notes
MA225 L3 Notes
(MA 225)
Class Notes
September – November, 2020
Instructor
Ayon Ganguly
Department of Mathematics
IIT Guwahati
Contents
1 Probability 2
1.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Classical Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Countable and Uncountable Sets . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Axiomatic Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Continuity of Probability . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Random Variable 17
2.1 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Expectation of Random Variable . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Transformation of Random Variable . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 Technique 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2 Technique 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.3 Expectation of Function of RV . . . . . . . . . . . . . . . . . . . . . . 40
2.5.4 Technique 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 Moment Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.A Gamma Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.B Beta Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1
3.9.3 Computing Expectation by Conditioning . . . . . . . . . . . . . . . . 76
3.10 Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.11 Some Results on Independent and Identically Distributed Normal RVs . . . . 84
2
Chapter 3
In the previous chapter, we studied RV and associated concepts. One of the mains use of the
random variable is to model numerical characteristic of a natural phenomena. For example,
we may assume that the RV X denote the income of a household. Now, assume that Y
denotes the spending of the household. Then Z = X − Y is a RV and it is the savings of the
household. Clearly, if we know the values of any of two RVs among X, Y , and Z, the value
of other one is known. In such situation, we may want to study the relationship between
X and Z. Clearly, the concepts of the previous chapter are not sufficient. Using the tools
of the previous chapter, we can study X and Z separately, but not jointly. For example
we cannot answer the question: Is savings increases with income? To answer it, we need
to know the probability of joint occurrence of events X ≤ x and Z ≤ z for different values
of x and z. In this chapter, we will study the concepts relating to the joint occurrence of
multiple RVs. There are plenty of examples, where we need to consider multiple RVs. These
examples include height and weight of a person, pollution level and blood pressure, lifetime
of a product and its cause of failure, etc.
50
y
(x, y)
Figure 3.1: CDF is the probability that (X, Y ) in the marked region
Theorem 3.1. Let X = (X, Y ) be a random vector with JCDF FX, Y (·, ·). Then the CDF
of X is given by FX (x) = lim FX, Y (x, y) for all x ∈ R. Similarly, the CDF of Y is given
y→∞
by FY (y) = lim FX, Y (x, y) for all y ∈ R.
x→∞
Proof: Fix x ∈ R. Let {yn }n≥1 be an increasing sequence of real numbers such that yn → ∞
as n → ∞. Let us define
An = {ω ∈ S : X(ω) ≤ x, Y (ω) ≤ yn }
A = lim An = ∪∞
n=1 An = {ω ∈ S : X(ω) ≤ x} .
n→∞
Now,
lim FX, Y (x, yn ) = lim P (An ) = P lim An = P (A) = FX (x).
n→∞ n→∞ n→∞
This shows that for any increasing sequence of real numbers {yn }n≥1 with yn → ∞ as n → ∞,
limn→∞ FX, Y (x, yn ) = FX (x). Thus, limy→∞ FX, Y (x, y) = FX (x) for each fixed x ∈ R.
Similarly, one can prove that limx→∞ FX, Y (x, y) = FY (y) for each fixed y ∈ R.
Remark 3.1. The previous theorem can be extended for more than two RVs. For example,
let X = (X, Y, Z). In this case we can find CDFs of X, Y , and Z using the following
formulas.
FX (x) = lim lim FX, Y, Z (x, y, z) = lim lim FX, Y, Z (x, y, z) for all x ∈ R,
y→∞ z→∞ z→∞ y→∞
FY (y) = lim lim FX, Y, Z (x, y, z) = lim lim FX, Y, Z (x, y, z) for all y ∈ R,
x→∞ z→∞ z→∞ x→∞
FZ (z) = lim lim FX, Y, Z (x, y, z) = lim lim FX, Y, Z (x, y, z) for all z ∈ R.
x→∞ y→∞ y→∞ x→∞
51
Let X = (X1 , . . . , Xn ). Let A = {i1 , i2 , . . . , ik } ⊂ {1, 2, . . . , n}. If we want to find the
JCDF of (Xi1 , Xi2 , . . . , Xik ), we need to take limit (tends to infinity) with respect to all the
components that are not present in A. †
In the context of random vector, the JCDF of a subset is called marginal CDF. Thus, if
X = (X, Y, Z), the CDF of X is called marginal CDF of X. Similarly the JCDF of (X, Y )
is called marginal CDF of (X, Y ).
Theorem 3.2 (Properties of JCDF). Let X = (X, Y ) be a random vector with JCDF
FX, Y (·, ·). Then
Proof: The proof of this theorem is similar to that of Theorem 2.1 with standard modifi-
cation for 2-dimensional functions. Therefore, the proof is skipped here.
Though we are skipping the proof the previous theorem, let us make a comparison be-
tween the properties (that are presented in the Theorem 2.1) of CDF of a RV and that
of JCDF of a 2-dimensional random vector. The Property 1 in the Theorem 2.1 states
that FX (·) is non-decreasing. This can be alternatively written as FX (x2 ) − FX (x1 ) =
P (x1 < X ≤ x2 ) ≥ 0 for all x1 < x2 . Thus, the non-decreasing property is a consequence
of the fact that the probability that a RV is in an interval must be non-negative. Now, a
natural extension of an interval in one-dimension is a rectangle in two-dimension. Thus, the
equivalent property should be based on the fact that the probability that a 2-dimensional
random vector in a rectangle must be non-negative. Let a1 < b1 and a2 < b2 be four
real numbers. Then the points (a1 , a2 ), (a1 , b2 ), (b1 , a2 ), and (b1 , b2 ) forms vertices of the
rectangle (a1 , b1 ] × (a2 , b2 ]. Now,
Thus, the fact that the probability that a random vector in a rectangle is non-negative gives
the Property 5 of Theorem 3.2.
Property 2 of Theorem 2.1 states that limx→∞ FX (x) = 1. This is intuitively tells that
if we cover the whole R, then the probability is one. Similarly, for a 2-dimensional ran-
dom vector if the whole R2 is covered, the probability is one and we have Property 1 of
Theorem 3.2.
Property 3 of Theorem 2.1 states that limx→−∞ FX (x) = 0. Loosely speaking, {X ≤ x}
becomes ∅ for x tends to −∞. For 2-dimensional random vector, if one of the components
52
tends to −∞, the set becomes empty. If x → −∞, then {X ≤ x} becomes ∅, and hence,
{X ≤ x, Y ≤ y} becomes ∅. Similarly, as y → −∞, {X ≤ x, Y ≤ y} becomes ∅. Thus, we
have Properties 2 and 3 of Theorem 3.2. Note that for the first property of Theorem 3.2,
both the components need to tend to ∞. However, for the Properties 2 and 3, if any of the
components tends to −∞ keeping other fixed, then JCDF tends to zero. Property 4 is a
straight forward extension of Property 4 of the Theorem 2.1.
Theorem 3.3. Let G : R2 → R be a function satisfying conditions 1–5 of the Theorem 3.2.
Then G is a JCDF of some 2-dimensional random vector.
This theorem can be used to check if a function is a JCDF or not. Theorems 3.2 and
3.3 can be extended for random vector having more than two components. However, writing
the property (5) involves complicated expressions.
Definition 3.4 (Joint PMF). Let (X, Y ) be a discrete random vector with support SX, Y .
Define a function fX, Y : R2 → R by
(
P (X = x, Y = y) if (x, y) ∈ SX, Y
fX, Y (x, y) =
0 otherwise.
The function fX, Y is called joint probability mass function (JPMF) of the discrete random
vector (X, Y ).
Note that discrete random vector is a straight forward extension of DRV. In case of
DRV, we need to find an atmost countable set SX in R such that P (X ∈ SX ) = 1. For
a 2-dimensional discrete random vector SX, Y is atmost countable and a subset of R2 such
that P ((X, Y ) ∈ SX, Y ) = 1. Similarly, the definition of JPMF is also a natural extension
of PMF of a DRV. These definitions can be easily extended for more than two dimensional
random vectors.
Theorem 3.4 (Properties of JPMF). Let (X, Y ) be a discrete random vector with JPMF
fX, Y (·, ·) and support SX, Y . Then
Proof: The proof of the theorem is straight forward form the definitions of discrete random
vector and JPMF.
53
Theorem 3.5. If a function g : R2 → R satisfy Properties 1 and 2 above for the atmost
countable set D = {(x, y) ∈ R2 : g(x, y) > 0} in place of SX, Y , then g is JPMF of some
2-dimensional discrete random vector.
Theorem 3.5 can be used to check if a function is JPMF or not. Again, Theorems 3.4
and 3.5 can be extended for more than 2-dimensional discrete random vector.
Theorem 3.6 (Marginal PMF from JPMF). Let (X, Y ) be a discrete random vector with
JPMF fX, Y (·, ·) and support SX, Y . Then X and Y are DRVs. The PMF of X is
X
fX (x) = fX, Y (x, y) for all fixed x ∈ R. (3.1)
(x, y)∈SX, Y
In this context, fX (·) and fY (·) are called marginal PMF of X and marginal PMF of Y ,
respectively.
As, SX, Y is atmost countable, D is also atmost countable. Fix x0 ∈ D. Consider (x0 , y) ∈
SX, Y and (x0 , y 0 ) ∈ SX, Y such that y 6= y 0 . Then the events
{X = x0 , Y = y} and {X = x0 , Y = y 0 }
are disjoint. Now, using theorem of total probability (Theorem 1.16), P (X = x0 ) can be
found by taking sum over all the points in SX, Y whose first component is x0 . Thus,
X X
P (X = x0 ) = P (X = x, Y = y) = fX, Y (x, y),
(x0 , y)∈SX, Y (x0 , y)∈SX, Y
and
X X X X
P (X = x) = fX, Y (x, y) = fX, Y (x, y) = 1.
x∈D x∈D (x, y)∈SX, Y (x, y)∈SX, Y
Hence, X is a DRV with PMF given by (3.1). Similarly we can prove that Y is also a DRV
with PMF given by (3.2)
54
where c is a constant. We can find the value of c based on the properties of JPMF. If f (·, ·)
have to be a JPMF, then f (x, y) ≥ 0 for all (x, y) ∈ R2 , which implies that c ≥ 0. Also,
n X
n
X 2
f (x, y) = 1 =⇒ c = .
x=1 y=1
n2 (n + 1)
We can also find the marginal PMF of X as follows: Fix x ∈ {1, 2, . . . , n}. Then
n
X X 1
P (X = x) = f (x, y) = cy = .
y=1
n
(x, y)∈SX, Y
where c is a constant. Note that the function f (·, ·) is almost similar to that of the previous
example. Only difference is in the sets where the functions are strictly positive. In the
previous example, f was positive on {1, 2, . . . , n} × {1, 2, . . . , n}. In the current example
the set is {(x, y) ∈ R2 : x = 1, 2, . . . , n; y = 1, 2, . . . , n; x ≤ y}. However, this changes the
probability distribution completely. We will see that the marginal PMFs are also different.
Hence, support is an important issue.
The constant c is positive and can be found as follows:
y
n X
X X 6
f (x, y) = 1 =⇒ cy = 1 =⇒ c = .
y=1 x=1
n(n + 1)(2n + 1)
(x, y)∈SX, Y
Please note the range of the summations. Thus, the JPMF of (X, Y ) is given by
(
6y
n(n+1)(2n+1)
if x = 1, 2, . . . , n; y = 1, 2, . . . , n; x ≤ y
f (x, y) =
0 otherwise
55
Please note the range of the summation above. Thus, the marginal PMF of X is given by
(
3(n+x)(n−x+1)
if x = 1, 2, . . . , n
fX (x) = n(n+1)(2n+1)
0 otherwise.
||
Proof: The proof of the Property 1 is straight forward from the definition of continuous
random vector. For the Property 2,
Z ∞Z ∞ Z A Z B
fX, Y (x, y)dxdy = lim lim fX, Y (x, y)dxdy = lim lim FX, Y (A, B) = 1.
−∞ −∞ A→∞ B→∞ −∞ −∞ A→∞ B→∞
56
The PDF of Y is given by
Z ∞
fY (y) = fX, Y (x, y)dx for all fixed y ∈ R. (3.4)
−∞
In the context of continuous random vector, fX (·) and fY (·) are called marginal PDF of X
and marginal PDF of Y , respectively.
Proof: For x ∈ R,
FX (x) = lim FX, Y (x, y)
y→∞
Z x Z y
= lim fX, Y (s, t)dtds
y→∞ −∞ −∞
Z x Z y
= lim fX,Y (s, t)dt ds
−∞ y→∞ −∞
Z x Z ∞
= fX, Y (s, t)dt ds
−∞ −∞
Z x
= g(s)ds,
−∞
Z ∞
where g(s) = fX, Y (s, t)dt. The third equality holds true as fX, Y (x, y) ≥ 0 for all
−∞
(x, y) ∈ R2 . Thus, X is a CRV with PDF as given in (3.3). Similarly, we can prove that Y
is also CRV with PDF given in (3.4).
Example 3.3. Let (X, Y ) be a CRV with JPDF
(
ce−(2x+3y) if 0 < x < y < ∞
f (x, y) =
0 otherwise,
where c is a constant. Clearly, c > 0 as fX, Y (x, y) ≥ 0 for all (x, y) ∈ R2 . The value of c
can be found as follows:
Z ∞Z ∞ Z ∞Z ∞
f (x, y)dydx = 1 =⇒ c e−(2x+3y) dydx = 1 =⇒ c = 15.
−∞ −∞ 0 x
Note the range of integration. We can find the marginal PDF of X as follows: For x ≤ 0,
Z ∞
fX (x) = f (x, y)dy = 0,
−∞
57
3.4 Expectation of Function of Random Vector
Definition 3.6 (Expectation of Function of Discrete Random Vector). Let (X, Y ) be a
discrete random vector with JPMF fX, Y (·, ·) and support SX, Y . Let h : R2 → R. Then the
expectation of h(X, Y ) is defined by
X
E (h(X, Y )) = h(x, y)fX, Y (x, y),
(x, y)∈SX, Y
X
provided |h(x, y)|fX, Y (x, y) < ∞.
(x, y)∈SX, Y
where ai ∈ R is a constant for all i = 1, 2, . . . , n. Here we assume that all the expectations
exist.
Proof: We will prove the theorem for continuous random vector X = (X, Y ). The proof
for general value of n is similar. Also, the proof for discrete random vector can be written
easily by replacing integration sign by summation sign. Let f be the JPDF of X. Then
Z ∞Z ∞
E (a1 X + a2 Y ) = (a1 x + a2 y) fX, Y (x, y)dxdy
−∞ −∞
Z ∞Z ∞ Z ∞Z ∞
= a1 xfX, Y (x, y)dxdy + a2 yfX, Y (x, y)dxdy
−∞ −∞ −∞ −∞
Z ∞ Z ∞ Z ∞ Z ∞
= a1 x fX, Y (x, y)dy dx + a2 y fX, Y (x, y)dx dy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
= a1 xfX (x)dx + a2 yfY (y)dy
−∞ −∞
= a1 E (X) + a2 E (Y ) .
The previous theorem tells us that we can compute expectation of a linear combination
of a random vectors by computing the expectations of individual components of the random
vector and then taking the linear combination. Note that the left hand side involves the
58
joint distribution and an n-dimensional integration (for continuous random vector) or an
n-dimensional summation (for discrete random vector). However, the right hand side in-
volves n one-dimensional integrations or n one-dimensional summations. Also, we need the
marginal distributions (not joint distribution) to compute the right hand side. Sometimes it
is much easier to compute n one-dimensional integrations compared to a single n-dimensional
integration. Same is true for summations. Also, in many problems, it is easier to obtain the
marginal distributions than to obtain joint distribution. The following example illustrate
this.
Example 3.4. At a party n men throw their hats into the center of a room. The hats
are mixed up and each man randomly selects one. Suppose that we want to calculate the
expected number of men who selects their own hat. Let X denote the number of men
who selects their own hat. We are interested to find E(X). To compute E(X) directly,
we need the distribution of X. It is clear that X should be a DRV with support SX =
{0, 1, . . . , n − 2, n}. Note that n − 1 6∈ SX (why?). It is quite easy to find the values
P (X = k) for k = 0 and n. However, it is quite difficult to find P (X = k) for other values
of k ∈ SX and k = 0, n. Thus, it becomes a difficult problem if we try to compute E(X)
directly.
Let us try to solve it by converting the problem into a multidimensional problem. For
i = 1, 2, . . . , n, let us define the RV
(
1 if ith person takes his own hat
Xi =
0 otherwise.
= 1.
59
Remark 3.3. If (X, Y ) is continuous random vector, then
Z Z
P ((X, Y ) ∈ A)) = fX,Y (x, y)dxdy,
(x, y)∈A
for all A ⊆ R2 such that the integration is possible. This statement can be seen
R as an
extension of the fact that if X is a CRV with PDF f (·), then P (X ∈ B) = B f (x)dx.
However, the mathematical proof is out of the scope of this course. †
Remark 3.4. In Theorem 3.9, we have seen that if (X, Y ) is continuous random vector,
then X and Y are continuous random variables. However, the converse, in general, is not
true. Thus, (X, Y ) may not be a continuous random vector even if X and Y are CRVs.
Consider the following example in this regards. Let X be a CRV. Suppose that Y = X.
Then X and Y are CRVs. It is clear that P (X = Y ) = 1. Now, if possible, assume that
(X, Y ) is a continuous random vector. Thus, (X, Y ) has a JPDF, say f (·, ·). Then
Z Z
P (X = Y ) = f (x, y)dxdy = 0.
x=y
RR
The last equality is true as a double integral B g(x, y)dxdy can be interpreted as the
volume under the function g(·, ·) over the set B. As the area of the set {(x, y) ∈ R2 : x = y}
is zero, the volume is also zero. This is a contradiction to the fact that P (X = Y ) = 1.
Thus, our assumption is wrong and (X, Y ) is not a continuous random vector.
In general, if there exists a set A ⊂ R2 whose area is zero and P ((X, Y ) ∈ A) > 0, then
(X, Y ) does not have a JPDF, and hence, (X, Y ) is not a continuous random vector. †
Remark 3.5. In Theorems 3.6 and 3.9, we have seen that the marginal distributions can be
recovered form the joint distribution. However, the converse, in general, is not true. Let us
illustrate it using the following example. †
Example 3.5. Let f (·) and g(·) be two PDFs and F (·) and G(·) be the corresponding
CDFs, respectively. Define, for −1 < α < 1,
60
Thus, h(·, ·) is a JPDF of a 2-dimensional continuous random vector, say (X, Y ). Let us
try to find the marginal PDFs of X and Y . The marginal PDF of X is
Z ∞
fX (x) = h(x, y)dy
−∞
Z ∞ Z ∞
= f (x) g(y)dy + αf (x) (1 − 2F (x)) g(y) (1 − 2G(y)) dy
−∞ −∞
= f (x).
Similarly the marginal PDF of Y is g(·). Thus, the marginal PDFs of X and Y does
not depend on α. However, the JPDF depends on the value of α. For different values of
α, we have different JPDF, but the marginals remain same. Hence, given the marginal
distributions, in general, we cannot construct the joint distribution. ||
Thus, two RVs X and Y are independent if and only if the events Ex = {X ≤ x} and
Fy = {Y ≤ y} are independent for all (x, y) ∈ R2 . For discrete random vector, an equivalent
definition of independence can be given in terms of JPMF and marginal PMFs. Similarly,
for continuous random vector, an equivalent definition of independence can be given in terms
of JPDF and marginal PDFs. Next, we will give these two alternative definitions, however
we will not prove the equivalence of the respective definitions. Nonetheless, these alternative
definitions are very handy in many applications.
Definition 3.9 (Alternative Definition for Discrete Random Vector). Let (X1 , X2 , . . . , Xn )
be a discrete random vector with JPMF fX1 , X2 , ..., Xn (·, . . . , ·). Then X1 , X2 , . . . , Xn are said
to be independent if
n
Y
fX1 , X2 , ..., Xn (x1 , x2 , . . . , xn ) = fXi (xi )
i=1
Definition 3.10 (Alternative Definition for Continuous Random Vector). Let (X1 , X2 , . . . , Xn )
be a continuous random vector with JPDF fX1 , X2 , ..., Xn (·, . . . , ·). Then X1 , X2 , . . . , Xn are
said to be independent if
n
Y
fX1 , X2 , ..., Xn (x1 , x2 , . . . , xn ) = fXi (xi )
i=1
61
In the last section, we have pointed out that, in general, the joint distribution cannot be
recovered from the marginal distributions. However, if X1 , X2 , . . . , Xn are independent RVs
and if we know the marginal Qn distributions of Xi for all i = 1, n2, . . . , n, then we can write
FX1 , ..., Xn (x1 , . . . , xn ) = i=1 FXi (xi ) for all (x1 , . . . , xn ) ∈ R . Thus, we can recover the
joint distribution from the marginal distributions if the RVs are known to be independent.
Moreover, if X1 , X2 , . . . , Xn are independent CRVs, then (X1 , . . . , Xn ) is a continuous
random vector.
Theorem 3.11. If X and Y are independent, then
E (g(X)h(Y )) = E (g(X)) E (h(Y )) ,
provided all the expectations exist.
Proof: We will prove it for continuous random vector (X, Y ). For discrete random vector,
it can be proved by replacing the integration sign by summation sign.
Z ∞Z ∞
E (g(X)h(Y )) = g(x)h(y)fX, Y (x, y)dxdy
−∞ −∞
Z ∞Z ∞
= g(x)h(y)fX (x)fY (y)dxdy, as X and Y are independent
−∞ −∞
Z ∞ Z ∞
= g(x)fX (x)dx h(y)fY (y)dy
−∞ −∞
= E (g(X)) E (h(Y )) .
It is easy to see that E(X) = 0 and E(X 3 ) = 0. Hence, Cov(X, Y ) = 0. Now, P (X ≤ −5) =
Φ(−5) 6= 0 and P (Y ≤ 1) = P (−1 ≤ X ≤ 1) = 2Φ(1)−1 6= 0. However, P (X ≤ −5, Y ≤ 1) =
0. Thus, X and Y are not independent. ||
62
Theorem 3.12. Let X, Y, Z, X1 , . . . , Xn , Y1 , . . . , Ym be RVs such that all the necessary
expectations for the followings exist. Then
1. Cov(X, X) = V ar(X).
n
! n
X X
7. If Xi ’s are independent, then V ar Xi = V ar(Xi ).
i=1 i=1
3.
That means that the quadratic equation λ2 V ar(Y ) + 2λCov(X, Y ) + V ar(X) = 0 either has
one solution or no solution. Hence,
63
3.8 Transformation Techniques
Let X = (X1 , X2 , . . . , Xn ) be a random vector and g : Rn → Rm . Clearly, Y = g(X) is a
m-dimensional random vector. In this section, we will discuss different methods to find the
distribution of the random vector Y = g(X). Like the previous chapter, there are mainly
three techniques to obtain the distribution of Y = g(X).
3.8.1 Technique 1
In Technique 1, we try to find the JCDF of Y = g(X) given the distribution of X. As
before, we will discuss this technique using examples.
Example 3.7. Let X1 and X2 be identically and independently distributed (i.i.d.) U (0, 1)
random variables. Suppose we want to find the CDF of Y = X1 + X2 . Now,
Z Z
FY (y) = P (Y ≤ y) = P (X1 + X2 ≤ y) = fX1 , X2 (x1 , x2 )dx1 dx2 . (3.5)
x1 +x2 ≤y
As X1 ∼ U (0, 1), X2 ∼ U (0, 1) and X1 and X2 are independent RVs, the JPDF of (X1 , X2 )
is given by
(
1 if 0 < x1 < 1, 0 < x2 < 1
fX1 , X2 (x1 , x2 ) =
0 otherwise.
Thus, the JPDF of (X1 , X2 ) is positive only on the unit square (0, 1) × (0, 1), which is
indicated by gray shade in Figure 3.2. Now, to compute the integration in (3.5), we need to
consider the following cases.
For y < 0, consider the Figure 3.2a. As the integrand in (3.5) is zero over the region
{(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y} for y < 0,
FY (y) = 0.
For 0 ≤ y < 1, consider the Figure 3.2b. The integrand is positive only on the shaded
region in the set {(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y}. Therefore,
Z y Z y−x2
1
FY (y) = dx1 dx2 = y 2 .
0 0 2
For 1 ≤ y < 2, consider the Figure 3.2c. The integrand is positive only on the shaded
region in the set {(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y}. Therefore,
Z 1 Z 1
1
FY (y) = 1 − dx1 dx2 = 1 − (2 − y)2 .
y−1 y−x2 2
For y ≥ 2, consider the Figure 3.2d. The integrand is positive on the shaded region in
the set {(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y} and the square (0, 1) × (0, 1) is completely inside the
set {(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y}. Therefore,
FY (y) = 1.
64
x2
x2
x1
x1
x2
x2
x1
x1
Suppose that we want to find the JCDF of Y1 = X1 +X2 and Y2 = X2 −X1 . Note that the
JPDF of (X1 , X2 ) is positive only on the set SX1 , X2 = {(x1 , x2 ) ∈ R2 : 0 < x1 < x2 < ∞}.
65
x2
x2
2
x
=
1
x
2
x
=
e−x1
1
x
0
0 x1
x1
x1
+
x2
0 0
=
y1
(a) JPDF of (X1 , X2 ) (b) y1 < 0
x2 x2
2
y
=
1
x
−
2
x
2
x
x
=
=
1
1
x
x
2
y
x1
=
+
1
x
x2
−
=
2
x
x1 x1
y1
x1
+
x2
=
y1
(c) 0 < y2 < y1 < ∞ (d) 0 < y1 < y2 < ∞
Suppose that y1 < 0. Then FY1 (y1 ) = 0. See the Figure 3.3b. As FY1 , Y2 (y1 , y2 ) ≤
min {FY1 (y1 ), FY2 (y2 )}, FY1 , Y2 (y1 , y2 ) = 0 for y1 < 0. Similarly, FY1 , Y2 (y1 , y2 ) = 0 for y2 < 0.
For 0 < y2 < y1 < ∞, Ay1 ,y2 ∩ SX1 , X2 is the shaded region of the Figure 3.3c. Therefore,
y1 −y2 y1
Z
2
Z x1 +y1 Z
2
Z y1 −x1
−x1
FY1 , Y2 (y1 , y2 ) = e dx2 dx1 + e−x1 dx2 dx1
y1 −y2
0 x1 2
x1
y y −y
− 21 − 12 2
= y1 + e − (y1 − y2 + 2)e .
For 0 < y1 < y2 < ∞, Ay1 ,y2 ∩ SX1 , X2 is indicated by the shaded region in the Figure 3.3d.
Therefore,
y1
Z
2
Z y1 −x1 y1
FY1 , Y2 (y1 , y2 ) = e−x1 dx2 dx1 = y1 + 2e− 2 − 2.
0 x1
66
Thus, the JCDF of (Y1 , Y2 ) = (X1 + X2 , X2 − X1 ) is given by
0
y1 y1 −y2
if y1 < 0 or y2 < 0
FY1 , Y2 (y1 , y2 ) = y1 + e− 2 − (y1 − y2 + 2)e− 2 if 0 < y2 ≤ y1 < ∞
y
− 21
y1 + 2e −2 if 0 < y1 < y2 < ∞.
It can be shown that (Y1 , Y2 ) is a continuous random vector. This is easy and therefore left
as an exercise. ||
3.8.2 Technique 2
Like the previous chapter, this technique is based on two theorems, one for discrete random
vector and another for continuous random vector.
Theorem 3.14. Let X = (X1 , X2 , . . . , Xn ) be a discrete random vector with JPMF fX
and support SX . Let gi : Rn → R for all i = 1, 2, . . . , k. Let Yi = gi (X) for i = 1, 2, . . . , k.
Then Y = (Y1 , . . . , Yk ) is a discrete random vector with JPMF
X
fX (x) if (y1 , . . . , yk ) ∈ SY
fY (y1 , . . . , yk ) = x∈Ay
0 otherwise,
Therefore, SX1 , X2 = {0, 1, 2, . . .}×{0, 1, 2, . . .}, which implies that SY = {0, 1, 2, . . .}. For
y ∈ SY , Ay = {(x, y − x) : x = 0, 1, . . . , y}. Hence, using the Theorem 3.14, for y ∈ SY ,
y
X e−(λ1 +λ2 ) λx1 1 λx2 2 e−(λ1 +λ2 ) X y x y−x 1
fY (y) = = λ1 λ2 = e−(λ1 +λ2 ) (λ1 + λ2 )y .
x1 !x2 ! y! x=0
x y!
(x1 , x2 )∈Ay
67
of successes out of n1 + n2 Bernoulli trials with success probability p. As X1 and X2 are
independent, these n1 + n2 Bernoulli trials can be assumed to be independent. Hence, the
distribution of Y must be Bin(n1 + n2 , p). Let us now check if we get the same distribution
using the Theorem 3.14. The JPMF of X1 and X2 is
(
n1 n2 x1 +x2
x1 x2
p (1 − p)n1 +n2 −x1 −x2 if x1 = 0, 1, . . . , n1 ; x2 = 0, 1, . . . , n2
fX1 , X2 (x1 , x2 ) =
0 otherwise.
Ay = {(x1 , x2 ) ∈ SX1 , X2 : x1 + x2 = y}
{(x, y − x) : x = 0, 1, . . . , y}
if 0 ≤ y ≤ n1
= {(x, y − x) : x = 0, 1, . . . , n1 } if n1 < y ≤ n2
{(x, y − x) : x = y − n2 , . . . , n1 } if n2 < y ≤ n1 + n2 .
The last equality can be proved by collecting the coefficient of xy from both sides of the
following expression:
(n ) (n )
1 2
X n1 i X n2 i
(1 + x)n1 (1 + x)n2 = x × x .
i=0
i i=0
i
Thus, X1 + X2 ∼ Bin (n1 + n2 , p). Note that independence of X1 and X2 and same value
of probability of success are important for the result. ||
is one-to-one. That means that there exists the inverse transformation xi = hi (y),
i = 1, 2, . . . , n defined on the range of the transformation.
68
2. Assume that both the mapping and its’ inverse are continuous.
∂xi
3. Assume that partial derivatives ∂yj
, i = 1, 2, . . . , n, j = 1, 2, . . . , n, exist and are
continuous.
Proof: The proof of this theorem can be done using transformation of variable technique
for multiple integration. However, the proof is skipped here.
Remark 3.7. Note that g is a vector valued function. As g should be one-to-one, the
dimension of g should be same as dimension of the argument of g. Though we have written
that gi : Rn → R in the previous theorem, the conclusion of the theorem is valid if we replace
gi : Rn → R by gi : SX → R. Moreover, the theorem gives us sufficient conditions for g(X)
to be a continuous random vector, when X is continuous random vector. Thus, g(X) can
be a continuous random vector even if the conditions of the previous theorem do not hold
true. †
Example 3.11. Let X1 and X2 be i.i.d. U (0, 1) random variables. We want to find the
JPDF of Y1 = X1 + X2 and Y2 = X1 − X2 . Clearly,
Thus, all the four conditions of the Theorem 3.15 hold, and hence, Y = (Y1 , Y2 ) is a
continuous random vector with JPDF
1 1 1
fY1 , Y2 (y1 , y2 ) = fX1 , X2 (y1 + y2 ), (y1 − y2 ) −
2 2 2
69
(
1
2
if 0 < y1 + y2 < 2, 0 < y1 − y2 < 2
=
0 otherwise.
Note that in Example 3.7, we have found the distribution of X1 + X2 . You may find the
marginal distribution of X1 +X2 from JPDF above and check if you are getting same marginal
distribution. ||
Example 3.12. Let X1 and X2 be i.i.d. N (0, 1) random variables. We want to find the
PDF of Y1 = X1 /X2 . Note that we cannot use Theorem 3.15 directly here as we have a single
function g1 (x1 , x2 ) = xx21 . Thus, we need to bring an auxiliary new function g2 (x1 , x2 ) such
that g(x1 , x2 ) = (g1 (x1 , x2 ), g2 (x1 , x2 )) satisfies all the conditions of Theorem 3.15. Let us
take g2 (x1 , x2 ) = x2 . Clearly, g(x1 , x2 ) is a one-to-one function. Here, the inverse function
is h(y1 , y2 ) = (h1 (y1 , y2 ), h2 (y1 , y2 )), where x1 = h1 (y1 , y2 ) = y1 y2 and x2 = h2 (y1 , y2 ) = y2 .
It is easy to see that mapping g and its’ inverse are continuous. Also,
∂x1 ∂x1 ∂x2 ∂x2
= y2 , = y1 , = 0, and = 1.
∂y1 ∂y2 ∂y1 ∂y2
All the partial derivatives are continuous. Hence, the Jacobian is
y2 y1
J= = y2 .
0 1
X1
Thus, all the four conditions of the Theorem 3.15 hold, and hence, Y = X2
, X2 is a
continuous random vector with JPDF
1 − 1 (1+y12 )y22
fY1 , Y2 (y1 , y2 ) = e 2 |y2 | for (y1 , y2 ) ∈ R2 .
2π
Now, we can find the marginal PDF of Y1 from the JPDF of (Y1 , Y2 ). The marginal PDF of
Y1 is given by
Z ∞
|y2 | − 1 (1+y12 )y22 1 ∞
Z
1 2 2 1
fY1 (y1 ) = e 2 dy2 = y2 e− 2 (1+y1 )y2 dy2 =
−∞ 2π π 0 π(1 + y12 )
for all y1 ∈ R. Thus, Y1 ∼ Cauchy(0, 1). ||
Theorem 3.16. If X and Y are independent, then g(X) and h(Y ) are also independent.
Proof: The exact proof of the theorem cannot be given in the course. However, intuitively
it makes sense. X and Y are independent. That means that there is no effect of one of the
RVs on the other. Now, g being a function of X only and h being a function of Y only, there
should be no effect of g(X) on h(Y ) and vice versa.
3.8.3 Technique 3
The Technique 3 depends on the MGF. Hence, first we need to define the MGF of a random
vector.
Definition 3.13 (Moment Generating Function). Let X = (X1 , X2 , . . . , Xn ) be a random
vector. The MGF of X at t = (t1 , t2 , . . . , tn ) is defined by
X n
MX (t) = E exp ti Xi
i=1
70
∂ r1 +r2 +...+rn
Theorem 3.17. E(X1r1 X2r2 · · · Xnrn ) = r r
∂t11 ∂t22 ...∂trnn
MX (t) .
t=0
Theorem 3.18. X and Y are independent iff MX, Y (t1 , t2 ) = MX (t1 )MY (t2 ) in a neigh-
borhood of the origin.
Note that if X and Y are independent, then using Theorems 3.11, it is straight forward
to see that MX, Y (t, s) = MX (t)MY (s). Also, note that E (g(X)h(Y )) = E (g(X)) E (h(Y ))
for some functions g and h does not imply that X and Y are independent. In particular
E(XY ) = E(X)E(Y ) does not imply X and Y are independent. Please revisit Example 3.6
in this regard.
Definition 3.14. Two n-dimensional random vectors X and Y are said to have the same
d
distribution, denoted by X = Y , if FX (x) = FY (x) for all x ∈ Rn .
Theorem 3.19. Let X and Y be two n-dimensional random vectors. Let MX (t) = MY (t)
d
for all t in a neighborhood around 0, then X = Y .
The fourth equality is true as the RVs X1 , X2 , . . . , Xk are independent. In Example 2.40,
we have seen that the MGF of X ∼ Bin(n, p) is MX (t) = (1 − p + pet )n for all t ∈ R. Thus,
the MGF of Y is
k
Y Pk
MY (t) = (1 − p + pet )ni = (1 − p + pet ) i=1 ni
i=1
P
k d
for t ∈ R. Let Z ∼ Bin n
i=1 i , p , then MZ (t) = MY (t) for all t ∈ R. Thus, Y = Z ∼
P
k
Bin i=1 ni , p . Note that this example is an extension of Example 3.10. ||
i.i.d.
Example 3.14. Let X1 , X2 , . . . , Xk ∼ Exp(λ) and Y = ki=1 Xi . Then the MGF of Y
P
is
k −k
Y k t
MY (t) = MXi (t) = [MX1 (t)] = 1 −
i=1
λ
for all t < λ. The second equality is due to the fact that Xi has same distribution for all
i = 1, 2, . . . , k. The third equality is hold true form Example 2.41. Let Z ∼ Gamma(k, λ).
Then MZ (t) = MY (t) for all t < λ. Hence, Y ∼ Gamma(k, λ). ||
71
2
Pk
Example
P 3.15. Let X i , i = 1, 2, . . . , k be independent N (µ i , σi ) RVs. Then i=1 Xi ∼
k Pk 2
N i=1 µi , i=1 σi . This can be proved following the same technique as the last example.
I am leaving it as an exercise. ||
Definition 3.15 (Expectation of a Random Vector). Expectation of a random vector is given
by
E(X) = (EX1 , EX2 , . . . , EXn )0 = µ.
Definition 3.16 (Variance-Covariance Matrix of a Random Vector). The variance-covariance
matrix of a n-dimensional random vector, denoted by Σ, is defined by
72
λ1
Thus, X1 |X1 + X2 = y ∼ Bin y, λ1 +λ2
. Note that the support of the conditional PMF is
{0, 1, . . . , y}. ||
Definition 3.18. The conditional CDF of X given Y = y is defined by
X
FX|Y (x|y) = P (X ≤ x|Y = y) = fX|Y (u|y),
{u≤x:(u,y)∈SX,Y }
The last equality is due to the fact that P (Xi |Y = 1) = pi for all i = 1, 2, . . . , n. ||
Theorem 3.20. If X and Y are independent DRVs, then fX|Y (x|y) = fX (x) for all x ∈ R
and y ∈ SY .
Proof: The proof is straight forward from the definition of conditional PMF.
73
3.9.2 For Continuous Random Vector
Let (X, Y ) be a continuous random vector. Note that, in this case, Y is a CRV, and hence,
P (Y = y) = 0 for all y ∈ R. As a result, we cannot define the conditional probabilities in
the same way as we defined for the discrete random vector in the previous subsection. Like
CRV or continuous random vector, we will first define the conditional CDF for a continuous
random vector (X, Y ), then conditional PDF.
Definition 3.20 (Conditional CDF). Let (X, Y ) be a continuous random vector. The con-
ditional CDF of X given Y = y is defined as
FX|Y (x|y) = lim P (X ≤ x|Y ∈ (y − , y + ]),
↓0
Theorem 3.21. Let fX,Y be the JPDF of (X, Y ) and let fY be the marginal PDF of Y . If
fY (y) > 0, then the conditional PDF of X given Y = y exists and is given by
fX,Y (x, y)
fX|Y (x|y) = .
fY (y)
Proof: This is not an exact proof, but a overview of it.
P (X ≤ x, y − ε < Y ≤ y + ε)
lim P (X ≤ x|y − ε < Y ≤ y + ε) = lim
ε↓0 ε↓0 P (y − ε < Y ≤ y + ε)
R x R y+ε
−∞ y−ε X, Y
f (t, s)dsdt
= lim R y+ε
ε↓0 f (s)ds
y−ε Y
Z y+ε
Z x 1
lim fX, Y (t, s)ds
ε↓0 2ε y−ε
= dt
1 y+ε
Z
−∞ lim fY (s)ds
ε↓0 2ε y−ε
Z x
fX, Y (t, y)
= dt.
−∞ fY (y)
The last equality is due to fundamental theorems of calculus. Thus, the conditional PDF is
f (x,y)
given by fX|Y (x|y) = X,Y
fY (y)
for those values of y ∈ R for which fY (y) > 0.
74
Definition 3.22 (Conditional Expectation for Continuous Random Vector). The conditional
expectation of h(X) given Y = y, is defined for all values of y such that fY (y) > 0, by
Z ∞
E(h(X)|Y = y) = h(x)fX|Y (x|y)dx,
−∞
75
Note that Y ∼ U (0, 2) and X|Y = y ∼ Exp(y) for y ∈ (0, 2). Hence,
X Z ∞ x
E e 2 |Y = 1 = e−x+ 2 dx = 2.
0
||
Theorem 3.22. If (X, Y ) is a continuous random vector such that X and Y are indepen-
dent random variables, then fX|Y (x|y) = fX (x) for all x ∈ R and for all y ∈ SY .
Proof: The proof is straight forward from the definition.
In the previous theorem, the outside expectation is with respect to the distribution of
Y , as E (X|Y ) is a function of Y . This theorem can be used to solve many problems. This
theorem states that we can compute the average of different parts of a population separately
and then take an weighted sum of those average to obtain the overall average. Let there be
three columns of seating arrangement (like some of the lectures halls at IITG) in a class and
we want to find the average of heights of the students in the class. Let xi denote the average
of the height of the students seating in the ith column and ni be the number of students
who is seating in ith column. Then the overall average is
n1 n2 n3
x= x1 + x2 + x3 ,
n n n
where n = n1 + n2 + n3 . Note that nni can be interpreted as the probability that a student
in column i and xi is the conditional expectation of height given that the student is in the
column i. Therefore, the overall average is EE (X|Y ), where X denotes the height of a
student and Y is an indicator of the column number. We can take Y = i if the student is
in the ith column. Thus, the above theorem is a generalization of what we have learned in
school.
76
Though we have discussed expectations when (X, Y ) is either discrete random vector or
continuous random vector, the above theorem is still valid if one of them is DRV and other
is CRV. We will see some applications where one of them is CRV and other one is DRV. Of
course, we will not go for proper definition (which is out of scope of this course) or proof in
these cases. If Y is a DRV, then
X
E (X) = EE(X|Y ) = E (X|Y = y) fY (y).
y∈SY
If Y is a CRV, then
Z ∞
E (X) = EE(X|Y ) = E (X|Y = y) fY (y)dy.
−∞
Example 3.21. Virat will read either one chapter of his probability book or one chapter
of his history book. If the number of misprints in a chapter of his probability and history
book is Poisson with mean 2 and 5 respectively, then assuming that Virat is equally likely
to choose either book, we can compute the expected number of misprints that he will come
across using the above theorem. Let X denote the number of misprint and
(
1 if Virat read the probability book
Y =
2 if Virat read the history book.
1
We need to find E(X). Note that P (Y = 1) = P (Y = 2) = 2
, E (X|Y = 1) = 2 and
E (X|Y = 2) = 5. Hence,
1
E(X) = EE(X|Y ) = P (Y = 1) E(X|Y = 1) + P (Y = 2) E (X|Y = 2) = (2 + 5) = 3.5.
2
Thus, expected number of misprint that Virat will come across is 3.5. ||
Theorem 3.24. E(X − E(X|Y ))2 ≤ E(X − f (Y ))2 for any function f .
Proof: Let us denote µ(Y ) = E (X|Y ). Then
Now,
The first equality is due to the Theorem 3.23. For the second equality, notice that µ(Y ) −
f (Y ), being a function of Y only, acts as a constant when Y is given. Hence, µ(Y ) − f (Y )
comes out of the conditional expectation. Thus,
as E (µ(Y ) − f (Y ))2 ≥ 0. The equality holds if and only if E (µ(Y ) − f (Y ))2 = 0. It can be
shown that E (µ(Y ) − f (Y ))2 = 0 if and only if f (Y ) = µ(Y ) = E(X|Y ).
77
Recall the Theorem 2.12, which states that if we do not have any extra information, then
the “best estimate” of X is E(X). The Theorem 3.24 states that if we have information on
the RV Y , then the “best estimate” of X is changes and becomes E (X|Y ).
Definition 3.23 (Conditional Variance). Let (X, Y ) be a random vector. Then the condi-
tional variance of X given Y = y is defined by
V ar(X) = E (X − µ)2
= E (X − µ(Y ))2 + E (µ(Y ) − µ)2 + 2E [(X − µ(Y )) (µ(Y ) − µ)] .
Thus,
= E (V ar(X|Y )) + V ar (µ(Y ))
= E (V ar(X|Y )) + V ar (E(X|Y )) .
Note that it is easy to remember the formula. On the right hand side, one is expectation of
conditional variance and another is variance of conditional expectation.
Example 3.22. Let X0 , X1 , X2 , . . . be a sequence of i.i.d. RVs with mean µ and variance σ 2 .
XN
Let N ∼ Bin(n, p) and is independent of Xi ’s for all i = 0, 1, . . .. Define S = Xi . Note
i=0
that the RV S is the sum of random number of RVs. This type of RVs are called compound
RVs. Compound random variables are quite important in many practical situations. For
example, consider a car insurance company. The number of accidents that a customer meets
in a year is a RV. Let N denotes the number of accident in a year. Now, assume that Xi
denotes the claim by the customer after the ith accident. Then S is the total claim made
by the customer. Now, it is important for the insurance company to have an idea of average
and variance of claims made by a customer. Let us try to compute E(S) and V ar(S). Note
that
N
! n
! n
!
X X X
E (S|N = n) = E Xi |N = n = E Xi |N = n = E Xi = (n + 1)µ.
i=0 i=0 i=0
The second equality is true as under the condition N = n, the sum has n components.
P The
third equality is true due to the fact that N and Xi ’s are independent. Note that N
i=1 Xi
78
PN
and
Pn N are not independent. However, when we put a specific value of N in i=1 Xi to get
i=1 X i , then the later does not involve N and becomes independent of N . Thus, we have
E (S|N ) = (N + 1)µ. Hence,
E(S) = EE(S|N ) = E [(N + 1)µ] = (np + 1)µ.
Now,
N
! n
! n
!
X X X
V ar(S|N = n) = V ar Xi |N = n = V ar Xi |N = n = V ar Xi = (n+1)σ 2 .
i=0 i=0 i=0
79
Example 3.25. Suppose X and Y are two independent RVs, either discrete or continuous.
Let us study the RV Z = X + Y and try to see if this is a CRV or DRV
We know that (X, Y ) is a discrete random vector if X and Y are DRVs, and hence,
Z = X + Y is a DRV. The PMF of Z, for z ∈ R, is
fZ (z) = P (X + Y = z)
X
= P (X + Y = z|Y = y) P (Y = y)
y∈Sy
X
= P (X + y = z|Y = y) fY (y)
y∈SY
X
= P (X = z − y) fY (y)
y∈SY
X
= fX (z − y)fY (y).
y∈SY
Now, assume that X and Y are CRVs. Let us first find the CDF of Z and then check
what type of RV Z is. The CDF of Z, for z ∈ R, is
FZ (z) = P (X + Y ≤ z)
Z ∞
= P (X + Y ≤ z|Y = y) fY (y)dy
−∞
Z ∞
= P (X + y ≤ z|Y = y) fY (y)dy
−∞
Z ∞
= P (X ≤ z − y) fY (y)dy, as X and Y are independent
−∞
Z ∞ Z z−y
= fX (x)fY (y)dxdy
−∞ −∞
Z ∞Z z
= fX (x0 − y)fY (y)dx0 dy, taking x0 = x + y
Z−∞ −∞
z Z ∞
= fX (x − y)fY (y)dy dx0 for all z ∈ R.
0
−∞ −∞
Note that changing the role of X and Y , we can write the PDF of Z as
Z ∞
fX+Y (z) = fY (z − x)fX (x)dx for all z ∈ R.
−∞
FZ (z) = P (X + Y ≤ z)
X
= P (X + Y ≤ z|Y = y) fY (y)
y∈SY
80
XZ z−y
= fX (x)fY (y)dx
y∈SY −∞
XZ z
= fX (x0 − y)fY (y)dx0 , taking x0 = x + y
y∈SY −∞
( )
Z z X
= fX (x0 − y)fY (y) dx0 for all z ∈ R.
−∞ y∈SY
||
Example 3.27. (X, Y ) is uniform on unit square. Then
Z 1Z 1
xdydx
E XI(1, ∞) (X + Y ) 0 1−x 2
E (X|X + Y > 1) = = Z 1Z 1 = .
P (X + Y > 1) 3
dxdy
0 1−y
||
Example 3.28. A rod of length l is broken into two parts. Then the expected length of
the shorter part is
l
E X|X < ,
2
where X ∼ U (0, l). Thus, the expected length of the shorter part is
Z l
1 2
E XI xdx
(−∞, 2l ) (X)
l l 0 l
E X|X < = l
= l = .
2 P X<2
Z
1 2 4
dx
l 0
An alternative formulation is as follows: The required quantity is E [min {X, l − X}].
Please calculate and check if you are getting the same value. ||
81
3.10 Bivariate Normal Distribution
Recall that a CRV X is said to have a univariate normal distribution if the PDF of X is
given by
1 1 x−µ 2
f (x) = √ e− 2 ( σ ) for all x ∈ R,
σ 2π
where µ ∈ R and σ > 0. In this case, X ∼ N (µ, σ 2 ) is used to denote the RV X follows a
normal distribution with parameters µ and σ 2 . Note that if X ∼ N (µ, σ 2 ), then all moments
of X exist. In particular, E(X) and V ar(X) exist, and they are given by E(X) = µ and
V ar(X) = σ 2 . This means that a normal distribution is completely specified by its mean
and variance.
X1
Definition 3.25 (Bivariate Normal). A two dimensional random vector X = is
X2
said to have a bivariate normal distribution if aX1 + bX2 is a univariate normal for all
(a, b) ∈ R2 \ (0, 0).
Theorem 3.26. If X has bivariate normal distribution, then each of X1 and X2 is uni-
variate normal. Hence, E(X1 ), E(X2 ), V ar(X1 ), V ar(X2 ), and Cov(X1 , X2 ) exist.
u0 X ∼ N (u0 µ, u0 Σu).
and
V ar(u0 X) = a2 σ11 + b2 σ22 + 2abσ12 = u0 Σu.
Thus, u0 X ∼ N (u0 µ, u0 Σu).
Theorem 3.28. Let X be a bivariate normal random vector with µ = E (X) and Σ =
V ar (X), then the MGF of X is given by
0 1 0
MX (t) = et µ+ 2 t Σt
for all t ∈ R2 .
82
Proof: The JMGF of X is
0
MX (t) = E et X = Mt0 X (1). (3.7)
As X has a bivariate normal distribution, t0 X ∼ N (t0 µ, t0 Σt). Now, using Example 2.42,
the theorem is immediate.
The Theorem 3.28 shows that the bivariate normal distribution is completely specified
by the mean vector µ and the variance-covariance matrix Σ. We will use the notation
X ∼ N2 (µ, Σ) to denote that the random vector X follows a bivariate normal distribution
with mean vector µ and variance-covariance matrix Σ.
Theorem 3.29. If X ∼ N2 (µ, Σ), then X1 ∼ N (µ1 , σ11 ) and X2 ∼ N (µ2 , σ22 ).
The converse of the Theorem 3.29 is not true in general. Consider the following example
in this regard.
Example 3.29. Let X ∼ N (0, 1). Let Z be a DRV, which is independent of X and
P (Z = 1) = 0.5 = P (Z = −1).
P (Y ≤ y) = P (ZX ≤ y)
= P (ZX ≤ y|Z = 1) P (Z = 1) + P (ZX ≤ y|Z = −1) P (Z = −1)
1 1
= P (X ≤ y) + P (X ≥ −y)
2 2
= Φ(y).
Thus, X ∼ N (0, 1) and Y ∼ N (0, 1). However, (X, Y ) is not a bivariate normal random
vector. To see it, observe that
1
P (X + Y = 0) = P (X + ZX = 0) = P (Z = −1) = .
2
That means that X + Y does not follow a univariate normal distribution, and hence, (X, Y )
is not a bivariate normal random vector, though X ∼ N (0, 1) and Y ∼ N (0, 1). ||
where MXi (·) is the MGF of Xi , i = 1, 2. This shows that X1 and X2 are independent.
Note that we have discussed that two random variable can be dependent even if covariance
between them zero. The bivariate normal random vector is special in this respect.
83
Theorem 3.31 (Probability Density Function). Let X ∼ N2 (µ, Σ) be such that Σ is
invertible, then, for all x ∈ R2 , X has a joint PDF given by
1 1 0 −1
f (x) = exp − (x − µ) Σ (x − µ)
2π|Σ|1/2 2
2 2
x1 −µ1 x −µ x2 −µ2 x −µ
1 − 1
2 σ1
−2ρ 1σ 1 σ2
+ 2σ 2
= p e 2(1−ρ ) 1 2
2πσ1 σ2 1 − ρ 2
√ √
where σ1 = σ11 , σ2 = σ22 , ρ is correlation coefficient between X1 and X2 .
Proof: The proof of this theorem is out of scope.
Theorem 3.32 (Conditional Probability Density Function). Let X ∼ N2 (µ, Σ) be such
that Σ is invertible, then for all x2 ∈ R, the conditional PDF of X1 given X2 = x2 is given
by
" 2 #
1 x1 − µ1|2
1
fX1 |X2 (x1 |x2 ) = √ exp − for x1 ∈ R,
σ1|2 2π 2 σ1|2
where µ1|2 = µ1 + ρ σσ21 (x2 − µ2 ) and σ1|2
2
= σ12 (1 − ρ2 ). Thus, X1 |X2 = x2 ∼ N µ1|2 , σ1|2
2
.
Proof: Easy to see from the fact that
fX1 , X2 (x1 , x2 )
fX1 |X2 (x1 |x2 ) = .
fX2 (x2 )
Of course, you need to perform some algebra.
Corollary 3.1. Under the condition of the Theorem 3.32, E(X1 |X2 = x2 ) = µ1|2 = µ1 +
ρ σσ12 (x2 −µ2 ) and V ar(X1 |X2 = x2 ) = σ1|2
2
= σ12 (1−ρ2 ) for all x2 ∈ R. Hence, the conditional
variance does not depend on x2 .
Proof: Straight forward from the previous theorem.
n
n
Y
MT (t) = MXi2 (t) = (1 − 2t)− 2 ,
i=1
1
Pn
where t < 2
.
Thus, T = 2
i=1 Xi∼ Gamma( n2 , 12 ). This distribution is also known as χ2
distribution with degrees of freedom n. Thus, the sum of squares of n i.i.d. N (0, 1) has a
χ2 distribution with degrees of freedom n.
84
1
Pn
Theorem 3.34. Let X1 , X2 , . . . , Xn be i.i.d. N (µ, σ 2 ) random variables. Let X = n i=1 Xi
1
Pn
and S 2 = n−1 2 2
i=1 (Xi − X) . Then X and S are independently distributed and
(n − 1)S 2
X ∼ N (µ, σ 2 /n) and 2
∼ χ2n−1 .
σ
Proof: Let A be an n × n orthogonal matrix, whose first row is
1 1 1
√ , √ , ..., √ .
n n n
Note that such a matrix exists as we can start with the row and construct a basis of Rn .
Then Gram-Schmidt orthogonalization will give us the required matrix. As A is orthogonal,
its’ inverse exists and A−1 = AT , the transpose of A. Now consider the transformation of
random vector X = (X1 , X2 , . . . , Xn )0 given by
Y = AX.
First, we shall try to find the distribution of Y . Note that the transformation g(x) = Ax
is a one-to-one transformation as A is invertible. The inverse transformation is given by
x = A0 y. Hence, the Jacobian of the inverse transformation is J = det(A). As A is
orthogonal, absolute value of det(A) is one. Now, as X1 , X2 , . . . , Xn are i.i.d. N (µ, σ 2 )
RVs, the JPDF of X, for x = (x1 , x2 , . . . , xn )0 ∈ Rn , is
" n
#
1 1 X
fX (x) = √ n exp − 2 (xi − µ)2
σ 2π 2σ i=1
1 1 0
= √ n exp − 2 (x − µ) (x − µ) ,
σ 2π 2σ
85
Again,
n n n
X X X 2
Y 0 Y = X 0 X =⇒ Yi2 = Xi2 − Y12 = Xi2 − nX = (n − 1)S 2 .
i=2 i=1 i=1
Yi
For i = 2, 3, . . . , n, σ
are i.i.d. N (0, 1) RVs. Thus, using the previous theorem
n 2
(n − 1)S 2 X
Yi
= ∼ χ2n−1 .
σ2 i=2
σ
86