0% found this document useful (0 votes)
33 views40 pages

MA225 L3 Notes

Uploaded by

manishkumars0914
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views40 pages

MA225 L3 Notes

Uploaded by

manishkumars0914
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Probability Theory and Random Processes

(MA 225)

Class Notes
September – November, 2020

Instructor
Ayon Ganguly
Department of Mathematics
IIT Guwahati
Contents

1 Probability 2
1.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Classical Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Countable and Uncountable Sets . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Axiomatic Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Continuity of Probability . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Random Variable 17
2.1 Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Continuous Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Expectation of Random Variable . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Transformation of Random Variable . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 Technique 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2 Technique 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.3 Expectation of Function of RV . . . . . . . . . . . . . . . . . . . . . . 40
2.5.4 Technique 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 Moment Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.A Gamma Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.B Beta Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3 Jointly Distributed Random Variables 50


3.1 Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Discrete Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Continuous Random Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Expectation of Function of Random Vector . . . . . . . . . . . . . . . . . . . 58
3.5 Some Useful Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7 Covariance and Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . 62
3.8 Transformation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8.1 Technique 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8.2 Technique 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.8.3 Technique 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.9 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.9.1 For Discrete Random Vector . . . . . . . . . . . . . . . . . . . . . . . 72
3.9.2 For Continuous Random Vector . . . . . . . . . . . . . . . . . . . . . 74

1
3.9.3 Computing Expectation by Conditioning . . . . . . . . . . . . . . . . 76
3.10 Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.11 Some Results on Independent and Identically Distributed Normal RVs . . . . 84

2
Chapter 3

Jointly Distributed Random Variables

In the previous chapter, we studied RV and associated concepts. One of the mains use of the
random variable is to model numerical characteristic of a natural phenomena. For example,
we may assume that the RV X denote the income of a household. Now, assume that Y
denotes the spending of the household. Then Z = X − Y is a RV and it is the savings of the
household. Clearly, if we know the values of any of two RVs among X, Y , and Z, the value
of other one is known. In such situation, we may want to study the relationship between
X and Z. Clearly, the concepts of the previous chapter are not sufficient. Using the tools
of the previous chapter, we can study X and Z separately, but not jointly. For example
we cannot answer the question: Is savings increases with income? To answer it, we need
to know the probability of joint occurrence of events X ≤ x and Z ≤ z for different values
of x and z. In this chapter, we will study the concepts relating to the joint occurrence of
multiple RVs. There are plenty of examples, where we need to consider multiple RVs. These
examples include height and weight of a person, pollution level and blood pressure, lifetime
of a product and its cause of failure, etc.

3.1 Random Vector


Definition 3.1 (Random Vector). A function X : S → Rn is called a random vector.
Clearly, random vector is a generalization of RV. Note that as X is a function from
S to Rn , X(ω) can be written as (X1 (ω), X2 (ω), . . . , Xn (ω)), where Xi : S → R for all
i = 1, 2, . . . , n. Thus, Xi is a RV for all i = 1, 2, . . . , n. Therefore, each component of a
random vector is a RV and we will write X = (X1 , X2 , . . . , Xn ).
Definition 3.2 (Joint CDF). For any random vector X = (X1 , X2 , . . . , Xn ), the joint
cumulative distribution function (JCDF) is defined by
FX (x) = P (X1 ≤ x1 , . . . , Xn ≤ xn ) ,
for all x = (x1 , . . . , xn ) ∈ Rn . Here
{X1 ≤ x1 , . . . , Xn ≤ xn } = {X1 ≤ x1 } ∩ {X2 ≤ x2 } ∩ . . . ∩ {Xn ≤ xn } .
Now, onward, most of the definitions, theorems, results will be presented for n = 2.
Thus, we will use X = (X1 , X2 ) or X = (X, Y ). This is for simplicity of the expressions.
However, most of the definitions, theorems, results can be extended for any general value of
n. For n = 2, the JCDF at the point (x, y) is the probability that the random vector (X, Y )
belongs to the shaded region of the Figure 3.1.

50
y

(x, y)

Figure 3.1: CDF is the probability that (X, Y ) in the marked region

Theorem 3.1. Let X = (X, Y ) be a random vector with JCDF FX, Y (·, ·). Then the CDF
of X is given by FX (x) = lim FX, Y (x, y) for all x ∈ R. Similarly, the CDF of Y is given
y→∞
by FY (y) = lim FX, Y (x, y) for all y ∈ R.
x→∞

Proof: Fix x ∈ R. Let {yn }n≥1 be an increasing sequence of real numbers such that yn → ∞
as n → ∞. Let us define

An = {ω ∈ S : X(ω) ≤ x, Y (ω) ≤ yn }

for n = 1, 2, 3, . . .. Clearly, {An }n≥1 is an increasing sequence of events, and hence,

A = lim An = ∪∞
n=1 An = {ω ∈ S : X(ω) ≤ x} .
n→∞

Now,  
lim FX, Y (x, yn ) = lim P (An ) = P lim An = P (A) = FX (x).
n→∞ n→∞ n→∞

This shows that for any increasing sequence of real numbers {yn }n≥1 with yn → ∞ as n → ∞,
limn→∞ FX, Y (x, yn ) = FX (x). Thus, limy→∞ FX, Y (x, y) = FX (x) for each fixed x ∈ R.
Similarly, one can prove that limx→∞ FX, Y (x, y) = FY (y) for each fixed y ∈ R.
Remark 3.1. The previous theorem can be extended for more than two RVs. For example,
let X = (X, Y, Z). In this case we can find CDFs of X, Y , and Z using the following
formulas.

FX (x) = lim lim FX, Y, Z (x, y, z) = lim lim FX, Y, Z (x, y, z) for all x ∈ R,
y→∞ z→∞ z→∞ y→∞

FY (y) = lim lim FX, Y, Z (x, y, z) = lim lim FX, Y, Z (x, y, z) for all y ∈ R,
x→∞ z→∞ z→∞ x→∞
FZ (z) = lim lim FX, Y, Z (x, y, z) = lim lim FX, Y, Z (x, y, z) for all z ∈ R.
x→∞ y→∞ y→∞ x→∞

We can find the JCDFs of (X, Y ), (X, Z), and (Y, Z) as

FX, Y (x, y) = lim FX, Y, Z (x, y, z) for all (x, y) ∈ R2 ,


z→∞
FX, Z (x, z) = lim FX, Y, Z (x, y, z) for all (x, z) ∈ R2 ,
y→∞

FY, Z (y, z) = lim FX, Y, Z (x, y, z) for all (y, z) ∈ R2 .


x→∞

51
Let X = (X1 , . . . , Xn ). Let A = {i1 , i2 , . . . , ik } ⊂ {1, 2, . . . , n}. If we want to find the
JCDF of (Xi1 , Xi2 , . . . , Xik ), we need to take limit (tends to infinity) with respect to all the
components that are not present in A. †

In the context of random vector, the JCDF of a subset is called marginal CDF. Thus, if
X = (X, Y, Z), the CDF of X is called marginal CDF of X. Similarly the JCDF of (X, Y )
is called marginal CDF of (X, Y ).

Theorem 3.2 (Properties of JCDF). Let X = (X, Y ) be a random vector with JCDF
FX, Y (·, ·). Then

1. lim lim FX, Y (x, y) = 1.


x→∞ y→∞

2. lim FX, Y (x, y) = 0 for all y ∈ R.


x→−∞

3. lim FX, Y (x, y) = 0 for all x ∈ R.


y→−∞

4. FX, Y (·, ·) is right continuous in each argument keeping other fixed.

5. For −∞ < a1 < b1 < ∞ and −∞ < a2 < b2 < ∞,

FX, Y (b1 , b2 ) − FX, Y (b1 , a2 ) − FX, Y (a1 , b2 ) + FX, Y (a1 , a2 ) ≥ 0.

Proof: The proof of this theorem is similar to that of Theorem 2.1 with standard modifi-
cation for 2-dimensional functions. Therefore, the proof is skipped here.

Though we are skipping the proof the previous theorem, let us make a comparison be-
tween the properties (that are presented in the Theorem 2.1) of CDF of a RV and that
of JCDF of a 2-dimensional random vector. The Property 1 in the Theorem 2.1 states
that FX (·) is non-decreasing. This can be alternatively written as FX (x2 ) − FX (x1 ) =
P (x1 < X ≤ x2 ) ≥ 0 for all x1 < x2 . Thus, the non-decreasing property is a consequence
of the fact that the probability that a RV is in an interval must be non-negative. Now, a
natural extension of an interval in one-dimension is a rectangle in two-dimension. Thus, the
equivalent property should be based on the fact that the probability that a 2-dimensional
random vector in a rectangle must be non-negative. Let a1 < b1 and a2 < b2 be four
real numbers. Then the points (a1 , a2 ), (a1 , b2 ), (b1 , a2 ), and (b1 , b2 ) forms vertices of the
rectangle (a1 , b1 ] × (a2 , b2 ]. Now,

P ((X, Y ) ∈ (a1 , b1 ] × (a2 , b2 ]) = P (a1 < X ≤ b1 , a2 < Y ≤ b2 )


= FX, Y (b1 , b2 ) − FX, Y (b1 , a2 ) − FX, Y (a1 , b2 ) + FX, Y (a1 , a2 ).

Thus, the fact that the probability that a random vector in a rectangle is non-negative gives
the Property 5 of Theorem 3.2.
Property 2 of Theorem 2.1 states that limx→∞ FX (x) = 1. This is intuitively tells that
if we cover the whole R, then the probability is one. Similarly, for a 2-dimensional ran-
dom vector if the whole R2 is covered, the probability is one and we have Property 1 of
Theorem 3.2.
Property 3 of Theorem 2.1 states that limx→−∞ FX (x) = 0. Loosely speaking, {X ≤ x}
becomes ∅ for x tends to −∞. For 2-dimensional random vector, if one of the components

52
tends to −∞, the set becomes empty. If x → −∞, then {X ≤ x} becomes ∅, and hence,
{X ≤ x, Y ≤ y} becomes ∅. Similarly, as y → −∞, {X ≤ x, Y ≤ y} becomes ∅. Thus, we
have Properties 2 and 3 of Theorem 3.2. Note that for the first property of Theorem 3.2,
both the components need to tend to ∞. However, for the Properties 2 and 3, if any of the
components tends to −∞ keeping other fixed, then JCDF tends to zero. Property 4 is a
straight forward extension of Property 4 of the Theorem 2.1.

Theorem 3.3. Let G : R2 → R be a function satisfying conditions 1–5 of the Theorem 3.2.
Then G is a JCDF of some 2-dimensional random vector.

Proof: Proof of this theorem is out of scope of this course.

This theorem can be used to check if a function is a JCDF or not. Theorems 3.2 and
3.3 can be extended for random vector having more than two components. However, writing
the property (5) involves complicated expressions.

3.2 Discrete Random Vector


Definition 3.3 (Discrete Random Vector). A random vector (X, Y ) is said to have a discrete
distribution if there exists an atmost countable set SX, Y ⊂ R2 such that P ((X, Y ) = (x, y)) =
P (X = x, Y = y) > 0 for all (x, y) ∈ SX, Y and P ((X, Y ) ∈ SX, Y ) = 1. SX, Y is called the
support of (X, Y ).

Definition 3.4 (Joint PMF). Let (X, Y ) be a discrete random vector with support SX, Y .
Define a function fX, Y : R2 → R by
(
P (X = x, Y = y) if (x, y) ∈ SX, Y
fX, Y (x, y) =
0 otherwise.

The function fX, Y is called joint probability mass function (JPMF) of the discrete random
vector (X, Y ).

Note that discrete random vector is a straight forward extension of DRV. In case of
DRV, we need to find an atmost countable set SX in R such that P (X ∈ SX ) = 1. For
a 2-dimensional discrete random vector SX, Y is atmost countable and a subset of R2 such
that P ((X, Y ) ∈ SX, Y ) = 1. Similarly, the definition of JPMF is also a natural extension
of PMF of a DRV. These definitions can be easily extended for more than two dimensional
random vectors.

Theorem 3.4 (Properties of JPMF). Let (X, Y ) be a discrete random vector with JPMF
fX, Y (·, ·) and support SX, Y . Then

1. fX, Y (x, y) ≥ 0 for (x, y) ∈ R2 .


X
2. fX, Y (x, y) = 1.
(x, y)∈SX, Y

Proof: The proof of the theorem is straight forward form the definitions of discrete random
vector and JPMF.

53
Theorem 3.5. If a function g : R2 → R satisfy Properties 1 and 2 above for the atmost
countable set D = {(x, y) ∈ R2 : g(x, y) > 0} in place of SX, Y , then g is JPMF of some
2-dimensional discrete random vector.

Proof: The proof of this theorem is out of scope of this course.

Theorem 3.5 can be used to check if a function is JPMF or not. Again, Theorems 3.4
and 3.5 can be extended for more than 2-dimensional discrete random vector.

Theorem 3.6 (Marginal PMF from JPMF). Let (X, Y ) be a discrete random vector with
JPMF fX, Y (·, ·) and support SX, Y . Then X and Y are DRVs. The PMF of X is
X
fX (x) = fX, Y (x, y) for all fixed x ∈ R. (3.1)
(x, y)∈SX, Y

The PMF of Y is given by


X
fY (y) = fX, Y (x, y) for all fixed y ∈ R. (3.2)
(x, y)∈SX, Y

In this context, fX (·) and fY (·) are called marginal PMF of X and marginal PMF of Y ,
respectively.

Proof: Define the set

D = {x ∈ R : (x, y) ∈ SX, Y for some y ∈ R} .

As, SX, Y is atmost countable, D is also atmost countable. Fix x0 ∈ D. Consider (x0 , y) ∈
SX, Y and (x0 , y 0 ) ∈ SX, Y such that y 6= y 0 . Then the events

{X = x0 , Y = y} and {X = x0 , Y = y 0 }

are disjoint. Now, using theorem of total probability (Theorem 1.16), P (X = x0 ) can be
found by taking sum over all the points in SX, Y whose first component is x0 . Thus,
X X
P (X = x0 ) = P (X = x, Y = y) = fX, Y (x, y),
(x0 , y)∈SX, Y (x0 , y)∈SX, Y

and
X X X X
P (X = x) = fX, Y (x, y) = fX, Y (x, y) = 1.
x∈D x∈D (x, y)∈SX, Y (x, y)∈SX, Y

Hence, X is a DRV with PMF given by (3.1). Similarly we can prove that Y is also a DRV
with PMF given by (3.2)

Example 3.1. Let (X, Y ) be a discrete random vector with JPMF


(
cy if x = 1, 2, . . . , n; y = 1, 2, . . . , n
f (x, y) =
0 otherwise,

54
where c is a constant. We can find the value of c based on the properties of JPMF. If f (·, ·)
have to be a JPMF, then f (x, y) ≥ 0 for all (x, y) ∈ R2 , which implies that c ≥ 0. Also,
n X
n
X 2
f (x, y) = 1 =⇒ c = .
x=1 y=1
n2 (n + 1)

Thus, the JPMF of (X, Y ) is given by


(
2y
n2 (n+1)
if x = 1, 2, . . . , n; y = 1, 2, . . . , n
f (x, y) =
0 otherwise.

We can also find the marginal PMF of X as follows: Fix x ∈ {1, 2, . . . , n}. Then
n
X X 1
P (X = x) = f (x, y) = cy = .
y=1
n
(x, y)∈SX, Y

Thus, the marginal PMF of X is given by


(
1
n
if x = 1, 2, . . . , n
fX (x) =
0 otherwise.

Similarly, we can find the marginal PMF of Y . I leave it as an exercise. ||

Example 3.2. Let (X, Y ) be a discrete random vector with JPMF


(
cy if x = 1, 2, . . . , n; y = 1, 2, . . . , n; x ≤ y
f (x, y) =
0 otherwise,

where c is a constant. Note that the function f (·, ·) is almost similar to that of the previous
example. Only difference is in the sets where the functions are strictly positive. In the
previous example, f was positive on {1, 2, . . . , n} × {1, 2, . . . , n}. In the current example
the set is {(x, y) ∈ R2 : x = 1, 2, . . . , n; y = 1, 2, . . . , n; x ≤ y}. However, this changes the
probability distribution completely. We will see that the marginal PMFs are also different.
Hence, support is an important issue.
The constant c is positive and can be found as follows:
y
n X
X X 6
f (x, y) = 1 =⇒ cy = 1 =⇒ c = .
y=1 x=1
n(n + 1)(2n + 1)
(x, y)∈SX, Y

Please note the range of the summations. Thus, the JPMF of (X, Y ) is given by
(
6y
n(n+1)(2n+1)
if x = 1, 2, . . . , n; y = 1, 2, . . . , n; x ≤ y
f (x, y) =
0 otherwise

The marginal PMF of X can be found as follows: For x ∈ {1, 2, . . . , n},


n
X 3(n + x)(n − x + 1)
P (X = x) = cy = .
y=x
n(n + 1)(2n + 1)

55
Please note the range of the summation above. Thus, the marginal PMF of X is given by
(
3(n+x)(n−x+1)
if x = 1, 2, . . . , n
fX (x) = n(n+1)(2n+1)
0 otherwise.

The marginal PMF of Y can also be found similarly and is given by


(
6y 2
n(n+1)(2n+1)
if y = 1, 2, . . . , n
fY (y) =
0 otherwise.

||

3.3 Continuous Random Vector


Definition 3.5 (Continuous Random Vector). A random vector (X, Y ) is said to have a
continuous distribution if there exists a non-negative integrable function fX, Y : R2 → R such
that Z x Z y
FX,Y (x, y) = fX,Y (t, s)dsdt
−∞ −∞
2
for all (x, y) ∈ R . The function fX, Y is called the joint probability density function (JPDF)
of (X, Y ). The set SX, Y = {(x, y) ∈ R2 : fX,Y (x, y) > 0} is called the support of (X, Y ).
Again, the continuous random vector is a natural extension of CRV. The JPMF exists
only for discrete random vector and JPDF exits for continuous random vector.
Theorem 3.7 (Properties of JPDF). Let (X, Y ) be a continuous random vector with JPDF
fX, Y (·, ·). Then
1. fX, Y (x, y) ≥ 0 for (x, y) ∈ R2 .
Z ∞Z ∞
2. fX, Y (x, y)dxdy = 1.
−∞ −∞

Proof: The proof of the Property 1 is straight forward from the definition of continuous
random vector. For the Property 2,
Z ∞Z ∞ Z A Z B
fX, Y (x, y)dxdy = lim lim fX, Y (x, y)dxdy = lim lim FX, Y (A, B) = 1.
−∞ −∞ A→∞ B→∞ −∞ −∞ A→∞ B→∞

Theorem 3.8. If a function g : R2 → R satisfy Properties 1 and 2 of the Theorems 3.7,


then g (·, ·) is JPDF of some 2-dimensional continuous random vector.
Proof: The proof of this theorem is out of scope of this course.
Theorem 3.9 (Marginal PDF from JPDF). Let (X, Y ) be a continuous random vector
with JPDF fX, Y (·, ·). Then X and Y are CRVs. The PDF of X is given by
Z ∞
fX (x) = fX, Y (x, y)dy for all fixed x ∈ R. (3.3)
−∞

56
The PDF of Y is given by
Z ∞
fY (y) = fX, Y (x, y)dx for all fixed y ∈ R. (3.4)
−∞

In the context of continuous random vector, fX (·) and fY (·) are called marginal PDF of X
and marginal PDF of Y , respectively.
Proof: For x ∈ R,
FX (x) = lim FX, Y (x, y)
y→∞
Z x Z y
= lim fX, Y (s, t)dtds
y→∞ −∞ −∞
Z x  Z y 
= lim fX,Y (s, t)dt ds
−∞ y→∞ −∞
Z x Z ∞ 
= fX, Y (s, t)dt ds
−∞ −∞
Z x
= g(s)ds,
−∞
Z ∞
where g(s) = fX, Y (s, t)dt. The third equality holds true as fX, Y (x, y) ≥ 0 for all
−∞
(x, y) ∈ R2 . Thus, X is a CRV with PDF as given in (3.3). Similarly, we can prove that Y
is also CRV with PDF given in (3.4).
Example 3.3. Let (X, Y ) be a CRV with JPDF
(
ce−(2x+3y) if 0 < x < y < ∞
f (x, y) =
0 otherwise,
where c is a constant. Clearly, c > 0 as fX, Y (x, y) ≥ 0 for all (x, y) ∈ R2 . The value of c
can be found as follows:
Z ∞Z ∞ Z ∞Z ∞
f (x, y)dydx = 1 =⇒ c e−(2x+3y) dydx = 1 =⇒ c = 15.
−∞ −∞ 0 x

Note the range of integration. We can find the marginal PDF of X as follows: For x ≤ 0,
Z ∞
fX (x) = f (x, y)dy = 0,
−∞

as the integrand is zero for all x ≤ 0. For x > 0,


Z x Z ∞ Z ∞ Z ∞
fX (x) = f (x, y)dy + f (x, y)dy = f (x, y)dy = 15 e−(2x+3y) dy = 5e−5x ,
−∞ x −∞ x

as f (x, y) = 0 for y < x. Thus, the marginal PDF of X is given by


(
5e−5x if x > 0
fX (x) =
0 otherwise.
Hence, X ∼ Exp(5). Similarly, the marginal PDF of Y can be calculated and is given by
(
15 −3y
e (1 − e−2y ) if y > 0
fY (y) = 2
0 otherwise.
||

57
3.4 Expectation of Function of Random Vector
Definition 3.6 (Expectation of Function of Discrete Random Vector). Let (X, Y ) be a
discrete random vector with JPMF fX, Y (·, ·) and support SX, Y . Let h : R2 → R. Then the
expectation of h(X, Y ) is defined by
X
E (h(X, Y )) = h(x, y)fX, Y (x, y),
(x, y)∈SX, Y

X
provided |h(x, y)|fX, Y (x, y) < ∞.
(x, y)∈SX, Y

Definition 3.7 (Expectation of Function of Continuous Random Vector). Let (X, Y ) be a


continuous random vector with JPDF fX, Y (·, ·). Let h : R2 → R. Then the expectation of
h(X, Y ) is defined by
Z ∞Z ∞
E (h(X, Y )) = h(x, y)fX, Y (x, y)dxdy,
−∞ −∞
Z ∞ Z ∞
provided |h(x, y)|fX, Y (x, y)dxdy < ∞.
−∞ −∞

Theorem 3.10 (Linearity Property of Expectation). Let X = (X1 , X2 , . . . , Xn ) be either


a discrete random vector or a continuous random vector. Then
n
! n
X X
E ai X i = ai E(Xi )
i=1 i=1

where ai ∈ R is a constant for all i = 1, 2, . . . , n. Here we assume that all the expectations
exist.

Proof: We will prove the theorem for continuous random vector X = (X, Y ). The proof
for general value of n is similar. Also, the proof for discrete random vector can be written
easily by replacing integration sign by summation sign. Let f be the JPDF of X. Then
Z ∞Z ∞
E (a1 X + a2 Y ) = (a1 x + a2 y) fX, Y (x, y)dxdy
−∞ −∞
Z ∞Z ∞ Z ∞Z ∞
= a1 xfX, Y (x, y)dxdy + a2 yfX, Y (x, y)dxdy
−∞ −∞ −∞ −∞
Z ∞ Z ∞  Z ∞ Z ∞ 
= a1 x fX, Y (x, y)dy dx + a2 y fX, Y (x, y)dx dy
−∞ −∞ −∞ −∞
Z ∞ Z ∞
= a1 xfX (x)dx + a2 yfY (y)dy
−∞ −∞
= a1 E (X) + a2 E (Y ) .

The previous theorem tells us that we can compute expectation of a linear combination
of a random vectors by computing the expectations of individual components of the random
vector and then taking the linear combination. Note that the left hand side involves the

58
joint distribution and an n-dimensional integration (for continuous random vector) or an
n-dimensional summation (for discrete random vector). However, the right hand side in-
volves n one-dimensional integrations or n one-dimensional summations. Also, we need the
marginal distributions (not joint distribution) to compute the right hand side. Sometimes it
is much easier to compute n one-dimensional integrations compared to a single n-dimensional
integration. Same is true for summations. Also, in many problems, it is easier to obtain the
marginal distributions than to obtain joint distribution. The following example illustrate
this.
Example 3.4. At a party n men throw their hats into the center of a room. The hats
are mixed up and each man randomly selects one. Suppose that we want to calculate the
expected number of men who selects their own hat. Let X denote the number of men
who selects their own hat. We are interested to find E(X). To compute E(X) directly,
we need the distribution of X. It is clear that X should be a DRV with support SX =
{0, 1, . . . , n − 2, n}. Note that n − 1 6∈ SX (why?). It is quite easy to find the values
P (X = k) for k = 0 and n. However, it is quite difficult to find P (X = k) for other values
of k ∈ SX and k = 0, n. Thus, it becomes a difficult problem if we try to compute E(X)
directly.
Let us try to solve it by converting the problem into a multidimensional problem. For
i = 1, 2, . . . , n, let us define the RV
(
1 if ith person takes his own hat
Xi =
0 otherwise.

Clearly, X = X1 + X2 + . . . + Xn . Thus, using the previous theorem, E(X) = E(X1 ) + . . . +


E(Xn ). Now, we need to compute E(Xi ) for all i = 1, 2, . . . , n. Note that P (Xi = 1) = n1
for all i = 1, 2, . . . , n. Hence, E(Xi ) = n1 for all i = 1, 2, . . . , n implies E(X) = 1. Thus,
on an average only one person takes his own hat, and it does not depend on n, the number
of persons present in the game. ||

3.5 Some Useful Remarks


Remark 3.2. In Theorem 3.6, we have seen that if (X, Y ) is a discrete random vector then
X and Y are DRVs. On the other hand, suppose that X and Y are DRVs with supports SX
and SY , respectively. Suppose that S = SX × SY . Clearly, S is atmost countable. Then
X
P ((X, Y ) ∈ S) = P (X = x, Y = y)
(x, y)∈S
( )
X X
= P (X = x, Y = y)
x∈SX y∈SY
X
= P (X = x) , using theorem of total probability
x∈SX

= 1.

Let T = {(x, y) ∈ R2 : P (X = x, Y = y) > 0}. Then T ⊆ S, and hence, T is atmost count-


able. Also, P ((X, Y ) ∈ T ) = 1. Thus, (X, Y ) is discrete random vector. This discussion
shows that (X, Y ) is discrete random vector if and only if X and Y are DRVs. †

59
Remark 3.3. If (X, Y ) is continuous random vector, then
Z Z
P ((X, Y ) ∈ A)) = fX,Y (x, y)dxdy,
(x, y)∈A

for all A ⊆ R2 such that the integration is possible. This statement can be seen
R as an
extension of the fact that if X is a CRV with PDF f (·), then P (X ∈ B) = B f (x)dx.
However, the mathematical proof is out of the scope of this course. †

Remark 3.4. In Theorem 3.9, we have seen that if (X, Y ) is continuous random vector,
then X and Y are continuous random variables. However, the converse, in general, is not
true. Thus, (X, Y ) may not be a continuous random vector even if X and Y are CRVs.
Consider the following example in this regards. Let X be a CRV. Suppose that Y = X.
Then X and Y are CRVs. It is clear that P (X = Y ) = 1. Now, if possible, assume that
(X, Y ) is a continuous random vector. Thus, (X, Y ) has a JPDF, say f (·, ·). Then
Z Z
P (X = Y ) = f (x, y)dxdy = 0.
x=y
RR
The last equality is true as a double integral B g(x, y)dxdy can be interpreted as the
volume under the function g(·, ·) over the set B. As the area of the set {(x, y) ∈ R2 : x = y}
is zero, the volume is also zero. This is a contradiction to the fact that P (X = Y ) = 1.
Thus, our assumption is wrong and (X, Y ) is not a continuous random vector.
In general, if there exists a set A ⊂ R2 whose area is zero and P ((X, Y ) ∈ A) > 0, then
(X, Y ) does not have a JPDF, and hence, (X, Y ) is not a continuous random vector. †

Remark 3.5. In Theorems 3.6 and 3.9, we have seen that the marginal distributions can be
recovered form the joint distribution. However, the converse, in general, is not true. Let us
illustrate it using the following example. †

Example 3.5. Let f (·) and g(·) be two PDFs and F (·) and G(·) be the corresponding
CDFs, respectively. Define, for −1 < α < 1,

h(x, y) = f (x)g(y) {1 + α(1 − 2F (x))(1 − 2G(y))} .

First, we will show that h(·, ·) is a JPDF of a two-dimensional random vector. As 0 ≤


F (x) ≤ 1, −1 ≤ 1 − 2F (x) ≤ 1. Similarly −1 ≤ 1 − 2G(y) ≤ 1. Hence, for all (x, y) ∈ R2 ,

−|α| ≤ α (1 − 2F (x)) (1 − 2G(y)) ≤ |α| =⇒ 1 + α (1 − 2F (x)) (1 − 2G(y)) ≥ 0.

As f (x) ≥ 0 and g(y) ≥ 0, h(x, y) ≥ 0 for all (x, y) ∈ R2 . Also,


Z ∞Z ∞ Z ∞Z ∞
h(x, y)dxdy = f (x)g(y) {1 + α(1 − 2F (x))(1 − 2G(y))} dxdy
−∞ −∞ −∞ −∞
Z ∞  Z ∞ 
=1+α f (x) (1 − 2F (x)) dx g(y) (1 − 2G(y)) dy
∞ −∞
Z ∞  Z ∞ 
=1+α (1 − 2F (x)) dF (x) (1 − 2G(y)) dG(y)
∞ −∞
= 1.

60
Thus, h(·, ·) is a JPDF of a 2-dimensional continuous random vector, say (X, Y ). Let us
try to find the marginal PDFs of X and Y . The marginal PDF of X is
Z ∞
fX (x) = h(x, y)dy
−∞
Z ∞ Z ∞
= f (x) g(y)dy + αf (x) (1 − 2F (x)) g(y) (1 − 2G(y)) dy
−∞ −∞
= f (x).

Similarly the marginal PDF of Y is g(·). Thus, the marginal PDFs of X and Y does
not depend on α. However, the JPDF depends on the value of α. For different values of
α, we have different JPDF, but the marginals remain same. Hence, given the marginal
distributions, in general, we cannot construct the joint distribution. ||

3.6 Independent Random Variables


Definition 3.8 (Independent RVs). The random variables X1 , X2 , . . . , Xn are said to be
independent if
n
Y
FX1 , X2 , ..., Xn (x1 , x2 , . . . , xn ) = FXi (xi ),
i=1
n
for all (x1 , x2 , . . . , xn ) ∈ R , where FXi (·) is the marginal CDF of Xi .

Thus, two RVs X and Y are independent if and only if the events Ex = {X ≤ x} and
Fy = {Y ≤ y} are independent for all (x, y) ∈ R2 . For discrete random vector, an equivalent
definition of independence can be given in terms of JPMF and marginal PMFs. Similarly,
for continuous random vector, an equivalent definition of independence can be given in terms
of JPDF and marginal PDFs. Next, we will give these two alternative definitions, however
we will not prove the equivalence of the respective definitions. Nonetheless, these alternative
definitions are very handy in many applications.

Definition 3.9 (Alternative Definition for Discrete Random Vector). Let (X1 , X2 , . . . , Xn )
be a discrete random vector with JPMF fX1 , X2 , ..., Xn (·, . . . , ·). Then X1 , X2 , . . . , Xn are said
to be independent if
n
Y
fX1 , X2 , ..., Xn (x1 , x2 , . . . , xn ) = fXi (xi )
i=1

for all (x1 , x2 , . . . , xn ) ∈ Rn , where fXi (·) is the marginal PMF of Xi , i = 1, 2, . . . , n.

Definition 3.10 (Alternative Definition for Continuous Random Vector). Let (X1 , X2 , . . . , Xn )
be a continuous random vector with JPDF fX1 , X2 , ..., Xn (·, . . . , ·). Then X1 , X2 , . . . , Xn are
said to be independent if
n
Y
fX1 , X2 , ..., Xn (x1 , x2 , . . . , xn ) = fXi (xi )
i=1

for all (x1 , x2 , . . . , xn ) ∈ Rn , where fXi (·) is the marginal PDF of Xi , i = 1, 2, . . . , n.

61
In the last section, we have pointed out that, in general, the joint distribution cannot be
recovered from the marginal distributions. However, if X1 , X2 , . . . , Xn are independent RVs
and if we know the marginal Qn distributions of Xi for all i = 1, n2, . . . , n, then we can write
FX1 , ..., Xn (x1 , . . . , xn ) = i=1 FXi (xi ) for all (x1 , . . . , xn ) ∈ R . Thus, we can recover the
joint distribution from the marginal distributions if the RVs are known to be independent.
Moreover, if X1 , X2 , . . . , Xn are independent CRVs, then (X1 , . . . , Xn ) is a continuous
random vector.
Theorem 3.11. If X and Y are independent, then
E (g(X)h(Y )) = E (g(X)) E (h(Y )) ,
provided all the expectations exist.
Proof: We will prove it for continuous random vector (X, Y ). For discrete random vector,
it can be proved by replacing the integration sign by summation sign.
Z ∞Z ∞
E (g(X)h(Y )) = g(x)h(y)fX, Y (x, y)dxdy
−∞ −∞
Z ∞Z ∞
= g(x)h(y)fX (x)fY (y)dxdy, as X and Y are independent
−∞ −∞
Z ∞  Z ∞ 
= g(x)fX (x)dx h(y)fY (y)dy
−∞ −∞
= E (g(X)) E (h(Y )) .

3.7 Covariance and Correlation Coefficient


Definition 3.11 (Covariance). The covariance of two random variables X and Y is defined
by
Cov(X, Y ) = E [(X − E(X))(Y − E(Y ))] = E(XY ) − E(X)E(Y ).
Definition 3.12 (Correlation Coefficient). The correlation coefficient of X and Y is defined
by
Cov(X, Y )
ρ(X, Y ) = p .
V ar(X)V ar(Y )
Remark 3.6. Let X and Y be independent, then
Cov(X, Y ) = E(XY ) − E(X)E(Y ) = E(X)E(Y ) − E(X)E(Y ) = 0.
We get the second equality using the previous theorem. Thus, ρ(X, Y ) = 0. However, the
converse is not true in general. That means that there exists dependent RVs X and Y such
that Cov(X, Y ) = 0. Let us consider the following example in this regards. †
Example 3.6. Let X ∼ N (0, 1) and Y = X 2 . Then
Cov(X, Y ) = E (XY ) − E(X)E(Y ) = E X 3 − E(X)E(Y ).


It is easy to see that E(X) = 0 and E(X 3 ) = 0. Hence, Cov(X, Y ) = 0. Now, P (X ≤ −5) =
Φ(−5) 6= 0 and P (Y ≤ 1) = P (−1 ≤ X ≤ 1) = 2Φ(1)−1 6= 0. However, P (X ≤ −5, Y ≤ 1) =
0. Thus, X and Y are not independent. ||

62
Theorem 3.12. Let X, Y, Z, X1 , . . . , Xn , Y1 , . . . , Ym be RVs such that all the necessary
expectations for the followings exist. Then

1. Cov(X, X) = V ar(X).

2. Cov(X, Y ) = Cov(Y, X).

3. Cov(aX, Y ) = a Cov(X, Y ) for a real constant a.

4. Cov(X + Z, Y ) = Cov(X, Y ) + Cov(Z, Y ).


n m
! n Xm
X X X
5. Cov ai X i , bj Y j = ai bj Cov(Xi , Yj ) for real constants a1 , a2 , . . . , an
i=1 j=1 i=1 j=1
and b1 , b2 , . . . , bm .
n
! n
X X X
6. V ar Xi = V ar(Xi ) + 2 Cov(Xi , Yj ).
i=1 i=1 i<j

n
! n
X X
7. If Xi ’s are independent, then V ar Xi = V ar(Xi ).
i=1 i=1

Proof: 1. Straight forward form the definition.

2. Straight forward form the definition.

3.

Cov (aX, Y ) = E ((aX − E (aX)) (Y − E(Y )))


= aE ((X − E(X)) (Y − E(Y )))
= a Cov(X, Y ).

4. Straight forward from the definition.

5. Combining 2, 3, and 4, we can prove it.

6. Combining 1 and 5, this proof is trivial.

7. Using Remark 3.6, it can be readily obtained from 5.

Theorem 3.13. |ρ(X, Y )| ≤ 1 provided it exists.

Proof: Note that for any λ ∈ R,

V ar(X + λY ) ≥ 0 =⇒ λ2 V ar(Y ) + 2λCov(X, Y ) + V ar(X) ≥ 0.

That means that the quadratic equation λ2 V ar(Y ) + 2λCov(X, Y ) + V ar(X) = 0 either has
one solution or no solution. Hence,

4 (Cov (X, Y ))2 − 4V ar(X)V ar(Y ) ≤ 0 =⇒ |ρ(X, Y )| ≤ 1.

63
3.8 Transformation Techniques
Let X = (X1 , X2 , . . . , Xn ) be a random vector and g : Rn → Rm . Clearly, Y = g(X) is a
m-dimensional random vector. In this section, we will discuss different methods to find the
distribution of the random vector Y = g(X). Like the previous chapter, there are mainly
three techniques to obtain the distribution of Y = g(X).

3.8.1 Technique 1
In Technique 1, we try to find the JCDF of Y = g(X) given the distribution of X. As
before, we will discuss this technique using examples.

Example 3.7. Let X1 and X2 be identically and independently distributed (i.i.d.) U (0, 1)
random variables. Suppose we want to find the CDF of Y = X1 + X2 . Now,
Z Z
FY (y) = P (Y ≤ y) = P (X1 + X2 ≤ y) = fX1 , X2 (x1 , x2 )dx1 dx2 . (3.5)
x1 +x2 ≤y

As X1 ∼ U (0, 1), X2 ∼ U (0, 1) and X1 and X2 are independent RVs, the JPDF of (X1 , X2 )
is given by
(
1 if 0 < x1 < 1, 0 < x2 < 1
fX1 , X2 (x1 , x2 ) =
0 otherwise.

Thus, the JPDF of (X1 , X2 ) is positive only on the unit square (0, 1) × (0, 1), which is
indicated by gray shade in Figure 3.2. Now, to compute the integration in (3.5), we need to
consider the following cases.
For y < 0, consider the Figure 3.2a. As the integrand in (3.5) is zero over the region
{(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y} for y < 0,

FY (y) = 0.

For 0 ≤ y < 1, consider the Figure 3.2b. The integrand is positive only on the shaded
region in the set {(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y}. Therefore,
Z y Z y−x2
1
FY (y) = dx1 dx2 = y 2 .
0 0 2
For 1 ≤ y < 2, consider the Figure 3.2c. The integrand is positive only on the shaded
region in the set {(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y}. Therefore,
Z 1 Z 1
1
FY (y) = 1 − dx1 dx2 = 1 − (2 − y)2 .
y−1 y−x2 2

For y ≥ 2, consider the Figure 3.2d. The integrand is positive on the shaded region in
the set {(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y} and the square (0, 1) × (0, 1) is completely inside the
set {(x1 , x2 ) ∈ R2 : x1 + x2 ≤ y}. Therefore,

FY (y) = 1.

64
x2
x2

x1
x1

(a) y < 0 (b) 0 ≤ y < 1

x2
x2

x1
x1

(c) 1 ≤ y < 2 (d) y ≥ 2

Figure 3.2: Plot for Example 3.7

Thus, the CDF of Y = X1 + X2 is given by




 0 if y<0
 1 y2

if 0≤y<1
FY (y) = 2 1


 1 − 2 (2 − y)2 if 1≤y<2
y ≥ 2.

1 if

It can be shown that Y is a CRV (why?). ||

Example 3.8. Let the JPDF of (X1 , X2 ) be given by


(
e−x1 if 0 < x1 < x2 < ∞
fX1 , X2 (x1 , x2 ) =
0 otherwise.

Suppose that we want to find the JCDF of Y1 = X1 +X2 and Y2 = X2 −X1 . Note that the
JPDF of (X1 , X2 ) is positive only on the set SX1 , X2 = {(x1 , x2 ) ∈ R2 : 0 < x1 < x2 < ∞}.

65
x2
x2

2
x
=
1
x
2
x
=
e−x1

1
x
0
0 x1
x1

x1
+
x2
0 0

=
y1
(a) JPDF of (X1 , X2 ) (b) y1 < 0

x2 x2
2
y
=
1
x

2
x

2
x

x
=

=
1

1
x

x
2
y
x1

=
+

1
x
x2


=

2
x
x1 x1
y1

x1
+
x2
=
y1
(c) 0 < y2 < y1 < ∞ (d) 0 < y1 < y2 < ∞

Figure 3.3: Plot for Example 3.8

See Figure 3.3a. Now, let Ay1 , y2 = {(x1 , x2 ) ∈ R : x1 + x2 ≤ y1 , x2 − x1 ≤ y2 }. Then


Z Z
FY1 , Y2 (y1 , y2 ) = P (X1 + X2 ≤ y1 , X2 − X1 ≤ y2 ) = fX1 , X2 (x1 , x2 )dx2 dx1 . (3.6)
Ay1 , y2

Suppose that y1 < 0. Then FY1 (y1 ) = 0. See the Figure 3.3b. As FY1 , Y2 (y1 , y2 ) ≤
min {FY1 (y1 ), FY2 (y2 )}, FY1 , Y2 (y1 , y2 ) = 0 for y1 < 0. Similarly, FY1 , Y2 (y1 , y2 ) = 0 for y2 < 0.
For 0 < y2 < y1 < ∞, Ay1 ,y2 ∩ SX1 , X2 is the shaded region of the Figure 3.3c. Therefore,
y1 −y2 y1
Z
2
Z x1 +y1 Z
2
Z y1 −x1
−x1
FY1 , Y2 (y1 , y2 ) = e dx2 dx1 + e−x1 dx2 dx1
y1 −y2
0 x1 2
x1
y y −y
− 21 − 12 2
= y1 + e − (y1 − y2 + 2)e .

For 0 < y1 < y2 < ∞, Ay1 ,y2 ∩ SX1 , X2 is indicated by the shaded region in the Figure 3.3d.
Therefore,
y1
Z
2
Z y1 −x1 y1
FY1 , Y2 (y1 , y2 ) = e−x1 dx2 dx1 = y1 + 2e− 2 − 2.
0 x1

66
Thus, the JCDF of (Y1 , Y2 ) = (X1 + X2 , X2 − X1 ) is given by

0

y1 y1 −y2
if y1 < 0 or y2 < 0
FY1 , Y2 (y1 , y2 ) = y1 + e− 2 − (y1 − y2 + 2)e− 2 if 0 < y2 ≤ y1 < ∞
y
− 21

y1 + 2e −2 if 0 < y1 < y2 < ∞.

It can be shown that (Y1 , Y2 ) is a continuous random vector. This is easy and therefore left
as an exercise. ||

3.8.2 Technique 2
Like the previous chapter, this technique is based on two theorems, one for discrete random
vector and another for continuous random vector.
Theorem 3.14. Let X = (X1 , X2 , . . . , Xn ) be a discrete random vector with JPMF fX
and support SX . Let gi : Rn → R for all i = 1, 2, . . . , k. Let Yi = gi (X) for i = 1, 2, . . . , k.
Then Y = (Y1 , . . . , Yk ) is a discrete random vector with JPMF
X
 fX (x) if (y1 , . . . , yk ) ∈ SY
fY (y1 , . . . , yk ) = x∈Ay
0 otherwise,

where Ay = {x ∈ SX : gi (x) = yi , i = 1, . . . , k} and SY = {(g1 (x), . . . , gk (x)) : x ∈ SX }.


Proof: The proof of the theorem is similar to that of Theorem 2.7.
Example 3.9. Let X1 ∼ P oi(λ1 ) and X2 ∼ P oi(λ2 ). Also, assume that X1 and X2 are
independent. Then Y = X1 + X2 ∼ P oi(λ1 + λ2 ). To see it, we can apply Theorem 3.14.
First note that the JPMF if (X1 , X2 ) is given by
( −(λ1 +λ2 ) x1 x2
e λ1 λ2
x1 !x2 !
if x1 = 0, 1, . . . ; x2 = 0, 1, . . .
fX1 , X2 (x1 , x2 ) =
0 otherwise.

Therefore, SX1 , X2 = {0, 1, 2, . . .}×{0, 1, 2, . . .}, which implies that SY = {0, 1, 2, . . .}. For
y ∈ SY , Ay = {(x, y − x) : x = 0, 1, . . . , y}. Hence, using the Theorem 3.14, for y ∈ SY ,
y  
X e−(λ1 +λ2 ) λx1 1 λx2 2 e−(λ1 +λ2 ) X y x y−x 1
fY (y) = = λ1 λ2 = e−(λ1 +λ2 ) (λ1 + λ2 )y .
x1 !x2 ! y! x=0
x y!
(x1 , x2 )∈Ay

Thus, the PMF of Y = X1 + X2 is


(
1 −(λ1 +λ2 )
y!
e (λ1 + λ2 )y if y = 0, 1, . . .
fY (y) =
0 otherwise,

which is PMF of a P (λ1 + λ2 ). Hence, X1 + X2 ∼ P (λ1 + λ2 ). ||


Example 3.10. Let X1 ∼ Bin(n1 , p) and X2 ∼ Bin(n2 , p). We also assume that X1
and X2 are independent. Suppose that we want to find the PMF of Y = X1 + X2 . Note
that X1 and X2 are the numbers of successes out of n1 and n2 independent Bernoulli trials,
respectively. In both the cases the probability of success is p. Therefore, Y is the number

67
of successes out of n1 + n2 Bernoulli trials with success probability p. As X1 and X2 are
independent, these n1 + n2 Bernoulli trials can be assumed to be independent. Hence, the
distribution of Y must be Bin(n1 + n2 , p). Let us now check if we get the same distribution
using the Theorem 3.14. The JPMF of X1 and X2 is
(  
n1 n2 x1 +x2
x1 x2
p (1 − p)n1 +n2 −x1 −x2 if x1 = 0, 1, . . . , n1 ; x2 = 0, 1, . . . , n2
fX1 , X2 (x1 , x2 ) =
0 otherwise.

Therefore, SX1 , X2 = {0, 1, . . . , n1 } × {0, 1, . . . , n2 }. Without loss of generality, we assume


that n1 ≤ n2 . If not, exchange the roles of X1 and X2 . Now, SY = {0, 1, . . . , n1 + n2 }. For
y ∈ SY ,

Ay = {(x1 , x2 ) ∈ SX1 , X2 : x1 + x2 = y}

{(x, y − x) : x = 0, 1, . . . , y}
 if 0 ≤ y ≤ n1
= {(x, y − x) : x = 0, 1, . . . , n1 } if n1 < y ≤ n2

{(x, y − x) : x = y − n2 , . . . , n1 } if n2 < y ≤ n1 + n2 .

Hence, for y ∈ SY and y ≤ n1 ,


y     
X n1 n2 n1 +n2 −y n1 + n2 y
fY (y) = y
p (1 − p) = p (1 − p)n1 +n2 −y .
x=0
x y−x y

The last equality can be proved by collecting the coefficient of xy from both sides of the
following expression:
(n   ) (n   )
1 2
X n1 i X n2 i
(1 + x)n1 (1 + x)n2 = x × x .
i=0
i i=0
i

For y ∈ SY and n1 < y ≤ n2 ,


n1     
X n1 n2 n1 +n2 −y n1 + n2 y
fY (y) = y
p (1 − p) = p (1 − p)n1 +n2 −y .
x=0
x y − x y

For y ∈ SY and n2 < y ≤ n1 + n2 ,


n1     
X n1 n2 n1 +n2 −y n1 + n2 y
fY (y) = y
p (1 − p) = p (1 − p)n1 +n2 −y .
x=y−n
x y − x y
2

Thus, X1 + X2 ∼ Bin (n1 + n2 , p). Note that independence of X1 and X2 and same value
of probability of success are important for the result. ||

Theorem 3.15. Let X = (X1 , . . . , Xn ) be a continuous random vector with JPDF fX .

1. Let yi = gi (x), i = 1, 2, . . . , n be Rn → R functions such that

y = g(x) = (g1 (x), . . . , gn (x))

is one-to-one. That means that there exists the inverse transformation xi = hi (y),
i = 1, 2, . . . , n defined on the range of the transformation.

68
2. Assume that both the mapping and its’ inverse are continuous.
∂xi
3. Assume that partial derivatives ∂yj
, i = 1, 2, . . . , n, j = 1, 2, . . . , n, exist and are
continuous.

4. Assume that the Jacobian of the inverse transformation


 
. ∂xi
J = det 6= 0
∂yj i,j=1, 2, ..., n

on the range of the transformation.


Then Y = (g1 (X), . . . , gn (X)) is a continuous random vector with JPDF

fY (y) = fX (h1 (y), . . . , hn (y))|J|.

Proof: The proof of this theorem can be done using transformation of variable technique
for multiple integration. However, the proof is skipped here.
Remark 3.7. Note that g is a vector valued function. As g should be one-to-one, the
dimension of g should be same as dimension of the argument of g. Though we have written
that gi : Rn → R in the previous theorem, the conclusion of the theorem is valid if we replace
gi : Rn → R by gi : SX → R. Moreover, the theorem gives us sufficient conditions for g(X)
to be a continuous random vector, when X is continuous random vector. Thus, g(X) can
be a continuous random vector even if the conditions of the previous theorem do not hold
true. †
Example 3.11. Let X1 and X2 be i.i.d. U (0, 1) random variables. We want to find the
JPDF of Y1 = X1 + X2 and Y2 = X1 − X2 . Clearly,

g1 (x1 , x2 ) = x1 + x2 and g2 (x1 , x2 ) = x1 − x2 .

Thus, y = (y1 , y2 ) = g(x1 , x2 ) = (g1 (x1 , x2 ), g2 (x1 , x2 )) = (x1 + x2 , x1 − x2 ). Now, if


(x1 , x2 ) 6= (x̃1 , x̃2 ), then g (x1 , x2 ) 6= g(x̃1 , x̃2 ). If not, then x1 + x2 = x̃1 + x̃2 and x1 − x2 =
x̃1 − x̃2 , which implies x1 = x̃1 and x2 = x̃2 . This is a contradiction. Hence, the function
g(·, ·) is one-to-one. The inverse function is given by h(y1 , y2 ) = (h1 (y1 , y2 ), h2 (y1 , y2 )),
where x1 = h1 (y1 , y2 ) = 21 (y1 + y2 ) and x2 = h2 (y1 , y2 ) = 21 (y1 − y2 ). Clearly, both the
mapping and inverse mapping are continuous. Now,
∂x1 1 ∂x1 1 ∂x2 1 ∂x2 1
= , = , = , and =− .
∂y1 2 ∂y2 2 ∂y1 2 ∂y2 2
All the partial derivatives are continuous. The Jacobian is
1 1 1
J= 2
1
2 =− =6 0.
2
− 21 2

Thus, all the four conditions of the Theorem 3.15 hold, and hence, Y = (Y1 , Y2 ) is a
continuous random vector with JPDF
 
1 1 1
fY1 , Y2 (y1 , y2 ) = fX1 , X2 (y1 + y2 ), (y1 − y2 ) −
2 2 2

69
(
1
2
if 0 < y1 + y2 < 2, 0 < y1 − y2 < 2
=
0 otherwise.
Note that in Example 3.7, we have found the distribution of X1 + X2 . You may find the
marginal distribution of X1 +X2 from JPDF above and check if you are getting same marginal
distribution. ||
Example 3.12. Let X1 and X2 be i.i.d. N (0, 1) random variables. We want to find the
PDF of Y1 = X1 /X2 . Note that we cannot use Theorem 3.15 directly here as we have a single
function g1 (x1 , x2 ) = xx21 . Thus, we need to bring an auxiliary new function g2 (x1 , x2 ) such
that g(x1 , x2 ) = (g1 (x1 , x2 ), g2 (x1 , x2 )) satisfies all the conditions of Theorem 3.15. Let us
take g2 (x1 , x2 ) = x2 . Clearly, g(x1 , x2 ) is a one-to-one function. Here, the inverse function
is h(y1 , y2 ) = (h1 (y1 , y2 ), h2 (y1 , y2 )), where x1 = h1 (y1 , y2 ) = y1 y2 and x2 = h2 (y1 , y2 ) = y2 .
It is easy to see that mapping g and its’ inverse are continuous. Also,
∂x1 ∂x1 ∂x2 ∂x2
= y2 , = y1 , = 0, and = 1.
∂y1 ∂y2 ∂y1 ∂y2
All the partial derivatives are continuous. Hence, the Jacobian is
y2 y1
J= = y2 .
0 1
 
X1
Thus, all the four conditions of the Theorem 3.15 hold, and hence, Y = X2
, X2 is a
continuous random vector with JPDF
1 − 1 (1+y12 )y22
fY1 , Y2 (y1 , y2 ) = e 2 |y2 | for (y1 , y2 ) ∈ R2 .

Now, we can find the marginal PDF of Y1 from the JPDF of (Y1 , Y2 ). The marginal PDF of
Y1 is given by
Z ∞
|y2 | − 1 (1+y12 )y22 1 ∞
Z
1 2 2 1
fY1 (y1 ) = e 2 dy2 = y2 e− 2 (1+y1 )y2 dy2 =
−∞ 2π π 0 π(1 + y12 )
for all y1 ∈ R. Thus, Y1 ∼ Cauchy(0, 1). ||
Theorem 3.16. If X and Y are independent, then g(X) and h(Y ) are also independent.
Proof: The exact proof of the theorem cannot be given in the course. However, intuitively
it makes sense. X and Y are independent. That means that there is no effect of one of the
RVs on the other. Now, g being a function of X only and h being a function of Y only, there
should be no effect of g(X) on h(Y ) and vice versa.

3.8.3 Technique 3
The Technique 3 depends on the MGF. Hence, first we need to define the MGF of a random
vector.
Definition 3.13 (Moment Generating Function). Let X = (X1 , X2 , . . . , Xn ) be a random
vector. The MGF of X at t = (t1 , t2 , . . . , tn ) is defined by
 X n 
MX (t) = E exp ti Xi
i=1

provided the expectation exists in a neighborhood of origin 0 = (0, 0, . . . , 0).

70
∂ r1 +r2 +...+rn
Theorem 3.17. E(X1r1 X2r2 · · · Xnrn ) = r r
∂t11 ∂t22 ...∂trnn
MX (t) .
t=0

Proof: The proof is out of scope of the course.

Theorem 3.18. X and Y are independent iff MX, Y (t1 , t2 ) = MX (t1 )MY (t2 ) in a neigh-
borhood of the origin.

Proof: The proof is out of scope of the course.

Note that if X and Y are independent, then using Theorems 3.11, it is straight forward
to see that MX, Y (t, s) = MX (t)MY (s). Also, note that E (g(X)h(Y )) = E (g(X)) E (h(Y ))
for some functions g and h does not imply that X and Y are independent. In particular
E(XY ) = E(X)E(Y ) does not imply X and Y are independent. Please revisit Example 3.6
in this regard.

Definition 3.14. Two n-dimensional random vectors X and Y are said to have the same
d
distribution, denoted by X = Y , if FX (x) = FY (x) for all x ∈ Rn .

Theorem 3.19. Let X and Y be two n-dimensional random vectors. Let MX (t) = MY (t)
d
for all t in a neighborhood around 0, then X = Y .

Proof: The proof is out of scope of the course.

Example 3.13. Let XP i , i = 1, 2, . . . , k be independent Bin(ni , p) RVs. Let us try to find


the distribution of Y = ki=1 Xi . Now, the MGF of Y is
k
!! k
! k k
X Y Y Y
tY tXi tXi

MY (t) = E e = E exp t Xi =E e = E(e ) = MXi (t).
i=1 i=1 i=1 i=1

The fourth equality is true as the RVs X1 , X2 , . . . , Xk are independent. In Example 2.40,
we have seen that the MGF of X ∼ Bin(n, p) is MX (t) = (1 − p + pet )n for all t ∈ R. Thus,
the MGF of Y is
k
Y Pk
MY (t) = (1 − p + pet )ni = (1 − p + pet ) i=1 ni
i=1
P 
k d
for t ∈ R. Let Z ∼ Bin n
i=1 i , p , then MZ (t) = MY (t) for all t ∈ R. Thus, Y = Z ∼
P 
k
Bin i=1 ni , p . Note that this example is an extension of Example 3.10. ||

i.i.d.
Example 3.14. Let X1 , X2 , . . . , Xk ∼ Exp(λ) and Y = ki=1 Xi . Then the MGF of Y
P
is
k  −k
Y k t
MY (t) = MXi (t) = [MX1 (t)] = 1 −
i=1
λ

for all t < λ. The second equality is due to the fact that Xi has same distribution for all
i = 1, 2, . . . , k. The third equality is hold true form Example 2.41. Let Z ∼ Gamma(k, λ).
Then MZ (t) = MY (t) for all t < λ. Hence, Y ∼ Gamma(k, λ). ||

71
2
Pk
Example
P 3.15. Let X i , i = 1, 2, . . . , k be independent N (µ i , σi ) RVs. Then i=1 Xi ∼
k Pk 2
N i=1 µi , i=1 σi . This can be proved following the same technique as the last example.
I am leaving it as an exercise. ||
Definition 3.15 (Expectation of a Random Vector). Expectation of a random vector is given
by
E(X) = (EX1 , EX2 , . . . , EXn )0 = µ.
Definition 3.16 (Variance-Covariance Matrix of a Random Vector). The variance-covariance
matrix of a n-dimensional random vector, denoted by Σ, is defined by

Σ = [Cov(Xi , Xj )]ni,j=1 = E(X − µ)(X − µ)0 .

3.9 Conditional Distribution


3.9.1 For Discrete Random Vector
Definition 3.17. Let (X, Y ) be a discrete random vector with JPMF fX,Y (·, ·). Suppose the
marginal PMF of Y is fY (·). The conditional PMF of X, given Y = y is defined by
fX,Y (x, y)
fX|Y (x|y) =
fY (y)
provided fY (y) > 0.
Note that fX, Y (x, y) = P (X = x, Y = y) and fY (y) = P (Y = y). Thus, the conditional
PMF of X given Y = y is P (X = x|Y = y). As we know that P (A|B) is defined if P (B) > 0,
here we need the condition that fY (y) = P (Y = y) > 0. Hence, fX|Y (x|y) is only defined for
y ∈ SY .
Example 3.16. Let X1 ∼ P oi(λ1 ), X2 ∼ P oi(λ2 ). Also, assume that X1 and X2 are
independent. In Example 3.9, we have seen that X1 + X2 ∼ P oi(λ1 + λ2 ) and the support
of X1 + X2 is S = {0, 1, 2, . . .}. Hence, the conditional PMF of X1 given X1 + X2 = y is
defined for all y ∈ S, and the conditional PMF of X given X1 + X2 = y is given by
fX1 , X1 +X2 (x, y)
fX1 |X1 +X2 (x|y) =
fX1 +X2 (y)
P (X1 = x, X1 + X2 = y)
=
P (X1 + X2 = y)
P (X1 = x, X2 = y − x)
=
P (X1 + X2 = y)
P (X1 = x) P (X2 = y − x)
= , as X1 and X2 are independent
P (X1 + X2 = y)
 y−x
λx λ2
 e−λ1 x!1 ×e−λ2 (y−x)!

(λ +λ2 )y if x = 0, 1, 2 . . . , y
= e
−(λ1 +λ2 ) 1
y!

0 otherwise
   x  y−x
 y λ1
1 − λ1
if x = 0, 1, 2 . . . , y
= x λ1 +λ2 λ1 +λ2
0 otherwise.

72
 
λ1
Thus, X1 |X1 + X2 = y ∼ Bin y, λ1 +λ2
. Note that the support of the conditional PMF is
{0, 1, . . . , y}. ||
Definition 3.18. The conditional CDF of X given Y = y is defined by
X
FX|Y (x|y) = P (X ≤ x|Y = y) = fX|Y (u|y),
{u≤x:(u,y)∈SX,Y }

provided fY (y) > 0.


Definition 3.19 (Conditional Expectation for Discrete Random Vector ). The conditional
expectation of h(X) given Y = y is defined by
X
E(h(X)|Y = y) = h(x)fX|Y (x|y),
{x:(x,y)∈SX,Y }

provided it is absolutely summable.


Remark 3.8. Note that for fixed y ∈ SY , fX|Y (·|y) is PMF. Thus, conditional expectation
is an expectation with respect to the distribution specified by the PMF fX|Y (·|y), and hence,
conditional expectation satisfies all the properties of expectation. For example, if h1 (x) ≤
h2 (x) for all x ∈ R, then

E (h1 (X)|Y = y) ≤ E (h2 (X)|Y = y) ,

provided the expectations exist. †


Example 3.17. Let X ∼ P oi(λ1 ), Y ∼ P oi(λ2 ). Let X and Y be independent. In
Example 3.16, we have seen that X|X + Y = n ∼ Bin(n, λ1λ+λ 1
2
) for all n = 0, 1, . . .. Hence,
nλ1
the conditional expectation of X given X + Y = n is λ1 +λ2 . ||
Example 3.18. Suppose that a system has n components. Suppose that on a rainy day,
component i functions with probability pi , i = 1, 2, . . . , n. Also, assume that the com-
ponents work independently. We want to calculate the conditional expected number of
components that will function tomorrow given that it will rain tomorrow. Again we will use
the indicator RVs as we used in Example 3.4 to count the number of components that will
work tomorrow. Let
( (
1 if component i functions tomorrow 1 if it rains tomorrow
Xi = and Y =
0 otherwise 0 otherwise.

Then the desired expectation can be obtained as follows:


" n # n n
X X X
E Xi |Y = 1 = E (Xi |Y = 1) = pi .
i=1 i=1 i=1

The last equality is due to the fact that P (Xi |Y = 1) = pi for all i = 1, 2, . . . , n. ||
Theorem 3.20. If X and Y are independent DRVs, then fX|Y (x|y) = fX (x) for all x ∈ R
and y ∈ SY .
Proof: The proof is straight forward from the definition of conditional PMF.

73
3.9.2 For Continuous Random Vector
Let (X, Y ) be a continuous random vector. Note that, in this case, Y is a CRV, and hence,
P (Y = y) = 0 for all y ∈ R. As a result, we cannot define the conditional probabilities in
the same way as we defined for the discrete random vector in the previous subsection. Like
CRV or continuous random vector, we will first define the conditional CDF for a continuous
random vector (X, Y ), then conditional PDF.
Definition 3.20 (Conditional CDF). Let (X, Y ) be a continuous random vector. The con-
ditional CDF of X given Y = y is defined as
FX|Y (x|y) = lim P (X ≤ x|Y ∈ (y − , y + ]),
↓0

provided the limit exists.


Note that CDF of a random variable, X, is defined by P (X ≤ x). Ideally, we want to see
what is the value of P (X ≤ x|Y = y). However, when (X, Y ) is continuous random vector,
we have a problem to define P (X ≤ x|Y = y) as P (Y = y) = 0 for all y ∈ R. One of the
way to overcome this difficulty is to proceed as follows. We can replace the event {Y = y}
by Y ∈ (y − , y + ] and then take limit that  drops to zero. Of course, this makes sense if
the limit exists. Motivated by this intuition, we have the previous definition of conditional
CDF of X given Y = y for a continuous random vector (X, Y ).
Definition 3.21 (Conditional PDF). Let (X, Y ) be a continuous random vector with con-
ditional CDF FX|Y (·|y) of X given Y = y. Define the conditional PDF of X given Y = y,
fX|Y (x|y), as the non-negative integrable function satisfying
Z x
FX|Y (x|y) = fX|Y (t|y)dt for all x ∈ R.
−∞

Theorem 3.21. Let fX,Y be the JPDF of (X, Y ) and let fY be the marginal PDF of Y . If
fY (y) > 0, then the conditional PDF of X given Y = y exists and is given by
fX,Y (x, y)
fX|Y (x|y) = .
fY (y)
Proof: This is not an exact proof, but a overview of it.
P (X ≤ x, y − ε < Y ≤ y + ε)
lim P (X ≤ x|y − ε < Y ≤ y + ε) = lim
ε↓0 ε↓0 P (y − ε < Y ≤ y + ε)
R x R y+ε
−∞ y−ε X, Y
f (t, s)dsdt
= lim R y+ε
ε↓0 f (s)ds
y−ε Y
Z y+ε
Z x 1
lim fX, Y (t, s)ds
ε↓0 2ε y−ε
= dt
1 y+ε
Z
−∞ lim fY (s)ds
ε↓0 2ε y−ε
Z x
fX, Y (t, y)
= dt.
−∞ fY (y)
The last equality is due to fundamental theorems of calculus. Thus, the conditional PDF is
f (x,y)
given by fX|Y (x|y) = X,Y
fY (y)
for those values of y ∈ R for which fY (y) > 0.

74
Definition 3.22 (Conditional Expectation for Continuous Random Vector). The conditional
expectation of h(X) given Y = y, is defined for all values of y such that fY (y) > 0, by
Z ∞
E(h(X)|Y = y) = h(x)fX|Y (x|y)dx,
−∞

provided it is absolutely integrable.


Remark 3.9. Note that for fixed y ∈ SY , fX|Y (·|y) is a PDF. Therefore, conditional expec-
tation is an expectation with respect to the PDF fX|Y (·|y). Thus, E(X|Y = y) satisfies all
properties of unconditional expectation. †
Example 3.19. Suppose the JPDF of (X, Y ) is given by
(
6xy(2 − x − y) 0 < x < 1, 0 < y < 1
fX,Y (x, y) =
0 otherwise

The marginal PDF of Y is


(
1
6
y(4 − 3y) if 0 < y < 1
fY (y) =
0 otherwise.

Hence, the conditional PDF of X given Y = y ∈ (0, 1) is given by


fX, Y (x, y)
fX|Y (x|y) =
f (y)
( Y
6x(2−x−y)
4−3y
if 0 < x < 1
=
0 otherwise.

The conditional expectation of X given Y = y ∈ (0, 1) is


Z ∞ Z 1
6 5 − 4y
E (X|Y = y) = xfX|Y (x|y)dx = x2 (2 − x − y)dx = .
−∞ 4 − 3y 0 2 (4 − 3y)
Note that for conditional PDF, the ranges of both of x and y are important and need
to mention unambiguously. Similarly for computing conditional expectation, we need the
appropriate ranges of x and y. ||
Example 3.20. Let the joint PDF of (X, Y ) be
(
1
ye−xy if 0 < x < ∞, 0 < y < 2
fX,Y (x, y) = 2
0 otherwise.

The marginal PDF of Y is


(
1
2
if 0 < y < 2
fY (y) =
0 otherwise.

For y ∈ (0, 2), the conditional PDF of X given Y = y is


(
ye−yx if x > 0
fX|Y (x|y) =
0 otherwise.

75
Note that Y ∼ U (0, 2) and X|Y = y ∼ Exp(y) for y ∈ (0, 2). Hence,
 X  Z ∞ x
E e 2 |Y = 1 = e−x+ 2 dx = 2.
0

||
Theorem 3.22. If (X, Y ) is a continuous random vector such that X and Y are indepen-
dent random variables, then fX|Y (x|y) = fX (x) for all x ∈ R and for all y ∈ SY .
Proof: The proof is straight forward from the definition.

3.9.3 Computing Expectation by Conditioning


Suppose that (X, Y ) is either a discrete random vector or a continuous random vector. Then
the conditional expectation E(X|Y = y) is a function of y. Let we denote g(y) = E(X|Y =
y). Then g(Y ) is a function of RV Y . Thus, g(Y ) = E(X|Y ) is again a random variable.
With this understanding, we have the following theorem.
Theorem 3.23. E(X) = E(E(X|Y )).
Proof: We will prove it for continuous random vector (X, Y ). For the discrete random
vector, the proof can be obtained by replacing the integration sign by summation sign.
Z ∞
EE(X|Y ) = E (X|Y = y) fY (y)dy
Z−∞
∞ Z ∞
= xfX|Y (x|y)fY (y)dxdy
−∞ −∞
Z ∞Z ∞
= xfX, Y (x, y)dxdy
−∞ −∞
= E(X).

In the previous theorem, the outside expectation is with respect to the distribution of
Y , as E (X|Y ) is a function of Y . This theorem can be used to solve many problems. This
theorem states that we can compute the average of different parts of a population separately
and then take an weighted sum of those average to obtain the overall average. Let there be
three columns of seating arrangement (like some of the lectures halls at IITG) in a class and
we want to find the average of heights of the students in the class. Let xi denote the average
of the height of the students seating in the ith column and ni be the number of students
who is seating in ith column. Then the overall average is
n1 n2 n3
x= x1 + x2 + x3 ,
n n n
where n = n1 + n2 + n3 . Note that nni can be interpreted as the probability that a student
in column i and xi is the conditional expectation of height given that the student is in the
column i. Therefore, the overall average is EE (X|Y ), where X denotes the height of a
student and Y is an indicator of the column number. We can take Y = i if the student is
in the ith column. Thus, the above theorem is a generalization of what we have learned in
school.

76
Though we have discussed expectations when (X, Y ) is either discrete random vector or
continuous random vector, the above theorem is still valid if one of them is DRV and other
is CRV. We will see some applications where one of them is CRV and other one is DRV. Of
course, we will not go for proper definition (which is out of scope of this course) or proof in
these cases. If Y is a DRV, then
X
E (X) = EE(X|Y ) = E (X|Y = y) fY (y).
y∈SY

If Y is a CRV, then
Z ∞
E (X) = EE(X|Y ) = E (X|Y = y) fY (y)dy.
−∞

Example 3.21. Virat will read either one chapter of his probability book or one chapter
of his history book. If the number of misprints in a chapter of his probability and history
book is Poisson with mean 2 and 5 respectively, then assuming that Virat is equally likely
to choose either book, we can compute the expected number of misprints that he will come
across using the above theorem. Let X denote the number of misprint and
(
1 if Virat read the probability book
Y =
2 if Virat read the history book.
1
We need to find E(X). Note that P (Y = 1) = P (Y = 2) = 2
, E (X|Y = 1) = 2 and
E (X|Y = 2) = 5. Hence,
1
E(X) = EE(X|Y ) = P (Y = 1) E(X|Y = 1) + P (Y = 2) E (X|Y = 2) = (2 + 5) = 3.5.
2
Thus, expected number of misprint that Virat will come across is 3.5. ||
Theorem 3.24. E(X − E(X|Y ))2 ≤ E(X − f (Y ))2 for any function f .
Proof: Let us denote µ(Y ) = E (X|Y ). Then

E (X − f (Y ))2 = E (X − µ(Y ) + µ(Y ) − f (Y ))2


= E (X − µ(Y ))2 + E (µ(Y ) − f (Y ))2 + 2E [(X − µ(Y ))(µ(Y ) − f (Y ))] .

Now,

E [(X − µ(Y ))(µ(Y ) − f (Y ))] = EE [(X − µ(Y )) (µ(Y ) − f (Y )) |Y ]


= E [(µ(Y ) − f (Y ))E (X − µ(Y )|Y )]
= E [(µ(Y ) − f (Y )) (µ(Y ) − µ(Y ))]
= 0.

The first equality is due to the Theorem 3.23. For the second equality, notice that µ(Y ) −
f (Y ), being a function of Y only, acts as a constant when Y is given. Hence, µ(Y ) − f (Y )
comes out of the conditional expectation. Thus,

E (X − f (Y ))2 = E (X − µ(Y ))2 + E (µ(Y ) − f (Y ))2 ≥ E (X − µ(Y ))2 ,

as E (µ(Y ) − f (Y ))2 ≥ 0. The equality holds if and only if E (µ(Y ) − f (Y ))2 = 0. It can be
shown that E (µ(Y ) − f (Y ))2 = 0 if and only if f (Y ) = µ(Y ) = E(X|Y ).

77
Recall the Theorem 2.12, which states that if we do not have any extra information, then
the “best estimate” of X is E(X). The Theorem 3.24 states that if we have information on
the RV Y , then the “best estimate” of X is changes and becomes E (X|Y ).
Definition 3.23 (Conditional Variance). Let (X, Y ) be a random vector. Then the condi-
tional variance of X given Y = y is defined by

V ar(X|Y = y) = E((X − E(X|Y ))2 |Y = y) = E(X 2 |Y = y) − (E(X|Y = y))2 .

Like expectation, V ar(X|Y ) = σ 2 (Y ) is a RV, where σ 2 (y) = V ar(X|Y = y). Then we


have the following theorem. The following theorem says that the overall variance can be
computed by calculating variances and expectations of different parts and then aggregating.
Theorem 3.25. V ar(X) = E(V ar(X|Y )) + V ar(E(X|Y )).
Proof: Let µ(Y ) = E (X|Y ) and µ = E (X) = EE (X|Y ). Then

V ar(X) = E (X − µ)2
= E (X − µ(Y ))2 + E (µ(Y ) − µ)2 + 2E [(X − µ(Y )) (µ(Y ) − µ)] .

Now, using Theorem 3.23,

E [(X − µ(Y )) (µ(Y ) − µ)] = EE [(X − µ(Y )) (µ(Y ) − µ) |Y ] = 0.

Thus,

V ar(X) = E (X − µ(Y ))2 + E (µ(Y ) − µ)2


= EE (X − µ(Y ))2 |Y + E [µ(Y ) − E(µ(Y ))]2
 

= E (V ar(X|Y )) + V ar (µ(Y ))
= E (V ar(X|Y )) + V ar (E(X|Y )) .

Note that it is easy to remember the formula. On the right hand side, one is expectation of
conditional variance and another is variance of conditional expectation.
Example 3.22. Let X0 , X1 , X2 , . . . be a sequence of i.i.d. RVs with mean µ and variance σ 2 .
XN
Let N ∼ Bin(n, p) and is independent of Xi ’s for all i = 0, 1, . . .. Define S = Xi . Note
i=0
that the RV S is the sum of random number of RVs. This type of RVs are called compound
RVs. Compound random variables are quite important in many practical situations. For
example, consider a car insurance company. The number of accidents that a customer meets
in a year is a RV. Let N denotes the number of accident in a year. Now, assume that Xi
denotes the claim by the customer after the ith accident. Then S is the total claim made
by the customer. Now, it is important for the insurance company to have an idea of average
and variance of claims made by a customer. Let us try to compute E(S) and V ar(S). Note
that
N
! n
! n
!
X X X
E (S|N = n) = E Xi |N = n = E Xi |N = n = E Xi = (n + 1)µ.
i=0 i=0 i=0

The second equality is true as under the condition N = n, the sum has n components.
P The
third equality is true due to the fact that N and Xi ’s are independent. Note that N
i=1 Xi

78
PN
and
Pn N are not independent. However, when we put a specific value of N in i=1 Xi to get
i=1 X i , then the later does not involve N and becomes independent of N . Thus, we have
E (S|N ) = (N + 1)µ. Hence,
E(S) = EE(S|N ) = E [(N + 1)µ] = (np + 1)µ.
Now,
N
! n
! n
!
X X X
V ar(S|N = n) = V ar Xi |N = n = V ar Xi |N = n = V ar Xi = (n+1)σ 2 .
i=0 i=0 i=0

Thus, V ar(S|N ) = (N + 1)σ 2 . Hence,


V ar(S) = E (V ar(S|N )) + V ar (E(S|N ))
= E (N + 1)σ 2 + V ar [(N + 1)µ]
 

= (np + 1)σ 2 + np(1 − p)µ2 .


||
Theorem 3.23 can be used to compute probability by conditioning. We have seen that
P (X ∈ A) = E (IA (X)), where IA is the indicator function of the set A. Also, note that
E (IA (X)|Y = y) = P (X ∈ A|Y = y). Therefore, we can write
P (A) = P (X ∈ A) = E (IA (X)) = EE (IA (X)|Y )
X


 P (A|Y = y)P (Y = y) for Y discrete
y∈S
Y
= Z ∞



 P (A|Y = y)fY (y)dy for Y continuous.
−∞
P
When Y is a DRV, P (E) = y∈SY P (E|Y = y) P (Y = y) can be concluded from the
Theorem 1.16. Of course, the result for CRV cannot be obtained form the Theorem 1.16.
Example 3.23. Let X and Y be independent CRVs having PDFs fX and fY , respectively.
Then
Z ∞
P (X < Y ) = P (X < Y |Y = y) fY (y)dy
Z−∞∞
= P (X < y|Y = y) fY (y)dy
−∞
Z ∞
= P (X < y) fY (y)dy
−∞
Z ∞
= FX (y)fY (y)dy,
−∞

where FX (·) is the CDF corresponding to fX (·). ||


Example 3.24. Let X and Y be i.i.d. CRVs having common PDF f (·) and CDF F (·).
Then using the last example
Z ∞
1
P (X < Y ) = F (y)f (y)dy = .
−∞ 2
1
Now, as X and Y are i.i.d., P (Y < X) = P (X > Y ) = 2
. Thus, P (X = Y ) = 1 −
P (X < Y ) − P (X > Y ) = 0. ||

79
Example 3.25. Suppose X and Y are two independent RVs, either discrete or continuous.
Let us study the RV Z = X + Y and try to see if this is a CRV or DRV
We know that (X, Y ) is a discrete random vector if X and Y are DRVs, and hence,
Z = X + Y is a DRV. The PMF of Z, for z ∈ R, is

fZ (z) = P (X + Y = z)
X
= P (X + Y = z|Y = y) P (Y = y)
y∈Sy
X
= P (X + y = z|Y = y) fY (y)
y∈SY
X
= P (X = z − y) fY (y)
y∈SY
X
= fX (z − y)fY (y).
y∈SY

Now, assume that X and Y are CRVs. Let us first find the CDF of Z and then check
what type of RV Z is. The CDF of Z, for z ∈ R, is

FZ (z) = P (X + Y ≤ z)
Z ∞
= P (X + Y ≤ z|Y = y) fY (y)dy
−∞
Z ∞
= P (X + y ≤ z|Y = y) fY (y)dy
−∞
Z ∞
= P (X ≤ z − y) fY (y)dy, as X and Y are independent
−∞
Z ∞ Z z−y
= fX (x)fY (y)dxdy
−∞ −∞
Z ∞Z z
= fX (x0 − y)fY (y)dx0 dy, taking x0 = x + y
Z−∞ −∞
z Z ∞ 
= fX (x − y)fY (y)dy dx0 for all z ∈ R.
0
−∞ −∞

Thus, X + Y is a CRV with PDF


Z ∞
fX+Y (z) = fX (z − y)fY (y)dy for all z ∈ R.
−∞

Note that changing the role of X and Y , we can write the PDF of Z as
Z ∞
fX+Y (z) = fY (z − x)fX (x)dx for all z ∈ R.
−∞

Now, assume that X is a CRV and Y is a DRV. Then the CDF of Z = X + Y is

FZ (z) = P (X + Y ≤ z)
X
= P (X + Y ≤ z|Y = y) fY (y)
y∈SY

80
XZ z−y
= fX (x)fY (y)dx
y∈SY −∞
XZ z
= fX (x0 − y)fY (y)dx0 , taking x0 = x + y
y∈SY −∞
( )
Z z X
= fX (x0 − y)fY (y) dx0 for all z ∈ R.
−∞ y∈SY

Therefore, X + Y is a CRV with PDF


X
fX+Y (z) = fX (z − y)fY (y) for all z ∈ R.
y∈SY

To summarize, if X and Y are independent, then the RV X + Y is continuous if at least one


of X or Y is CRV. If both of them are DRVs, then X + Y is a DRV. ||
Definition 3.24 (Conditional Expectation for given Event). Let (X, Y ) be a random vector.
Then
E(h(X, Y )IA (X, Y ))
E (h(X, Y )|(X, Y ) ∈ A) = .
P ((X, Y ) ∈ A))
Example 3.26. Let X ∼ Exp(1). Then
Z ∞
 xI[2, ∞) (x)e−x dx Z ∞
E XI[2, ∞) (X)
E (X|X ≥ 2) = = 0 Z ∞ =e 2
xe−x dx = 3.
P (X ≥ 2) −x 2
e dx
2

||
Example 3.27. (X, Y ) is uniform on unit square. Then
Z 1Z 1
 xdydx
E XI(1, ∞) (X + Y ) 0 1−x 2
E (X|X + Y > 1) = = Z 1Z 1 = .
P (X + Y > 1) 3
dxdy
0 1−y

||
Example 3.28. A rod of length l is broken into two parts. Then the expected length of
the shorter part is  
l
E X|X < ,
2
where X ∼ U (0, l). Thus, the expected length of the shorter part is
Z l
  1 2
 E XI xdx
(−∞, 2l ) (X)

l l 0 l
E X|X < = l
 = l = .
2 P X<2
Z
1 2 4
dx
l 0
An alternative formulation is as follows: The required quantity is E [min {X, l − X}].
Please calculate and check if you are getting the same value. ||

81
3.10 Bivariate Normal Distribution
Recall that a CRV X is said to have a univariate normal distribution if the PDF of X is
given by
1 1 x−µ 2
f (x) = √ e− 2 ( σ ) for all x ∈ R,
σ 2π
where µ ∈ R and σ > 0. In this case, X ∼ N (µ, σ 2 ) is used to denote the RV X follows a
normal distribution with parameters µ and σ 2 . Note that if X ∼ N (µ, σ 2 ), then all moments
of X exist. In particular, E(X) and V ar(X) exist, and they are given by E(X) = µ and
V ar(X) = σ 2 . This means that a normal distribution is completely specified by its mean
and variance.
 
X1
Definition 3.25 (Bivariate Normal). A two dimensional random vector X = is
X2
said to have a bivariate normal distribution if aX1 + bX2 is a univariate normal for all
(a, b) ∈ R2 \ (0, 0).

Theorem 3.26. If X has bivariate normal distribution, then each of X1 and X2 is uni-
variate normal. Hence, E(X1 ), E(X2 ), V ar(X1 ), V ar(X2 ), and Cov(X1 , X2 ) exist.

Proof: Taking a = 1 and b = 0, aX1 + bX2 = X1 follows normal distribution. Similarly, X2


follows normal distribution. As all momentspof a normal RV exist, E(X1 ), E(X2 ), V ar(X1 ),
and V ar(X2 ) exist. As |Cov(X1 , X2 )| ≤ V ar(X1 )V ar(X2 ) (see the proof of the Theo-
rem 3.13), Cov(X1 , X2 ) exists.
   
µ1 σ11 σ12
Let us denote µ = E(X) = and Σ = V ar(X) = , where µ1 =
µ2 σ21 σ22
E(X1 ), µ2 = E(X2 ), σ11 = V ar(X1 ), σ22 = V ar(X2 ), and σ12 = σ21 = Cov(X1 , X2 ).

Theorem 3.27. Let X be a bivariate normal random vector. If µ = E(X) and Σ =


V ar(X), then for any fixed u = (a, b) ∈ R2 \ (0, 0),

u0 X ∼ N (u0 µ, u0 Σu).

Proof: As u0 X = aX1 + bX2 , u0 X follows a univariate normal distribution. Now,

E(u0 X) = aµ1 + bµ2 = u0 µ.

and
V ar(u0 X) = a2 σ11 + b2 σ22 + 2abσ12 = u0 Σu.
Thus, u0 X ∼ N (u0 µ, u0 Σu).

Theorem 3.28. Let X be a bivariate normal random vector with µ = E (X) and Σ =
V ar (X), then the MGF of X is given by
0 1 0
MX (t) = et µ+ 2 t Σt

for all t ∈ R2 .

82
Proof: The JMGF of X is
 0 
MX (t) = E et X = Mt0 X (1). (3.7)

As X has a bivariate normal distribution, t0 X ∼ N (t0 µ, t0 Σt). Now, using Example 2.42,
the theorem is immediate.

The Theorem 3.28 shows that the bivariate normal distribution is completely specified
by the mean vector µ and the variance-covariance matrix Σ. We will use the notation
X ∼ N2 (µ, Σ) to denote that the random vector X follows a bivariate normal distribution
with mean vector µ and variance-covariance matrix Σ.

Theorem 3.29. If X ∼ N2 (µ, Σ), then X1 ∼ N (µ1 , σ11 ) and X2 ∼ N (µ2 , σ22 ).

Proof: The proof of the theorem is immediate from Theorem 3.27.

The converse of the Theorem 3.29 is not true in general. Consider the following example
in this regard.

Example 3.29. Let X ∼ N (0, 1). Let Z be a DRV, which is independent of X and

P (Z = 1) = 0.5 = P (Z = −1).

Then Y = ZX ∼ N (0, 1). To see it, notice that for all y ∈ R,

P (Y ≤ y) = P (ZX ≤ y)
= P (ZX ≤ y|Z = 1) P (Z = 1) + P (ZX ≤ y|Z = −1) P (Z = −1)
1 1
= P (X ≤ y) + P (X ≥ −y)
2 2
= Φ(y).

Thus, X ∼ N (0, 1) and Y ∼ N (0, 1). However, (X, Y ) is not a bivariate normal random
vector. To see it, observe that
1
P (X + Y = 0) = P (X + ZX = 0) = P (Z = −1) = .
2
That means that X + Y does not follow a univariate normal distribution, and hence, (X, Y )
is not a bivariate normal random vector, though X ∼ N (0, 1) and Y ∼ N (0, 1). ||

Theorem 3.30. If X ∼ N2 (µ, Σ) and Cov(X1 , X2 ) = 0, then X1 and X2 are independent.

Proof: In this case, Σ = diag(σ11 , σ22 ). Hence, the JMGF of (X1 , X2 ) is


1 2 1 2
MX1 , X2 (t1 , t2 ) = et1 µ1 + 2 σ11 t1 × et2 µ2 + 2 σ22 t2
= MX1 (t1 )MX2 (t2 ),

where MXi (·) is the MGF of Xi , i = 1, 2. This shows that X1 and X2 are independent.

Note that we have discussed that two random variable can be dependent even if covariance
between them zero. The bivariate normal random vector is special in this respect.

83
Theorem 3.31 (Probability Density Function). Let X ∼ N2 (µ, Σ) be such that Σ is
invertible, then, for all x ∈ R2 , X has a joint PDF given by
 
1 1 0 −1
f (x) = exp − (x − µ) Σ (x − µ)
2π|Σ|1/2 2
 2     2 
x1 −µ1 x −µ x2 −µ2 x −µ
1 − 1
2 σ1
−2ρ 1σ 1 σ2
+ 2σ 2
= p e 2(1−ρ ) 1 2

2πσ1 σ2 1 − ρ 2
√ √
where σ1 = σ11 , σ2 = σ22 , ρ is correlation coefficient between X1 and X2 .
Proof: The proof of this theorem is out of scope.
Theorem 3.32 (Conditional Probability Density Function). Let X ∼ N2 (µ, Σ) be such
that Σ is invertible, then for all x2 ∈ R, the conditional PDF of X1 given X2 = x2 is given
by
" 2 #
1 x1 − µ1|2

1
fX1 |X2 (x1 |x2 ) = √ exp − for x1 ∈ R,
σ1|2 2π 2 σ1|2
 
where µ1|2 = µ1 + ρ σσ21 (x2 − µ2 ) and σ1|2
2
= σ12 (1 − ρ2 ). Thus, X1 |X2 = x2 ∼ N µ1|2 , σ1|2
2
.
Proof: Easy to see from the fact that
fX1 , X2 (x1 , x2 )
fX1 |X2 (x1 |x2 ) = .
fX2 (x2 )
Of course, you need to perform some algebra.
Corollary 3.1. Under the condition of the Theorem 3.32, E(X1 |X2 = x2 ) = µ1|2 = µ1 +
ρ σσ12 (x2 −µ2 ) and V ar(X1 |X2 = x2 ) = σ1|2
2
= σ12 (1−ρ2 ) for all x2 ∈ R. Hence, the conditional
variance does not depend on x2 .
Proof: Straight forward from the previous theorem.

3.11 Some Results on Independent and Identically Dis-


tributed Normal RVs
Theorem 3.33. Let X1 , X2 , . . . , Xn be i.i.d. N (0, 1) random variables. Then
n
X
Xi2 ∼ Gamma(n/2, 1/2) ≡ χ2n .
i=1

Proof: The MGF of X12 is given by


Z ∞
 2 1 1 2 1
MX12 (t) = E e tX1
=√ e−( 2 −t)x dx = (1 − 2t)− 2 ,
2π −∞
for t < 12 . Hence, the MGF of T = ni=1 Xi2
P

n
n
Y
MT (t) = MXi2 (t) = (1 − 2t)− 2 ,
i=1
1
Pn
where t < 2
.
Thus, T = 2
i=1 Xi∼ Gamma( n2 , 12 ). This distribution is also known as χ2
distribution with degrees of freedom n. Thus, the sum of squares of n i.i.d. N (0, 1) has a
χ2 distribution with degrees of freedom n.

84
1
Pn
Theorem 3.34. Let X1 , X2 , . . . , Xn be i.i.d. N (µ, σ 2 ) random variables. Let X = n i=1 Xi
1
Pn
and S 2 = n−1 2 2
i=1 (Xi − X) . Then X and S are independently distributed and

(n − 1)S 2
X ∼ N (µ, σ 2 /n) and 2
∼ χ2n−1 .
σ
Proof: Let A be an n × n orthogonal matrix, whose first row is
 
1 1 1
√ , √ , ..., √ .
n n n
Note that such a matrix exists as we can start with the row and construct a basis of Rn .
Then Gram-Schmidt orthogonalization will give us the required matrix. As A is orthogonal,
its’ inverse exists and A−1 = AT , the transpose of A. Now consider the transformation of
random vector X = (X1 , X2 , . . . , Xn )0 given by
Y = AX.
First, we shall try to find the distribution of Y . Note that the transformation g(x) = Ax
is a one-to-one transformation as A is invertible. The inverse transformation is given by
x = A0 y. Hence, the Jacobian of the inverse transformation is J = det(A). As A is
orthogonal, absolute value of det(A) is one. Now, as X1 , X2 , . . . , Xn are i.i.d. N (µ, σ 2 )
RVs, the JPDF of X, for x = (x1 , x2 , . . . , xn )0 ∈ Rn , is
" n
#
1 1 X
fX (x) = √ n exp − 2 (xi − µ)2
σ 2π 2σ i=1
 
1 1 0
= √ n exp − 2 (x − µ) (x − µ) ,
σ 2π 2σ

where µ = (µ, µ, . . . , µ)0 is a n component vector. Thus, the JPDF of Y , for y ∈ Rn , is


fY (y) = fX (A0 y)
 
1 1 0 0 0
= √ n exp − 2 (A y − µ) (A y − µ)
σ 2π 2σ
 
1 1 0
= √ n exp − 2 (y − η) (y − η) ,
σ 2π 2σ

where η = (η1 , η2 , . . . , ηn )0 = Aµ. Note that η1 = nµ. Moreover,
n
X n
X
0 0
η η = µ µ =⇒ ηi2 2
= nµ =⇒ ηi2 = nµ2 − η12 = 0.
i=1 i=2

Thus, ηi = 0 for i = 2, 3, . . . , n. Hence the JPDF of Y is


( n )
1 1 √ 2
Y 1 yi2
fY (y) = √ e− 2σ2 (y1 − nµ) √ e− 2σ2 for y = (y1 , y2 , . . . , yn )0 ∈ Rn .
σ 2π i=2
σ 2π

Therefore, Y1 , Y2 , . . . , Yn are independent RVs and Y1 ∼ N ( nµ, σ 2 ) and Yi ∼ N (0, σ 2 ) for
i = 2, 3, . . . , n, where Y = (Y1 , Y2 , . . . , Yn )0 . Now,
√ √ √ σ2
 
2
Y1 = n X =⇒ n X ∼ N ( nµ, σ ) =⇒ X ∼ N µ, .
n

85
Again,
n n n
X X X 2
Y 0 Y = X 0 X =⇒ Yi2 = Xi2 − Y12 = Xi2 − nX = (n − 1)S 2 .
i=2 i=1 i=1

Yi
For i = 2, 3, . . . , n, σ
are i.i.d. N (0, 1) RVs. Thus, using the previous theorem
n 2
(n − 1)S 2 X

Yi
= ∼ χ2n−1 .
σ2 i=2
σ

Notice that X is a function of Y1 only, and S 2 is a function of Y2 , Y3 , . . . , Yn . As Yi ’s are


independent, X and S 2 are independent.

86

You might also like