0% found this document useful (0 votes)
35 views

Chapter 7 - Sum of Independent Random - 2016 - Introduction To Statistical Machi

Uploaded by

Robinson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Chapter 7 - Sum of Independent Random - 2016 - Introduction To Statistical Machi

Uploaded by

Robinson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CHAPTER

SUM OF INDEPENDENT
RANDOM VARIABLES
7
CHAPTER CONTENTS
Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Reproductive Property. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

In this chapter, the behavior of the sum of independent random variables is first
investigated. Then the limiting behavior of the mean of independent and identically
distributed (i.i.d.) samples when the number of samples tends to infinity is discussed.

7.1 CONVOLUTION
Let x and y be independent discrete variables, and z be their sum:
z = x + y.
Since x + y = z is satisfied when y = z − x, the probability of z can be computed by
summing the probability of x and z − x over all x. For example, let z be the sum of
the outcomes of two 6-sided dice, x and y. When z = 7, these dice take
(x, y) = (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1),
and summing up the probabilities of occurring these combinations gives the proba-
bility of z = 7.
The probability mass function of z, denoted by k(z), can be expressed as

k(z) = g(x)h(z − x),
x

where g(x) and h(y) are the probability mass functions of x and y, respectively. This
operation is called the convolution of x and y and denoted by x ∗ y. When x and y are
continuous, the probability density function of z = x + y, denoted by k(z), is given
similarly as

k(z) = g(x)h(z − x)dx,

where g(x) and h(y) are the probability density functions of x and y, respectively.

An Introduction to Statistical Machine Learning. DOI: 10.1016/B978-0-12-802121-7.00018-2


Copyright © 2016 by Elsevier Inc. All rights of reproduction in any form reserved. 73
74 CHAPTER 7 SUM OF INDEPENDENT RANDOM VARIABLES

7.2 REPRODUCTIVE PROPERTY


When the convolution of two probability distributions in the same family again yields
a probability distribution in the same family, that family of probability distributions is
said to be reproductive. For example, the normal distribution is reproductive, i.e., the
convolution of normal distributions N(µ x , σ 2x ) and N(µ y , σ y2 ) yields N(µ x + µ y , σ 2x +
σ y2 ).
When x and y are independent, the moment-generating function of their sum,
x + y, agrees with the product of their moment-generating functions:
Mx+y (t) = Mx (t)My (t).
Let x and y follow N(µ x , σ 2x ) and N(µ y , σ y2 ), respectively. As shown in Eq. (4.1), the
moment-generating function of normal distribution N(µ x , σ 2x ) is given by
σ 2x t 2
 
Mx (t) = exp µ x t + .
2
Thus, the moment-generating function of the sum, Mx+y (t), is given by
Mx+y (t) = Mx (t)My (t)
σ2 t 2 σ y2 t 2
 
= exp µ x t + x exp * µ y t + +
2 , 2 -
(σ 2x + σ y2 )t 2
= exp *(µ x + µ y )t + +.
, 2 -
Since this is the moment-generating function of N(µ x + µ y , σ 2x +σ y2 ), the reproductive
property of normal distributions is proved.
Similarly, computation of the moment-generating function of Mx+y (t) for inde-
pendent random variables x and y proves the reproductive properties for the binomial,
Poisson, negative binomial, gamma, and chi-squared distributions (see Table 7.1).
The Cauchy distribution does not have the moment-generating function, but the
computation of the characteristic function φ x (t) = Mi x (t) (see Section 2.4.3) shows
that the convolution of Ca(a x , b x ) and Ca(a y , by ) yields Ca(a x + a y , b x + by ).
On the other hand, the geometric distribution Ge(p) (which is equivalent to the
binomial distribution NB(1, p)) and the exponential distribution Exp(λ) (which is
equivalent to the gamma distribution Ga(1, λ)) do not have the reproductive properties
for p and λ.

7.3 LAW OF LARGE NUMBERS


Let x 1 , . . . , x n be random variables and f (x 1 , . . . , x n ) be their joint probability
mass/density function. If f (x 1 , . . . , x n ) can be represented by using a probability
mass/density function g(x) as
f (x 1 , . . . , x n ) = g(x 1 ) × · · · × g(x n ),
7.3 LAW OF LARGE NUMBERS 75

Table 7.1 Convolution


Distribution x y x+y
Normal N (µ x , σ 2x ) N (µ y , σ 2y ) N (µ x + µ y , σ 2x + σ 2y )
Binomial Bi(n x , p) Bi(n y , p) Bi(n x + n y , p)
Poisson Po(λ x ) Po(λ y ) Po(λ x + λ y )
Negative binomial NB(k x , p) NB(k y , p) NB(k x + k y , p)
Gamma Ga(α x , λ) Ga(α y , λ) Ga(α x + α y , λ)
Chi-squared χ 2 (n x) χ 2 (n y) χ 2 (n x + n y )
Cauchy Ca(a x , b x ) Ca(a y , b y ) Ca(a x + a y , b x + b y )

x 1 , . . . , x n are mutually independent and follow the same probability distribution.


Such x 1 , . . . , x n are said to be i.i.d. with probability density/mass function g(x) and
denoted by
i.i.d.
x 1 , . . . , x n ∼ g(x).

When x 1 , . . . , x n are i.i.d. random variables having expectation µ and variance


σ 2 , the sample mean (Fig. 7.1),
n
1
x= xi ,
n i=1

satisfies
n
1
E[x] = E[x i ] = µ,
n i=1
n
1  σ2
V [x] = 2
V [x i ] = .
n i=1 n

This means that the average of n samples has the same expectation as the original
single sample, while the variance is reduced by factor 1/n. Thus, if the number
of samples tends to infinity, the variance vanishes and thus the sample average x
converges to the true expectation µ.
The weak law of large numbers asserts this fact more precisely. When the original
distribution has expectation µ, the characteristic function φ x (t) of the average of
independent samples can be expressed by using the characteristic function φ x (t) of a
single sample x as
  t  n  t n
φ x (t) = φ x = 1 + iµ + · · · .
n n
76 CHAPTER 7 SUM OF INDEPENDENT RANDOM VARIABLES

The mean of samples x 1 , . . . , x n usually refers to the arithmetic mean, but other means
such as the geometric mean and the harmonic mean are also often used:
n
1
Arithmetic mean: xi ,
n
i=1
1
n n

Geometric mean: *. x i +/ ,
, i=1 -
1
Harmonic mean: 1  1
.
n
n i=1 x i

For example, suppose that the weight increased by the factors 2%, 12%, and 4%
in the last three years, respectively. Then the average increase rate is not given
by the arithmetic mean (0.02 + 0.12 + 0.04)/3 = 0.06, but the geometric mean
1
(1.02 × 1.12 × 1.04) 3 ≈ 1.0591. When climbing up a mountain at 2 kilometer per
hour and going back at 6 kilometer per hour, the mean velocity is not given by the
arithmetic mean (2 + 6)/2 = 4 but by the harmonic mean 2d/( d2 + d6 ) = 3 for distance
d, according to the formula “velocity = distance/time.” When x 1 , . . . , x n > 0, the
arithmetic, geometric, and harmonic means satisfy
1
n n n
1  1
x i ≥ *. x i +/ ≥ 1  1
,
n n
i=1 , i=1 - n i=1 x i

and the equality holds if and only if x 1 = · · · = x n . The generalized mean is defined
for p , 0 as
1
n p
*. 1

p
x i +/ .
n
, i=1 -
The generalized mean is reduced to the arithmetic mean when p = 1, the geometric
mean when p → 0, and the harmonic mean when p = −1. The maximum of x 1 , . . . , x n
is given when p → +∞, and the minimum of x 1 , . . . , x n is given when p → −∞.
When p = 2, it is called the root mean square.

FIGURE 7.1
Arithmetic mean, geometric mean, and harmonic mean.

Then Eq. (3.5) shows that the limit n → ∞ of the above equation yields

lim φ x (t) = eit µ .


n→∞
7.4 CENTRAL LIMIT THEOREM 77

(a) Standard normal distribution N (0, 1) (b) Standard Cauchy distribution Ca(0, 1)

FIGURE 7.2
Law of large numbers.

Since eit µ is the characteristic function of a constant µ,

lim Pr(|x − µ| < ε) = 1


n→∞

holds for any ε > 0. This is the weak law of large numbers and x is said to
converge in probability to µ. If the original distribution has the variance, its proof
is straightforward by considering the limit n → ∞ of Chebyshev’s inequality (8.4)
(see Section 8.2.2).
On the other hand, the strong law of large numbers asserts
 
Pr lim x = µ = 1,
n→∞

and x is said to almost surely converge to µ. The almost sure convergence is a more
direct and stronger concept than the convergence in probability.
Fig. 7.2 exhibits the behavior of the sample average x = n1 i=1
n
x i when
x 1 , . . . , x n are i.i.d. with the standard normal distribution N(0, 1) or the standard
Cauchy distribution Ca(0, 1). The graphs show that, for the normal distribution which
possesses the expectation, the increase of n yields the convergence of the sample
average x to the true expectation 0. On the other hand, for the Cauchy distribution
which does not have the expectation, the sample average x does not converge even if
n is increased.

7.4 CENTRAL LIMIT THEOREM


As explained in Section 7.2, the average of independent normal samples follows the
normal distribution. If the samples follow other distributions, which distribution does
78 CHAPTER 7 SUM OF INDEPENDENT RANDOM VARIABLES

(a) Continuous uniform distribution U(0, 1)

(b) Exponential distribution Exp(1)

(c) Distribution used in Fig. 19.11

FIGURE 7.3
Central limit theorem. The solid lines denote the normal densities.

the sample average follow? Fig. 7.3 exhibits the histograms of the sample averages
for the continuous uniform distribution U(0, 1), the exponential distribution Exp(1),
and the probability distribution used in Fig. 19.11, together with the normal densities
with the same expectation and variance. This shows that the histogram of the sample
average approaches the normal density as the number of samples, n, increases.
The central limit theorem asserts this fact more precisely: for standardized random
variable

x−µ
z= √ ,
σ/ n
7.4 CENTRAL LIMIT THEOREM 79

the following property holds:


 b
1
√ e−x /2 dx.
2
lim Pr(a ≤ z ≤ b) =
n→∞ a 2π
Since the right-hand side is the probability density function of the standard normal
distribution integrated from a to b, z is shown to follow the standard normal
distribution in the limit n → ∞. In this case, z is said to converge in law or
converge in distribution to the standard normal distribution. More informally, z is
said to asymptotically follow the normal distribution or z has asymptotic normality.
Intuitively, the central limit theorem shows that, for any distribution, as long as it has
the expectation µ and variance σ 2 , the sample average x approximately follows the
normal distribution with expectation µ and variance σ 2 /n when n is large.
Let us prove the central limit theorem by showing that the moment-generating
function of
x−µ
z= √
σ/ n
2 /2
is given by the moment-generating function of the standard normal distribution, e t .
Let
xi − µ
yi =
σ
and express z as
n n
1  xi − µ 1 
z= √ = √ yi .
n i=1 σ n i=1

Since yi has expectation 0 and variance 1, the moment-generating function of yi is


given by
1
My i (t) = 1 + t 2 + · · · .
2
This implies that the moment-generating function of z is given by
n  n
n  t2

 t
Mz (t) = My i / √ n (t) = My i √ = 1+ +··· .
n 2n

If the limit n → ∞ of the above equation is considered, Eq. (3.5) yields


2 /2
lim Mz (t) = e t ,
n→∞

which means that z follows the standard normal distribution.

You might also like