0% found this document useful (0 votes)
45 views

Mathematical Foundations of Computer Science

The document discusses sampling and estimation. It defines key concepts like population, sample, random sample, statistics, and sampling distributions. It provides examples of common statistics like the mean, median, mode, range, and variance. It also discusses how the sampling distribution of the mean approximates the normal distribution as sample size increases, and how this can be used to estimate probabilities about the true population parameters based on sample statistics.

Uploaded by

murugesh72
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Mathematical Foundations of Computer Science

The document discusses sampling and estimation. It defines key concepts like population, sample, random sample, statistics, and sampling distributions. It provides examples of common statistics like the mean, median, mode, range, and variance. It also discusses how the sampling distribution of the mean approximates the normal distribution as sample size increases, and how this can be used to estimate probabilities about the true population parameters based on sample statistics.

Uploaded by

murugesh72
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Sampling and Estimation - 134

Chapter 6. Sampling and Estimation

6.1. Introduction
Frequently the engineer is unable to completely characterize the entire population. She/he
must be satisfied with examining some subset of the population, or several subsets of the
population, in order to infer information about the entire population. Such subsets are called
samples. A population is the entirety of observations and a sample is a subset of the population.
A sample that gives correct inferences about the population is a random sample, otherwise it is
biased.
Statistics are given different symbols than the expectation values because statistics are
approximations of the expectation value. The statistic called the mean is an approximation to the
expectation value of the mean. The statistic mean is the mean of the sample and the expectation
value mean is the mean of the entire population. In order to calculate an expectation, one requires
knowledge of the PDF. In practice, the motivation in calculating a statistic is that one has no
knowledge of the underlying PDF.

6.2. Statistics
Any function of the random variables constituting a random sample is called a statistic.

Example 6.1.: Mean


The mean is a statistic of a random sample of size n and is defined as

1 n
X = ∑ Xi
n i =1
(6.1)

Example 6.2.: Median


The median is a statistic of a random sample of size n, which represents the “middle” value of
the sample and, for a sampling arranged in increasing order of magnitude, is defined as

134
Sampling and Estimation - 135

~
X = X ( n +1) / 2 for odd n
~ X + X ( n +1) / 2 (6.2)
X = n/2 for even n
2

The median of the sample space {1,2,3} is 2.


The median of the sample space {3,1,2} is 2.
The median of the sample space {1,2,3,4} is 2.5.

Example 6.3.: Mode


The mode is a statistic of a random sample of size n, which represents the most frequently
appearing value in the sample. The mode may not exist and, if it does, it may not be unique.

The mode of the sample space {2,1,2,3} is 2.


The mode of the sample space {2,1,2,3,4,4} is 2 and 4. (bimodal)
The mode of the sample space {1,2,3} does not exist since each entry occurs only once.

Example 6.4.: Range


The range is a statistic of a random sample of size n, which represents the “span” of the sample
and, for a sampling arranged in increasing order of magnitude, is defined as

range(X) = X n -X 1 (6.3)

The range of {1,2,3,4,5} is 5-1=4.

Example 6.5.: Variance


The variance is a statistic of a random sample of size n, which represents the “spread” of the
sample and is defined as

2
 n 
∑ (X − X)
n n
n∑ ( X i ) −  ∑ X i 
2 2
i
S =
2 i =1
= i =1  i =1  (6.4)
n −1 n(n − 1)

The reason for using (n-1) in the denominator rather than n is given later.

Example 6.6.: Standard Deviation


The standard deviation, s, is a statistic of a random sample of size n, which represents the
“spread” of the sample and is defined as the positive square root of the variance.

S= S2 (6.5)

135
Sampling and Estimation - 136

6.3. Sampling Distributions


We have now stated the definitions of the statistics we are interested in. Now, we need to
know the distribution of the statistics to determine how good these sampling approximations are to
the true expectation values of the population.

Statistic 1. Mean when the variance is known: Sampling Distribution


If X is the mean of a random sample of size n taken from a population with mean µ and
variance σ2, then the limiting form of the distribution of

X −µ
Z= (6.6)
σ/ n

as n → ∞ , is the standard normal distribution n(z;0,1). This is known as the Central Limit
Theorem. What this says is that, given a collection of random samples, each of size n, yielding a
mean X , the distribution of X approximates a normal distribution, and becomes exactly a normal
distribution as the sample size goes to infinity. The distribution of X does not have to be normal.
Generally, the normal approximation for X is good if n > 30.
We provide a derivation in Appendix V proving that the distribution of the sample mean is
given by the normal distribution.

Example 6.7.: distribution of the mean, variance known


In a reactor intended to grow crystals in solution, a “seed” is used to encourage nucleation.
Individual crystals are randomly sampled from the effluent of each reactor of sizes n = 10 . The
population has variance in crystal size of σ 2 = 1.0 µm2. (We must know this from previous
research.) The samples yield mean crystal sizes of x = 15.0 µm. What is the likelihood that the
true population mean, µ , is actually less than 14.0 µm?

x − µ 15 − 14
Z= = = 3.162
σ / n 1 / 10

P(µ < 14 ) = P(z > 3.162 )

We have the change in sign because as µ increases, z decreases.


The evaluation of the cumulative normal probability distribution can be performed several
ways. First, when the pioneers were crossing the plains in their covered wagons and they wanted
to evaluate probabilities from the normal distribution, they used Tables of the cumulative normal
PDF, such as those provided in the back of the statistics textbook. These tables are also available
online. For example wikipedia has a table of cumulative standard numeral PDFs at

136
Sampling and Estimation - 137

http://en.wikipedia.org/wiki/Standard_normal_table

Using the table, we find

P(µ < 14 ) = P(z > 3.162 ) = 1 − P(z < 3.162 ) = 1 − 0.9992 = 0.0008

Second, we can use a modern computational tool like MATLAB to evaluate the probability.
The problem can be worked in terms of the standard normal PDF (µ = 0 and σ = 1), which for
P(µ < 14 ) = P(z > 3.162 ) = 1 − P(z < 3.162 ) is

>> p = 1 - cdf('normal',3.162,0,1)

p = 7.834478217108032e-04

Alternatively, the problem can be worked in terms of the non-standard normal PDF ( x = 15
and σ / n = 1 / 10 ), which for P(µ < 14)

>> p = cdf('normal',14,15,1/sqrt(10))

p = 7.827011290012762e-04

The difference in these results is due to the round-off in 3.162, used as an argument in the
function call for the standard normal distribution.
Based on our sampling data, the probability that the true sample mean is less than 14.0 µm is
0.078%.

Statistic 2. difference of means when the variance is known: Sampling Distribution

It is useful to know the sampling difference of two means when you want to determine whether
there is a significant difference between two populations. This situation applies when you takes
two random samples of size n1 and n2 from two different populations, with means µ1 and µ 2 and
variances σ 12 and σ 22 , respectively. Then the sampling distribution of the difference of means,
X 1 − X 2 , is approximately normal, distributed with mean

µX 1−X2
= µ1 − µ 2

and variance

σ2 σ2
σ 2
X1 − X 2
= 1
− 2

n1 n2

137
Sampling and Estimation - 138

Hence,

Z=
(X 1 − X 2 ) − (µ1 − µ 2 )
(6.7)
 σ 12   σ 22 
 + 
n  n 
 1  2

is approximately a standard normal variable.

Example 6.8.: distribution of the difference of means, variances known


In a reactor intended to grow crystals, two different types of “seeds” are used to encourage
nucleation. Individual crystals are randomly sampled from the effluent of each reactor of sizes
n1 = 10 and n2 = 20 . The populations have variances in crystal size of σ 12 = 1.0 µm2 and
σ 22 = 2.0 µm2. (We must know this from previous research.) The samples yield mean crystal
sizes of X 1 = 15.0 µm and X 2 = 10.0 µm. How confident can we be that the true difference in
population means, µ1 − µ 2 , is actually 4.0 µm or greater?
Using equation (6.7) we have:

Z=
(X 1 − X 2 ) − (µ1 − µ 2 )
=
(15 − 10) − (4) = 2.2361
σ1 2
 σ 2
2
 1  2 
 +   + 
n  n   10   20 
 1   2 

P(µ1 − µ 2 > 4.0 ) = P(z < 2.2361)

We have the change in sign because as ∆µ increases, z decreases. The probability that
µ1 − µ 2 is greater 4.0 µm is then given by P(Z<2.2361). How do we know that we want
P(Z<2.2361) and not P(Z>2.2361)? We just have to sit down and think what the problem
physically means. Since we want the probability that µ1 − µ 2 is greater 4.0 µm, we know we need
to include the area due to higher values of µ1 − µ 2 . Higher values of µ1 − µ 2 yield lower values
of Z. Therefore, we need the less than sign.
The evaluation of the cumulative normal probability distribution can again be performed two
ways. First, using a standard normal table, we have

P(Z < 2.24 ) = 0.9875

Second, using MATLAB we have

>> p = cdf('normal',2.2361,0,1)
138
Sampling and Estimation - 139

p = 0.987327389270190

We expect 98.73% of the differences in crystal size of the two populations to be at least 4.0
µm.

Statistic 3. Mean when the variance is unknown: Sampling Distribution

Of course, usually we don’t know the population variance. In that case, we have to use some
other statistic to get a handle on the distribution of the mean.
If X is the mean of a random sample of size n taken from a population with mean µ and
unknown variance , then the limiting form of the distribution of

X −µ
T= (6.8)
S/ n

as n → ∞ , is the t distribution f T (t ; v) . The T-statistic has a t-distribution with v=n-1 degrees of


freedom. The t-distribution is just another continuous PDF, like the others we learned about in the
previous section.
The t distribution is given by

v +1
Γ[(v + 1) / 2)]  t 2 

2
f (t ) = 1 +  −∞ <t <∞
Γ(v / 2) πv 
for
v
0.45

As a reminder, the t 0.4


distribution is plotted again in normal
100
Figure 6.1. 0.35
50
20
0.3 10
Example 6.9.: distribution of the 5

mean, variance unknown 0.25


f(t)

In a reactor intended to grow 0.2


crystals, a “seed” is used to
0.15
encourage nucleation. Individual
crystals are randomly sampled 0.1
from the effluent of each reactor
of sizes n = 10 . The population 0.05

has unknown variance in crystal 0


-6 -4 -2 0 2 4 6
size. The samples yield mean t

crystal sizes of x = 15.0 µm and a Figure 6.1. The t distribution as a function of the degrees
sample variance of s 2 = 1.0 µm2. of freedom and the normal distribution.

139
Sampling and Estimation - 140

What is the likelihood that the true population mean , µ , is actually less than 14.0 µm?

x − µ 15 − 14
t= = = 3.162
s / n 1 / 10

P(µ < 14 ) = P(t > 3.162 )

We have the change in sign because as µ increases, t decreases. The parameter v = n-1 = 9.

The evaluation of the cumulative t probability distribution can again be performed two ways.
First, we can use a table of critical values of the t-distribution. It is crucial to note that such a table
does not provide cumulative PDFs, rather it provides one minus the cumulative PDF. In other
words, where as the standard normal table provides the probability less than z (the cumulative
PDF), the t-distribution table provides the probability greater than t (one minus the cumulative
PDF). We then have

P(µ < 14 ) = P(t > 3.162 ) ≈ 0.007

Second, using MATLAB we have P(µ < 14) = P(t > 3.162) = 1 − P(t < 3.162)

>> p = 1 - cdf('t',3.162,9)

p = 0.005756562560207

Based on our sampling data, the probability that the true sample mean is less than 14.0 µm is
0.57%.
We should point out that our percentage here is substantially greater than for our percentage
when we knew the population variance (0.078%). That is because knowing the population
variance reduces our uncertainty. Approximating the population variance with the sampling
variance adds to the uncertainty and results in a larger percentage of our population deviating
farther from the sample mean.

Example 6.10.: distribution of the mean, variance unknown


An engineer claims that the population mean yield of a batch process is 500 g/ml of raw
material. To verify this, she samples 25 batches each month. One month the sample has a mean
X = 518 g and a standard deviation of s=40 g. Does this sample support his claim that µ = 500 g?

The first step in solving this problem is to compute the T statistic.

X −µ 500 − 518
T= = = −2.25
S/ n 40 / 25
140
Sampling and Estimation - 141

Second, using MATLAB we have P(µ > 518) = P(t < −2.25)

>> p = cdf('t',-2.25,24)

p = 0.016944255452754

(Or using a Table, we find that when v=24 and T=2.25, α=0.02). This means there is only a
1.6% probability that a population with µ = 500 would yield a sample with X = 518 or higher.
Therefore, it is unlikely that 500 is the population mean.

Statistic 4. difference of means when the variance is unknown: Sampling Distribution


It is useful to know the sampling difference of two means when you want to determine whether
there is a significant difference between two populations. Sometimes you want to do this when
you don’t know the population variances. This situation applies when you takes two random
samples of size n1 and n2 from two different populations, with means µ1 and µ 2 and unknown
variances. Then the sampling distribution of the difference of means, X 1 − X 2 , follows the t-
distribution.

transformation: T =
(X 1 − X 2 ) − (µ1 − µ 2 )
(6.9)
 s 12   s 22 
 + 
 n1   n2 
   
symmetry: t1−α = −tα ,

parameters: v = n1 + n2 − 2 if σ 1 = σ 2
2
 s12 s22 
 + 
parameters: v =  n1 n2  if σ 1 ≠ σ 2
 s 2  2   s 2  2 
 1  (n1 − 1) +  2  (n2 − 1)
 n1    n2  

Since we don’t know either population variance in this case, we can’t assume they are equal
unless we are told they are equal.

Example 6.11.: distribution of the difference of means, variances unknown


In a reactor intended to grow crystals, two different types of “seeds” are used to encourage
nucleation. Individual crystals are randomly sampled from the effluent of each reactor of sizes
n1 = 10 and n2 = 20 . The populations have unknown variances in crystal size. The samples yield

141
Sampling and Estimation - 142

mean crystal sizes of X 1 = 15.0 µm and X 2 = 10.0 µm and sample variances of s12 = 1.0 µm2 and
s22 = 2.0 µm2. What percentage of true population differences yielding these sampling results
would have a true difference in population means, µ1 − µ 2 , of 4.0 µm or greater?

T=
(X 1 − X 2 ) − (µ1 − µ 2 )
=
(15 − 10) − (4) = 2.2361
 s 12   s 22  1  2 
 +   + 
 n1   n2   10   20 
   

The degree of freedom parameter is given by:

2 2
 s12 s22   12 2 2 
 +   + 
v=  n1 n2  =  10 20  = 27.98 ≈ 28
 s 2  2   s 2  2   12  2   2 2  2 
 1  (n1 − 1) +  2  (n2 − 1)   (10 − 1) +   (20 − 1)
 n1    n2    10    20  

P(µ1 − µ 2 > 4.0 ) = P(t < 2.2361) = 1 − P(t > 2.2361)

The evaluation of the cumulative normal probability distribution can again be performed two
ways. First, using a table of critical values of the t-distribution, we have

P(µ1 − µ 2 > 4.0 ) = P(t < 2.2361) = 1 − P(t > 2.2361) = 1 − 0.0217 = 0.9783

Second, using MATLAB we have for P(µ1 − µ 2 > 4.0 ) = P(t < 2.2361)

>> p = cdf('t',2.2361,28)

p = 0.983252747598848

We expect 98.3% of the differences in crystal size of the two populations to be at least 4.0 µm.

Statistic 5. Variance: Sampling Distribution


We now wish to know the sampling distribution of the sample variance, S2. If S2 is the
variance of a random sample of size n taken from a population with mean µ and variance σ2, then
the statistic

(n − 1) S 2 n
( X i − X )2
χ2 = =∑ (6.10)
σ2 i =1 σ2

142
Sampling and Estimation - 143

has a chi-squared distribution with v=n-1 degrees of freedom, f χ 2 ( χ 2 ; n − 1) . The chi-squared


distribution is defined as

 1
 v/2 x v/2 -1e- x/ 2 for x > 0
f χ 2 ( x; v) =  2 Γ(v / 2)

 0 elsewhere

It is a special case of the 0.18

Gamma Distribution, when


α=v/2 and β=2, where v is called
0.16
50

the “degrees of freedom” and is a 0.14 40

positive integer. As a reminder, 30


0.12 20
we provide a plot of the chi-
10
squared distribution in Figure 0.1
5
f(χ2 )

6.2. 0.08

Example 6.12.: distribution of 0.06

the variance 0.04


In a reactor intended to grow
crystals, a “seed” is used to 0.02

encourage nucleation. Individual 0


crystals are randomly sampled 0 10 20 30 40 50
χ 2
60 70 80 90 100

from the effluent of each reactor Figure 6.2. The chi-squared distribution for various values
of sizes n = 10 . The samples of v.
yield mean crystal sizes of
x = 15.0 µm and a sample variance of s 2 = 1.0 µm2. What is the likelihood that the true population
variance , σ 2 , is actually less than 0.5 µm2?

(n − 1) S 2 (10 − 1)1
χ2 = = = 18
σ 2
0.5

( ) (
P σ 2 < 0.5 = P χ 2 > 18 )
We have the change in sign because as σ 2 increases, χ 2 decreases. The parameter v = n-1 =
9.

The evaluation of the cumulative χ 2 probability distribution can again be performed two
ways. First, we can use a table of critical values of the χ 2 -distribution. It is crucial to note that
such a table does not provide cumulative PDFs, rather it provides one minus the cumulative PDF.
We then have
143
Sampling and Estimation - 144

( ) (
P σ 2 < 0.5 = P χ 2 > 18 ≈ 0.04)
( ) (
Second, using MATLAB we have P σ 2 < 0.5 = P χ 2 > 18 = 1 − P χ 2 < 18 ) ( )
>> p = 1 - cdf('chi2',18,9)

p = 0.035173539466985

Based on our sampling data, the probability that the true variance is less than 0.5 µm2 is 3.5%.

Statistic 6. the ratio of 2 Variances: Sampling Distribution (F-distribution)


Just as we studied the distribution of two sample means, so too are we interested in the
distribution of two variances. In the case of the mean, it was a difference. In the case of the
variance, the ratio is more useful. Now consider sampling two random samples of size n1 and n2
from two different populations, with means σ 12 and σ 22 , respectively. The statistic, F,

S12 / σ 12 S12σ 22
F= = (6.11)
S 22 / σ 22 S 22σ 12

provides a distribution of the ratio of two variances. This distribution is called the F-distribution
with v1 = n1 − 1 and v2 = n2 − 1 degrees of freedom. The f-distribution is defined as

 v1
v + v 
 Γ 1 2  1 
v  2
v1
  2  v2    f 2
−1

 for f > 0
h f ( f ; v1 , v2 ) =   v1   v 2  v1 + v2

 Γ 2 Γ 2   v  2
    1 + 1 f 
  v2 

 0 elsewhere

As a reminder, the f-distribution is plotted in Figure 6.3.

Example 6.13.: ratio of the variances


In a reactor intended to grow crystals, two different types of “seeds” are used to encourage
nucleation. Individual crystals are randomly sampled from the effluent of each reactor of sizes
n1 = 10 and n2 = 20 . The populations have unknown variances in crystal size. The samples yield

144
Sampling and Estimation - 145
0.9

mean crystal sizes of X 1 = 15.0 µm 0.8

and X 2 = 10.0 µm and sample v1 = 10, v2=20

0.7 v1=10, v2=10


variances of s12 = 1.0 µm2 and v1 = 10, v2=5
0.6
s22 = 2.0 µm . What is the
2 v1=5, v2=20

v1=5, v2=10
0.5
probability that the ratio of variances v1=5, v2=5

h(f )
σ2
1
0.4
, , is less than 0.25?
σ2
2 0.3

0.2

S12σ 22 1
F= 2 2 = =2 0.1
S 2 σ 1 2 ⋅ 0.25
0
0 1 2 3 4 5 6 7 8 9

σ 
2 f

P < 0.25  = P(F > 2)


1 Figure 6.3. The F distribution for various values of v1 and
σ
2
2  v2.

σ 12
We have the change in sign because as 2 increases, F decreases. The parameters are
σ2
v1 = n1 − 1 = 9 and v2 = n2 − 1 = 19 .
The evaluation of the cumulative F probability distribution can again be performed in one
way. We cannot use tables because there are no tables for arbitrary values of the probability.
There are only tables for two values of the probability, 0.01 and 0.05. Therefore, using MATLAB
 σ 12 
we have P 2 < 0.25  = P(F > 2) = 1 − P(F < 2)
σ2 

>> p = 1 - cdf('f',2,9,19)

p = 0.097413204997132

Based on our sampling data, the probability that the ratio of variances is less than 0.25 is 9.7%.

6.4. Confidence Intervals


In the previous section we showed what types of distributions describe various statistics of a
random sample. In this section, we discuss estimating the population mean and variance from the
sample mean and variance. In addition, we introduce confidence intervals to quantify the
goodness of these estimates.

145
Sampling and Estimation - 146

A confidence interval is some subset of random variable space with which someone can
say something like, “I am 95% sure that the true population mean is between µlow and µ hi .” In
this section, we discuss how a confidence interval is defined and calculated.
The confidence interval is defined by a percent. This percent is called (1-2α). So if α=0.05,
then you would have a 90% confidence interval.
The concept of a confidence interval is illustrated in graphical terms in Figure 6.4.

Figure 6.4. A schematic illustrating a confidence interval.

The trick then is to find µ low = zα and µ hi == z1−α so that you can say for a given α, I am
(1 − 2α )% confident that µlow < µ < µ hi .

Statistic 1. mean, σ known: confidence interval


We now know that the sample mean is distributed with the standard normal distribution. For a
symmetric PDF, centered around zero, like the standard normal, µlow = − µ hi . We can then make
the statement:

P( zα < Z < z1−α ) = 1 − 2α

Now the normal distribution is symmetric about the y-axis so we can write

zα = − z1−α

so

P( zα < Z < z1−α ) = P( zα < Z < − zα ) = 1 − 2α

146
Sampling and Estimation - 147

where

X −µ
Z= .
σ/ n

We can rearrange this to equation to read

σ σ
P( X + zα < µ < X − zα ) = 1 − 2α (6.12)
n n

where we now have µlow and µ hi explicitly.

Example 6.14.: confidence interval on mean, variance known


Samples of dioxin contamination in 36 front yards in St. Louis show a concentration of 6 ppm.
Find the 95% confidence interval for the population mean. Assume that the standard deviation is
1.0 ppm.
To solve this, first calculate α , zα , z1−α .

1 − 2α = 0.95
α = 0.025
zα = z0.025 = −1.96
z1−α = − zα = 1.96

The z value came from a standard normal table. Alternatively, we can compute this value from
MATLAB,

>> z = icdf('normal',0.025,0,1)

z = -1.959963984540055

Here we used the inverse cumulative distribution function (icdf) command. Since we have the
standard normal PDF, the mean is 0 and the variance is 1. The value of 0.025 corresponds to
alpha, the probability.
To get the value of the other limit, we either rely on symmetry, or compute it directly,

>> z = icdf('normal',0.975,0,1)

z = 1.959963984540054

Note that these values of z are independent of all aspects of the problem except the value of the
confidence interval.
147
Sampling and Estimation - 148

Therefore, by equation (6.12)

1 1
P (6 + (−1.96) < µ < X − (−1.96) = 1 − 0.05 = 0.95
36 36

so the 95% confidence interval for the mean is 5.673 < µ < 6.327 .

Statistic 2. mean, σ unknown: confidence interval


Now usually, we don’t know the variance. We have to use our estimate of the variance, s, for
σ. In that case, estimating the mean requires the T-distribution. (See previous section.) Let me
stress that we do everything exactly as we did before but we use s for σ and use the t-distribution
instead of the normal distribution. Remember the t-distribution is also symmetric about the origin,
so t1−α = −tα . (this means you only have to compute the t probability once. Remember, v=n-1.

P (tα < T < t1−α ) = P(tα < T < −tα ) = 1 − 2α

where

X −µ
T= .
s/ n

Just as before, we can rearrange this to equation to read

s s
P ( X + tα < µ < X − tα ) = 1 − 2α (6.13)
n n

where we now have µlow and µ hi explicitly.

Example 6.15.: confidence interval on mean, variance unknown


Samples of dioxin contamination in 36 front yards in St. Louis show a concentration of 6 ppm.
Find the 95% confidence interval for the population mean. The sample standard deviation, s, was
measured to be 1.0.
To solve this, first calculate α , tα , t1−α for v = 35.

1 − 2α = 0.95
α = 0.025
tα = t 0.025 = −2.03
t1−α = −tα = +2.03

148
Sampling and Estimation - 149

The t value came from a table of t-distribution values. Alternatively, we can compute this
value using MATLAB,

>> t = icdf('t',0.025,35)

t = -2.030107928250342

and for the upper limit

>> t = icdf('t',0.975,35)

t = 2.030107928250342,

which can also be obtained by symmetry. Note that these values of t are independent of all aspects
of the problem except the value of the confidence interval and the number of sample points, n.

Therefore, by equation (6.13)

1 1
P (6 − (2.03) < µ < X + (2.03) = 1 − 0.05 = 0.95
36 36

so the 95% confidence interval for the mean is 5.662 < µ < 6.338 .
You should note that we are a little less confident about the mean when we use the sample
variance as the estimate for the population variance, for which the 95% confidence interval for the
mean was 5.673 < µ < 6.327 .

Statistic 3. difference of means, σ known: confidence interval


The exact same derivation that we used above for a single mean can be used for the difference
of means. When we the variances of the two samples are known, we have:

 σ 12 σ 22 σ 12 σ 22 
P ( X 1 − X 2 ) + zα + < (µ1 − µ 2 ) < ( X 1 − X 2 ) − zα +  = 1 − 2α (6.14)
 n1 n2 n1 n2 

where z is a random variable obeying the standard normal PDF.

Example 6.16.: confidence interval on the difference of means, variances known


Samples of dioxin contamination in 36 front yards in Times Beach, a suburb of St. Louis, show
a concentration of 6 ppm with a population variance of 1.0 ppm2. Samples of dioxin
contamination in 16 front yards in Quail Run, another suburb of St. Louis, show a concentration of
8 ppm with a population variance of 3.0 ppm2. Find the 95% confidence interval for the difference
of population means. .

149
Sampling and Estimation - 150

To solve this, first calculate α , zα , z1−α .

1 − 2α = 0.95
α = 0.025
zα = z0.025 = −1.96
z1−α = − zα = 1.96

The z value came from a table of standard normal PDF values. Alternatively, we can compute
this value from MATLAB,

>> z = icdf('normal',0.025,0,1)

z = -1.959963984540055

Therefore, by equation (6.16)

 3
P (6 − 8) − 1.96 < (µ1 − µ 2 ) < (6 − 8) + 1.96
1 3 1
+ +  = 1 − 2(0.025)
 36 16 36 16 

P[− 2.909 < (µ1 − µ 2 ) < −1.091] = 0.95

So the 95% confidence interval for the mean is − 2.909 < (µ1 − µ 2 ) < −1.091 .
If we are determining which site is more contaminated, then we are 95% sure that site 2 (Quail
Run) is more contaminated by 1 to 3 ppm than site 1, (Times Beach).

Statistic 4. difference of means, σ unknown: confidence interval


When we the variances of the two samples are unknown, we have:

 s12 s 22 
P  ( X 1 − X 2 ) + tα < (µ1 − µ 2 ) < ( X 1 − X 2 ) + t1−α
s12 s 22
+ +  = 1 − 2α (6.15)
 n1 n2 n1 n2 

where the number of degrees of freedom for the t-distribution is

v = n1 + n2 − 2 if σ 1 = σ 2

150
Sampling and Estimation - 151

2
 s12 s22 
 + 
v=  n1 n2  if σ 1 ≠ σ 2
 s 2  2   s 2  2 
 1  (n1 − 1) +  2  (n2 − 1)
 n1    n2  

Example 6.16.: confidence interval on the difference of means, variances unknown


Samples of dioxin contamination in 36 front yards in Times Beach, a suburb of St. Louis, show
a concentration of 6 ppm with a sample variance of 1.0 ppm2. Samples of dioxin contamination in
16 front yards in Quail Run, another suburb of St. Louis, show a concentration of 8 ppm with a
sample variance of 3.0 ppm2. Find the 95% confidence interval for the difference of population
means. .
To solve this, first calculate α , tα , t1−α .

2
 s12 s22   1
2
3
 +   + 
v=  n1 n2  =  36 16  = 19.59 ≈ 20
 s 2  2   s 2  2   1  2   3  2 
 1  (n1 − 1) +  2  (n2 − 1)   (36 − 1) +   (16 − 1)
 n1    n2    36    16  

1 − 2α = 0.95
α = 0.025
tα = t 0.025 = 2.086
t1−α = −tα = −2.086

The t value came from a table of t-PDF values. Alternatively, we can compute this value using
MATLAB,

>> t = icdf('t',0.025,20)

t = -2.085963447265864

Therefore, substituting into equation (6.15) yields

 1 3
P (6 − 8) − 2.086 < (µ1 − µ 2 ) < (6 − 8) + 2.086
1 3
+ +  = 1 − 2(0.025)
 36 16 36 16 

P[− 2.97 < (µ1 − µ 2 ) < −1.03] = 0.95

151
Sampling and Estimation - 152

So the 95% confidence interval for the mean is − 2.97 < (µ1 − µ 2 ) < −1.03 .

If we are determining which site is more contaminated, then we are 95% sure that site 2 (Quail
Run) is more contaminated by 1 to 3 ppm than site 1, (Times Beach).

Statistic 5. variance: confidence interval


The confidence interval of the variance can be estimated in a precisely analogous way,
knowing that the statistic

(n − 1) S 2 n
( X i − X )2
χ =
2
=∑
σ2 i =1 σ2

has a chi-squared distribution with v=n-1 degrees of freedom, f χ 2 ( χ 2 ; n − 1) . So


 (n − 1) σ 2 (n − 1) σ 2 
P < σ 2
<  = 1 − 2α (6.16)
 χ 1−α χ α2 
2

Perversely, the tables of the critical values for the χ 2 distribution, have defined α to be 1-α, so
the indices have to be switched when using the table.

 (n − 1) σ 2 (n − 1) σ 2 
< σ <  = 1 − 2α when using the χ critical values table only!
2
P 2

 χ 2
α
χ 2
1−α 

If you get confused, just remember that the upper limit must be greater than the lower limit.
Remember also that the f χ 2 ( χ 2 ; n − 1) is not symmetric about the origin, so we cannot use the
symmetry arguments used for the confidence intervals for functions of the mean.

Example 6.17.: variance


Samples of dioxin contamination in 16 front yards in St. Louis show a concentration of 6 ppm.
Find the 95% confidence interval for the population mean. The sample standard deviation, s, was
measured to be 1.0.
To solve this, first calculate α , χ α2 , χ 12−α .

For v = n – 1 = 15, we have

1 − 2α = 0.95
α = 0.025
χα2 = χ 02.025 = 27.488
χ12−α = χ 02.975 = 6.262
152
Sampling and Estimation - 153

The t value came from a table of χ 2 -distribution values. Alternatively, we can compute this
value using MATLAB,

>> chi2 = icdf('chi2',0.025,15)

chi2 = 6.262137795043251

and

>> chi2 = icdf('chi2',0.975,15)

chi2 = 27.488392863442972

Therefore, substituting into equation (6.16) yields

 (16 − 1)1.0 (16 − 1)1.0 


P <σ2 < = 1 − 2(0.025)
 27.488 6.262 

( )
P 0.5457 < σ 2 < 2.395 = 0.95

So the 95% confidence interval for the mean is 0.5457 < σ 2 < 2.395 .

Statistic 6. ratio of variances: confidence interval (p. 253)


The ratio of two population variances can be estimated in a precisely analogous way, knowing
that the statistic

S12 / σ 12 S12σ 22
F= 2 2 = 2 2
S2 / σ 2 S2σ1

follows the F-distribution with v1 = n1 − 1 and v2 = n2 − 1 degrees of freedom. Remember, the F-


1
distribution has a symmetry, f1−α / 2 (v1, v2 ) = . This symmetry relation is essential if one
fα / 2 (v2 , v1 )
is to use tables for the critical value of the F-distribution. It is not essential if one uses MATLAB
commands.
If one is computing the cumulative PDF for the f distribution, then one simply, rearranges this
σ 12
equation for
σ 22

153
Sampling and Estimation - 154

σ 22 S 22
= F
σ 12 S12

σ 12 1 S12
=
σ 22 F S 22

S2 1 σ 2 S2 1 
P  12 < 12 < 12  = 1 − 2α (6.17)
 S 2 f1−α (v1, v2 ) σ 2 S 2 f α (v1, v2 ) 

σ 12
One notes that the order of the limits has changed here, since as goes up, F goes down. In
σ 22
any case, the lower limit must be smaller than the upper limit. If one chooses to use tables of
critical values, one must take into account two idiosyncrasies of the procedure. First, as was the
case with the t and chi-squared distributions, the table provide the probability that f is greater than
a value, not the cumulative PDF, which is the probability that f is less than a value. Second, the
tables only provide data for small values of α. Therefore, we must eliminate all instances of 1-α.,
using a symmetry relation. The result is

S2 1 σ 2 S2 
P  12 < 12 < 12 fα (v2, v1 ) = 1 − 2α when using the tables only!
 S 2 fα (v1, v2 ) σ 2 S 2 

Example 6.18.: confidence interval on the ratio of variances


Samples of dioxin contamination in 20 front yards in Times Beach, a suburb of St. Louis, show
a concentration of 6 ppm with a sample variance of 1.0 ppm2. Samples of dioxin contamination in
16 front yards in Quail Run, another suburb of St. Louis, show a concentration of 8 ppm with a
sample variance of 3.0 ppm2. Find the 90% confidence interval for the difference of population
means. .
To solve this, first calculate α , Fα , F1−α , with v1 = n1 − 1 = 19 and v2 = n2 − 1 = 15

1 − 2α = 0.90
α = 0.05

We can compute the f probabilities using MATLAB,

>> f = icdf('f',0.05,19,15)

f = 0.447614966503185

154
Sampling and Estimation - 155

and

>> f = icdf('f',0.95,19,15)

f = 2.339819281665456

Substituting into equation (6.16) yields

1 1 σ2 1 1 
P < 12 <  = 1 − 2(0.05)
 3 2.3398 σ 2 3 0.4476 

 σ2 
P 0.1425 < 12 < 0.7447  = 0.90
 σ2 

Alternatively, we can use the table of critical values

Fα = F0.05 = F0.05 (v1 = 19, v2 = 15) ≈ F0.05 (v1 = 20, v2 = 15) = 2.33
F0.05 (v1 = 15, v2 = 19) = 2.23

1 1 σ2 1 
P < 12 < 2.23 = 1 − 2(0.05)
 3 2.33 σ 2 3 

 σ2 
P 0.1431 < 12 < 0.7433 = 0.90
 σ2 

σ 12
So the 90% confidence interval for the mean is 0.1425 < < 0.7447 .
σ 22
If we are determining which site has a greater variance of contamination levels then we are
90% sure that site 2 (Quail Run) has more variance by a factor of 1.3 to 7.0.

6.5. Problems
We intend to purchase a liquid as a raw material for a material we are designing. Two vendors
offer us samples of their product and a statistic sheet. We run the samples in our own labs and
come up with the following data:

155
Sampling and Estimation - 156

Vendor 1 Vendor 2
sample # outcome sample # outcome
1 2.3 1 2.49
2 2.49 2 1.98
3 2.05 3 2.18
4 2.4 4 2.36
5 2.18 5 2.47
6 2.12 6 2.36
7 2.38 7 1.82
8 2.39 8 1.88
9 2.4 9 1.87
10 2.46 10 1.87
11 2.19
12 2.04
13 2.43
14 2.34
15 2.19
16 2.12

Vendor Specification Claims:


Vendor 1: µ = 2.0 and σ 2 = 0.05 , σ = 0.2236
Vendor 2: µ = 2.3 and σ 2 = 0.12 , σ = 0.3464

Sample statistics, based on the data provided in the table above.

n1 = 16 x1 =
1 16
∑ xi =2.280
16 i =1
s1 =
2 1 16
∑ [
16 i =1
]
(xi − x1 )2 =0.0229 s1 = 0.1513

n2 = 10 x2 =
1 10
∑ xi =2.128
10 i =1
s2 =
2 1 10
∑ [
10 i =1
]
(xi − x2 )2 =0.0744 s2 = 0.2728

Problem 6.1.
Determine a 95% confidence interval on the mean of sample 1. Use the value of the
population variance given. Is the given population mean legitimate?

Problem 6.2.
Determine a 95% confidence interval on the difference of means between samples 1 and 2.
Use the values of the population variance given. Is the difference between the given population
means legitimate?

156
Sampling and Estimation - 157

Problem 6.3.
Determine a 95% confidence interval on the mean of sample 1. Assume the given values of
the population variances are suspect and not to be trusted. Is the given population mean
legitimate?

Problem 6.4.
Determine a 95% confidence interval on the difference of means between samples 1 and 2.
Assume the given values of the population variances are suspect and not to be trusted. Is the
difference between the given population means legitimate?

Problem 6.5.
Determine a 95% confidence interval on the variance of sample 1. Is the given population
variance legitimate?

Problem 6.6.
Determine a 98% confidence interval on the ratio of variance of samples 1 & 2. Is the ratio of
the given population variances legitimate?

157

You might also like