Mathematical Foundations of Computer Science
Mathematical Foundations of Computer Science
6.1. Introduction
Frequently the engineer is unable to completely characterize the entire population. She/he
must be satisfied with examining some subset of the population, or several subsets of the
population, in order to infer information about the entire population. Such subsets are called
samples. A population is the entirety of observations and a sample is a subset of the population.
A sample that gives correct inferences about the population is a random sample, otherwise it is
biased.
Statistics are given different symbols than the expectation values because statistics are
approximations of the expectation value. The statistic called the mean is an approximation to the
expectation value of the mean. The statistic mean is the mean of the sample and the expectation
value mean is the mean of the entire population. In order to calculate an expectation, one requires
knowledge of the PDF. In practice, the motivation in calculating a statistic is that one has no
knowledge of the underlying PDF.
6.2. Statistics
Any function of the random variables constituting a random sample is called a statistic.
1 n
X = ∑ Xi
n i =1
(6.1)
134
Sampling and Estimation - 135
~
X = X ( n +1) / 2 for odd n
~ X + X ( n +1) / 2 (6.2)
X = n/2 for even n
2
range(X) = X n -X 1 (6.3)
2
n
∑ (X − X)
n n
n∑ ( X i ) − ∑ X i
2 2
i
S =
2 i =1
= i =1 i =1 (6.4)
n −1 n(n − 1)
The reason for using (n-1) in the denominator rather than n is given later.
S= S2 (6.5)
135
Sampling and Estimation - 136
X −µ
Z= (6.6)
σ/ n
as n → ∞ , is the standard normal distribution n(z;0,1). This is known as the Central Limit
Theorem. What this says is that, given a collection of random samples, each of size n, yielding a
mean X , the distribution of X approximates a normal distribution, and becomes exactly a normal
distribution as the sample size goes to infinity. The distribution of X does not have to be normal.
Generally, the normal approximation for X is good if n > 30.
We provide a derivation in Appendix V proving that the distribution of the sample mean is
given by the normal distribution.
x − µ 15 − 14
Z= = = 3.162
σ / n 1 / 10
136
Sampling and Estimation - 137
http://en.wikipedia.org/wiki/Standard_normal_table
P(µ < 14 ) = P(z > 3.162 ) = 1 − P(z < 3.162 ) = 1 − 0.9992 = 0.0008
Second, we can use a modern computational tool like MATLAB to evaluate the probability.
The problem can be worked in terms of the standard normal PDF (µ = 0 and σ = 1), which for
P(µ < 14 ) = P(z > 3.162 ) = 1 − P(z < 3.162 ) is
>> p = 1 - cdf('normal',3.162,0,1)
p = 7.834478217108032e-04
Alternatively, the problem can be worked in terms of the non-standard normal PDF ( x = 15
and σ / n = 1 / 10 ), which for P(µ < 14)
>> p = cdf('normal',14,15,1/sqrt(10))
p = 7.827011290012762e-04
The difference in these results is due to the round-off in 3.162, used as an argument in the
function call for the standard normal distribution.
Based on our sampling data, the probability that the true sample mean is less than 14.0 µm is
0.078%.
It is useful to know the sampling difference of two means when you want to determine whether
there is a significant difference between two populations. This situation applies when you takes
two random samples of size n1 and n2 from two different populations, with means µ1 and µ 2 and
variances σ 12 and σ 22 , respectively. Then the sampling distribution of the difference of means,
X 1 − X 2 , is approximately normal, distributed with mean
µX 1−X2
= µ1 − µ 2
and variance
σ2 σ2
σ 2
X1 − X 2
= 1
− 2
n1 n2
137
Sampling and Estimation - 138
Hence,
Z=
(X 1 − X 2 ) − (µ1 − µ 2 )
(6.7)
σ 12 σ 22
+
n n
1 2
Z=
(X 1 − X 2 ) − (µ1 − µ 2 )
=
(15 − 10) − (4) = 2.2361
σ1 2
σ 2
2
1 2
+ +
n n 10 20
1 2
We have the change in sign because as ∆µ increases, z decreases. The probability that
µ1 − µ 2 is greater 4.0 µm is then given by P(Z<2.2361). How do we know that we want
P(Z<2.2361) and not P(Z>2.2361)? We just have to sit down and think what the problem
physically means. Since we want the probability that µ1 − µ 2 is greater 4.0 µm, we know we need
to include the area due to higher values of µ1 − µ 2 . Higher values of µ1 − µ 2 yield lower values
of Z. Therefore, we need the less than sign.
The evaluation of the cumulative normal probability distribution can again be performed two
ways. First, using a standard normal table, we have
>> p = cdf('normal',2.2361,0,1)
138
Sampling and Estimation - 139
p = 0.987327389270190
We expect 98.73% of the differences in crystal size of the two populations to be at least 4.0
µm.
Of course, usually we don’t know the population variance. In that case, we have to use some
other statistic to get a handle on the distribution of the mean.
If X is the mean of a random sample of size n taken from a population with mean µ and
unknown variance , then the limiting form of the distribution of
X −µ
T= (6.8)
S/ n
v +1
Γ[(v + 1) / 2)] t 2
−
2
f (t ) = 1 + −∞ <t <∞
Γ(v / 2) πv
for
v
0.45
crystal sizes of x = 15.0 µm and a Figure 6.1. The t distribution as a function of the degrees
sample variance of s 2 = 1.0 µm2. of freedom and the normal distribution.
139
Sampling and Estimation - 140
What is the likelihood that the true population mean , µ , is actually less than 14.0 µm?
x − µ 15 − 14
t= = = 3.162
s / n 1 / 10
We have the change in sign because as µ increases, t decreases. The parameter v = n-1 = 9.
The evaluation of the cumulative t probability distribution can again be performed two ways.
First, we can use a table of critical values of the t-distribution. It is crucial to note that such a table
does not provide cumulative PDFs, rather it provides one minus the cumulative PDF. In other
words, where as the standard normal table provides the probability less than z (the cumulative
PDF), the t-distribution table provides the probability greater than t (one minus the cumulative
PDF). We then have
Second, using MATLAB we have P(µ < 14) = P(t > 3.162) = 1 − P(t < 3.162)
>> p = 1 - cdf('t',3.162,9)
p = 0.005756562560207
Based on our sampling data, the probability that the true sample mean is less than 14.0 µm is
0.57%.
We should point out that our percentage here is substantially greater than for our percentage
when we knew the population variance (0.078%). That is because knowing the population
variance reduces our uncertainty. Approximating the population variance with the sampling
variance adds to the uncertainty and results in a larger percentage of our population deviating
farther from the sample mean.
X −µ 500 − 518
T= = = −2.25
S/ n 40 / 25
140
Sampling and Estimation - 141
Second, using MATLAB we have P(µ > 518) = P(t < −2.25)
>> p = cdf('t',-2.25,24)
p = 0.016944255452754
(Or using a Table, we find that when v=24 and T=2.25, α=0.02). This means there is only a
1.6% probability that a population with µ = 500 would yield a sample with X = 518 or higher.
Therefore, it is unlikely that 500 is the population mean.
transformation: T =
(X 1 − X 2 ) − (µ1 − µ 2 )
(6.9)
s 12 s 22
+
n1 n2
symmetry: t1−α = −tα ,
parameters: v = n1 + n2 − 2 if σ 1 = σ 2
2
s12 s22
+
parameters: v = n1 n2 if σ 1 ≠ σ 2
s 2 2 s 2 2
1 (n1 − 1) + 2 (n2 − 1)
n1 n2
Since we don’t know either population variance in this case, we can’t assume they are equal
unless we are told they are equal.
141
Sampling and Estimation - 142
mean crystal sizes of X 1 = 15.0 µm and X 2 = 10.0 µm and sample variances of s12 = 1.0 µm2 and
s22 = 2.0 µm2. What percentage of true population differences yielding these sampling results
would have a true difference in population means, µ1 − µ 2 , of 4.0 µm or greater?
T=
(X 1 − X 2 ) − (µ1 − µ 2 )
=
(15 − 10) − (4) = 2.2361
s 12 s 22 1 2
+ +
n1 n2 10 20
2 2
s12 s22 12 2 2
+ +
v= n1 n2 = 10 20 = 27.98 ≈ 28
s 2 2 s 2 2 12 2 2 2 2
1 (n1 − 1) + 2 (n2 − 1) (10 − 1) + (20 − 1)
n1 n2 10 20
The evaluation of the cumulative normal probability distribution can again be performed two
ways. First, using a table of critical values of the t-distribution, we have
P(µ1 − µ 2 > 4.0 ) = P(t < 2.2361) = 1 − P(t > 2.2361) = 1 − 0.0217 = 0.9783
Second, using MATLAB we have for P(µ1 − µ 2 > 4.0 ) = P(t < 2.2361)
>> p = cdf('t',2.2361,28)
p = 0.983252747598848
We expect 98.3% of the differences in crystal size of the two populations to be at least 4.0 µm.
(n − 1) S 2 n
( X i − X )2
χ2 = =∑ (6.10)
σ2 i =1 σ2
142
Sampling and Estimation - 143
1
v/2 x v/2 -1e- x/ 2 for x > 0
f χ 2 ( x; v) = 2 Γ(v / 2)
0 elsewhere
6.2. 0.08
from the effluent of each reactor Figure 6.2. The chi-squared distribution for various values
of sizes n = 10 . The samples of v.
yield mean crystal sizes of
x = 15.0 µm and a sample variance of s 2 = 1.0 µm2. What is the likelihood that the true population
variance , σ 2 , is actually less than 0.5 µm2?
(n − 1) S 2 (10 − 1)1
χ2 = = = 18
σ 2
0.5
( ) (
P σ 2 < 0.5 = P χ 2 > 18 )
We have the change in sign because as σ 2 increases, χ 2 decreases. The parameter v = n-1 =
9.
The evaluation of the cumulative χ 2 probability distribution can again be performed two
ways. First, we can use a table of critical values of the χ 2 -distribution. It is crucial to note that
such a table does not provide cumulative PDFs, rather it provides one minus the cumulative PDF.
We then have
143
Sampling and Estimation - 144
( ) (
P σ 2 < 0.5 = P χ 2 > 18 ≈ 0.04)
( ) (
Second, using MATLAB we have P σ 2 < 0.5 = P χ 2 > 18 = 1 − P χ 2 < 18 ) ( )
>> p = 1 - cdf('chi2',18,9)
p = 0.035173539466985
Based on our sampling data, the probability that the true variance is less than 0.5 µm2 is 3.5%.
S12 / σ 12 S12σ 22
F= = (6.11)
S 22 / σ 22 S 22σ 12
provides a distribution of the ratio of two variances. This distribution is called the F-distribution
with v1 = n1 − 1 and v2 = n2 − 1 degrees of freedom. The f-distribution is defined as
v1
v + v
Γ 1 2 1
v 2
v1
2 v2 f 2
−1
for f > 0
h f ( f ; v1 , v2 ) = v1 v 2 v1 + v2
Γ 2 Γ 2 v 2
1 + 1 f
v2
0 elsewhere
144
Sampling and Estimation - 145
0.9
v1=5, v2=10
0.5
probability that the ratio of variances v1=5, v2=5
h(f )
σ2
1
0.4
, , is less than 0.25?
σ2
2 0.3
0.2
S12σ 22 1
F= 2 2 = =2 0.1
S 2 σ 1 2 ⋅ 0.25
0
0 1 2 3 4 5 6 7 8 9
σ
2 f
σ 12
We have the change in sign because as 2 increases, F decreases. The parameters are
σ2
v1 = n1 − 1 = 9 and v2 = n2 − 1 = 19 .
The evaluation of the cumulative F probability distribution can again be performed in one
way. We cannot use tables because there are no tables for arbitrary values of the probability.
There are only tables for two values of the probability, 0.01 and 0.05. Therefore, using MATLAB
σ 12
we have P 2 < 0.25 = P(F > 2) = 1 − P(F < 2)
σ2
>> p = 1 - cdf('f',2,9,19)
p = 0.097413204997132
Based on our sampling data, the probability that the ratio of variances is less than 0.25 is 9.7%.
145
Sampling and Estimation - 146
A confidence interval is some subset of random variable space with which someone can
say something like, “I am 95% sure that the true population mean is between µlow and µ hi .” In
this section, we discuss how a confidence interval is defined and calculated.
The confidence interval is defined by a percent. This percent is called (1-2α). So if α=0.05,
then you would have a 90% confidence interval.
The concept of a confidence interval is illustrated in graphical terms in Figure 6.4.
The trick then is to find µ low = zα and µ hi == z1−α so that you can say for a given α, I am
(1 − 2α )% confident that µlow < µ < µ hi .
Now the normal distribution is symmetric about the y-axis so we can write
zα = − z1−α
so
146
Sampling and Estimation - 147
where
X −µ
Z= .
σ/ n
σ σ
P( X + zα < µ < X − zα ) = 1 − 2α (6.12)
n n
1 − 2α = 0.95
α = 0.025
zα = z0.025 = −1.96
z1−α = − zα = 1.96
The z value came from a standard normal table. Alternatively, we can compute this value from
MATLAB,
>> z = icdf('normal',0.025,0,1)
z = -1.959963984540055
Here we used the inverse cumulative distribution function (icdf) command. Since we have the
standard normal PDF, the mean is 0 and the variance is 1. The value of 0.025 corresponds to
alpha, the probability.
To get the value of the other limit, we either rely on symmetry, or compute it directly,
>> z = icdf('normal',0.975,0,1)
z = 1.959963984540054
Note that these values of z are independent of all aspects of the problem except the value of the
confidence interval.
147
Sampling and Estimation - 148
1 1
P (6 + (−1.96) < µ < X − (−1.96) = 1 − 0.05 = 0.95
36 36
so the 95% confidence interval for the mean is 5.673 < µ < 6.327 .
where
X −µ
T= .
s/ n
s s
P ( X + tα < µ < X − tα ) = 1 − 2α (6.13)
n n
1 − 2α = 0.95
α = 0.025
tα = t 0.025 = −2.03
t1−α = −tα = +2.03
148
Sampling and Estimation - 149
The t value came from a table of t-distribution values. Alternatively, we can compute this
value using MATLAB,
>> t = icdf('t',0.025,35)
t = -2.030107928250342
>> t = icdf('t',0.975,35)
t = 2.030107928250342,
which can also be obtained by symmetry. Note that these values of t are independent of all aspects
of the problem except the value of the confidence interval and the number of sample points, n.
1 1
P (6 − (2.03) < µ < X + (2.03) = 1 − 0.05 = 0.95
36 36
so the 95% confidence interval for the mean is 5.662 < µ < 6.338 .
You should note that we are a little less confident about the mean when we use the sample
variance as the estimate for the population variance, for which the 95% confidence interval for the
mean was 5.673 < µ < 6.327 .
σ 12 σ 22 σ 12 σ 22
P ( X 1 − X 2 ) + zα + < (µ1 − µ 2 ) < ( X 1 − X 2 ) − zα + = 1 − 2α (6.14)
n1 n2 n1 n2
149
Sampling and Estimation - 150
1 − 2α = 0.95
α = 0.025
zα = z0.025 = −1.96
z1−α = − zα = 1.96
The z value came from a table of standard normal PDF values. Alternatively, we can compute
this value from MATLAB,
>> z = icdf('normal',0.025,0,1)
z = -1.959963984540055
3
P (6 − 8) − 1.96 < (µ1 − µ 2 ) < (6 − 8) + 1.96
1 3 1
+ + = 1 − 2(0.025)
36 16 36 16
So the 95% confidence interval for the mean is − 2.909 < (µ1 − µ 2 ) < −1.091 .
If we are determining which site is more contaminated, then we are 95% sure that site 2 (Quail
Run) is more contaminated by 1 to 3 ppm than site 1, (Times Beach).
s12 s 22
P ( X 1 − X 2 ) + tα < (µ1 − µ 2 ) < ( X 1 − X 2 ) + t1−α
s12 s 22
+ + = 1 − 2α (6.15)
n1 n2 n1 n2
v = n1 + n2 − 2 if σ 1 = σ 2
150
Sampling and Estimation - 151
2
s12 s22
+
v= n1 n2 if σ 1 ≠ σ 2
s 2 2 s 2 2
1 (n1 − 1) + 2 (n2 − 1)
n1 n2
2
s12 s22 1
2
3
+ +
v= n1 n2 = 36 16 = 19.59 ≈ 20
s 2 2 s 2 2 1 2 3 2
1 (n1 − 1) + 2 (n2 − 1) (36 − 1) + (16 − 1)
n1 n2 36 16
1 − 2α = 0.95
α = 0.025
tα = t 0.025 = 2.086
t1−α = −tα = −2.086
The t value came from a table of t-PDF values. Alternatively, we can compute this value using
MATLAB,
>> t = icdf('t',0.025,20)
t = -2.085963447265864
1 3
P (6 − 8) − 2.086 < (µ1 − µ 2 ) < (6 − 8) + 2.086
1 3
+ + = 1 − 2(0.025)
36 16 36 16
151
Sampling and Estimation - 152
So the 95% confidence interval for the mean is − 2.97 < (µ1 − µ 2 ) < −1.03 .
If we are determining which site is more contaminated, then we are 95% sure that site 2 (Quail
Run) is more contaminated by 1 to 3 ppm than site 1, (Times Beach).
(n − 1) S 2 n
( X i − X )2
χ =
2
=∑
σ2 i =1 σ2
Perversely, the tables of the critical values for the χ 2 distribution, have defined α to be 1-α, so
the indices have to be switched when using the table.
(n − 1) σ 2 (n − 1) σ 2
< σ < = 1 − 2α when using the χ critical values table only!
2
P 2
χ 2
α
χ 2
1−α
If you get confused, just remember that the upper limit must be greater than the lower limit.
Remember also that the f χ 2 ( χ 2 ; n − 1) is not symmetric about the origin, so we cannot use the
symmetry arguments used for the confidence intervals for functions of the mean.
1 − 2α = 0.95
α = 0.025
χα2 = χ 02.025 = 27.488
χ12−α = χ 02.975 = 6.262
152
Sampling and Estimation - 153
The t value came from a table of χ 2 -distribution values. Alternatively, we can compute this
value using MATLAB,
chi2 = 6.262137795043251
and
chi2 = 27.488392863442972
( )
P 0.5457 < σ 2 < 2.395 = 0.95
So the 95% confidence interval for the mean is 0.5457 < σ 2 < 2.395 .
S12 / σ 12 S12σ 22
F= 2 2 = 2 2
S2 / σ 2 S2σ1
153
Sampling and Estimation - 154
σ 22 S 22
= F
σ 12 S12
σ 12 1 S12
=
σ 22 F S 22
S2 1 σ 2 S2 1
P 12 < 12 < 12 = 1 − 2α (6.17)
S 2 f1−α (v1, v2 ) σ 2 S 2 f α (v1, v2 )
σ 12
One notes that the order of the limits has changed here, since as goes up, F goes down. In
σ 22
any case, the lower limit must be smaller than the upper limit. If one chooses to use tables of
critical values, one must take into account two idiosyncrasies of the procedure. First, as was the
case with the t and chi-squared distributions, the table provide the probability that f is greater than
a value, not the cumulative PDF, which is the probability that f is less than a value. Second, the
tables only provide data for small values of α. Therefore, we must eliminate all instances of 1-α.,
using a symmetry relation. The result is
S2 1 σ 2 S2
P 12 < 12 < 12 fα (v2, v1 ) = 1 − 2α when using the tables only!
S 2 fα (v1, v2 ) σ 2 S 2
1 − 2α = 0.90
α = 0.05
>> f = icdf('f',0.05,19,15)
f = 0.447614966503185
154
Sampling and Estimation - 155
and
>> f = icdf('f',0.95,19,15)
f = 2.339819281665456
1 1 σ2 1 1
P < 12 < = 1 − 2(0.05)
3 2.3398 σ 2 3 0.4476
σ2
P 0.1425 < 12 < 0.7447 = 0.90
σ2
Fα = F0.05 = F0.05 (v1 = 19, v2 = 15) ≈ F0.05 (v1 = 20, v2 = 15) = 2.33
F0.05 (v1 = 15, v2 = 19) = 2.23
1 1 σ2 1
P < 12 < 2.23 = 1 − 2(0.05)
3 2.33 σ 2 3
σ2
P 0.1431 < 12 < 0.7433 = 0.90
σ2
σ 12
So the 90% confidence interval for the mean is 0.1425 < < 0.7447 .
σ 22
If we are determining which site has a greater variance of contamination levels then we are
90% sure that site 2 (Quail Run) has more variance by a factor of 1.3 to 7.0.
6.5. Problems
We intend to purchase a liquid as a raw material for a material we are designing. Two vendors
offer us samples of their product and a statistic sheet. We run the samples in our own labs and
come up with the following data:
155
Sampling and Estimation - 156
Vendor 1 Vendor 2
sample # outcome sample # outcome
1 2.3 1 2.49
2 2.49 2 1.98
3 2.05 3 2.18
4 2.4 4 2.36
5 2.18 5 2.47
6 2.12 6 2.36
7 2.38 7 1.82
8 2.39 8 1.88
9 2.4 9 1.87
10 2.46 10 1.87
11 2.19
12 2.04
13 2.43
14 2.34
15 2.19
16 2.12
n1 = 16 x1 =
1 16
∑ xi =2.280
16 i =1
s1 =
2 1 16
∑ [
16 i =1
]
(xi − x1 )2 =0.0229 s1 = 0.1513
n2 = 10 x2 =
1 10
∑ xi =2.128
10 i =1
s2 =
2 1 10
∑ [
10 i =1
]
(xi − x2 )2 =0.0744 s2 = 0.2728
Problem 6.1.
Determine a 95% confidence interval on the mean of sample 1. Use the value of the
population variance given. Is the given population mean legitimate?
Problem 6.2.
Determine a 95% confidence interval on the difference of means between samples 1 and 2.
Use the values of the population variance given. Is the difference between the given population
means legitimate?
156
Sampling and Estimation - 157
Problem 6.3.
Determine a 95% confidence interval on the mean of sample 1. Assume the given values of
the population variances are suspect and not to be trusted. Is the given population mean
legitimate?
Problem 6.4.
Determine a 95% confidence interval on the difference of means between samples 1 and 2.
Assume the given values of the population variances are suspect and not to be trusted. Is the
difference between the given population means legitimate?
Problem 6.5.
Determine a 95% confidence interval on the variance of sample 1. Is the given population
variance legitimate?
Problem 6.6.
Determine a 98% confidence interval on the ratio of variance of samples 1 & 2. Is the ratio of
the given population variances legitimate?
157