Notes 2022
Notes 2022
A sample is a finite representative part of a statistical population whose properties are studied to
gain information about the whole population (Webster, 1985). When dealing with people, it can
be defined as a set of respondents (people) selected from a larger population for the purpose of a
survey.
Sampling is the act, process, or technique of selecting a suitable sample, or a representative part
of a population for the purpose of determining parameters or characteristics of the whole
population.
The main purpose of sampling is to draw conclusions about populations by directly observing a
portion of the population.
Definition 5.1 Target population: The totality of elements which are under discussion and about
which information is desired will be called the target population.
How should sample be chosen to form target population for the inference to be credible?
A population has a density if we can assume that each element in the population has some
numerical value associated with it and the distribution of these numerical values is given by the
density.
Definition 5.2 Random sample: Let the random variables X 1 , X 2 , X 3 ,..., X n have a joint
Often it is not possible to select a random sample from the target population, but a random sample
can be selected from some related population.
1
Valid probability statements can be made about sampled populations based on random samples,
but statements about the target populations are not valid in a relative-frequency probability sense
unless the target population is also the sampled population.
Example 5.1 Suppose that a sociologist desires to study the religious habits of 20-year-old males
in South Africa. He draws a sample from the 20-year-old males of Kimberley to conduct his study.
In this case, the target population is the 20-year-old males in South Africa, and the sampled
population is the 20-year-old males in Kimberley, which he sampled. He can draw valid relative-
frequency-probabilistic conclusions about his sampled population, but he must use his personal
judgment to extrapolate to the target population, and the reliability of the extrapolation cannot be
measured in relative-frequency probability terms.
Example 5.2 A wheat researcher is studying the yield of a certain variety of wheat in Free State.
She has at her disposal five farms scattered throughout Free State on which he can plant the wheat
and observe the yield. The sampled population consists of the yields on these five farms, whereas
the target population consists of the yields of wheat on every farm in Free State.
A central problem in discovering new knowledge in the real world consists of observing a few of
the elements under discussion, and based on these few we make a statement about the totality of
elements. A population distribution is the distribution of all the individual measurements in a
population, and a sample distribution is the distribution of the individual values included in a
sample.
In contrast to such distributions for individual measurements, a sampling distribution refers to the
distribution of different values that a sample statistic, or estimator, would have over many samples
of the same size. Thus, even though we typically would have just one random sample or rational
subgroup, we recognize that the particular sample statistic that we determine, such as the sample
mean or median, is not exactly equal to the respective population parameter.
Further, a sample statistic will vary in value from sample to sample because of random sampling
variability, or sampling error. This is the idea underlying the concept that any sample statistic is in
fact a type of variable whose distribution of values is represented by a sampling distribution.
2
A point estimator is the numeric value of a sample statistic that is used to estimate the value of a
population or process parameter. One of the most important characteristics of an estimator is that
it be unbiased. An unbiased estimator is a sample statistic whose expected value is equal to the
parameter being estimated.
Advantages of Sampling
3
5.2.1 Sampling Distribution of Means and the Central Limit Theorem
Suppose that a random sample of n observations is taken from a normal population with mean μ
and variance σ2. Each observation X i , i 1, 2, . . . , n , of the random sample will then have the
same normal distribution as the population being sampled. Hence, the sample mean defined by
1
X X 1 X 2 ... X n
n
1 1 2
X ... and 2X 2 2 ... 2 = .
n n n
If we are sampling from a population with unknown distribution, either finite or infinite, the
2
sampling distribution of X will still be approximately normal with mean μ and variance , if the
n
sample size is large. This is an immediate consequence of the following called the Central Limit
Theorem.
Theorem 5.1 Central Limit Theorem: If X is the mean of a random sample of size n taken from
a population with mean μ and finite variance 2 , then the limiting form of the distribution of
X
Z
n
as n , has the standard normal distribution N (Z ; 0, 1)
The normal approximation for X will generally be good if n ≥ 30, provided the population
distribution is not terribly skewed. If n < 30, the approximation is good only if the population is
not too different from a normal distribution and, as stated above, if the population is known to be
normal, the sampling distribution of X will follow a normal distribution exactly, no matter how
small the size of the samples.
4
The sample size n = 30 is a guideline to use for the Central Limit Theorem. However, as the
statement of the theorem implies, the presumption of normality on the distribution of X becomes
more accurate as n grows larger.
5.2.3 t-Distribution
The applications of the Central Limit Theorem revolve around inferences on a population mean or
the difference between two population means. Use of the Central Limit Theorem and the normal
distribution is certainly helpful in this context. However, it was assumed that the population
standard deviation is known. In many situations, knowledge of σ is certainly no more reasonable
than knowledge of the population mean μ. often, in fact, an estimate of σ must be supplied by the
same sample information that produced the sample average X .
Let X 1 , X 2 , . . . , X n be independent random variables that are all normal with mean μ and
1 n 1 n
Xi X
2
X X i and s 2
n i 1 n 1 i 1
X
Then the random variable T has a t-distribution with v n 1 degrees of freedom.
s
n
If the sample size is small, the values of S2 fluctuate considerably from sample to sample and the
distribution of T deviates appreciably from that of a standard normal distribution. If the sample
size is large enough, say n ≥ 30, the distribution of T does not differ considerably from the standard
normal. However, for n < 30, it is useful to deal with the exact distribution of T. In developing the
sampling distribution of T, we shall assume that our random sample was selected from a normal
population.
1. The distribution of T is similar to the distribution of Z in that they both are symmetric about
a mean of zero.
5
2. Both distributions are bell shaped, but the t-distribution is more variable, owing to the fact
that the T-values depend on the fluctuations of X and S2, whereas the Z-values depend
only on the changes in X from sample to sample.
3. The variance of T depends on the sample size n and is always greater than 1. When the
sample size n→∞ the two distributions become the same.
Applications of t-Distribution
The t-distribution is used extensively in problems that deal with inference about the population
mean or in problems that involve comparative samples (i.e., in cases where one is trying to
determine if means from two samples are significantly different).
NB
The use of t-distribution requires that the sample is obtained from a normal distribution
The use of the t-distribution and the sample size consideration do not relate to the Central
Limit Theorem.
The use of the standard normal distribution rather than T for n ≥ 30 merely implies that S
is a sufficiently good estimator of σ in this case.
Statistical inference consists of those methods by which one makes inferences or generalizations
about a population. The trend today is to distinguish between the classical method of estimating a
population parameter, whereby inferences are based strictly on information obtained from a
random sample selected from the population, and the Bayesian method, which utilizes prior
subjective knowledge about the probability distribution of the unknown parameters in conjunction
with the information provided by the sample data. Throughout this module, we shall use classical
methods to estimate unknown population parameters such as the mean, the proportion, and the
variance by computing statistics from random samples and applying the theory of sampling
distributions.
When choosing an estimator for a population parameter, certain desirable properties needs to be
taken into account. Among these estimators includes unbiasedness, efficiency, consistency and
sufficiency.
6
(a) Unbiased Estimator
Let ̂ be an estimator whose value ˆ is a point estimate of some unknown population parameter
θ. If the mean of the sampling distribution of ̂ to is equal to the estimated parameter, then ̂ is
said to be unbiased.
ˆ
ˆ
Note
n n 2 n 2
i 1 i 1 i 1
n n n
Xi X X 2 X Xi X
2 2 2
i 1 i 1 i 1
n
Xi X n X
2 2
i 1
1 n
Xi X
2
s2
n 1 i 1
n 2
E s2 E Xi X
1
n 1 i 1
n 2 1 n 2
Xi X n X X i X n X
1
2 2
n 1 i 1 n 1 i 1
1
n 1
X2 i n X2
However,
2
X2 2 , for i 1, 2,..., n, and X2
i
n
Therefore,
7
1 n 2 2
E s2 n
n 1 i 1 n
1 2 2
n n
n 1 n
1
n 1
n 2 2
1
n 1
n 1 2 2
If 1 and 2 are two unbiased estimators of the same population parameter , we want to choose
the estimator whose sampling distribution has the smaller variance.
Definition 5.5. Let 1 and 2 be two unbiased estimators of the population parameter . If
then 1 is said to be more efficient than 2 . Also the relative efficiency of 1 with respect to 2
is by
Var 2
Var 1 .
If the relative efficiency is more than 1, then 1 is more efficient, otherwise, 2 is more efficient.
Example 5.2:
Let X 1 , X 2 , X 3 be a sample of size n = 3 from a distribution with unknown mean, ,
8
Home work.
Let X 1 , X 2 , X 3 be a random sample from the normal distribution where both the mean and the
1 1 1 1
variance are unknown. If 1 Y1 Y2 Y3 and 2 Y1 Y2 Y3 .
4 2 4 3
(a) Which of the two estimators are more efficient?
(b) Calculate the relative efficiency of 2 with respect to 1 and comment on it.
We discuss the estimation of population means for one sample when the population variance is
either known or unknown. If our sample is selected from a normal population or, if n is sufficiently
large, we can establish a confidence interval for μ by considering the sampling distribution of X .
From the Central Limit Theorem, we can expect the sampling distribution of X to be
2
approximately normally distributed with mean X and 2
X
.
n
X
Z
n
X
P Z Z Z 1
2 2
n
P X Z X Z 1
2 n 2 n
In constructing the 1 % confidence interval for the mean, we may come across two main
situations. The first situation is when the population variance is known and the second situation is
when the population variance is unknown.
9
(a) Situation 1: Confidence Interval for population mean when is known
When the population variance is known we use the normal distribution in the the following two
cases. In both cases, a 100 1 - α % confidence interval for the is given by x z 2 and the
n
corresponding z 2 values are read from the standard normal Tables.
Case 2: We use normal distribution by invoking the central limit theorem when the
Example 5.3
Suppose the weights in kg of 5 newly born babies randomly selected from maternity clinic are 3,
4,5,5,8. Find a 95% confidence interval for the mean weight of all the newly born babies at the
clinic if the population variance of the weights of the babies is 2kg. Assume the weights of the
babies are approximately normal.
Solution
In this example, the population variance is known, the sample size is small and the sample is
approximately normally distributed, hence we make use of the formula: x ± Z 2
n
3+ 4 +5+5+8
Here σ = 2 , x = 5 , n 5 and α = 0.05 .
5
10
Z0.05 2 = Z0.025 , thus, the Z value corresponding to 1- 0.025 = 0.975 from the normal table is 1.96.
2
5 ±1.96 = 5 ±1.239 3.761 < μ < 6.239 .
5
Thus, we are 95% confidence that the true mean is between 3.761 and 6.239.
Scholastic Aptitude Test (SAT) mathematics scores of a random sample of 300 high school seniors
in the state of Texas are collected, and the sample mean is found to be 401. If the population
standard deviation is 112, obtain a 95% confidence interval on the mean SAT mathematics score
for seniors in the state of Texas.
There are two case involved when the population variance is unknown. In case 1 a 100 1 - α %
s
confidence interval for the is given by x t ,n-1 while in case 2 the confidence interval is
2
n
s s
given by either x t 2,n-1 or x z 2 . The t 2,n -1 value is read from the student t-table.
n n
11
NB: The central limit theorem may also be invoked here and it is expected that the two methods
will yield approximately the same results as the sample size increase.
Example 5.5
The contents of seven similar containers of sulfuric acid are 9.8, 10.2, 10.4, 9.8, 10.0, 10.2, and
9.6 liters. Find a 95% confidence interval for the mean contents of all such containers, assuming
an approximately normal distribution.
Solution
The sample mean and standard deviation for the given data are x 10.0 and s = 0.283.
From the t-table, t 0.025 = 2.447 for v = n -1 = 6 degrees of freedom. Hence, the
0.283
10.0 ± 2.447
7
Example 5.6
Scholastic Aptitude Test (SAT) mathematics scores of a random sample of 500 high school seniors
in the state of Texas are collected, and the sample mean and standard deviation are found to be
501 and 112, respectively. Find a 99% confidence interval on the mean SAT mathematics score
for seniors in the state of Texas. Hint: The population variance is unknown and the distribution is
unknown, thus invoke the central limit theorem and replace the by .
There are many applications in which only one sided-bound is desired. For example, if the
measurement of interest is tensile strength, the engineer receives better information from a lower
bound only. This bound communicates the worst-case scenario. On the other hand, if the
measurement is something for which a relatively large value of μ is not profitable or desirable,
then an upper confidence bound is of interest. An example would be a case in which inferences
12
need to be made concerning the mean mercury composition in a river. An upper bound is very
informative in this case.
One sided upper confidence bound a normal distribution are given by x z while the
n
lower confidence bound is x z . For t-distribution, the bounds are respectively
n
s s
x t ,n 1 and x t ,n 1 .
n n
5.7 Determining the sample size for the estimation of the mean
We now see that a longer interval is required to estimate μ with a higher degree of confidence. The
100 1 % confidence interval provides an estimate of the accuracy of our point estimate. If μ
is actually the center value of the interval, then x estimates μ without error. Most of the time,
however, x will not be exactly equal to μ and the point estimate will be in error. The size of this
error will be the absolute value of the difference between μ and x , and we can be 100(1 − α)%
confident that this difference will not exceed Z 2 . Hence we can choose n such that = Zα 2 .
n n
Let X1 , X2 ,..., Xn be a random sample of size n from the normal distribution and let x and s be the
sample mean and standard deviation respectively. Then the minimum sample required for x to be
within of the true mean with a 100 1 - α % probability is
Zα
2
n= 2
ε
Strictly speaking, the formula is applicable only if we know the variance of the population from
which we select our sample. If we do not know the variance, we could take a preliminary sample
of size n ≥ 30 to provide an estimate of σ. Then, using s as an approximation for σ, we could
13
determine approximately how many observations are needed to provide the desired degree of
accuracy.
Example 5.7
How large a sample is required if we want to be 95% confident that our estimate of μ in Example
9.2 is off by less than 0.05?
Solution
Zα 1.96 0.3 2
2
n= 2 138.3 .
ε 0.05
Therefore, we can be 95% confident that a random sample of size 139 will provide an estimate x
differing from μ by an amount less than 0.05.
Example 5.8
A soft drink machine is regulated so that the amount of drink dispensed is approximately normally
distributed with mean of and a variance of 2 . If a random sample of selected drinks yields a
standard deviation of 0.3, what size of the sample is needed to be:
i) 95% confident that the sample mean will be within 0.02 of the true mean?
ii) 90% confident that the sample mean will be within 0.01 of the true mean?
Exercise
14
2. The height (ft.) of a random sample of 64 college students produced a mean height of
4.4583ft and sample standard deviation of 1.889. Construct a 95%
(a) Confidence interval for the mean height.
(b) Lower bound for the mean height.
3. Consider a random sample 2.4; 1.6; 3.8; 4.0; 5.2 and 2.0 from a normal distribution.
Construct a 90% confidence interval for the mean in each of the following cases:
(a) Variance is 1.44
(b) Variance is unknown
4. An efficiency expert wishes to determine the average time that it takes to drill three holes
in a certain metal clamp. How large a sample will she need to be 95% confident that her
sample mean will be within 15 seconds of the true mean? Assume that it is known from
previous studies that σ = 40 seconds.
5. A machine produces metal pieces that are cylindrical in shape. A sample of pieces is taken,
and the diameters are found to be 1.01, 0.97, 1.03, 1.04, 0.99, 0.98, 0.99, 1.01, and 1.03
centimeters. Find a 99% confidence interval for the mean diameter of pieces from this
machine, assuming an approximately normal distribution.
Sometimes, it is necessary to predict the possible value of a future observation. For instance,
In quality control, the experimenter may need to use the observed data to predict a new
observation.
A weather forecaster may be interested in predicting tomorrow’s temperature.
In these cases, a confidence interval on the mean tensile strength and the also the temperature does
not capture the required information. These types of requirement are addressed by constructing a
prediction interval.
Prediction interval provides a good estimate of the location of a future observation, which is quite
different from the estimate of the sample mean value. The variation of the prediction interval is
the sum of the variation due to an estimation of the mean and the variation of a single observation.
15
One of the importance of prediction interval is the detection of outlier observation. The majority
of scientific investigators are keenly sensitive to the existence of outlying observations or so-called
faulty or “bad data.” Outlying observation could be viewed as the observation that comes from a
population with a mean that is different from the mean that governs the rest of the sample of size
n being studied. The prediction interval produces a bound that “covers” a future single observation
with probability 1 – if it comes from the population from which the sample was drawn. As a
result, a methodology for outlier detection involves the rule that an observation is an outlier if it
falls outside the prediction interval computed without including the questionable observation in
the sample.
is the new observation and x comes from the sample. Since x0 and x are independent, we know
that
x0 x x0 x
z is n (z; 0, 1).
2 1
2 2
n n
Recall that
P z / 2 z z / 2 1
Therefore, for a normal distribution of measurements with unknown mean μ and known variance
σ 2 , a 100 1 - α % prediction interval of a future observation x0 is
x zα/2 σ 1+ 1 n x0 x zα/ 2σ 1+ 1 n .
Similarly, for a normal distribution of measurements with unknown mean μ and unknown variance
σ 2 , a 100 1 - α % prediction interval of a future observation x0 is
x tα/2s 1+ 1 n x0 x tα/ 2s 1+ 1 n .
16
Example 5.9
Due to the decrease in interest rates, the First Citizens Bank received a lot of mortgage applications.
A recent sample of 50 mortgage loans resulted in an average loan amount of $257,300. Assume a
population standard deviation of $25,000.
(a) For the next customer who fills out a mortgage application, find a 95% prediction interval
for the loan amount.
(b) Determine if will be abnormal if the next customer who fills out a mortgage application
obtains a loan amount of 200,700.
Solution
(a) The point prediction of the next customer’s loan amount is x = $257,300 . The z-value here
is z0.025 = 1.96 . Hence, a 95% prediction interval for the future loan amount is
(b) The approved loan will not be abnormal because it falls outside the prediction interval.
Example 5.10
A meat inspector has randomly selected 30 packs of 95% lean beef. The sample resulted in a mean
of 96.2% with a sample standard deviation of 0.8%. Find a 99% prediction interval for the leanness
of a new pack. Assume that the distribution of the meat packs is normal.
Solution
For v = 29 degrees of freedom, t0.005 = 2.756 . Hence, a 99% prediction interval for a new
observation is
96.2 - (2.756)(0.8) 1 + 1/ 30 < x0 < 6.2 + (2.756)(0.8) 1 + 1/ 30 .
17
Sometimes, one is interested in determining the bounds in some probabilistic sense “cover” values
in the population (i.e., the measured values of the dimension). One method of establishing the
desired bounds is to determine a confidence interval on a fixed proportion of the measurements. A
bound that covers the middle 95% of the population of observations is μ ±1.96σ . This is called a
tolerance interval, and indeed its coverage of 95% of measured observations is exact. However, in
practice, μ and σ are seldom known; thus, the user must apply
x ks.
The interval is a random variable, and hence the coverage of a proportion of the population by the
interval is not exact. As a result, a 100 1- γ % confidence interval must be used since x ks
cannot be expected to cover any specified proportion all the time. For a normal distribution of
measurements with unknown mean μ and unknown standard deviation σ, tolerance limits are
given by x ks , where k is determined such that one can assert with 100 1- γ % confidence
that the given limits contain at least the proportion 1 - α of the measurements.
Example 5.11
A machine produces metal pieces that are cylindrical in shape. Assuming that the diameters are
approximately normally distributed, find the 99% tolerance limits that will contain 95% of the
metal pieces produced by this machine if a random sample of 10 selected metal pieces has a mean
and standard deviation of x 1.0056 and s 0.0246 respectively .
Solution
For n 9 , 1- 0.99 and 1- α = 0.95 , we find k = 4.550 for two-sided limits. Hence, the 99%
tolerance limits are given by
with the bounds being 0.8937 and 1.1175. We are 99% confident that the tolerance interval from
0.8937 to 1.1175 will contain the central 95% of the distribution of diameters produced.
Exercise (Tutorials)
18
1. The following measurements were recorded for the drying time, in hours, of a certain brand
of latex paint:
3.4; 2.5; 4.8; 2.9; 3.6; 2.8; 3.3; 5.6; 3.7; 2.8; 4.4; 4.0; 5.2; 3.0; 4.8.
If the measurements represent a random sample from a normal population, find a 95%
prediction interval for the drying time for the next trial of the paint.
2. A random sample of 100 automobile owners in the state of Virginia shows that an
automobile is driven on average 23,500 kilometers per year with a standard deviation of
3900 kilometers. Assume the distribution of measurements to be approximately normal,
construct a 99% prediction interval for the kilometers traveled annually by an automobile
owner in Virginia.
3. A random sample of 12 graduates of a certain secretarial school typed an average of 79.3
words per minute with a standard deviation of 7.8 words per minute. Assuming a normal
distribution for the number of words typed per minute, find a 95% prediction interval for
the average number of words typed by all graduates of this school.
4. A random sample of 12 shearing pins is taken in a study of the Rockwell hardness of the
pin head. Measurements on the Rockwell hardness are made for each of the 12, yielding
an average value of 48.50 with a sample standard deviation of 1.5. Assuming the
measurements to be normally distributed, construct a 95% tolerance interval containing
90% of the measurements.
5. A random sample of 25 tablets of buffered aspirin contains, on average, 325.05 mg of
aspirin per tablet, with a standard deviation of 0.5 mg. Find the 95% tolerance limits that
will contain 95% of the tablet contents for this brand of buffered aspirin. Assume that the
aspirin content is normally distributed.
6. The following data are diameter measurements from a sample of metal pieces: 1.01, 0.97,
1.03, 1.04, 0.99, 0.98, 1.01, 1.03, 0.99, 1.00, 1.00, 0.99, 0.98, 1.01, 1.02, 0.99 centimeters.
If the sample is drawn from a normal distribution, construct the following intervals and
interpret your results to those of the case study.
(a) Compute a 99% confidence interval on the mean diameter.
(b) Compute a 99% prediction interval on the next diameter to be measured.
(c) Compute a 99% tolerance interval for coverage of the central 95% of the distribution
of diameters.
19
5.9 Confidence Interval for a proportion
Let p be the proportion of individuals with a certain property P , for example, proportion of
students, adults etc. Suppose that a random sample of size n is drawn from such a population and
let k be the number of individuals in the sample with property P , and then a 100 1- %
pˆ 1 pˆ
pˆ z
2 n
k
where pˆ is the sample proportion and z is the 100 1- % percentile of the standard normal
n 2 2
distribution. NB. It is always assumed that the sample size is large since we are invoking the central
limit theorem here.
Example 5.12
In a random sample of n = 500 families owning television sets in the city, it is found that x = 340
subscribe to ETV. Find a 95% confidence interval for the actual proportion of families with
television sets in this city that subscribe to ETV.
Solution
The point estimate of p is p̂ = 340 / 500 = 0.68 . Recall that z0.025 = 1.96 . Therefore, a 95%
confidence interval for p is
Example 5.13
A new rocket-launching system is being considered for deployment of small, short-range missiles.
The existing system has a success probability of 0.8. A sample of 40 experimental launches is
made with the new system and 34 are successful.
20
(a) Construct a 90% confidence interval for the success probability
(b) Based on the 90% confidence interval, would you conclude that the new system is
better?
If p̂ is used as an estimate of p, we can be 100(1 − α) % confident that the error will be less than
a specified amount when the sample size is approximately
pˆ 1 - pˆ
n z α/2
2
ε2
If a crude estimate of p can be made without taking a sample, this value can be used to determine
n. However, if this is not possible, we could take a preliminary sample of size n ≥ 30 to provide an
estimate of p. Using the above formula, we could determine approximately how many
observations are needed to provide the desired degree of accuracy.
Example 5.14
A random sample of families owning television sets in the city is to be selected such that we are
95% confident that the actual proportion of families with television sets in a city that subscribe to
ETV is within 0.02 of the true value. How large should the sample be?
Solution
Let us treat the 500 families as a preliminary sample, providing an estimate pˆ 0.68 . Then,
n=
0.02
2
Sometimes, it is not practical to obtain an estimate of p to be used for determining the sample size
for a specified degree of confidence. If this happens, an upper bound for n is established by noting
ˆ ˆ pˆ 1 pˆ , which must be at most 1/4, since p̂ must lie between 0 and 1. In such a
that pq
z2 /2
n=
4ε 2
21
Example 5.15
How large a sample is required if we want to be at least 95% confident that our estimate of p in
Example 6.10 is within 0.02 of the true value?
Solution
1.96
2
Substituting the relevant data into the correct formula, we obtain n= = 2401 .
4 0.02
2
If a sample of size n is drawn from a normal population with variance σ2 and the sample variance
s2 is computed, we obtain a value of the statistic S2. This computed sample variance is used as a
point estimate of σ2. Hence, the statistic S2 is called an estimator of σ2. An interval estimate of σ2
can be established by using the statistic
2
=
n 1 S 2
2
where χ α/2
2
and χ 1-α/2
2
are χ 2 values with v n 1 degrees of freedom, leaving areas of α/2 and
1− α/2, respectively, to the right. An approximate 100(1 − α) % confidence interval for σ is
obtained by taking the square root of each endpoint of the interval for σ2.
Example 5.16
The following are the weights, in decagrams, of 10 packages of grass seed distributed by a certain
company: 46.4, 46.1, 45.8, 47.0, 46.1, 45.9, 45.8, 46.9, 45.2, and 46.0. Find a 95% confidence
interval for the variance of the weights of all such packages of grass seed distributed by this
company, assuming a normal population.
22
Solution
1 n
Note that s 2 =
n 1 n 1
xi x hence s2 = 0.286 .
To obtain a 95% confidence interval, we choose α = 0.05. Then, using the Chi-square Table with
v = 9 degrees of freedom, we find χ 0.025
2
19.023 and χ 0.975
2
2.700 . Therefore, the 95%
confidence interval for σ2 is
10 - 1 0.286 < σ 2 < 10 -1 0.286
19.023 2.7
Which implies
0.135 < σ2 < 0.953
Exercise
1. A machine is used to fill boxes with product in an assembly line operation. Much concern
centers around the variability in the number of ounces of product in a box. The standard
deviation in weight of product is known to be 0.3 ounce. An improvement is implemented,
after which a random sample of 20 boxes is selected and the sample variance is found to
be 0.045 ounce2. Find a 95% confidence interval on the variance in the weight of the
product. Does it appear from the range of the confidence interval that the improvement of
the process enhanced quality as far as variability is concerned? Assume normality on the
distribution of weights of product.
2. A manufacturer of car batteries claims that the batteries will last, on average, 3 years with
a variance of 1 year. If 5 of these batteries have lifetimes of 1.9, 2.4, 3.0, 3.5, and 4.2 years,
construct a 95% confidence interval for σ2 and decide if the manufacturer’s claim that
2 1 is valid. Assume the population of battery lives to be approximately normally
distributed.
3. A random sample of 20 students yielded a mean of 72 and a variance of 16 for scores on a
college placement test in mathematics. Assuming the scores to be normally distributed,
construct a 98% confidence interval for σ2.
23
CHAPTER 6
Statistical hypothesis is an assertion or conjecture concerning one or more populations. The truth
or falsity of a statistical hypothesis is never known with absolute certainty unless we examine the
entire population. This, of course, would be impractical in most situations. Instead, we take a
random sample from the population of interest and use the data contained in this sample to provide
evidence that either supports or does not support the hypothesis. Evidence from the sample that is
inconsistent with the stated hypothesis leads to a rejection of the hypothesis.
In hypothesis testing, the hypothesis we wish to test is called the null hypothesis and is denoted
by H0. The alternative hypothesis H1 usually represents the question to be answered or the theory
to be tested. The rejection of H0 leads to the acceptance of an alternative hypothesis.
The null hypothesis H0 nullifies or opposes H1 and is often the logical complement to H1 . The
following are the plausible conclusions made in hypothesis testing:
Note that the conclusions do not involve a formal and literal “accept H0”. The statement of H0
often represents the “status quo” in opposition to the new idea, conjecture, and so on, stated in 1
H1, while failure to reject 0 H represents the proper conclusion.
Example
Consider, a person accused of a crime and has been brought before a jury for trial. The status quo
is that the accused is not guilty until proven otherwise, thus the trial involves the following
hypothesis:
24
The accusation comes because of suspicion of guilt. The hypothesis H0 (the status quo) stands in
opposition to H1 and is maintained unless H1 is supported by evidence “beyond a reasonable
doubt.” However, “failure to reject H0 ” in this case does not imply innocence, but merely that the
evidence was insufficient for conviction. Therefore, the jury does not necessarily accept H0 but
fails to reject H0.
A hypothesis test is differentiated by its alternative hypothesis since the null hypothesis is often
the same. Depending on the way the alternative hypothesis is stated, a test can be either one-tail or
two-tailed. By one tail, we simply mean that the rejection region is found in only one direction
while in two tailed test, it could be found equally in both directions.
A test of any statistical hypothesis where the alternative is one sided, that is
Definition 6.1 H0 : 0 against H1 : 0 or H0 : 0 against H1 : 0 is called a one-
tailed test.
Generally, the critical region for the alternative hypothesis 0 lies in the right tail of the
distribution of the test statistic, while the critical region for the alternative hypothesis 0 lies
entirely in the left tail. (In a sense, the inequality symbol points in the direction of the critical
region.)
Definition 6.2 Critical region (rejection region) is the area (s) of the sampling distribution of a
statistic that will lead to the rejection of the hypothesis tested when that hypothesis is true. The
critical region is also called the region.
A test of any statistical hypothesis where the alternative is two sided, such as
The critical region of a two-tailed test is split into two parts, often having equal probabilities, in
each tail of the distribution of the test statistic. The alternative hypothesis 0 states that either
0 or 0 .
Examples 6.1
25
A manufacturer of a certain brand of rice cereal claims that the average saturated fat content does
not exceed 1.5 grams per serving. State the null and alternative hypotheses to be used in testing
this claim and determine where the critical region is located.
Solution
The claim implies that 1.5 thus, the claim can only be rejected if 1.5 , therefore we test the
hypothesis H0 : 1.5 against H1 : 1.5 . The critical region lies to the left of the mean, i.e. the
claim is rejected when the mean is greater than 1.5
Examples 6.2
A real estate agent claims that 60% of all private residences being built today are 3-bedroom
homes. To test this claim, a large sample of new residences is inspected; the proportion of these
homes with 3 bedrooms is recorded and used as the test statistic. State the null and alternative
hypotheses to be used in this test and determine the location of the critical region.
Consider the case of judicial trial, sometimes an innocent person is jailed due to false witness or
evidence and this lead to rejecting H0 whiles in fact it should not be rejected (True). In such cases,
the jury committing an error (mistake) by rejecting H0 (sentencing an innocent person) when in
fact, he should be acquitted and discharged (H0 is true). Such an error is called a type I error,
usually called false positive.
Definition 6.4: Rejection of the null hypothesis when it is true is called a type I error.
The probability of committing a type I error is also called the level of significance or size of the
test, and it is denoted by i.e.
If the size of the critical region is small, it is unlikely that a type I error will be committed. For
example, a critical region of size 0.0409 is very small, and therefore it is unlikely that a type I
error will be committed. In the case of the jury, if 0.0409 it may imply that it is very unlikely
that the jury will sentence an innocent person to imprisonment.
26
Sometimes, due to lack of evidence, a guilty person is let go without punishment. In such cases,
the jury failed to sentence a guilty person (fail to reject H1) while in actual sense he is guilty (false).
This type of mistake is called type II error, usually called false negative.
Definition 6.5: Non-rejection of the null hypothesis when it is false is called a type II error.
The probability of committing a type II error is impossible to compute unless we have a specific
alternative hypothesis. The smaller the probability of committing type II error, the higher the
power of the test.
Definition 6.6: The power of a test is the probability of rejecting H0 given that a specific alternative
is true. Mathematically, Power of test is 1 .
Power of test is correctly rejecting 𝐻0 when it is indeed false and 𝐻1is true. Ideally, we like to use
a test procedure for which the type I and type II error probabilities are both small. Statistical
hypothesis testing leads to four possible situations that determine whether our decision is correct
or not. These are summarized in the table below:
H0 is true H0 is false
Fail to reject H0 Correct decision Type II error
Reject H0 Type I error Correct decision
The probability of committing both types of error can be reduced by increasing the sample size. In
the Jury case, both errors can be reduced by collecting more evidence and considering other factors
that may affect the decision. The following are the important Properties of a Test of Hypothesis
The type I error and type II error are related. A decrease in the probability of one generally
results in an increase in the probability of the other.
The size of the critical region, and therefore the probability of committing a type I error,
can always be reduced by adjusting the critical value(s).
An increase in the sample size n will reduce α and β simultaneously.
If the null hypothesis is false, β is a maximum when the true value of a parameter
approaches the hypothesized value.
27
The greater the distance between the true value and the hypothesized value, the smaller β
will be.
Examples 6.3
The standard deviation of a population of weights of students is 3.6. A sample of 36 students are
selected and weighted. If the researcher is interested in testing the null hypothesis that the average
weight of male students in a certain college is 68 kilograms against the alternative hypothesis that
it is unequal to 68.
(a) Compute the probability of committing type one error.
(b) Assuming that the true mean is happened to be 70 kilograms, compute the probability of
committing type two error.
Solution
(a) H 0 : 68 versus H1 : 68
A critical region for the test statistic might arbitrarily be chosen to be the two intervals
x 67 and x 69 . The non-rejection region will then be the interval 67 x 69 . Thus
67 - 68 69 - 68
Z1 = = -1.67 and Z 2 = = 1.67
3.6 3.6
36 36
Therefore, α = P(Z < −1.67) + P(Z > 1.67) = 2P(Z < −1.67) = 2(0.0475)=0.0950.
If you increase the size of the sample to say 64, the Type one error reduces to 0.0264.
(b) The reduction in α is not sufficient by itself to guarantee a good testing procedure. We must
also evaluate β for various alternative hypotheses. If it is important to reject H0 when the true mean
is some value μ ≥ 70 or μ ≤ 66, then the probability of committing a type II error should be
computed and examined for the alternatives μ = 66 and μ = 70. Because of symmetry, it is only
necessary to consider the probability of not rejecting the null hypothesis that μ = 68 when the
alternative μ = 70 is true. A type II error will result when the sample mean falls between 67 and
69 when H1 is true. Therefore, we are interested in P 67 X 69 when 70
67 - 70 69 - 70
Z1 = = -6.67 and Z 2 = = -2.22
3.6 3.6
64 64
Therefore,
28
β = P(−6.67 < Z < −2.22) = P(Z < −2.22) − P(Z < −6.67)
= 0.0132 − 0.0000 = 0.0132.
Examples 6.4
Assuming the time spent at ATM follows a normal distribution with mean 2.1 minutes and
standard deviation 0.9 minutes. A new system is to be implemented to reduce the average time
spent on the ATM. The null hypothesis is that the mean time spent is the same under the new and
the old systems. It is decided to reject the null hypothesis and conclude that the new system is
quicker if the mean withdrawal time from a random sample of 20 cash withdrawals is less than 1.7
minutes.
Assume that, for the new system, the standard deviation is still 0.9 minutes, and the time spent still
follows a normal distribution.
(a) Calculate the probability of committing a Type I error
(b) If the mean withdrawal time under the new system is actually 1.25 minutes, calculate the
probability of a Type II error
(c) Calculate the power of this test.
Solution
But
P X 1.7 | H 0 : μ = 2.1 P Z
1.7 - 2.1
0.9 P Z 1.99 = 0.0233
20
Therefore about 2.33% of all sample of size 20 will lead us to reject that μ = 2.1 when in
fact it is true.
(b) The probability of committing a type II error is given by
29
P Fail to reject H 0 | H 0 is true P X 1.7 | H 0 : μ = 1.25
But
1.7 1.25
P X 1.7 | H0 : μ = 1.25 P Z P Z 2.24 = 0.0125
0.9
20
Therefore about 1.25% of all sample of size 20 will lead us to fail to reject that μ = 1.25
when in fact it is not true.
(c) Power of test 1 1 0.0125 = 0.9875 .
Therefore, about 98.75% of all sample of size 20 will lead us to reject that μ = 1.25 when
in fact it is not true.
Answer
(a) P(Z < −1.77) = 0.0384.
(b) β = P(Z > 0) = 0.5 and β = P(Z > 0.59) = 0.2776
Exercise 6
1. Suppose that an allergist wishes to test the hypothesis that at least 30% of the public is
allergic to some cheese products. Explain how the allergist could commit
30
(c) a type I error;
(d) a type II error.
2. A sociologist is concerned about the effectiveness of a training course designed to get more
drivers to use seat belts in automobiles.
(a) What hypothesis is she testing if she commits a type I error by erroneously
concluding that the training course is ineffective?
(b) What hypothesis is she testing if she commits a type II error by erroneously
concluding that the training course is effective?
3. A large manufacturing firm is being charged with discrimination in its hiring practices.
(a) What hypothesis is being tested if a jury commits a type I error by finding the firm
guilty?
(b) What hypothesis is being tested if a jury commits a type II error by finding the firm
guilty?
4. A random sample of 400 voters in a certain city are asked if they favor an additional 4%
gasoline sales tax to provide badly needed revenues for street repairs. If more than 220 but
fewer than 260 favor the sales tax, we shall conclude that 60% of the voters are for it.
(a) Find the probability of committing a type I error if 60% of the voters favor the increased
tax.
(b) What is the probability of committing a type II error using this test procedure if actually
only 48% of the voters are in favor of the additional gasoline tax?
5. A soft-drink machine at a steak house is regulated so that the amount of drink dispensed is
approximately normally distributed with a mean of 200 milliliters and a standard deviation
of 15 milliliters. The machine is checked periodically by taking a sample of 9 drinks and
computing the average content. If X falls in the interval 191 X 209 , the machine is
thought to be operating satisfactorily; otherwise, we conclude that 200 milliliters .
(a) Find the probability of committing a type I error when μ = 200 milliliters.
(b) Find the probability of committing a type II error when μ = 215 milliliters.
31
6.4 Hypothesis Testing
In this section, we formally consider tests of hypotheses on a single population means, proportions
and variances. A hypothesis test involves four main identified procedures and either the p-value
or the critical region approach may be used. The general approach is given below:
1. From the problem, identify the parameter of interest and state the null and alternative
hypothesis and the type of the test (i.e. two tail or one tail test).
2. Identify the level of significance, the distribution and select or identify the critical region.
3. Compute the test statistic and or the p-value.
4. Using your test statistic or p-value, make a decision with regards to H0.
Hypothesis test involving population mean can take two main forms, i.e. when the population
standard deviation is known, and when it is not known.
Hypothesis test about population mean when the population variance is known
1. State the null and the alternative hypothesis and state the type of the test.
2. Identify the level of significance and the type of the test (i.e. two tail or one tail test)
32
Reject H 0 if p value and conclude H 0 .
Fail to reject H 0 if p value and conclude H 0 .
Using critical values
For 0 , reject H 0 if Z Z
or Z Z1 .
2 2
Hypothesis test about population mean when the population variance is unknown
1. State the null and the alternative hypothesis and state the type of the test.
2. Identify the level of significance and the type of the test (i.e. two tail or one tail test)
33
Examples 6.5
A random sample of 100 recorded deaths in the United States during the past year showed an
average life span of 71.8 years. Assuming a population standard deviation of 8.9 years, does this
seem to indicate that the mean life span today is greater than 70 years? Use a 0.05 level of
significance.
Solution
Examples 6.6
A manufacturer of sports equipment has developed a new synthetic fishing line that the company
claims has a mean breaking strength of 8 kilograms with a standard deviation of 0.5 kilogram. A
random sample of 50 lines is tested and found to have a mean breaking strength of 7.8 kilograms.
Doe this support the company’s claim at 99% confidence level?
Solution
34
Examples 6.6
Aflatoxins produced by mold on peanut crops in Kimberly is assumed to be normally distributed.
A sample of 15 batches of peanuts reveals levels of 17.9 ppm, on average, with a standard deviation
of 3 ppm. On the basis of this data can we conclude at 1% level of significance that the mean
Aflatoxins is more than 16.5 ppm?
Solution
17.9 16.5
4. tcal 1.81.
3 / 15
35
5. Decision: Since tcal 2.624 , we fail to reject the null hypothesis. There is insufficient
evidence to conclude that the mean Aflatoxins is more than 16.5.
36
37
38
CHAPTER 7
7.1 Correlation
Correlation analysis is a statistical technique used to quantify the direction and strength of
association between two variables. An estimate of the correlation between two variables is called
the correlation coefficient, denoted by r . In correlation analysis, we estimate a sample correlation
coefficient. The sample correlation coefficient, ranges between -1 and +1. When the correlation
coefficient is positive, higher levels of one variable are associated with higher levels of the other
and when it is negative, higher levels of one variable are associated with lower levels of the other.
The sign of the correlation coefficient indicates the direction of the association while the magnitude
of the correlation coefficient indicates the strength of the association.
There are two main approaches commonly used in computing the sample correlation coefficient.
These are
Pearson product-moment
This is used for quantitative data measured on interval or ratio scale. For given observations x
and y , It the Pearson product-moment is defined by:
cov( x, y )
r
x y
where
1 n 1 n 1 n
xi x yi y , x xi x and y yi y
2 2
cov( x, y)
n 1 i 1 n 1 i 1 n 1 i 1
39
This applied to the ranks instead of the actual data. It is defined by:
n
6 di2
r 1 i 1
n n 2 1
Where
Example 7.1
Consider the Table below provides the score for math and English for 10 students in a test.
Math 5 0 3 1 2 2 5 3 5 4
English 1 2 1 3 3 1 3 1 6 2
Scores( )
a) Find the correlation coefficient between the math grade and English grade and discuss or
interpret it.
b) Test the hypothesis that there is no linear association among the variables at 5% level of
significance.
Solution
xi x
xi x yi y xi x yi y yi y
2 2
Math xi English yi
5 1 5-3=2 1-2.3=-1.3 2^2=4 (-1.3)^2=1.69 -2.6
0 2 -3 -0.3 (-3)^2=9 0.09 0.9
3 1 0 -1.3 0 1.69 0
1 3 -2 0.7 4 0.49 -1.4
2 3 -1 0.7 1 0.49 -0.7
2 1 -1 -1.3 1 1.69 1.3
5 3 2 0.7 4 0.49 1.4
3 1 0 -1.3 0 1.69 0
5 6 2 3.7 4 13.69 7.4
4 2 1 -0.3 1 0.09 -0.3
40
3 2.3 28 22.1 6
a) y 2.3 and x 3
1 1
x 28 3.1111 y 22.1 2.4456
10 1 10 1
1
cov( x, y ) 6 0.6667 .
10 1
cov( x, y )
r
x y
0.6667
r 0.2417
3.1111 2.4456
The correlation coefficient between the math grade and English grade is about 24% which seem
to be a weak direct linear relationship. In addition, 24% of the time, whenever the math grade
increases, the grade of English also increases and vice versa.
b)
1. H0 : 0 versus H1 : 0
2. The distribution of the population correlation is unknown, thus it follows the t- distribution
r n2 0.2417 10 2
4. Compute the test Statistics using T , hence T 0.7045
1 r2 1 0.2417 2
5. Decision: Do not reject H0 because T 2.306 and conclude that there is no linear association
between the variables.
41
Example 7.2
x 69 66 68 73 71 74 71 69
y 163 153 185 186 157 220 190 185
Solution.
x y Rx Ry d= Rx- Ry di2
69 163 3.5 3 0.5 0.25
66 153 1 1 0 0
68 185 2 4.5 -2.5 6.25
73 186 7 6 1 1
71 157 5.5 2 3.5 12.25
74 22 8 8 0 0
71 190 5.5 7 -1.5 2.25
69 185 3.5 4.5 -1 1
n
d
i 1
i
2
0.25 0 6.25 1 12.25 0 2.25 1 23
n
6 di2
6(23)
r 1 i 1
=1 0.7262
n n 1
2
8(64 1)
The correlation coefficient between the two variables is about 73% which seem to be a strong
direct linear relationship. In addition, 74% of the time, whenever x increases, y increases and vice
versa.
(b) The percentage of percentage of the total variations in y explained by x
42
Exercise 7.3
(a) Consider the Table of grades below:
Mathematics grade 70 92 80 74 65 83
English grade 74 84 63 87 78 90
(i) Compute and interpret the correlation coefficient if the grades of students are selected at
random.
(ii) Test the hypothesis that there is no linear association among the variables at 5% level of
significance.
(b) The Statistics Consulting Center at Virginia Tech analyzed data on normal woodchucks for the
Department of Veterinary Medicine. The variables of interest were body weight in grams and heart
weight in grams. It was desired to develop a linear regression equation in order to determine if
there is a significant linear relationship between heart weight and total body weight.
Weight 2.75 2.15 4.41 5.52 3.21 4.32 2.31 4.3 3.71
Chest
(kg) size 29.5 26.3 32.2 36.5 27.2 27.7 28.3 30.3 28.7
(cm)
(i) Calculate r and interpret it.
(ii) Test the null hypothesis that ρ = 0 against the alternative that ρ > 0. Use α=0.01 level.
(ii) What percentage of the variation in infant chest sizes is explained by difference in weight?
{NB: square r to obtain this percentage}.
(iii) Use the spearman’s rank approach to compute r using the data in question 8.1.
Regression analysis is a statistical technique in which we use observed data to relate a variable of
interest (or response) variable, to one or more independent (or predictor) variables. A regression
analysis in which the response/variable of interest/dependent variable depends one independent or
predictor variable is called a simple regression model. The main objective of regression analysis
is to build a regression model or prediction equation that can be used to describe, predict, and
control the dependent variable on the basis of the independent variables. (Bowerman et al,2001)
43
7.2.2 Scatter Diagram
One way to explore the relationship between a dependent variable y and an independent variable
(denoted x ) is to make a scatter diagram, or scatter plot, of y versus x . First, data concerning
the two variables are observed in pairs. To construct the scatter plot, each value of y is plotted
against its corresponding value of x .If y and x are related, the plot shows us the direction of the
relationship. That is, y could be positively related to x ( y increases as x increases) or y could
be negatively related to x ( y decreases as x increases). (Bowerman et al, 2001). The figures
below shows some examples of scatter plots.
44
7.2.3 Simple linear regression model
The simple linear regression model assumes that the relationship between the dependent variable,
denoted y and the independent variable, denoted x, can be approximated by a straight line. We can
tentatively decide whether there is an approximate straight line relationship between y and x by
making a scatter diagram, or scatter plot of x and y by making a scatter diagram, or scatter plot of
y versus x. The simple linear regression model is given by:
yi yi xi i
yi 0 1 xi i
where
i = the random error or noise term which accounts for errors due to chance and neglected factors
assumed not important.
(v) The random errors i and j are independent i.e. Cov( i , j )= 0 for i j
45
From the above assumptions, we establish the following:
yˆi E( yi ) E(0 1 xi i ) 0 1 x
Var (i ) 2 xi
(a) yi N ( 0 1 xi , 2 )
(b) cov( yi , y j ) 0
We seek to find the estimates 0 and 1 respectively by minimizing the total sum of squares
error(SSE):
n n n
SSE i2 ( yi yˆi ) 2 ( yi ˆ0 ˆ1 xi ) 2 (1)
i 1 i 1 i 1
( SSE ) n
2 ( yi ˆ0 ˆ1 xi )
ˆ0 i 1
(2)
( SSE ) n
2 ( yi 0 1 xi ) xi
ˆ ˆ
ˆ1 i 1
By setting the partial derivatives to zero and rearranging the terms, obtain the normal equations:
n n
nˆ0 ˆ1 xi yi
i 1 i 1
n n n (3)
0 xi 1 xi xi yi
ˆ ˆ
i 1 i 1 i 1
46
n
x x y y
i i
ˆ1 i 1
n
(4)
xi x
2
i 1
and
n n
y ˆ x
i 1 i
ˆ0 i 1 i 1
y ˆ1 x (5)
n
Example 7.4
Consider the Table below provides the score for math and English for 10 students in a test.
Math 5 0 3 1 2 2 5 3 5 4
English 1 2 1 3 3 1 3 1 6 2
Scores(
a) )Fit the linear regression line yˆi 0 1 xi onto the data.
b) Determine with reason the nature of the relationship between the grades for Math and English.
Solution
xi x
xi x
2
Math xi English yi xi x yi x yi x
5 1 2 -1.3 4 -2.6
0 2 -3 -0.3 9 0.9
3 1 0 -1.3 0 0
1 3 -2 0.7 4 -1.4
2 3 -1 0.7 1 -0.7
2 1 -1 -1.3 1 1.3
5 3 2 0.7 4 1.4
3 1 0 -1.3 0 0
5 6 2 3.7 4 7.4
4 2 1 -0.3 1 -0.3
47
3 2.3 28 6
n
x x y y
i i
6
ˆ1 i 1
n
0.2143 ˆ0 y ˆ1 x 2.3 0.2143 3 1.657
x x
2 28
i
i 1
b) ˆ1 0.2143 : All things being equal, a unit change in x results in a 0.2143 change in y.
ˆ0 1.4819 : All things being equal, if x do not change, y will still change by 1.4819
units.
n n
n n n
( yi yˆi )2 ( yˆi y ) 2 2 ( yi yˆi )( yˆi y )
i 1 i 1 i 1
n n n
( yi y )2 ( yi yˆi )2 ( yˆi y )2
i 1 i 1 i 1
The SST is a measure of dispersion of the total variance in the observed values, yi .
48
The SSR also measures the amount of the total variance in the observed values of yi
that is accounted for by the model.
The SSE is a measure of dispersion of the observed values yi about the regression line.
SSR
r2
SST
fraction of the total variation. It determines the percentage of the total variation in yi accounted
for by the model.
SSR
We can deduce from r2 = that
SST
i) SSR r 2 SST
SS xx
Also 0 r 2 1 , r ˆ1 and r 2 ˆ1 ˆ0 .
2 2
SST
Total SST SS yy n 1
From the ANOVA Table the following valid conclusions can be drawn:
49
i) E ( SST ) (n 1) 2 ˆ12 SS xx
iv) E ( MST ) 2
E ( MSR ) 2
E ( MSE ) 2
However MSE is unbiased estimator for 2 whether or not x and y are related ( i.e ˆ1 0 or
not). If ,then ˆ1 0 then E ( MSR ) 2 since ˆ12 SS xx 0 . Thus ,for testing whether or not
ˆ1 0 a comparison of MSR and MSE is made. If MSR and MSE are of the same order of
magnitude, it will suggest that ˆ1 0 . On the other hand, if the MSR MSE this would
These two mean squares ( MSR and MSE ) form the basic idea underlying the ANOVA test of
the overall regression model.
The ANOVA generally provides highly useful test for regression models (and other linear
statistical models).
MSR
(b) The test –statistic is F F (1,n 2) which approaches 1 when 1 0 and bigger
MSE
than 1 when 1 0
50
(c) Decision Rule: Reject H 0 : 1 0 if F F (1,n 2)
The coefficient of determination denoted by R2 measures the proportion of the total variability
in the dependent variable (y) that is explained by the independent variable (x).
(a) 1. 0 R2 1
(b) If all the data points fall exactly on the regression line having a non-zero slope, then
R2 1 .
(c) If ˆ1 0 then R2 0.
(d) The square root of R2 gives you the correlation coefficient where the direction or the
r if 1
sign is determined by the direction of ˆ1 . i.e. r R 2 where .
r if 1
7.2.8 Pitfalls and limitations associated with regression and Correlation analysis
51
coefficient of determination of r2 = 0.01 for this example indicates that only 1 percent of
the variance in Y is statistically explained by knowing X.
(f) The interpretation of the coefficients of correlation and determination is based on the
assumption of a bivariate normal distribution for the population and, for each variable,
equal conditional variances.
(g) For both regression and correlation analysis, a linear model is assumed. For a relationship
that is curvilinear, a transformation to achieve linearity may be available. Another
possibility is to restrict the analysis to the range of values within which the relationship is
essentially linear.
Example 8.3
Use the estimated model below to describe the relationship between x and y
yˆ 3.829633 0.903643x .
Example 8.3
Suppose an analyst takes a random sample of 10 recent truck shipments made by a company and
records the distance in miles and delivery time to the nearest half-day from the time that the
shipment was made available for pick-up. Use the Table below to answer the following questions.
(a) Construct a scatter plot and use it to determine if a linear regression will be appropriate.
(b) Determine the least-squares regression equation for the data.
(c) What is the nature of the relationship between distance and delivery time? Motivate your
answer.
(d) Interpret the estimated value for β0.
(e) Determine the number of days it will take a shipment to arrive if the total miles is 1000.
(f) Determine the coefficient of variation and interpret it.
(g) Construct an ANOVA table to represent the data.
52
(h) Perform a hypothesis test to determine the significance of the estimated model at α=0.05.
(i) Compute the correlation coefficient and use it to test the hypothesis H 0 : 0 versus
H1 : 0 at 0.05 level of significance
Solution
(a) Looking at the scatter plot below, the points seems to form a straight line, therefore linear
regression may be appropriate model for the data.
4
DELIVERY TIME
0
0 200 400 600 800 1000 1200 1400 1600
DISTANCE
(b) We will need the following data since we were not given.
10 10 10
x 762 , y 2.85 , xi x 1297860 , yi y 18.525 x x y y 4653
2 2
i i
i 1 i 1 i 1
( x x )( y y )
i i
4653
ˆ1 i 1
n
0.003585
(x x ) 2 12977860
i
i 1
53
(c) There is a direct or positive relationship between delivery time and distance because β1 is
positive.
(d) All things being equal, a unit increase in distance will result in 0.0036 increase in delivery
time.
ŷ 0.118 0.0036 1000 3.70 days .
10
Note from the variance decomposition that SST yi y 18.525 ,
2
i 1
SSE 1.844
r2 1 1 0.90 .
SST 18.525
About 90% of the total variations in delivery time is explained by the distance.
Total 18.525 10
F 72.396
(h) Since F 5.32 we reject H 0 : ˆ1 0 and conclude the model is significant.
54
(i) H0 : 0 versus H1 : 0
r n2 0.95 10 2
T , hence T 8.61
1 r2 1 0.952
We reject H0 because T 2.306 and conclude that there is a significant relationship between
distance and delivery time.
55