05 Statistical Inference-2 PDF
05 Statistical Inference-2 PDF
Topics Outline
The Process of Statistical Inference
Sampling Distributions
The t Distributions
Confidence Interval Estimation
Hypothesis Tests for a Population Mean
The Process of Statistical Inference
A population is the set of all subjects (units, individuals, members, elements) of interest in
a particular study. A parameter is a number describing a characteristic of the population.
Parameters are usually unknown.
A sample is a subset of the population consisting of units which we actually examine and
for which we do have data. Several methods can be used to select a sample from a
population. One of the most common is simple random sampling. A simple random
sample of size n consists of n units chosen in such a way that:
1. Each unit in the population has the same chance of being selected in the sample.
2. All possible samples of size n have the same chance of being drawn.
A statistic is a number describing a characteristic of a sample. We use statistics to
estimate the unknown population parameters.
The purpose of statistical inference is to develop estimates and test hypotheses about the
characteristics of a population using information contained in a sample.
Population
Random
sample
Calculate
sample
statistics
Make an inference
about population
parameters
x
s2
s
-1-
Sampling Distributions
Statistical inference is based on the sampling distributions of statistics. That is, we use
probability to say what would happen if we applied the inference method many times.
The sampling distribution of a statistic is the distribution of all possible values taken by
the statistic when all possible samples of a fixed size n are taken from the population.
Sampling distribution of
known)
(a) Population
distribution
(b) Sampling
distribution of
X for n = 2
(c) Sampling
distribution of
X for n = 10
(d) Sampling
distribution of
X for n = 25
-2-
Notes:
1. The CLT says that if the sample size is large, then regardless of the form of the
population distribution, the random variable X is approximately normally distributed
with mean
and
, where
is the population mean,
n
is the population standard deviation (see figure 2).
-3-
Example 1
The time (in hours) that a technician requires to perform preventive maintenance on an
air-conditioning unit is governed by the exponential distribution whose density curve appears
in figure 2 (a). The mean time is = 1 hour and the standard deviation is = 1 hour.
In order to estimate the technicians time needed to maintain an air-conditioner,
a simple random sample of 70 air-conditioning units will be selected.
(a) Find the sampling distribution of the sample mean X .
The sample size n = 70 is large enough to invoke the Central Limit Theorem.
P( X < 0.9) = P z
0.9 1
= P(z < 0.84) = 0.2005 or 20%
1
70
Therefore, if you budget 0.9 hours per unit, there is an 80% chance that the technician
will not complete the work within the budgeted time.
(c) What is the probability that the sample mean will be greater than 1.25 hours?
P( X > 1.25) = P z
1.25 1
= P(z > 2.09)
1
70
= 1 P(z
Thus, if you budget 1.25 hours per unit, there is only a 2% chance that the technician will
not complete the work within the budgeted time. You therefore budget 1.25 hours per unit.
-4-
Sampling distribution of
unknown)
In most practical situations, the true value of the population standard deviation is unknown. In
such cases, we estimate by the sample standard deviation s.
We assume the population distribution is normal. Then in the place of the random variable
X
X
Z
we use the random variable T
which has t distribution
s
n
with n 1 degrees of freedom.
The t Distributions
The t distribution, also known as the Students t distribution, was discovered in 1908 by William
Gosset who was a chemist employed by the Guinness brewing company.
He considered himself a student still learning statistics, so he signed his papers with a pseudonym
Student. Or perhaps he used a pseudonym due to trade secrets restrictions by Guinness.
The probability density function for Students t distribution with
degrees of freedom is
1
f (t )
2
2
, where
( )
1) (
1) ,
(n)
0
0
1
2
Note that there are different t distributions; it is a family of similar probability distributions.
When we speak of a specific t distribution, we have to specify the degrees of freedom.
We will write the t distribution with n 1 degrees of freedom as t(n 1).
Figure 3 Density curves for the t distributions with 2 and 9 degrees of freedom
and the standard normal distribution. All are symmetric with center 0.
The t distributions are somewhat more spread out.
-5-
Properties of t distributions:
1. The t density curves are symmetric, bell-shaped, and have their peak at 0 like the standard
normal distribution.
2. Since the peak of t is lower than that of the standard normal distribution, the tails of t are heavier
(because the area under the curve is 1).
Why do we need to use a distribution with heavier tails?
3. As the degrees of freedom increase the t density curve approaches the standard normal curve.
Using the t-Table
Each row in the t-Table contains critical values t* for the t distribution whose degrees of freedom
appear at the left of the row. For convenience, we label the table entries both by the confidence
level (in percent) required for confidence intervals and by the one-sided and two-sided P-values
required for hypothesis tests.
The bottom row of the t-Table contains the standard normal critical values z*. By looking down
any column, you can check that the t critical values approach the normal values as the degrees of
freedom increase.
Please note the difference in using the z- and t- tables.
The z-Table gives probabilities (areas under the curve) to the left of specified z-values.
The t-Table gives the t-values for a specified upper tail area.
Example 2 You have a simple random sample of size n = 25.
The corresponding t distribution has n 1 = 25 1 = 24 degrees of freedom.
(a) What is the critical value t* such that t has probability 0.025 to the right of t*?
In the t-Table, in the row for df = 24
above one-sided P-value 0.025,
we find t* = 2.064.
With Excel, t*= TINV(0.05,24) = 2.064.
(b) What is the critical value t* such that t has probability 0.75 to the left of t*?
If t has probability 0.75 to the left of t*,
then t has probability 0.25 to the right of t*.
In the t-Table, in the row for df = 24 above
one-sided P-value 0.25, we find t* = 0.685.
With Excel, t*= TINV(0.5,24) = 0.685.
-6-
t*s
m
Remember, though, that increasing the sample size is typically associated with higher cost in
practice. The best approach is to use the smallest possible sample size that gives useful results.
-7-
Example 3
A study of commuting times reports the travel times to work of a random sample of 20 employed
adults in New York State. The mean is x = 31.25 minutes and the standard deviation is s = 21.88
minutes. Assume that the travel times to work are normally distributed.
(a) What is the point estimate of the average travel time to work?
The point estimate is x = 31.25 minutes.
(b) What is the standard error of the mean?
SE =
21.88
n
20
(c) What is the critical value for 95% confidence?
4.8925
df = 20 1 = 19
t* = 2.093 (from t-Table)
(d) What is the margin of error for 95% confidence?
m = t*SE = (2.093)(4.8925) = 10.24
(e) What is the 95% confidence interval for the mean travel time to work?
x
We are 95% confident that the mean travel time to work for employed adults in New York
State is between 21.01 and 41.49 minutes.
(f) Construct 90% and 99% confidence intervals.
The 90% and 99% confidence intervals replace the 95% critical value t* = 2.093 by the
90% and 99% critical values t* = 1.729 and t*= 2.861, respectively.
90% confidence interval: x t *
99% confidence interval: x t *
s
n
s
31.25 (1.729 )
21.88
20
21.88
31.25 (2.861)
31.25 14.00 or 17.25 to 45.25
n
20
Note that: 1. The obtained intervals are centered at the point estimate x = 31.25.
2. A longer interval is required to estimate
with a higher level of confidence.
(The margin of error equals 8.46 for 90% confidence, 10.24 for 95%, and 14.00 for 99%.)
(g) How many more observations should we include in the sample if we want to estimate
the mean travel time within 6 minutes with 95% level of confidence?
n
t*s
m
(2.093)(21.88)
6
45.7948
6
7.6325 2
58.25
59 (Rounding up!)
-8-
right-sided test
H0 :
Ha :
two-sided test
H0 :
Ha :
s
n
The P-value is the probability if H 0 is true, of randomly drawing a sample like the one obtained,
or more extreme, in the direction of H a . The P-value is calculated as the corresponding area
under the curve, one-tailed or two-tailed depending on H a :
left-sided test
right-sided test
two-sided test
Small P-values are evidence against H 0 , because they say that the observed result is unlikely to
occur when H 0 is true. Large P-values fail to give evidence against H 0 .
-9-
A result with a certain P-value could be significant at one level of significance and not
significant at another level of significance.
There is no rule for how small a P-value we should require to reject H 0 it is a matter of judgment.
The most commonly used values for
are 0.05 and 0.01.
If we choose = 0.05, we are requiring that the data give evidence against H 0 so strong that it
would happen no more than 5% of the time (5 times in 100 samples in the long run) when H 0 is true.
The following steps are appropriate for testing a claim about a population mean:
1. Check the conditions for statistical inference.
2. Formulate H 0 and H a .
3. Select a level of significance
s
n
- 10 -
Example 4
A business travel magazine wants to classify transatlantic gateway airports according to the
mean rating for the population of business travelers. A rating scale with a low score of 0 and a
high score of 10 will be used, and airports with a population mean rating greater than 7 will be
designated as superior service airports.
The magazine staff surveyed a random sample of 60 business travelers at each airport to obtain
the ratings data. The sample for Londons Heathrow Airport provided a sample mean rating of
x = 7.25 and a sample standard deviation of s = 1.052.
Do the data indicate that Heathrow should be designated as a superior service airport?
Solution:
1. The sample size n = 60 is large enough to perform safely the hypothesis test.
2. We want to develop a hypothesis test for which the decision to reject H 0 will lead to the
conclusion that the population mean rating for the Heathrow Airport is greater than 7.
Thus, a right-sided test is required:
3. We will use
H0 :
Ha :
x
s
7.25 7
1.052
60
0.25
0.1358
1.84
- 11 -
Example 5
The company Holiday Toys manufactures and distributes its products through more than 1000
retail outlets. For this years most important new toy, Holidays marketing director is expecting
demand to average 40 units per retail outlet.
Prior to making the final production decision based upon this estimate, Holiday decided to survey a
sample of 25 retailers in order to develop more information about the demand for the new product.
Each retailer was provided with information about the features of the new toy along with the cost and
the suggested selling price. Then each retailer was asked to specify an anticipated order quantity.
Here are their answers about the number of units they anticipate to order:
26
45
23
39
32
52
47
52
45
22
31
22
47
33
59
21
21
34
52
42
45
30
53
28
34
Should Holiday Toys continue its production planning based on the marketing directors estimate
or they have to reevaluate their production plan?
Solution:
H0 :
40
Ha :
40
Count
7
6
5
4
3
2
1
0
20-26 27-33 34-40 41-47 48-54 55-61
Number of units
s
n
37.4 40
11.79
2.6
2.358
1.10
25
Because we have a two-sided test, the P-value is two times the area under the curve for the t
distribution to the left of t = 1.10. Since the t distribution is symmetric, however, the area under the
curve to the left of t = 1.10 is the same as the area under the curve to the right of t = 1.10. Using the
t-Table for df = 24, we see that t = 1.10 is very close to 1.059 corresponding to two-sided P-value 0.30.
(Excel gives: P-value = TDIST(1.10,24,2) = 0.2822)
The P-value
Sufficient evidence is not available to conclude that Holiday Toys should change its production
plan for the coming season. Therefore, Holiday Toys should continue its production planning for
the coming season based on the expectation that the average demand is 40 units per retail outlet.
- 12 -
t*
s
n
37.4 2.064
11.79
25
= 32.53 to 42.27
This interval contains the hypothesized mean ( 0 = 40).
Therefore, we do not reject the null hypothesis at the 5% (
37.4 4.87
Example 6
Carpetland salespersons averaged $8,000 per week in sales. Steve Contois, the firms vice president,
proposed a compensation plan with new selling incentives. Steve hopes that the results of a trial
selling period will enable him to conclude that the compensation plan increases the average sales
per salesperson.
(a) Develop the appropriate null and alternative hypotheses.
(b) What is the Type I error in this situation? What are the consequences of making this error?
(c) What is the Type II error in this situation? What are the consequences of making this error?
Solution:
reject H 0
(a) H 0 :
8000
do not reject H 0 (Type II error if H 0 is false)
accept H a
Ha :
8000
do not accept H a
H a is a research hypothesis to see if the plan increases average sales.
- 14 -