Confidence_Intervals-Reader
Confidence_Intervals-Reader
1 Introduction
Confidence intervals (CI) are mostly used to convey the information about the
uncertainty associated with sample statistics (mostly the sample mean, but can
also be the median, mode etc.) which are used to estimate the population
parameters. For example we estimate the average height of Dutch nationals
(population parameter) using the sample mean of a random sample from the
population of size n, say 1000. We have seen in the previous lecture that the
sample mean is a random variable and the value of the sample mean varies
from sample to sample. Hence, we need to add a measure of the variability to
the point estimate (from the sample data) of the population parameter. This
measure of variability is called the margin of error. The sample statistic, plus or
minus the margin of error gives the confidence interval. CI tells us how confident
can we be about our point estimates or sample statistics which we use to infer
the population parameter.
Formally, a CI for a population parameter θ is a random interval calculated from
the sample data, that contains θ with some specified probability or confidence.
1
1.2 Confidence interval for a population mean
We will prove the confidence interval for population mean [1]. For other cases
we will just state the results. Generally one uses the standard normal Z−
distribution for computing the confidence interval, however for real life cases
when the population variance is not know then the t− distribution is used. We
will discuss both the cases.
For 0 ≤ α ≤ 1, Let z(α) be a value such that the area under the standard normal
density function curve to the right of z(α) be α or in mathematical terms we can
write P(Z ≥ z(α)) = α. See Figure 1 for more clarity. Due to the symmetry of
the curve we can say that the area under the curve to the left of −z(α) is also
α, i.e., P(Z ≤ −z(α)) = α.
Figure 1: Standard normal density curve showing the value z(α) and the area
under the curve to the right of it as α
P(−z(α/2) ≤ Z ≤ z(α/2)) = 1 − α.
To understand the equation above visually, see Figure 2. Note that the area to
the right of z(α/2) is α/2 and similarly the area to the left of −z(α/2) is α/2.
By the law of total probability the area between the grey area is 1 − α.
Remember the central limit theorem, which states that the sampling distribution
of the sample means approaches a normal distribution as the sample size gets
larger - no matter what the shape of the population distribution is. Given the
population mean to be µ, the sample mean of size n be X̄ and the standard
deviation (or the standard error) of the sample mean σX̄ then from the central
limit theorem we have that (X̄ − µ)/σX̄ follows approximately the standard
normal distribution, and hence we have:
2
Figure 2: Area under the standard normal curve in between −z(α/2) and
z(α/2).
(X̄ − µ)
P(−z(α/2) ≤ ≤ z(α/2)) ≈ 1 − α,
σX̄
upon some elementary calculation we have
Table 1: z(α/2) values for selected confidence level from the standard normal
Z− distribution.
3
To recap the results, the confidence interval of a population mean µ is given by
σ
X̄ ± z(α/2) √ , (2)
n
4
r
p̂(1 − p̂)
p̂ ± z(α/2) , (3)
n−1
where z(α/2) is the appropriate value from the standard normal Z− distribution
for desired level of confidence.
Note: The following conditions need to satisfied to build confidence intervals
for population proportions using sample proportions:
• Random condition: The data should come from a random sample. This
ensures we have unbiased data from the population.
• Normal condition: The sampling distribution of p̂ should be approxi-
mately normal, and for that to happen, these condition need to be met
˙ − p̂) ≥ 10 simultaneously.
np̂ ≥ 10 and n(1
• Independence condition: Individual observations need to be indepen-
dent. If sampling without replacement, our sample size shouldn’t be more
than 10% percent of the population.
Before doing the actual computations of the interval, it’s important to check
whether or not the above conditions have been met, otherwise the calculations
and conclusions that follow aren’t valid.
s
σ12 σ2
(X̄1 − X̄2 ) ± z(α/2) + 2, (4)
n1 n2
where X̄1 and X̄2 are the sample means, n1 and n2 are the sample sizes; σ1
and σ2 are the population standard deviations respectively; and z(α/2) is the
appropriate value from the Z− distribution with desired confidence level.
Following are the two conditions when we use t − distribution with n1 + n2 − 2
degrees of freedom:
1. If one or both of the sample sizes are small (less than 30).
5
2. When the population standard deviations are unknown, we use the sample
standard deviation along with the t− distribution. Then the formula for
confidence interval becomes:
s
s21 s2
(X̄1 − X̄2 ) ± t(α/2, n1 + n2 − 2) + 2, (5)
n1 n2
s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
(p̂1 − p̂2 ) ± z(α/2) + , (6)
n1 n2
where p̂1 and p̂2 are the sample proportions, n1 and n2 are the sample sizes
respectively; and z(α/2) is the appropriate value from the Z− distribution with
desired confidence level. Please note that to create valid confidence intervals
for difference of population proportions from sample proportions, the random,
normal and independence conditions as described in Section 1.3 need to be
satisfied for the individual samples.
6
Hence one may report ”Based on the inference, a range of likely values for
the population parameter is xxx and yyy, with a confidence level of 95%.” Note
that the population parameter is fixed and the sample parameter estimate varies
based on the sample chosen.
3 Resource list
Following are some existing resources and videos to be followed to have a better
understanding of the concepts of confidence interval:
7
A t-distribution table for confidence interval
Figure 3: t-table. Figure used from this website. Please go through the website
to understand how to read t-tables.
8
References
[1] John A Rice. Mathematical statistics and data analysis. Cengage Learning,
2006.