0% found this document useful (0 votes)
5 views9 pages

Confidence_Intervals-Reader

The document discusses confidence intervals (CIs), which are used to express the uncertainty associated with sample statistics when estimating population parameters. It explains the concept of confidence levels, the calculation of CIs for population means, proportions, and differences between two means or proportions, along with the necessary conditions for their validity. Additionally, it emphasizes the importance of correctly interpreting confidence intervals in statistical reporting.

Uploaded by

basgoudriaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views9 pages

Confidence_Intervals-Reader

The document discusses confidence intervals (CIs), which are used to express the uncertainty associated with sample statistics when estimating population parameters. It explains the concept of confidence levels, the calculation of CIs for population means, proportions, and differences between two means or proportions, along with the necessary conditions for their validity. Additionally, it emphasizes the importance of correctly interpreting confidence intervals in statistical reporting.

Uploaded by

basgoudriaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Confidence intervals

Dr. Debarati Bhaumik

1 Introduction
Confidence intervals (CI) are mostly used to convey the information about the
uncertainty associated with sample statistics (mostly the sample mean, but can
also be the median, mode etc.) which are used to estimate the population
parameters. For example we estimate the average height of Dutch nationals
(population parameter) using the sample mean of a random sample from the
population of size n, say 1000. We have seen in the previous lecture that the
sample mean is a random variable and the value of the sample mean varies
from sample to sample. Hence, we need to add a measure of the variability to
the point estimate (from the sample data) of the population parameter. This
measure of variability is called the margin of error. The sample statistic, plus or
minus the margin of error gives the confidence interval. CI tells us how confident
can we be about our point estimates or sample statistics which we use to infer
the population parameter.
Formally, a CI for a population parameter θ is a random interval calculated from
the sample data, that contains θ with some specified probability or confidence.

1.1 Confidence level


Confidence level is the percentage of times random confidence intervals will
contain the population parameter θ when constructed from random samples.
Let us take the example of population mean µ. A 95% confidence interval for
µ is a random interval that contains µ with probability 0.95. In other words,
if we constructed confidence intervals around each sample mean X̄ calculated
from many different random sample data, then about 95% of these confidence
intervals will contain the population mean µ.
In more statistical notation, let α be a value (0 ≤ α ≤ 1) such that 100 × α% is
the percentage of times the confidence intervals constructed from the random
sample data does not contain the population parameter. Then we can write
100 × (1 − α)% as the confidence interval or (1 − α) confidence level.

1
1.2 Confidence interval for a population mean
We will prove the confidence interval for population mean [1]. For other cases
we will just state the results. Generally one uses the standard normal Z−
distribution for computing the confidence interval, however for real life cases
when the population variance is not know then the t− distribution is used. We
will discuss both the cases.
For 0 ≤ α ≤ 1, Let z(α) be a value such that the area under the standard normal
density function curve to the right of z(α) be α or in mathematical terms we can
write P(Z ≥ z(α)) = α. See Figure 1 for more clarity. Due to the symmetry of
the curve we can say that the area under the curve to the left of −z(α) is also
α, i.e., P(Z ≤ −z(α)) = α.

Figure 1: Standard normal density curve showing the value z(α) and the area
under the curve to the right of it as α

Now, if a random variable Z follows the standard normal distribution then by


definition of z(α):

P(−z(α/2) ≤ Z ≤ z(α/2)) = 1 − α.
To understand the equation above visually, see Figure 2. Note that the area to
the right of z(α/2) is α/2 and similarly the area to the left of −z(α/2) is α/2.
By the law of total probability the area between the grey area is 1 − α.
Remember the central limit theorem, which states that the sampling distribution
of the sample means approaches a normal distribution as the sample size gets
larger - no matter what the shape of the population distribution is. Given the
population mean to be µ, the sample mean of size n be X̄ and the standard
deviation (or the standard error) of the sample mean σX̄ then from the central
limit theorem we have that (X̄ − µ)/σX̄ follows approximately the standard
normal distribution, and hence we have:

2
Figure 2: Area under the standard normal curve in between −z(α/2) and
z(α/2).

(X̄ − µ)
P(−z(α/2) ≤ ≤ z(α/2)) ≈ 1 − α,
σX̄
upon some elementary calculation we have

P(X̄ − z(α/2)σX̄ ≤ µ ≤ X̄ + z(α/2)σX̄ ) ≈ 1 − α.



Recall that σX̄ = σ/ n, where σ is population variance. Then the above result
becomes,
σ σ
P(X̄ − z(α/2) √ ≤ µ ≤ X̄ + z(α/2) √ ) ≈ 1 − α. (1)
n n
The above equation (1) implies that the probability that µ lies in the interval
X̄ ± z(α/2) √σn is approximately 1 − α. The interval is thus called 100 × (1 − α)%
confidence interval. Note that the interval is random and the probability
that this random interval contains the population mean µ is 1 − α.
In practice we generally choose a small value of α such as 0.1, 0.05, 0.01 which
leads to corresponding confidence intervals of 90%, 95%, 99% respectively.

α confidence level z(α/2)


0.1 90 % 1.64
0.05 95 % 1.96
0.01 99 % 2.58

Table 1: z(α/2) values for selected confidence level from the standard normal
Z− distribution.

3
To recap the results, the confidence interval of a population mean µ is given by

σ
X̄ ± z(α/2) √ , (2)
n

where X̄ is the sample mean, σ is the population standard deviation, n is the


sample size, z(α/2) is the appropriate value from the Z-distribution for desired
confidence level (see Table 1 for the values). Few points to note:
• Unknown σ: Most of the times the population standard deviation σ will
not be known. Hence we will substitute it this sample standard deviation
s. For large samples this substitution has negligible effect. There is no
thumb of rule, however for n ≥ 30 we can adequately use this substitution.
• Small sample size n: When n ≤ 30 instead of using the Z-distribution
table we use the t− distribution table values corresponding to n−1 degrees
of freedom. In Appendix A you can find the t− distribution table values
for different confidence level values. See how for degrees of freedom 30 the
t and z values get close to each other.
Example: Let us see an example from [1]; a particular area contains 8000
condominium units. In a survey of the occupants, a simple random sample
of size 100 yields the information that the average number of motor vehicles
per unit is 1.6, with a sample standard deviation of .8. Here, X̄ = 1.6, n =
100, s = 0.8. Hence, the 95% confidence interval for the population average will
be X̄ ± z(0.025) √sn . We know that z(0.025) = 1.96, hence the 95% CI will be
1.6 ± 1.96 × 0.08. We can use the Z− distribution because n ≥ 30.
Now let us say if we had a a simple random sample of size 25 which gave the
same sample mean and standard deviation as above. Then we would use the
t− table with 24 degrees of freedom. Then the 95% confidence interval for the
population mean would be X̄ ± t(0.025) √sn = 1.6 ± 2.06 × 0.16. See how this
confidence interval is bigger than the previous case. That is due to the fact that
for this case the sample size is smaller (n = 25) than the previous case, where
n = 100, which makes us less confident about the estimation.

1.3 Confidence interval for a population proportion


This is the dichotomous case. When we are interested in knowing what pro-
portion (or percentage) of people/ population elements fall into a particular
category. For example, percentage of people prefer to be contacted through
email or percentage of people in favor of a four-day work week. For these cases
we estimate the population proportion p with the sample proportion p̂ plus mi-
nus a margin of error. Sample proportion is the proportion of individuals in the
sample who have the characteristics of interest. The formula for the confidence
interval for the population proportion p is given by

4
r
p̂(1 − p̂)
p̂ ± z(α/2) , (3)
n−1

where z(α/2) is the appropriate value from the standard normal Z− distribution
for desired level of confidence.
Note: The following conditions need to satisfied to build confidence intervals
for population proportions using sample proportions:
• Random condition: The data should come from a random sample. This
ensures we have unbiased data from the population.
• Normal condition: The sampling distribution of p̂ should be approxi-
mately normal, and for that to happen, these condition need to be met
˙ − p̂) ≥ 10 simultaneously.
np̂ ≥ 10 and n(1
• Independence condition: Individual observations need to be indepen-
dent. If sampling without replacement, our sample size shouldn’t be more
than 10% percent of the population.

Before doing the actual computations of the interval, it’s important to check
whether or not the above conditions have been met, otherwise the calculations
and conclusions that follow aren’t valid.

1.4 Confidence interval for the difference of two means


The goal of many surveys and studies is to compare the difference between
two groups, such as men versus women, liberals versus conservatives. When
the properties being compared are numerical (for example height, weight, age,
income, etc) one is generally interested in the different between the means (aver-
age) of the two populations. For example we want to compare the difference in
average incomes of men versus women. The confidence interval for the difference
of two population means µ1 − µ2 is given by

s
σ12 σ2
(X̄1 − X̄2 ) ± z(α/2) + 2, (4)
n1 n2

where X̄1 and X̄2 are the sample means, n1 and n2 are the sample sizes; σ1
and σ2 are the population standard deviations respectively; and z(α/2) is the
appropriate value from the Z− distribution with desired confidence level.
Following are the two conditions when we use t − distribution with n1 + n2 − 2
degrees of freedom:

1. If one or both of the sample sizes are small (less than 30).

5
2. When the population standard deviations are unknown, we use the sample
standard deviation along with the t− distribution. Then the formula for
confidence interval becomes:

s
s21 s2
(X̄1 − X̄2 ) ± t(α/2, n1 + n2 − 2) + 2, (5)
n1 n2

where, s1 and s2 are the sample standard deviations respectively; and


t(α/2, n1 + n2 − 2) is the appropriate value from the t− distribution with
n1 + n2 − 2 degrees of freedom and 1 − α is the desired confidence level.
Read the values from the t− table in Figure 3.

1.5 Confidence interval for the difference of two propor-


tions
Just like the difference between population means there are situations when
we are interested in the difference between population proportions. Such as
comparing males to females with their opinion on four-day work week. In these
cases we estimate the difference between the population proportions. We do
this by taking the difference between the sample proportions plus minus the
margin of error. The confidence interval for the difference of two population
proportions p1 − p2 is given by

s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
(p̂1 − p̂2 ) ± z(α/2) + , (6)
n1 n2

where p̂1 and p̂2 are the sample proportions, n1 and n2 are the sample sizes
respectively; and z(α/2) is the appropriate value from the Z− distribution with
desired confidence level. Please note that to create valid confidence intervals
for difference of population proportions from sample proportions, the random,
normal and independence conditions as described in Section 1.3 need to be
satisfied for the individual samples.

2 Interpreting confidence intervals


It is very important how confidence intervals should be interpreted and reported.
It is wrong to say ”Based on the inference we are 95 % confident that population
mean is between xxx and yyy”. Confidence interval goes back to the idea of
confidence level. Remember that a confidence interval is a random interval
constructed from the random sample data. Hence, a 95% confidence level is the
percentage of times of all the possible random samples of size n whose confidence
intervals will contain the population parameter.

6
Hence one may report ”Based on the inference, a range of likely values for
the population parameter is xxx and yyy, with a confidence level of 95%.” Note
that the population parameter is fixed and the sample parameter estimate varies
based on the sample chosen.

3 Resource list
Following are some existing resources and videos to be followed to have a better
understanding of the concepts of confidence interval:

1. Confidence interval estimation around population mean for known σ. This


video very well explains how to interpret a confidence interval: follow this
video.

2. Confidence interval estimation around population mean for unknown σ:


follow this video.

7
A t-distribution table for confidence interval

Figure 3: t-table. Figure used from this website. Please go through the website
to understand how to read t-tables.

8
References
[1] John A Rice. Mathematical statistics and data analysis. Cengage Learning,
2006.

You might also like