0% found this document useful (0 votes)
22 views

W5 Lecture5

1. The sampling distribution of the mean plays an important role in statistical inference. 2. According to the Central Limit Theorem, for large sample sizes the sampling distribution of the mean will be approximately normal even if the population is not normal. 3. The mean of the sampling distribution is the population mean, and its standard deviation is the population standard deviation divided by the square root of the sample size.

Uploaded by

Thi Nam Phạm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

W5 Lecture5

1. The sampling distribution of the mean plays an important role in statistical inference. 2. According to the Central Limit Theorem, for large sample sizes the sampling distribution of the mean will be approximately normal even if the population is not normal. 3. The mean of the sampling distribution is the population mean, and its standard deviation is the population standard deviation divided by the square root of the sample size.

Uploaded by

Thi Nam Phạm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Biostatistics

Lecture 5
Sampling Distribution of the Mean
2022-2 Fall Semester

Instructor: Min Jin Ha


Department of Health Informatics and Biostatistics
Graduate School of Public Health
Yonsei University
Reading
• Pagano and Gauvreau, Chapter 8
Overview
• While the topic may seem a little theoretical, the sampling
distribution of the mean plays a very important role in statistical
inference
• We need these points in order to make statistical inference and draw
conclusions about the results we see
• The basic idea, called the Central Limit Theorem, is that for any
distribution with well-defined mean and variance, the distribution of
the means computed for sample of size n is approximately normal.
• This is extremely important result for hypothesis testing and
construction of confidence intervals later in the course
Statistical Inference
• We examined several theoretical distributions, such as the binomial
distribution and the normal distribution
• In all cases, the relevant population parameters were assume to be known;
this allowed us to describe the distributions completely and to calculate the
probabilities associated with various outcomes
• However in most practical applications, we are not given the values of these
parameters
• Instead, we must attempt to describe or estimate some characteristic of a
population- such as mean and standard deviation- using information
contained in a sample of observations
• The process of drawing conclusion about an entire population based on the
information in a sample is known as statistical inference
Forms of Statistical Inference
• Point Estimation: estimating an unknown parameter using a single
number calculated from the sample data

• Interval Estimation: estimating an unknown parameter using an


interval or range of values that is likely to cover the true population
value

• Hypothesis Testing: checking whether sample data provide evidence


against some claim made about the population
Parameters and statistics
• It is imperative to understand the distribution between parameters
and statistics
Point Estimation for Mean and SD
We’re interested in estimating mean height of women ages 18-74 in the US
• Target population: women ages 18-74 in the US
• Quantity of interest: mean height (𝜇) and SD (𝜎)
• Randomly sample 𝑛 women from the population
• The observed heights of the 𝑛 samples are denoted by
𝑥1 , 𝑥2 , … , 𝑥𝑛
𝟏
• The statistic for mean 𝜇 (sample mean) is 𝒙
ഥ= σ𝒏𝒊=𝟏 𝒙𝒊
𝒏
𝟏
• The statistic for SD 𝜎 is 𝒔 = √ σ𝒏𝒊=𝟏 ഥ
𝒙𝒊 − 𝒙 𝟐
𝒏−𝟏
• Point estimation is single guess about the value of a parameter in the population
Sampling Distribution of the Mean
Suppose we can list all the women of the papulation. We then will do the
following:
• Take a random sample of size 𝑛  Sample 1
• Compute the sample mean 𝑥ҧ1
• Put the sample back, and take a second random sample also of size 𝑛 Sample
2
• Compute the sample mean 𝑥ҧ2
• Repeat this many times, we obtain many sample means 𝑥ҧ1 , 𝑥ҧ2 , 𝑥ҧ3 , …
• What is the ‘Theoretical’ distribution of the statistics 𝑥ҧ1 , 𝑥ҧ2 , 𝑥ҧ3 , … ?
• Check simulations in workshop!
Central Limit Theorem
Let 𝒙𝟏 , 𝒙𝟐 , … , 𝒙𝒏 be the random sample 𝒏 from a population
distribution with mean 𝝁 and standard deviation 𝝈. Then the
sampling distribution of 𝒙 ഥ approaches to normal distribution with
mean 𝝁 and standard deviation 𝝈/ 𝒏 as 𝒏 → ∞.
• The the sampling distribution of the mean is identical to the population mean
• The standard deviation of the distribution of the sample mean is the
population standard deviation divided by square root of the sample size
• For the sample size large enough, the shape of the sampling distribution is
approximately normal  holds for any population distributions with finite
variances
Example
• The distribution of IQ in the general population is
normal with mean 100 and sd 15.
• Suppose we draw samples of size n=10 from this
population
• Then from the CLT, we know that the distribution of the
sample averages will also be normal with mean 100 and
standard deviation 4.7.
• If our data on IQ are exactly normal, then the
distribution of the sample averages will also be exactly
normal
• If our data on IQ are not exactly normal, then the rule of
thumb is that at a minimum of 30 observations are
need for CLT to kick in
Back to proportions
• Going back to our example, the Central Limit Theorem (CLT) tells us
that the distribution of sample averages 𝜋ො 𝑖 should follow a normal
𝜋(1−𝜋)
distribution with mean 𝜋 and standard deviation .
𝑛
• The CLT tells us that the sample averages are normally distributed if
we have enough data. This results holds even if our original variables
(here, a binary variable) are not normally distributed
• We will check this in the workshop
Applications of CLT
• Consider the distribution of serum cholesterol levels for all 20- to 74-years old males
living in South Korea. The mean of this population is 𝜇 = 211𝑚𝑔/100𝑚𝑙, and the
standard deviation is 𝜎 = 46𝑚𝑔/100𝑚𝑙. If we select repeated samples of size 25 from
the population, what proportion of the samples will have a mean value of 230mg/100ml
or above?
• By CLT, we know that 𝑋ത ∼ 𝑁(211,462 /25) and want to compute 𝑃(𝑋ത > 230)

Solution 1: find the right tail probability from 𝑁(211,462 /25)


pnorm(230,lower.tail=F,mean=211,sd=46/5)

𝑋−211 230−211 230−211
Solution 2: Based on 𝑃( 46 > 46 )=𝑃(𝑍 > 46 ), find right tail probability from the
5 5 5
standard normal distribution
 pnorm((230-211)/(46/5),lower.tail=F)

Interpretation:
• About 1.9% of the samples will have a mean greater than 230mg/100ml
• If we select a single sample of size 25 from the population of 20-74years old males, the probability that the
mean serum cholesterol level for this sample is 230mg/100ml or higher is 0.019
Applications of CLT
• What mean value of serum cholesterol level cuts off the lower 10% of
the sampling distribution of means?

𝑋−211 𝑀−211 𝑀−211
• 𝑃 𝑋ത < 𝑀 = 𝑃 46 < 46 =𝑃 𝑍< 46 = 0.1
5 5 5
𝑀−211 46
• Find 𝑧0.1 = qnorm(0.1) = 46  𝑧0.1 + 211 = 199.2
5
5
• Or qnorm(0.1,mean=211,sd=46/5)
• Interpretation: approximately 10% of the samples of size 25 have
means that are less than or equal to 199.2mg/100ml
Applications of CLT
• Let us now calculate the upper and lower limits that enclose 95% of the means of
samples of size 25 drawn from the population
• Consider a symmetric interval that is the shortest intervals
• 𝑃(𝑧0.025 < 𝑍 < 𝑧0.975 ) = 0.95
𝑋ത − 211
⇔ 𝑃(−1.96 < < 1.96) = 0.95
46
5
46 46

⇔ 𝑃(211 − 1.96 < 𝑋 < 211 + 1.96 ) = 0.95
5 5
⇔ 𝑃(193 < 𝑋ത < 229) = 0.95
• Interpretation: Approximately 95% of the means of samples of sizes 25 lie
between 193mg/100ml and 229.0mg/100ml
• Calculate the same when we select samples of sizes 5,15,25,50,100,1000?
Applications of CLT
• Finally, let’s consider a more complicated question. How large would
our random samples need to be for 95% of their averages to lie within
+/-10 of the population mean 𝜇?
• To solve this, find sample size 𝑛 for which
𝑃(𝜇 − 10 < 𝑋ത < 𝜇 + 10) = 0.95

• Interpretation: Samples of size xxx would be required for 95% of the


sample means lie within +/-10mg/100ml of the population mean 𝜇

You might also like