0% found this document useful (0 votes)
84 views

04 Normal Approximation For Data and Binomial Distribution

The document discusses the normal distribution and how it can be used to summarize and analyze data. It introduces key concepts such as the mean, standard deviation, z-scores, and the empirical rule. These concepts are used to analyze data that follows a normal distribution, such as measuring the percentage of data that falls within a certain number of standard deviations from the mean. The document also discusses how data can be standardized and how the normal distribution can be approximated using other distributions like the binomial.

Uploaded by

admirodebrito
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

04 Normal Approximation For Data and Binomial Distribution

The document discusses the normal distribution and how it can be used to summarize and analyze data. It introduces key concepts such as the mean, standard deviation, z-scores, and the empirical rule. These concepts are used to analyze data that follows a normal distribution, such as measuring the percentage of data that falls within a certain number of standard deviations from the mean. The document also discusses how data can be standardized and how the normal distribution can be approximated using other distributions like the binomial.

Uploaded by

admirodebrito
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

The normal curve

Many data have histograms that look bell-shaped, e.g. heights, weights, IQ scores:

64 66 68 70 72

Heights of 928 Fathers

‘The data follow the normal curve.’


But remember that some data have histograms that look quite different, e.g. incomes,
house prices.
The empirical rule

If the data follow the normal curve, then


I about 2/3 (68%) of the data fall within one standard deviation of the mean
I about 95% fall within 2 standard deviations of the mean
I about 99.7% fall within 3 standard deviations of the mean
Galton’s measurements of heights of fathers have x̄ = 68.3 in and s = 1.8 in.
Therefore about 95% of all heights are between 68.3 in −2 × 1.8 in = 64.7 in and
68.3 in +2 × 1.8 in = 71.9 in.
The empirical rule
Recall that in a histogram, percentages are given by areas:
Standardizing data

A normal curve is determined by x̄ and s: If the data follow the normal curve, then
knowing x̄ and s means knowing the whole histogram.
To compute areas under the normal curve, we first standardize the data by subtracting
off x̄ and then dividing by s:
height − x̄
z =
s

z is called the standardized value or z-score.


z has no unit (height, x̄ and s all have the unit ‘inches’)
For example, z = 2 means the height is 2 standard deviations above average.
z = −1.5 means the height is 1.5 standard deviations below average.
The standard normal curve
Standardized data have mean 0 and standard deviation equal to 1 – this is the point of
standardizing.
Fathers’ heights follow the normal curve with x̄ = 68.3 in and s = 1.8 in. Therefore the
standardized values follow the standard normal curve with mean 0 and standard
deviation 1.
0.4
0.3

This curve is given by the function


1 2
√1 e− 2 x
0.2

y= 2π
0.1
0.0

-4 -2 0 2 4
Normal approximation
Finding areas under the normal curve is called normal approximation.
What percentage of fathers have heights between 67.4 in and 71.9 in?

1. Standardize:
67.4 in−68.3 in 71.9 in−68.3 in = 2
1.8 in
= −0.5 1.8 in

2. Mark the area under the normal curve:

64 66 68 70 72

Heights of 928 Fathers


Normal approximation

3. Write the desired area in a form that can be computed by software


or looked up in a table:
Typically we can look up the area to the left of a given value.

4. Use software or a table to find these values: 97.7% − 30.9% = 66.8%


Normal approximation
The empirical rule is a special case of normal approximation:

0.4
0.3
0.2
0.1
0.0

-4 -2 0 2 4
Computing percentiles for normal data

What is the 30th percentile of the fathers’ heights?


Computing percentiles for normal data

What is the 30th percentile of the fathers’ heights?

From software or from a normal table: z = −0.52


Computing percentiles for normal data

What is the 30th percentile of the fathers’ heights?

From software or from a normal table: z = −0.52


height−x̄
Recall z = s .
Computing percentiles for normal data

What is the 30th percentile of the fathers’ heights?

From software or from a normal table: z = −0.52


height−x̄
Recall z = s . Solve for height= x̄ + zs
Computing percentiles for normal data

What is the 30th percentile of the fathers’ heights?

From software or from a normal table: z = −0.52


height−x̄
Recall z = s . Solve for height= x̄ + zs
Or: z = −0.52 means that the height is 0.52 standard deviations below average.
Computing percentiles for normal data

What is the 30th percentile of the fathers’ heights?

From software or from a normal table: z = −0.52


height−x̄
Recall z = s . Solve for height= x̄ + zs
Or: z = −0.52 means that the height is 0.52 standard deviations below average. So the
height is x̄ − 0.52s = 68.3 in −(0.52)(1.8 in) = 67.4 in.
The binomial setting

We saw that for a newborn baby, there is a 49% chance that it is a girl.
What are the chances that 2 out of 3 newborns are girls?
We can compute this by listing all the possibilities (total enumeration):
P( 2 out of 3 are girls) = P(GGB or GBG or BGG)
= P(GGB) + P(GBG) + P(BGG) addition rule
= P(G) P(G) P(B) + P(G) P(B) P(G) + . . . multiplication rule
= 3 × (0.49)(0.49)(0.51)

‘3’ counts the number of ways one can arrange two G and one B
The binomial setting

This is an example of the binomial setting:


I There are n = 3 independent repetitions of an experiment.
I Each of these experiments has two possible outcomes
(which are generically called ‘success’ and ‘failure’).
I The probability of success p = 49% is the same in each experiment.
The binomial coefficient
What about P( 2 out of 5 newborns are girls) ?
In principle, we can compute this in the same way. However, there are now 10
possibilities to arrange 2 girls among 5 newborns:
GGBBB, GBGBB, GBBGB,. . .
The number of possibilities grows very quickly as n gets larger. Fortunately, there is a
formula for it:
The binomial coefficient counts the number of ways one can arrange k successes in n
experiments:

n!
where n! = 1 × 2 × 3 × . . . × n
k!(n − k)!

3! 1×2×3
0! = 1. We had n = 3 and k = 2, so 2!1! = 1×2×1 = 3.
The binomial formula

Applying this coefficient in the binomial setting gives:


n!
P(k successes in n experiments) = k!(n−k)! pk (1 − p)n−k

This is the binomial probability.


The binomial formula
You play an online game 10 times. Each time there are three possible outcomes:
P(win a big prize) = 10%, P(win a small prize) = 20%, P(win nothing) = 70%.
What is P(win two small prizes) ?
Random variables
The outcomes of the 10 experiments are due to chance, so the number of successes is
random: One set of 10 experiments might result in 4 successes, another set might
result in 7 successes.
X =‘number of successes’ is called a random variable.
P(X = 2) = 30.2%. X has the binomial distribution.
We can visualize the probabilities of the various outcomes of X with a probability
histogram:
Probability histogram for the binomial with n=10, p=0.2

0.30
0.25
0.20
0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10
The probability histogram
We can visualize the probabilities of the various outcomes of X with a probability
histogram:
Probability histogram for the binomial with n=10, p=0.2

0.30
0.25
0.20
0.15
0.10
0.05
0.00

0 1 2 3 4 5 6 7 8 9 10

A histogram of data gives percentages for observed data. In contrast, a probability


histogram is a theoretical construct: it visualizes probabilities rather than data that
have been empirically observed.
Normal approximation to the binomial

As the number of experiments n gets larger, the probability histogram of the binomial
distribution looks more and more similar to the normal curve:
Probability histogram for the binomial with n=10, p=0.2 Probability histogram for the binomial with n=50, p=0.2
0.30

0.12
0.25

0.10
0.20

0.08
0.15

0.06
0.10

0.04
0.05

0.02
0.00

0.00
0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10 12 14 16 18 20 22

In fact, we can approximate binomial probabilities using


p normal approximation:
to standardize, subtract off np and then divide by np(1 − p).
Normal approximation to the binomial
We can approximate binomial probabilities using normal
p approximation:
to standardize, subtract off np and then divide by np(1 − p).
In the previous example, we had p = P(win a small prize) = 0.2.
Play n = 50 times. What is P(at most 12 small prizes) ?
Sampling without replacement

A simple random sample selects subjects without replacement.


This is not the binomial setting, because p changes after a subject has been removed.
But if the population is much larger than the sample, then sampling with replacement
is about the same as sampling without replacement.
Then the number of successes will have approximately the binomial distribution
(and so it will approximately follow the normal curve.)

You might also like