starting-statistics_11
starting-statistics_11
Estimating Numbers
Chapter Overview
As the previous chapter showed, researchers who are interested in large populations have to base their
research on the results from smaller, more manageable samples. As far as possible, the sample should
be representative of the population. However, you can describe the population exactly only if you survey
everyone in it. What you can do with the sample result is to infer from it the most probable population result
using inferential statistics.
There are two basic types of analysis using inferential statistics: estimation and hypothesis testing. They
involve looking at the sample data from different perspectives, like describing a glass as ‘half-full’ or ‘half-
empty’. This and the next chapter focus on estimation – how to estimate population values such as means
and percentages from sample results. This chapter is about numerical data – measuring variables with
numbers. The following chapter goes through the same ideas but with categorical data.
Inferential statistics work with samples that, as far as possible, are representative of the population. This
means that you use inferential statistics with data collected using probability sampling. As Chapter 10 pointed
out, with probability sampling every individual in the population has a known probability (i.e. chance) of being
in the sample. Researchers select individuals at random, using a random numbers table.
Unfortunately, even probability sampling cannot guarantee a sample that exactly reflects the population from
which it is drawn. The following very simple set of numbers illustrates this:
You paint each number on a ball and place the five balls in the lottery drum. These numbered balls make up
the population. The machine randomly picks out one ball, you make a note of its number (e.g. 3), replace it,
wait for the machine to pick out a second ball, and again note its number (e.g. 5). These two selections make
up a sample. You then work out the mean of the two numbers in the sample (i.e. 4). As the population mean
is 5 (i.e. (1 + 3 + 5 + 7 + 9) ÷ 5), you might also expect a similar sample mean. However, although a mean
of 5 is the most likely result, it is not the only one. Altogether there are nine possible sample means (1, 2, 3,
4, 5, 6, 7, 8, and 9), though some are more likely to occur than others. For example, there is only one way of
getting a sample mean of 1 – drawing 1, replacing it, and then drawing 1 again. There are two ways of getting
a sample mean of 2 (1 + 3 and 3 + 1). And there are three ways of getting a sample mean of 3 (1 + 5, 3 + 3,
and 5 + 1). Figure 11.1 shows all 25 possible results.
Figure 11.1 All possible sample results: sampling two values from a population of five values (1, 3, 5,
7, 9)
Figure 11.2 summarises this information about the sample results. The graph shows a sampling distribution,
so called because it is the distribution of all possible sample results for the same sample size.
Figure 11.2 Sampling distribution: sampling two values from a population of five values (1, 3, 5, 7, 9)
1 Most importantly, the sampling distribution is centred on the actual population mean (i.e.
5)
2 The sample result in the centre of the distribution is also the sample result most likely to
occur. The greater the difference between the population and the sample result, the
smaller the chance of drawing such a sample.
3 The sampling distribution is symmetrical. In other words, the left-hand side of the graph
is a mirror image of the right-hand side.
These general features also apply to much larger studies. For example, a university careers adviser carries
out a survey of 225 of the 10,000 recent graduates, and finds that the mean salary of the sample is $50,000.
What does this sample result say about the salaries of all recent graduates from the university? The most
likely sample value is also the actual population value. Thus, provided the survey used probability sampling,
the best single estimate of the population mean is the one given by the sample result – $50,000. A best single
estimate is termed a point estimate.
However, to leave the analysis there ignores the fact that each sample is only one of a much larger number of
possible samples. Thus, the result from the one sample taken is just one of a much larger number of possible
results. Although the most likely sample result is the actual population value, there is no guarantee that the
Your analysis needs to take into account this variability in sample results. The reasoning is similar to that
about the mean and the standard deviation. The descriptive power of the mean depends on how spread
out the distribution is around the mean. In other words, it depends on the variability of the values around
the mean, as measured by the standard deviation. Similarly, what you need here is not only the result from
the one sample taken, but also a measure of the variability of all possible results from samples of the same
size as the one sample taken. In other words, you need to know the standard deviation of the means for all
possible samples of the same size as yours.
With the previous very small example, it was easy to work out all the possible samples, calculate the
sample mean, and draw up the sampling distribution. This isn't possible with real-life research where the only
information you have comes from a single sample. The following section shows how to estimate population
values from the results of just a single sample.
In the graduate salary survey referred to above, there are 225 graduates in the sample and 10,000 graduates
in the population. This means that there are many millions of different samples of 225 graduates in addition
to the one taken. Even if you have the computing power to calculate all the sample means, you still face an
impossible task because you do not know the salaries of all 10,000 graduates in the population. And if you
knew all the salaries, there would be absolutely no point in taking a sample!
Fortunately, you don't need to do the impossible, and draw the many millions of different samples of 225
graduates from the population of 10,000, calculate the mean salary of each, and then plot all the means on a
bar graph. Drawing on work done by statisticians, you can accurately predict what the sampling distribution of
salary means looks like. Figure 11.3 shows this sampling distribution. It's called a normal distribution because
the same type of bell-shaped graph is normally produced when you plot the distribution of all the results from
samples of the same large size.
The first person to identify a normal distribution was Abraham de Moivre (1667–1754).
At the time, writers did not take spelling too seriously, spelling even their own names in
several ways. For example, Abraham at different times spelled his family name as de
Moivre, Demoivre, and De Moivre (Tankard 1984: 146). A century earlier, attitudes to
spelling were even more lax. For example, Shakespeare (1564–1616) ‘did not spell his
name the same way twice in any of his six known signatures, and even spelled it two ways
on one document, his will …. Curiously, the one spelling he never seemed to use himself
was Shakespeare’ (Bryson 1990: 116).
It is extremely important that you understand Figure 11.3 because the idea of sampling distributions runs
through much of the rest of the book. Remember, Figure 11.3 shows the results of all the millions of possible
different samples of 225 graduates that can be taken from the population of 10,000 graduates. For each
sample, you calculate the mean of the 225 salaries, and plot it along with the millions of other sample means.
The horizontal axis shows sample means. Those samples with the lowest means (i.e. when, by chance, only
the most poorly paid graduates in the population are sampled) are in the bars on the left. Those samples with
Starting Statistics: A Short, Clear Guide
Page 6 of 16
SAGE SAGE Research Methods
2010 SAGE Publications, Ltd. All Rights Reserved.
the highest means (i.e. when, by chance, only the most highly paid graduates in the population are sampled)
are in the bars on the right. The vertical axis shows the percentage of samples with each of the various
sample mean results. The taller the bar, the greater the percentage of samples with this mean result. Figure
11.4 summarises the basic steps in building a sampling distribution.
Figure 11.4 Logic of sampling distribution of means
The features of the sampling distribution of the very simple example shown in Figure 11.2 also apply to the
sampling distribution for the millions of samples about graduate salaries. First, it is centred on the actual
population mean – the average salary of all 10,000 recent graduates. For example, if the careers adviser
had surveyed all 10,000 graduates and found that the population mean was $50,000, then the sampling
distribution of salary means would be centred on $50,000.
Second, it is centred on the sample result that is most likely to occur – the bars are tallest in the centre of
the distribution. Thus, the samples that are most likely to occur are those showing a mean value equal to the
population mean.
Third, it is symmetrical – the left side is a mirror image of the right. Consequently, the umpteen million samples
with means that are less than the population mean are matched by the same number of umpteen million
samples with means that are more than the population mean. In fact, you can give a much more detailed
description of the normal distribution. To make the following account easier to understand, assume that you
already know the mean ($50,000) and the standard deviation ($10,000) of the sampling distribution of the
salaries. However, it is important to bear in mind that I have simply plucked these figures out of the air. Indeed,
the whole purpose of this section is to estimate the mean of the sampling distribution, and from there to
estimate the population mean.
Table 11.1 shows six salary categories based on a mean of $50,000 and a standard deviation of $10,000. It
is centred on $50,000 and moves away from it in units of $10,000 (i.e. in standard deviation units). To simplify
things, assume that no sample result is exactly the same as the population mean ($50,000.00) or any other
values used to divide up the categories (e.g. $60,000.00, $40,000.00).
Table 11.1 Chance of sample means occurring in sampling distribution
Table 11.1 also shows the likelihood of sample results occurring in each category. For example, there is a
34% chance of drawing a sample with a mean over $50,000 and under $60,000; a 13½% chance of drawing
a sample with a mean over $60,000 and under $70,000; and only a 2½% chance of drawing a sample with a
mean of over $70,000. Because the distribution is symmetrical, the same percentages also apply to sample
means that are less than the population mean. Figure 11.5 also shows this.
Figure 11.5 Sampling distribution: chance of salary sample means occurring
You can standardise the dollar values shown in Figure 11.5 by expressing them in terms of how many
‘standard deviation units’ they are from the mean. Recall the two statistics plucked out of the air earlier:
a mean of $50,000 and a standard deviation of $10,000. Using these figures, you can say that a sample
result of $60,000 is $10,000, or one standard deviation unit, more than the mean of the sampling distribution.
Because the mean of the sampling distribution is also the population mean, a sample result of $60,000 is
one standard deviation unit (i.e. $10,000) more than the population mean, and a sample result of $40,000
is one standard deviation unit less than the population mean. Similarly, $70,000 is two standard deviation
units (i.e. $20,000) more than the population mean, and $30,000 is two standard deviation units less than the
population mean.
Figure 11.6 replaces the salary sample means shown in Figure 11.5 with standard deviation units. It shows
that 68% of all sample means (34 + 34) are within one standard deviation of the population mean. It also
shows that 95% (13½ + 34 + 34 + 13½) of all sample means are within two standard deviation units of the
population mean. These percentages apply to all sampling distributions when the sample size is large enough
to produce a normal distribution.
Figure 11.6 Sampling distribution: chance of standardised sample means occurring
It's now time to get back to the basic problem of how to estimate the population mean based on the result
from only one of the huge number of possible samples. The best single estimate of the population mean is
that given by the one sample result. However, your results need to recognise the variability in sample results.
To do this, you need to qualify the mean result from the one sample taken with a measure of this overall
variability. A measure of the variability of a distribution is the standard deviation. Thus, the aim is to find the
standard deviation of the sampling distribution of the means.
Before you can calculate the standard deviation of the sampling distribution, you need to know the standard
deviation of the population. And to find the standard deviation of the population, you need to know the
population mean – which, of course, is what you were trying to find in the first place! However, statisticians
have found a way around this catch-22. It goes like this:
1 Estimate the standard deviation of the population from the standard deviation of the
sample:
2 Estimate the standard deviation of the sampling distribution of the means from the
standard deviation of the population:
The standard deviation of the sampling distribution of the means is usually referred to as the standard error
of the sample means. This helps to distinguish it from the observed standard deviation of the one sample
taken and the estimated standard deviation of the population. The term standard error also highlights the fact
that it measures the error likely when estimating the population mean from the sample mean. I'll use the term
standard error from now on.
How do these two steps apply to the graduate salaries example? There are 225 individuals in the one sample
taken. Results from this sample show that the mean of the sample is $49,000 and the standard deviation of
the sample is $10,477.
First, you estimate the standard deviation of the population from the standard deviation of the sample:
Second, you estimate the standard error (SE) of the sampling distribution from the standard deviation of the
population:
The estimate of the standard error of the sampling distribution (i.e. the standard deviation of the sampling
distribution) is thus $700.
You now have several useful bits of information about the sampling distribution – that is, about the distribution
of the mean salaries of all the millions of possible samples of 225 graduates that can be drawn from the
population of 10,000 graduates:
• It has a mean that is the same as the mean of the population from which the samples are drawn.
• It has a standard error of $700.
• It has a normal distribution with all the usual features associated with a normal distribution. For
example, 95% of values in the distribution lie within two standard errors of the mean of the sampling
distribution – and thus the mean of the population. In this example, two standard errors equals $1400
(i.e. 2 × 700).
Thus, you now know that 95% of all sample means lie within $1400 of the population mean. Consequently,
the mean of the one known sample also has a 95% chance of lying within $1400 of the population mean.
With an ingenious piece of lateral thinking, you can reverse this statement and say that there's a 95% chance
that the population mean lies within $1400 of the mean of the one sample taken. After all, if New York is 1400
miles from Houston, then Houston must be 1400 miles from New York.
If the sample mean is $49,000, then you can be 95% confident that the population mean lies within two
standard errors of $49,000. Thus, the population mean lies in the range of the sample mean minus two
standard errors ($49,000 – $1400) to the sample mean plus two standard errors ($49,000 + $1400). In other
words, you will be correct 95 times in 100 when you say that the population mean (i.e. the mean of all 10,000
graduate salaries) lies within the range of $47,600 to $50,400.
The level of confidence with which you can make this sort of statement is termed (not surprisingly) the
confidence level. Here, I've used a 95% confidence level. The range of the values within which you are
confident the population value will lie is the confidence interval. Here, the confidence interval is $2800
($50,400 minus $47,600). The two ends of the confidence interval ($47,600 and $50,400) are the confidence
limits.
The above discussion shows how you can go from calculating the mean of a sample ($49,000) to estimating
with a given level of confidence (95%) the range of values within which the mean of the population is likely to
be found ($47,600 to $50,400). This is termed the interval estimate. It is based on the general shape of the
sampling distribution of the means, the standard error of which you estimate from the standard deviation of
the sample. Table 11.2 summarises this procedure, using the graduate salaries survey as an example.
Table 11.2 Estimating the population mean from a sample mean (e.g. graduate salaries)
It's now time to look briefly at an issue skipped over earlier – the importance of the size of the sample.
The graduate salary study is based on quite a large sample of 225 graduates. Consequently, the sampling
distribution of the means has a normal shape, is centred on the population mean, and has 95% of sample
means within two standard errors of the population mean.
The smaller the sample, the greater the possible difference between the sampling distribution and a normal
distribution, and thus the less accurate the population estimate based on a normal distribution. Recall the
very simple example with a population of only five values: 1, 3, 5, 7, 9. With a sample size of only 1, sample
results of 1, 3, 5, 7, and 9 are equally likely – and thus the sampling distribution of the ‘means’ does not have
a normal shape:
Similarly, with a sample size of 2, the sampling distribution of the means is not normal. However, as Figure
11.2 shows, it does have several characteristics in common with a normal distribution: it is centred on the
population mean, it is symmetrical, and it falls away continuously from the centre. But it does not match the
precise characteristics of a normal distribution with, for example, 95% of sample means falling within two
standard errors of the population mean.
Fortunately, statisticians have calculated the sampling distributions for samples of all sizes. The pioneering
work was done by William Gosset, who worked for Guinness Brewery. Gosset's work on small samples
stemmed from a very practical concern – how variations in malt and temperature influenced the length of
time Guinness remained drinkable. His ‘population’ was all the Guinness the brewery made, and each of
his samples consisted of a number of glasses (or, more likely, test tubes) of Guinness. Each glass needed
lengthy analysis to produce a single result, and time restrictions meant that Gosset was often limited to
only a few dozen samples (Tankard 1984: Ch. 5). He published his statistical work under the pseudonym
‘Student’ because Guinness didn't want to publicise his association with the company. Table 11.3 shows some
results from Gosset's work on sampling distributions for small sample sizes. They are known as Student's
t distributions, though the letter t has no real significance. (Perhaps Student's G might have been more
appropriate – for Gosset or Guinness!)
Table 11.3 Sampling distributions for small samples
Sample size Standard error values enclosing 95% of all sample means
10 2.26
20 2.09
30 2.05
Large 1.96
Today, Guinness is not so reticent about its association with William Gosset. A Guinness
company history notes that in 1936 William Gosset became the Head Brewer at the
London Brewery (Guinness nd). It refers to him as ‘the father of modern statistics’ – a title
that the very modest Gosset would have strongly resisted. ‘The old slogan “Guinness Is
Good For You” may actually be true, according to new medical research that suggests the
stout may help prevent heart attacks … Guinness reduced clotting activity but lager did
not’ (Malkin 2007).
Table 11.3 is important, and worth looking at carefully. The bottom row refers to large samples. These produce
a normal sampling distribution in which 95% of all sample means lie within 1.96 standard errors of the
population mean. (Earlier, I rounded off 1.96 to 2 standard errors, but here we need a more precise value.)
With a sample size of 30, the sampling distribution is slightly wider than a normal distribution – you have to go
out to 2.05 standard errors on either side of the centre to enclose 95% of all sample means. With a sample
size of 20, the sampling distribution is wider still, with 95% of all sample means lying within 2.09 standard
errors of the population mean. And with a very small sample of only 10, the sampling distribution is even
wider, with 95% of all sample means lying within 2.26 standard errors of the population mean.
To estimate the population mean from the mean of one small sample, you use the standard error value in
exactly the same way as with large samples. The only difference comes at the end when you work out the
95% confidence interval using the standard error values in Table 11.3. The following example reworks the
earlier graduate salary example, but this time using a random sample of only 10 recent graduates rather than
225. For simplicity, assume that the sample mean is still $49,000 and the sample standard deviation is still
$10,477.
Using the information from Table 11.3 about the sampling distribution of the means with a sample size of 10,
you know that 95% of all sample means lie within 2.26 standard errors of the population mean. The above
calculation shows that the standard error is 3495. Thus, you can be 95% confident that the population mean
lies between the sample mean minus 2.26 standard errors to the sample mean plus 2.26 standard errors. In
other words, you are 95% confident that the population mean lies between $41,101 and $56,899 (i.e. 49,000
– (2.26 × 3495) and 49,000 + (2.26 × 3495)).
The smaller the sample, the wider the confidence interval. For example, with a sample size of 225, you can be
95% confident that the population mean of graduate salaries lies between $47,600 and $50,400, a confidence
interval of $2800. In contrast, with a sample size of only 10 the 95% confidence interval is much bigger –
$15,798 (i.e. $41,101 to $56,899). This seems reasonable. You can feel more confident about relying on the
results of a large sample rather than a small one because a few extreme values are more likely to distort a
small sample than a large one.
This chapter looked at how to estimate the mean of a population from the mean of a sample drawn through
probability sampling. The next chapter, Chapter 12, covers similar ground, except that it looks at estimating
percentages in categorical variables. Chapter 12 has a similar structure to this one. First, it shows how
variability is built into the sampling process. Second, it shows how to estimate the population percentage
value based on a single sample value. Finally, it highlights the importance of sample size in determining the
variability of sample results.
http://dx.doi.org/10.4135/9781446287873.n11