Applied Statistics Ebook
Applied Statistics Ebook
James Reilly
Applied Statistics James Reilly [Link]
APPLIED STATISTICS
© 2015 James Reilly
All rights reserved
This ebook may be freely downloaded from the publisher’s website for personal use.
Advance written permission is required for redistribution, for commercial or corporate
use, or for quotations exceeding 6 pages in length. Requests for such permission
should be sent by email to info@[Link]
All quotations should be accompanied by a suitable acknowledgement such as:
Reilly, James, (2015). Applied Statistics [ebook]. Retrieved from
[Link]
Software output and graphs were produced using Minitab® statistical software.
-2-
Applied Statistics James Reilly [Link]
Table of Contents
1. Observing Data
1A. Graphs ................................................................................................ - 5 -
1B. Sampling ........................................................................................... - 15 -
1C. Summary Statistics ........................................................................... - 19 -
2. Probability
2A. Calculating Simple Probabilities ........................................................ - 23 -
2B. The General Rules of Probability ..…………………………………… - 29 -
2C. Subtleties and Fallacies .................................................................... - 32 -
2D. Reliability........................................................................................... - 35 -
3. Probability Distributions
3A. Random Variables ............................................................................. - 39 -
3B. The Normal Distribution..................................................................... - 40 -
3C. The Binomial Distribution .................................................................. - 44 -
3D. The Poisson Distribution ................................................................... - 48 -
3E. Other Distributions ............................................................................ - 50 -
4. Estimation
4A. What Happens When We Take Samples? ........................................ - 53 -
4B. Confidence Interval for a Mean or Proportion .................................... - 57 -
4C. Sample Size for Estimation ............................................................... - 60 -
4D. Estimating a Standard Deviation ....................................................... - 62 -
4E. Difference between Means or Proportions ........................................ - 64 -
5. Testing Theories
5A. Hypothesis Testing ............................................................................ - 68 -
5B. Testing a Mean or Proportion ............................................................ - 69 -
5C. Difference between Means or Proportions ........................................ - 73 -
5D. Sample Size and Power .................................................................... - 80 -
5E. Tests of Variances and Goodness-of-fit ............................................ - 82 -
-3-
Applied Statistics James Reilly [Link]
7. Design of Experiments
7A. Single-Factor Experiments and ANOVA ......................................... - 109 -
7B. Two-Factor Experiments and Interaction......................................... - 116 -
7C. General Linear Model ..................................................................... - 125 -
7D. Multi-Factor Experiments ................................................................ - 127 -
7E. Response Surface Methodology ..................................................... - 134 -
Appendix 1
Answers to Problems………………………………………………………… - 161-
Appendix 2
Statistical Tables…………………………………………………………….... -171-
-4-
Applied Statistics James Reilly [Link]
1
Observing Data
Having completed this chapter you will be able to:
– interpret graphs;
– draw samples;
– use a calculator in statistical mode.
Statistics allows numbers to tell their story. The story can help us to find our way
through uncertainty so that we can make decisions and take action with confidence.
1A. Graphs
A graph provides a great way to look at a set of data in order to see the big picture.
We don’t want to get lost in the details but to get an overall impression of what the
numbers are saying.
Histograms
Fig 1.1
20
15
Frequency
10
0
150 160 170 180 190 200
Height in cm
When reading a histogram, focus on the shape rather than the numbers. Also, ignore
small irregularities and pay attention to the more important features. Therefore, when
you look at the histogram above you should see the frequency curve below. There is
no need to draw the frequency curve – you can see it in your imagination.
-5-
Applied Statistics James Reilly [Link]
Fig 1.2
QUESTION What story is this histogram telling? ANSWER There is a typical height that
is common, and while some people are taller or shorter than this, very tall and very
short people are rare. We can tell all this by noting the peak in the middle, and the
similar tails on either side. This bell-shaped pattern is very common and is referred to
as normal. Also, notice that we told the story without quoting any numbers. We
allowed the numbers to talk. We will now look at some other patterns that can arise.
Fig 1.3
30
Frequency
20
10
0
0 10 20 30 40 50 60 70
Time in minutes
-6-
Applied Statistics James Reilly [Link]
Fig 1.4
The difference here is that the tails are not symmetric. We say that these data are
positively skewed, or skewed to the right. The waiting time for a taxi might be much
longer than expected, but could not be much shorter than expected, because the
waiting time cannot be less than zero.
Fig 1.5
20
15
Frequency
10
0
50 100 150 200 250
Weight in grams
-7-
Applied Statistics James Reilly [Link]
Fig 1.6
Can you tell by looking at the histogram that these potatoes come from a supermarket
and not from a farm? The potatoes have been sorted, and any potato weighing less
than 50 grams has been removed. We say that the frequency curve is truncated. A
frequency curve can be truncated on the left, or on the right, or on both sides.
Fig 1.7
20
15
Frequency
10
0
2 4 6 8 10 12 14
Weight in kg
-8-
Applied Statistics James Reilly [Link]
Fig 1.8
Although these animals are all called pets, we can see from the histogram that they
are not a homogeneous group. Maybe there is a mixture of cats and dogs. We say
that these data are bimodal, because instead of having one common value, the
mode, we have two.
Fig 1.9
25
20
Frequency
15
10
0
10 15 20 25
Age
-9-
Applied Statistics James Reilly [Link]
Fig 1.10
One of these numbers does not seem to belong with the others. It is called an outlier.
Outliers can represent the most interesting feature of a data set. On the other hand,
they may simply be the result of someone making a mistake when typing the numbers.
What do you think the outlier represents in the histogram above?
Note that a graph needs to have a sensible title, and to have labels and units on both
axes.
- 10 -
Applied Statistics James Reilly [Link]
Fig 1.11
In the first time series plot above, there is a good deal of random variation, but there
is no obvious pattern. We conclude that the variable remained more or less steady
throughout the period. This is the most common kind of time series plot, and it
corresponds to a normal frequency curve. If the graph was rotated anticlockwise
through 90 degrees, and the points fell onto the variable axis, they would form a bell-
shaped mound.
Fig 1.12
The striking feature of the second time series plot is the shift, i.e. a sudden and
sustained change in the figures. There must be some explanation. This explained
variation could be as a result of the hardware store opening a new department.
- 11 -
Applied Statistics James Reilly [Link]
Fig 1.13
This time series plot shows a spike, i.e. there is a sudden change but then the figures
return to their usual level again. Perhaps there was a one-day sale in the fashion store.
This feature corresponds to an outlier in a histogram.
Fig 1.14
A new pharmacy has opened recently. The time series plot of daily revenue shows a
trend, i.e. a gradual movement in the figures. As the new pharmacy becomes known,
its customers increase. Because there is some random variation present, this trend
may not be obvious until we view the data for a large number of days.
- 12 -
Applied Statistics James Reilly [Link]
Fig 1.15
The final plot shows an obvious cyclical pattern. Cinemas are busier at weekends.
The word seasonal is used when the cycle in question is of one year’s duration.
Problems 1A
#1. The histogram below shows the morning journey times in minutes for a certain
commuter. What does the histogram tell you about the journey times?
Fig 1.16
25
20
Frequency
15
10
0
40 50 60 70 80 90
Journey Times
- 13 -
Applied Statistics James Reilly [Link]
#2. The individual weights, in mg, of tablets in a consignment are shown in the
histogram below. What can you say about the pharmaceutical production and
inspection processes?
Fig 1.17
15
Frequency
10
0
402 404 406 408 410
Tablet Weights
#3. Here is a time series plot of Irish monthly new vehicle registrations for the 36 month
period from August 2011 up to July 2014. What story do the figures tell?
Fig 1.18
20000
15000
10000
5000
4 8 12 16 20 24 28 32 36
Index
- 14 -
Applied Statistics James Reilly [Link]
1B. Sampling
We now come to a very important and surprising concept. The data that we collect
and analyse are not the data that we are interested in! Usually we are interested in a
much larger data set, one which is simply too big to collect. The population is the
name given to the complete data set of interest. The population must be clearly defined
at the beginning of an investigation, e.g. the shoe-sizes of all the people who live in
Galway. Note that a population is not a set of people but a set of numbers, and we will
never see most of these numbers.
Sometimes the measurements of interest arise in connection with some process, e.g.
the dissolution times in water of a certain new tablet. In this case there is no population
out there to begin with: we must make some of the tablets and dissolve them in water,
or else there is nothing to measure. Also, the process could be repeated any number
of times in the future, so the population is infinitely large.
Once a tablet has been dissolved, it cannot be sold or used again. This is an example
of a destructive measurement. This is another reason why we cannot measure
everything that interests us: we would have plenty of data, but no merchandise left to
sell.
We have seen that we virtually never take a census of every unit in the population.
Instead we collect a sample from the population and we assume that the other data
in the population are similar to the data in the sample. We see only a part, but we
imagine what the complete picture looks like. This kind of reasoning is called
inference, and statistics carried out using this approach is called inferential
statistics. If we simply describe the observed data, and make no attempt to make
statements about a larger population, this is called descriptive statistics.
Because we use a sample to provide information about a population, it is essential
that the sample is representative. The sample may not give perfect information, but
it should not be misleading. Bias occurs if the sample is selected or used in such a
way that it gives a one-sided view of the population.
Sampling Techniques
Random Sampling
A simple random sample is a sample selected in such a way that every unit in the
population has an equal chance of being selected.
If the population is finite, this can be done by assigning identification numbers, from 1
to N, to all the units in the population, and then using the random number generator
on your calculator to select units for the sample.
EXAMPLE Select two letters at random from the alphabet.
First, assign an ID to each letter. Let A=1, B=2 all the way up to Z=26. The identification
numbers can often be assigned by using the order of arrival, or the positions, of the
units. Next, generate two random numbers between 1 and 27, i.e. between 1 and N+1.
We use N+1 so that the last item has a chance to be selected. To do this, enter 27 on
your calculator and press the random number key. Suppose you get 25.258, then
- 15 -
Applied Statistics James Reilly [Link]
simply ignore what comes after the decimal point and select letter number 25, the letter
Y. Now press the random number key again. Suppose you get 18.862 this time. Select
letter number 18, the letter R.
The same letter could be chosen twice. This is called sampling with replacement
and it does not cause bias. Repeats can be disallowed if sampling without replacement
is preferred.
If a population is infinite, or diffuse, it can be impossible to assign identification
numbers. Suppose we wish to interview a random sample of 100 shoppers at a mall.
It is not feasible to assign a number to each shopper, and we cannot compel any
shopper to be interviewed. We have no choice but to interview willing shoppers ‘here
and there’ in the mall, and hope that the sample is representative. However, when
such informal sampling techniques are used, there is always the possibility of bias in
the sample. For example, busy shoppers will be less inclined to complete
questionnaires, and may also be less inclined to answer yes to some survey questions
such as, ‘Have you visited a restaurant at the mall today?’.
Multi-Stage Sampling
Suppose you want to draw a sample of thirty players from the premiership. Rather
than assigning IDs to all the players in the premiership, it is much simpler to begin by
assigning IDs to all the teams in the premiership. In fact the current league standings
can be used to assign these IDs. Select a team at random using a random number,
say Liverpool. This is the first stage.
Now assign IDs to all the Liverpool players. The squad numbers can be used to assign
these IDs. Now select a Liverpool player at random. This is the second stage.
Repeat these two stages until the required sample size is achieved. The sample is
random, but using multi-stage sampling rather than simple random sampling makes
the task easier when the population has a hierarchic structure.
Stratified Sampling
Let us suppose that we plan to open a hairdressing business in Kilkenny. We might
decide to interview a random sample of adults in the city to investigate their frequency
of visits to the hairdresser. If our sample just happened to include a large majority of
women, then the results could be misleading. It makes better sense to firstly divide the
population of adults into two strata, men and women, and to draw a random sample
from each stratum.
Cluster Sampling
It is tempting to select naturally occurring clusters of units, rather than the units
themselves. For example, we might wish to study the tread on the tyres of cars passing
along a certain road. If we randomly select 25 cars and check all four tyres on each
car, this is not equivalent to a random sample of 100 tyres. All the tyres on the same
car may have similar tread. The selected units are random, but not independently
random. If the units of interest are tyres, then select tyres, not cars.
- 16 -
Applied Statistics James Reilly [Link]
Quota Sampling
Quota sampling is also called convenience sampling. With this approach, it does not
matter how the data are collected so long as the sample size is reached. Usually the
first units are selected because this is the easiest thing to do. But systematic variation
in the population may mean that the first units are not representative of the rest of the
population – people who are first to purchase a product, fruit at the top of a box, calls
made early in the morning, etc.
Systematic Sampling
With this approach, units are selected at regular intervals. For example, every tenth
box from a packing operation is inspected for packing defects. But suppose there are
two packers, Emie and Grace, who take turns packing the boxes. Then, if the first box
inspected (box 1) is Emie’s, the next box inspected (box 11) will be Emie’s, and so on.
Grace’s boxes are never inspected! All of Grace’s boxes might be incorrectly packed,
but the sample indicates that everything is fine. The problem here is that the sampling
interval can easily correspond to some cyclical variation in the population itself.
In summary, only random samples can be relied upon to be free from bias. Random
samples avoid biases due to systematic or cyclical variation in the population, or due
to personal biases, whether conscious or unconscious. Data collection is crucial and
it must be done carefully and honestly. All the formulae in this book are based on two
assumptions: that random sampling has been used, and that the population is infinitely
large. If samples are drawn by any technique that is not strictly random, then keep
your eyes open for any potential sources of bias.
Types of Data
Information comes in two forms: numbers and text.
QUESTION How many minutes long was the movie? ANSWER 115 (a number).
QUESTION What did you eat at the movie? ANSWER Popcorn (text).
The second question can be rephrased to invite a simple ‘yes or no’ answer.
QUESTION Did you eat popcorn at the movie? ANSWER Yes (yes or no).
The two types of data, which we have described here as numbers and text, are called
measurements and attributes respectively.
A measurement is said to be continuous if any value within a given range could arise,
e.g. your height could be 1.5 m or 2 m or anything in between. A measurement is said
to be discrete if only certain values are possible, e.g. the number of phone calls that
you made yesterday might be 5 or 6 but cannot be anything in between. In the case
of measurement data we will often use the mean to express how big the
measurements are.
Attribute data are called binary if there are only two categories (yes and no) and are
called nominal if there are more than two categories (popcorn, chocolate, ice-cream).
It is sometimes convenient to convert binary data into numbers by means of an
indicator variable that designates ‘yes’ as 1 and ‘no’ as 0. Attribute data can also be
- 17 -
Applied Statistics James Reilly [Link]
referred to as categorical data. In the case of attribute data we will often use the
proportion to express how common the attribute is.
A measurement or an attribute can be referred to as a response, and each value of
the response that is sampled is called an observation.
Very large data-sets such as those that arise in connection with social media apps or
customer loyalty cards are referred to as big data. Big data are characterised by their
volume (there is a lot of data), variety (there are different types of data such as
measurements, attributes, dates and times, pictures and video) and velocity (the data
change quickly).
Structuring Data
When entering data into a table or a spreadsheet, enter the values of a variable in a
vertical column, like this.
Table 1.1
Height
162
179
176
170
173
176
If you have a number of variables, use a column for every variable and a row for every
case, like this.
Table 1.2
Height Red Cards Position Outfield
162 1 Defender 1
179 0 Goalkeeper 0
176 0 Attacker 1
170 0 Goalkeeper 0
173 2 Defender 1
176 0 Attacker 1
Problems 1B
#1. Identify one potential source of bias in each of the following sampling schemes:
(a) The first 50 passengers who boarded a train were asked if they were satisfied with
the scheduled departure time.
- 18 -
Applied Statistics James Reilly [Link]
(b) An interviewer was instructed to call to the homes of any 20 dog-owners and find
out if their dog had ever bitten anyone.
(c) An agricultural laboratory requested a farmer to provide ten stalks of wheat for
analysis. The farmer pulled a handful of ten stalks from the edge of the field.
(d) In a study into the lengths of words used in newspapers, an investigator recorded
the number of letters in the first word on every page.
#2. Use a random number generator to select three names from the following list:
Josiah, Sandrina, Matthew, Vanessa, Sinead, Sarah, Daireann, David, Johnny.
X X
n
There are many other summary measures of location, and we describe two of them
here: the mode and the median. The mode is the most commonly occurring value and
it is useful if the data are nearly all the same. To the question ‘How many legs has a
dog?’ the best answer is ‘four’, which is the mode. Some dogs have three legs, so the
mean number of legs is probably about 3.99, but this is not a useful answer.
The median is the middle number when the numbers are arranged in order. It is a
robust measure of location that is insensitive to outliers. The median of the sample of
tree heights above is 23, and it would still be 23 if the largest number, 28, were
changed to 48, or to any large unknown number.
- 19 -
Applied Statistics James Reilly [Link]
Although we can calculate the value of the sample mean, what we really want to know
is the value of the population mean (symbol μ, pronounced ‘mu’). The sample mean
is merely an estimator. In general, sample statistics estimate population parameters.
Measures of Dispersion
We have been considering the heights of trees in a forest and we have estimated that
the population mean is 24 metres, based on a random sample of five trees. Does this
mean that all the trees are exactly 24 metres tall? Of course not! Some are taller and
some are shorter. If we built a wall 24 metres tall, in front of the forest, some of the
trees would extend above the top of wall, and others would stop short of it. There is a
difference (or error, or deviation) between the height of each tree and the height of
the wall. The average difference (i.e. the typical error, the standard deviation) is a
useful summary measure of dispersion or spread or variability. The sample
standard deviation is the root-mean-square deviation. The sample standard deviation
estimate is calculated as follows.
Data: 22, 28, 25, 22, 23
Deviations: -2, +4, +1, -2, -1
Squared: 4, 16, 1, 4, 1
Mean: (4 + 16 + 1 + 4 + 1) / 4 = 6.5
Root: √6.5 = 2.5495
This is the sample standard deviation estimate, denoted by the letter S. It estimates
the population standard deviation, which is denoted by σ, pronounced sigma.
Formula 1.2
Sample Standard Deviation
S
(X X ) 2
n 1
Note that the standard deviation is the typical error, not the maximum error. Also, the
word ‘error’ does not refer to a mistake: it just means that the trees are not all the same
height.
Did you notice that when calculating the mean of the squared deviations, we divided
by 4 rather than 5? This quantity, n-1, is called the degrees of freedom, which is
defined as the number of independent comparisons available. Five trees provide four
comparisons: one tree would tell us nothing at all about variability. Dividing by n-1 in
this way ensures that S is an unbiased estimator of σ.
The standard deviation squared is called the variance. The symbol for the population
variance is σ2, and S2 denotes its sample estimate. The variance is useful in
calculations, but the standard deviation is easier to interpret and talk about.
- 20 -
Applied Statistics James Reilly [Link]
The sample mean and standard deviation can be easily found using a calculator in
statistical mode. Follow these steps:
1. Choose statistical mode.
2. Type the first number and then press the data entry key.
3. Repeat the above step for every number.
4. Press the relevant key to see the mean.
5. Press the relevant key to see the standard deviation.
6. When you are finished, choose computation mode.
In some situations it is convenient to express the standard deviation as a percentage
of the mean. This is called the Coefficient of Variation (CV) or the Relative Standard
Deviation (RSD). The Coefficient of Variation for the trees data is 2.5495/24×100 =
10.62%.
The range is another measure of dispersion. The range is the difference between the
largest and smallest values in a data set. Unfortunately the sample range is a biased
estimator of the population range. It has other weaknesses too: it is sample size
dependent, and sensitive to outliers.
Measuring variation is important because in statistics we study the differences
between numbers in order to gain insight. We ask: Why are the numbers different? In
the case of the weights of pets, the numbers are different because some of the pets
are dogs and some are cats. This is explained variation because we have identified
a source of variation. But that is only part of the story. If we separate the dogs and
cats, and just observe the cats, there will still be differences among their weights. This
is called residual variation or unexplained variation or random variation. Even
though we can’t explain this variation, we can measure it (using the standard deviation)
and we can describe it (by identifying if it is normally distributed, or if it follows some
other pattern).
Some processes are expected to achieve a target result. For production and service
processes, the target is the nominal value; for measurement processes, the target is
the correct value. A process is said to be accurate (or unbiased) if the process mean
is on target. A process is said to be precise if the standard deviation is small, i.e. the
results are close together.
Problems 1C
#1. Estimate the mean and the standard deviation of each of these populations, based
on the sample data provided.
(a) Points awarded to a gymnast: 7, 8, 9.
(b) Shoe-sizes of tram passengers: 8, 11, 2, 9, 6, 12.
#2. First guess the mean and standard deviation of the number of letters per word in
this sentence, and then use your calculator to see how you did.
- 21 -
Applied Statistics James Reilly [Link]
#3. Three vertical lines are drawn on a frequency curve, one each at the mean, the
median and the mode. Identify which of (a), (b) and (c) below corresponds to each
measure of location.
(a) This line divides the area under the frequency curve in half.
(b) This line touches the tallest point on the frequency curve.
(c) This line passes through the centre of gravity of the area under the frequency curve.
Project 1C
Sampling Project
Select a sample of 100 measurements from any population of your choice, and write
a report in the following format.
(a) Define the population of interest and the measurement of interest.
(b) Carefully describe how the sample was selected from the population.
(c) Construct a histogram.
(d) Comment on what the shape of the histogram tells you about the population.
- 22 -
Applied Statistics James Reilly [Link]
2
Probability
Having completed this chapter you will be able to:
– define probability from two different standpoints;
– calculate the probabilities of simple and compound events;
– calculate the reliability of a system of components.
Nearly all processes involve uncertainty. This is obvious in the case of simple games
of chance played with cards and dice, and more complex games such as horse-racing
and team sports. But uncertainty is also present in business processes, manufacturing
processes and scientific measurement processes. We cannot be certain about the
future behaviour of a customer, or the exact dimensions of the next unit of product, or
even the result that would arise if we repeated a measurement a second time.
Definitions of Probability
DEFINITION #1 The probability of an event means HOW OFTEN the event occurs. It is
the proportion of occurrences when many trials are performed.
p = 1 implies that the event always occurs.
p = 0 implies that the event never occurs.
In general, 0 ≤ p ≤ 1.
DEFINITION #2 The probability of an event is a measure of HOW LIKELY the event is to
occur on a single future trial.
Now, p = 1 implies that the event is certain to occur.
And p = 0 implies that the event is certain not to occur.
Again, in general, 0 ≤ p ≤ 1.
- 23 -
Applied Statistics James Reilly [Link]
The probability of an event has the same value no matter which definition we use.
Definition #1 is easier to think about. But definition #2 is useful when only one trial is
to be performed, e.g. a single toss of a coin. With definition #2, when we say that the
probability of ‘heads’ is one-half, we mean that heads is just as likely to occur as not
to occur. In the case of manufactured goods, definition #1 represents the perspective
of the producer, and definition #2 represents the perspective of the consumer, in
relation to the event ‘a unit is defective’.
Probability values that are close to 1 are interesting because they indicate that such
events are likely to occur. Therefore we have confidence that these event will occur.
This is the basis of statistical estimation. Probability values that are close to 0 are
interesting because they indicate that such events are unlikely to occur. We say that
such events are significant, which means that they may have occurred for a special
reason. This is the basis of statistical hypothesis testing.
Calculating Probability
Classical Probability
We have already claimed that the probability of heads, when a coin is tossed, is one-
half. We use a capital letter to denote an event, so we may write:
P(H) = 1/2
Why is the probability of heads calculated by dividing one by two? Because there are
two sides on the coin (two possible outcomes) and only one of these is heads. This
approach is valid only because the different outcomes (heads and tails) are equally
likely to occur. It would not be true to say that the probability that it will snow in Dublin
tomorrow is one-half, because the two outcomes (‘snow’ and ‘no snow’) are not equally
likely.
Formula 2.1
Classical Probability
If all the outcomes of a trial are equally likely, then the probability of an event can be
calculated by the formula
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑎𝑦𝑠 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑐𝑎𝑛 𝑜𝑐𝑐𝑢𝑟
𝑃(𝐸) =
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑜𝑢𝑡𝑐𝑜𝑚𝑒𝑠
Note that this formula, like most formulae related to probability, has a condition
associated with it. The formula must not be used unless the condition is satisfied.
Empirical Probability
We now consider how to calculate the probabilities of events which do not lend
themselves to the classical approach. For example, what is the probability that when
a thumb-tack is tossed in the air, it lands on its back, pointing upwards? Recall our first
definition of probability: the probability of an event means the proportion of
occurrences of the event, when many trials are performed. We can perform a large
number of trials, and count the number of times it lands on its back. If it lands on its
back r times, out of n trials, then p = r/n estimates the probability of the event. Of
- 24 -
Applied Statistics James Reilly [Link]
course, the result will be only an estimate of the true probability: to get a perfect
answer, we would have to repeat the trial an infinite number of times.
Formula 2.2
Empirical Probability
When a large number of trials are performed, the probability of an event can be
estimated by the formula
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑒𝑣𝑒𝑛𝑡 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑑
𝑃(𝐸) =
𝑇ℎ𝑒 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑎𝑙𝑠 𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑒𝑑
Subjective Probability
There are some events whose probabilities cannot be calculated, or estimated, by
either of the two formulae presented above: for example, the probability that a
particular horse will win a race, or the probability that a particular company will be
bankrupt within a year. These events do not satisfy the conditions associated with
either classical probability or empirical probability. To estimate the probabilities of such
events, we simply make a guess. The more we know about horse-racing, or business,
the more reliable our guess will be. This approach is called subjective probability. In
such cases it is common to refer to the odds of an event rather than its probability.
Formula 2.3
Odds
p
Odds
1 p
For example, if the probability that your team will win this weekend is 0.8, then the
odds are 4 to 1. A bookmaker would offer a price of 4 to 1 to someone who wants to
bet against your team.
- 25 -
Applied Statistics James Reilly [Link]
This rule is very useful, especially if you want to calculate the probability of a complex
event, A, which has a simple complement, not A.
The Simple Multiplication Rule (AND)
This rule can only be applied to events that are independent. Independence means
that the probability of occurrence of one event stays the same, regardless of whether
or not the other event occurs.
Formula 2.5
The Simple Multiplication Rule
For independent events A, B
𝑃(𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴) × 𝑃(𝐵)
- 26 -
Applied Statistics James Reilly [Link]
Solving Problems
When you are presented with a problem in probability, the first step is to identify the
separate events involved. Next rephrase the problem, using only the words ‘and’, ‘or’
and ‘not’ to connect the events. Next, consider how multiplication, addition, etc., can
be used to calculate the probability of this combination of events, paying particular
attention to the relevant assumptions in each case. Also, look out for short-cuts, e.g.
if the event is complex, it may be simpler to calculate the probability that it does not
occur, and subtract the result from one.
EXAMPLE
A husband and wife take out a life insurance policy for a twenty-year term. It is
estimated that the probability that each of them will be alive in twenty years is 0.8 for
the husband and 0.9 for the wife. Calculate the probability that, in twenty years, just
one of them will be alive.
SOLUTION
Identify the separate events:
H: The husband will be alive in twenty years. P(H) = 0.8
W: The wife will be alive in twenty years. P(W) = 0.9
Rephrase the problem using the words ‘AND’ ‘OR’ and ‘NOT’ to connect the events.
‘just one of them’ means
H and not W or not H and W
‘or’ becomes ‘+’ because the mutually exclusive property holds; ‘and’ becomes
‘multiply’ assuming independence.
What we require is
P(H) × P(not W) + P(not H) × P(W)
= 0.8 × 0.1 + 0.2 × 0.9
= 0.08 + 0.18
= 0.26
Problems 2A
#1. Explain each of the statements below twice, firstly using definition #1 and secondly
using definition #2 of probability.
(a) The probability of a ‘six’ on a roll of a die is one-sixth.
(b) The probability that an invoice will be paid within 30 days is 90%.
(c) The probability that a pig flies is 0.
#2. Calculate, or estimate, the probabilities of each of the following events:
(a) A number greater than four occurs when a die is rolled.
(b) When a thumb-tack is tossed in the air, it lands on its back, pointing upwards.
(c) The current Taoiseach will be Taoiseach on the first of January next year.
- 27 -
Applied Statistics James Reilly [Link]
#3. A single letter is drawn at random from the alphabet. The following events are
defined:
W: the letter W is obtained.
V: a vowel is obtained.
Calculate the probability of each of the following events:
(a) W
(b) V
(c) W or V.
#4. A coin is tossed and a die is rolled. The following events are defined:
H: the coin shows ‘heads’.
S: the die shows a ‘six’.
Compute the probabilities of the following events:
(a) H
(b) S
(c) H and S
(d) H or S.
#5. A die is rolled twice. Compute the probability of obtaining:
(a) two sixes
(b) no sixes
(c) a six on the first roll but not on the second roll
(d) at least one six
(e) exactly one six.
#6. Three coins are tossed simultaneously. Calculate the probability of obtaining:
(a) three heads
(b) at least one head.
#7. A firm submits tenders for two different contracts. The tenders will be assessed
independently. The probability that the first tender will be successful is 70%, and the
probability that the second tender will be successful is 40%.
Calculate the probability that:
(a) both will be successful
(b) neither will be successful
(c) only the first will be successful
(d) only the second will be successful
(e) at least one will be successful.
Project 2A
Estimation of Odds
Identify some upcoming sporting event. Make an estimate of the probability of each
possible outcome. Express your estimates as odds. Compare your estimates with the
odds offered by a bookmaker for the same event.
- 28 -
Applied Statistics James Reilly [Link]
EXAMPLE A letter is drawn at random from among the 26 letters of the alphabet.
L denotes the event ‘the letter drawn is from the latter half of the alphabet (N to Z)’.
V denotes the event ‘the letter drawn is a vowel’.
P(L) = 13/26
P(V) = 5/26
However, these events are not independent. If L occurs, then the probability of V is
not 5/26, but rather 2/13, because there are only two vowels in the latter half of the
alphabet. We call this the conditional probability of V given L, and write it as P(V|L).
P(V|L) = 2/13
P(L and V) = P(L)×P(V|L)
P(L and V) = 13/26 × 2/13 = 1/13
- 29 -
Applied Statistics James Reilly [Link]
De Morgan’s Rule
De Morgan’s rule provides a simple way to replace an expression involving ‘or’ with
an equivalent expression involving ‘and’, and vice versa. This can be convenient, for
example, if the word ‘or’ arises with independent events.
Formula 2.9
De Morgan’s Rule
To replace an expression with an alternative, equivalent, expression, follow these
three steps:
1. Write ‘not’ before the entire expression.
2. Write ‘not’ before each event in the expression.
3. Replace every ‘and’ with ‘or’, and every ‘or’ with ‘and’.
Example: A or B ≡ not (not A and not B)
EXAMPLE A game involves tossing a coin and rolling a die, and the player ‘wins’ if
either ‘heads’ or a ‘six’ is obtained. Calculate the probability of a win.
H denotes ‘heads’, P(H) = 1/2
S denotes a ‘Six’, P(S) = 1/6
P(win) = P(H or S)
H and S are not mutually exclusive events.
But H and S are independent events, so let us write:
P(win) = P (not (not H and not S))
P(H) = 1/2, P(not H) = 1/2
P(S) = 1/6, P(not S) = 5/6
P(win) = 1 - (1/2 × 5/6)
P(win) = 1 – 5/12
P(win) = 7/12
- 30 -
Applied Statistics James Reilly [Link]
EXAMPLE In how many different ways can the letters A, B and C be arranged?
ANSWER 3! = 3.2.1 = 6
In this situation, three different letters can be chosen to be put first. After the first
letter has been chosen, there is a choice of two letters for second position. That
leaves one letter which must go in third place. The arrangements are ABC, ACB,
BAC, BCA, CAB, and CBA.
Formula 2.11
Permutations
The number of ways of arranging r things, taken from among n things, is ‘n
permutation r’, written nPr
n!
n
Pr
(n r )!
EXAMPLE The ‘result’ of a horse race consists of the names, in order, of the first three
horses past the finishing post. How many different results are possible in a seven-
horse race?
ANSWER 7P3 = [Link].3.2.1 / ([Link]) = 7.6.5 = 210
In this situation, seven different horses can finish in first place. After the first place has
been awarded, there are six possibilities for second position. After the first and second
places have been awarded, there are five possibilities for third place.
Formula 2.12
Combinations
The number of sets of r things that can be taken from among n things is ‘n
combination r’, written nCr. The order in which the things are taken does not matter.
n!
n
Cr
r!(n r )!
EXAMPLE In how many different ways can 6 numbers be chosen from among 45
numbers in a lottery game?
ANSWER 45C = 45!/6!/39! = [Link].41.40/([Link].2.1) = 8,145,060
6
If the six numbers had to be arranged in the correct order then the answer would be
45P . But because the order does not matter, we divide this by 6! which is the number
6
of arrangements possible for the 6 numbers.
Problems 2B
#1. In how many different orders can 8 different numbers be arranged?
#2. Leah, Josiah and John are three members of a class of 20 students. A raffle is
held among the members of the class. Each member has one ticket, and the three
winning tickets are drawn from a hat.
- 31 -
Applied Statistics James Reilly [Link]
(a) What is the probability that Leah wins first prize, Josiah wind second prize, and
John wins third prize?
(b) What is the probability that Leah, Josiah and John are the three prize-winners?
#3. It has already been calculated that there are 8,145,060 ways in which 6 numbers
can be chosen from among 45 numbers in a lottery game. In how many different ways
can 39 numbers be chosen from among 45 numbers? Explain.
#4. In a lottery game a player selects 6 numbers, from a pool of 45 numbers. Later on
six winning numbers are drawn randomly from the pool. Calculate the probability that,
on a single play, a player achieves:
(a) the jackpot, i.e. the player’s selection matches all six winning numbers
(b) a ‘match 5’, i.e. the player’s selection matches any 5 winning numbers
(c) a ‘match 4’, i.e. the player’s selection matches any 4 winning numbers
#5. Explain in words why nPn = n!
#6. Explain in words why nCn = 1.
Bayes’ Theorem
Let A denote the event ‘the person has the disease’.
Let B denote the event ‘the person is classified as having the disease’.
We are given that
P(A) = 0.01
P(B|A) = 0.95
P(B| not A) = 0.02
We now proceed to calculate P(A|B) using the formula for Bayes’ Theorem.
- 32 -
Applied Statistics James Reilly [Link]
Formula 2.13
Bayes’ Theorem
𝑃(𝐴 𝑎𝑛𝑑 𝐵)
𝑃(𝐴|𝐵) =
𝑃(𝐵)
An Illustration
The table below shows how data can be cross-classified according to two sets of
criteria. This is a 2×2 table because there are only two rows and two columns of data
in the table proper. The totals are displayed in the margins and do not provide any
additional information. Various probabilities can be calculated using the data in the
table and these are outlined below.
Table 2.1
iPhone Android Totals
Male 20 30 50
Female 26 24 50
Totals 46 54 100
- 34 -
Applied Statistics James Reilly [Link]
Problems 2C
#1. 5% of cattle in a national herd have a particular disease. A test is used to identify
those animals who have the disease. The test is successful in 70% of genuine cases,
but also gives a ‘false positive’ result for healthy cattle, with probability 0.002. What
proportion of the animals, identified by the test, actually have the disease?
2D. Reliability
Reliability is defined as the probability that a product will function as required for a
specified period. Where a product consists of a system of components whose
individual reliabilities are known, the reliability of the system can be calculated.
Components in Series
When components are in series it means that all the components must function in
order for the system to function. Most systems are like this.
EXAMPLE A torch consists of a battery, a switch and a bulb, assembled in series. The
reliabilities of the battery, switch and bulb are 0.9, 0.8 and 0.7 respectively. Calculate
the system reliability.
Fig 2.1
- 35 -
Applied Statistics James Reilly [Link]
The independence condition requires that a good battery has the same chance of
being combined with a good switch, as does a bad battery, etc. This condition tends
to be satisfied by random assembly.
Formula 2.14
Reliability of a System of Components in Series
The reliability of a system of components in series, assuming independence, is the
product of the reliabilities of the individual components.
Components in Parallel
Parallel systems include components in back-up mode. If one component fails, there
is another component to take its place. The back-up components can be either in
standby mode, such as a standby generator to provide power during a mains failure,
or in active mode, such as stop-lamps on a vehicle, where all of the components are
active even before one fails.
EXAMPLE The reliability of the mains power supply is 0.99 and the reliability of a
standby generator is 0.95.
Calculate the system reliability.
Fig 2.2
- 36 -
Applied Statistics James Reilly [Link]
Formula 2.15
Reliability of a System of Components in Parallel
The unreliability of a system of components in parallel, assuming independence, is the
product of the unreliabilities of the individual components.
Unreliability = 1 – reliability
Complex Systems
Complex systems contain subsystems. The reliability of each subsystem should be
calculated first. The subsystem can then be regarded as a single component.
EXAMPLE A projector has two identical bulbs and one fuse. The reliability of a bulb is
0.9 and the reliability of a fuse is 0.95. All other components are assumed to be 100%
reliable. Only one bulb is required for successful operation. Calculate the reliability of
the projector.
Fig 2.3
- 37 -
Applied Statistics James Reilly [Link]
Problems 2D
#1. An electric convector heater consists of an element, a switch, and a fan assembled
together. These three components are independent and will operate successfully for
a year with probabilities 0.92, 0.98 and 0.97, respectively.
(a) Calculate the probability that the heater will operate successfully for a year.
(b) If two such heaters are purchased, calculate the probability that both heaters will
operate successfully for a year.
(c) If two such heaters are purchased, calculate the probability that neither of them will
operate successfully for a year.
(d) If two such heaters are purchased, calculate the probability that at least one of
them will operate successfully for a year.
(e) If the heater is redesigned to include a second element (similar to the first element
and as a back-up for it), what is the probability that the redesigned heater will operate
successfully for a year?
#2. A vacuum cleaner (Model M1) consists of a power supply, a switch and a motor
assembled in series. The reliabilities of these components are 0.95, 0.90 and 0.80,
respectively.
(a) Calculate the reliability of a Model M1 vacuum cleaner. Also, if a customer buys
two Model M1 vacuum cleaners, in order to have a second vacuum cleaner available
as a back-up, calculate the reliability of the system.
(b) If a customer buys two Model M1 vacuum cleaners, in order to have a second
vacuum cleaner available to provide spare parts as required, calculate the reliability of
the system.
(c) A model M2 vacuum cleaner is made from the same components as Model M1,
but it has a second motor included as a back-up. Therefore a model M2 vacuum
cleaner consists of a power supply, a switch, a motor and a back-up motor, each with
reliabilities as described above. Calculate the reliability of a Model M2 vacuum cleaner.
(d) If a customer buys two Model M2 vacuum cleaners, in order to have a second
vacuum cleaner available to provide spare parts as required, calculate the reliability of
the system.
- 38 -
Applied Statistics James Reilly [Link]
3
Probability Distributions
Having completed this chapter you will be able to:
– recognise some common statistical distributions;
– calculate probabilities associated with these distributions.
15
Percent
10
0
1 2 3 4 5 6
Number on a Rolled Die
Selecting a random digit gives rise to a similar situation. The same rule applies about
the outcomes being equally likely. The probability distributions even look alike, having
bars that are equally tall.
- 39 -
Applied Statistics James Reilly [Link]
Fig 3.2
10
Percent
0
0 1 2 3 4 5 6 7 8 9
Random Digit
These random variables are alike in that they are from the same family, called the
uniform distribution. They are different members of the family as can be seen from
their different parameter values, k = 6 and k = 10, where a parameter means a number
that is constant within a given situation.
In general, if we can name the distribution that a random variable belongs to, then we
have a model that describes its behaviour. This can be useful in two ways. Firstly, if
the data fit the model well, we can estimate the parameter or parameters and the
probability of occurrence of any value of the variable. Secondly, if the data are not a
good fit to the model, this suggests that the independence and randomness
assumptions are not true, thus providing unexpected insight into the process.
Problems 3A
#1. An experiment consists of tossing a coin. Instead of expressing the outcome as
heads or tails, the outcome is expressed as the number of heads that appear on that
toss.
(a) What is the sample space for this experiment?
(b) What is the name of the distribution that describes the behaviour of this variable?
(c) What is the value of the parameter?
- 40 -
Applied Statistics James Reilly [Link]
The normal distribution is a continuous distribution, not a discrete distribution. The total
area under the curve is 1. The height of the curve is called the probability density.
The probability of occurrence of a value in any particular range is the area under the
curve in that range.
Fig 3.3
The z-score means, ‘How many standard deviations above the mean is this value?’ It
follows that z = 0 represents the mean, while positive z-scores represent values above
the mean, and negative z-scores represent values below the mean. The normal
distribution table gives the cumulative probability corresponding to any positive z-
score.
EXAMPLE 1
The heights of men are normally distributed with mean, µ = 175 cm, and standard
deviation, σ = 10 cm. What is the proportion of men that are taller than 182.5 cm?
SOLUTION 1
z = (X - µ) / σ
z = (182.5 - 175) / 10
z = 0.75
- 41 -
Applied Statistics James Reilly [Link]
Fig 3.4
At this point, we sketch a bell-curve with zero in the middle. We mark our z-score on
the left or right as appropriate (on the right this time, because we have +0.75), and
shade the area to the left or right of z as appropriate (to the right this time, because
we have ‘taller than’). We now ask ourselves whether the shaded area is a minority or
a majority of the total area under the curve (a minority this time, i.e. we expect that the
answer < 0.5).
The normal table gives 0.7734, a majority.
We require 1 - 0.7734 = 0.2266.
22.66% of men are taller than 182.5 cm.
EXAMPLE 2
The heights of men are normally distributed with mean, µ = 175 cm, and standard
deviation, σ = 10 cm. What is the proportion of men that are between 165 cm and 170
cm in height?
SOLUTION 2
These two X values will correspond to two z-scores.
X1 = 165
X2 = 170
z = (X - µ) / σ
z1 = -1
z2 = -0.5
- 42 -
Applied Statistics James Reilly [Link]
Fig 3.5
The following four steps can be used to find the probability corresponding to an
interval.
Left tail area = 1 - 0.8413 = 0.1587
Right tail area = 0.6915
Total tail area = 0.8502
Interval area = 1 - 0.8502 = 0.1498.
14.98% of men are between 165 cm and 170 cm in height.
Problems 3B
#1. The heights of corn stalks are normally distributed with mean µ = 16 cm, and
standard deviation σ = 2 cm. Calculate the proportion of these stalks that are:
(a) shorter than 14 cm
(b) between 13 cm and 17 cm
(c) shorter than 17.5 cm
(d) taller than 18.25 cm
(e) taller than 13.8 cm
(f) between 11 cm and 13.5 cm
(g) between 17.6 cm and 18.4 cm
(h) taller than 16 cm
(i) taller than 30 cm
(j) between 12.08 cm and 19.92 cm.
#2. A manufacturing process makes penicillin tablets with a nominal weight of 250 mg.
The process mean is 250.2 mg, and the standard deviation is 0.8 mg. The specification
limits are: upper specification limit = 252 mg, lower specification limit = 248 mg. What
is the proportion of tablets that conform to the specifications?
- 43 -
Applied Statistics James Reilly [Link]
#3. The heights of women are normally distributed with mean 167.5 cm and standard
deviation 7.5 cm. A range of T-shirts are made to fit women of different heights as
follows:
Small T-shirt: 155 to 165 cm
Medium T-shirt: 165 to 175 cm
Large T-shirt: 175 to 185 cm.
What percentage of the population is in each category, and what percentage is not
catered for?
EXAMPLE 10% of eggs are brown. If 6 eggs are packed, at random, into a carton, what
is the probability that exactly two are brown?
- 44 -
Applied Statistics James Reilly [Link]
SOLUTION
n=6
p = 0.1
r=2
P(r) = nCr × pr × (1-p)n-r
P(2) = 6C2 × (0.1)2 × (1 – 0.1)6-2
P(2) = 6C2 × (0.1)2 × (0.9)4
P(2) = 15 × 0.01 × 0.6561
P(2) = 0.098415
The probability that a carton contains exactly two brown eggs is 0.098415, or, to put it
another way, 9.84% of all such cartons contain exactly two brown eggs.
The complete probability distribution is shown below.
Fig 3.6
50
40
Percent
30
20
10
0
0 1 2 3 4 5 6
Number of Occurrences (r)
- 45 -
Applied Statistics James Reilly [Link]
SOLUTION
n = 12
p = 0.55
‘at least three’ means that r = 3 or 4 or... or 12
P(‘at least three’) = P(3) + P(4) + ... + P(12)
(the probabilities can always be added because the exact outcomes are mutually
exclusive)
P(‘at least three’) = 1 – [P(0) + P(1) + P(2)]
(For convenience, if the list of values consists of more than half the total number, use
the r values not on the list instead, and subtract their sum from 1.)
P(0) = 12C0 × 0.550 × (0.45)12 = 0.000069
P(1) = 12C1 × 0.551 × (0.45)11 = 0.001011
P(2) = 12C2 × 0.552 × (0.45)10 = 0.006798
P(0) + P(1) + P(2)= 0.007878
1 - 0.007878 = 0.992122
99.21% of juries will have at least three female members.
The complete probability distribution is shown below.
Fig 3.7
25
20
15
Percent
10
0
0 1 2 3 4 5 6 7 8 9 10 11 12
Number of Occurrences (r)
- 46 -
Applied Statistics James Reilly [Link]
Formula 3.3
Normal Approximation to the Binomial Distribution
µ = 12 × 0.55 = 6.6
σ = √(12 × 0.55 × 0.45) = 1.723
‘at least three’ means that X ≥ 2.5
(Note the correction for continuity. Because the normal distribution is continuous and
the binomial distribution is discrete, all the values between 2.5 and 3.5 are allocated
to r = 3.)
z = (2.5 – 6.6) / 1.723 = -2.38
The normal table gives 0.9913.
99.13% of juries will have at least two female members.
This answer approximates closely to the exact answer obtained using the binomial
distribution.
Problems 3C
#1. Which of the following variables are binomially distributed?
Q: the number of left-handed people sitting at a table of four people
R: the number of red cars in a random sample of 20 cars
S: the time taken by a show-jumping contestant to complete the course
T: the outdoor temperature on a summer's day
U: the number of months in a year in which there are more male than female births in
a maternity unit
V: the number of weeds in a flower-bed
W: the weight of a banana
Y: the duration of a telephone call
Z: the number of players on a football team who receive an injury during a game.
#2. Construct the complete probability distribution for the number of heads obtained
when a coin is tossed four times, i.e. calculate P(0), P(1), P(2), P(3), and P(4). Verify
that the sum of these probabilities is one, and explain in words why this is so.
#3. Seeds are sold in packs of 10. If 98% of seeds germinate, calculate the probability
that when a pack of seeds is sown:
(a) more than eight will germinate
(b) eight or fewer will germinate.
#4. A certain machine produces gaskets, 5% of which are defective. Every hour a
sample of 6 gaskets are selected at random and inspected. What proportion of these
samples contain:
(a) exactly two defectives?
- 47 -
Applied Statistics James Reilly [Link]
e r
P(r )
r!
λ = the mean number of occurrences per interval
P(r) is the probability of exactly r occurrences.
EXAMPLE A company receives three complaints per day on average. What is the
probability of receiving more than two complaints on a particular day?
SOLUTION
λ=3
‘more than two‘ means that r = 3 or 4 or 5 or …
P(‘more than two‘) = P(3) + P(4) + P(5) + …
P(‘more than two’) = 1 – { P(0) + P(1) + P(2)}
P(0) = e-3 × 30 / 0! = 0.0498
P(1) = e-3 × 31 / 1! = 0.1494
- 48 -
Applied Statistics James Reilly [Link]
20
15
Percent
10
0
0 1 2 3 4 5 6 7 8 9 10
Number of Occurrences (r)
Problems 3D
#1. Which of the following variables are Poisson distributed?
N: the number of goals scored in a football match
O: the number of students per year that are kicked by horses in a riding school
P: the number of days in a week on which the Irish Independent and the Irish Times
carry the same lead story
Q: the weight of a cocker spaniel puppy
R: the number of crimes committed in Dublin in a weekend
S: the number of telephone calls received per hour at an enquiry desk
T: the number of weeds in a flower-bed
U: the length of an adult's step
V: the number of bulls-eye’s a player achieves with three darts.
#2. The number of customers who enter a shop, per hour, is Poisson distributed with
mean 8. Calculate the percentage of hours in which:
(a) fewer than two customers enter
(b) more than two customers enter.
#3. The number of blemishes that occur on a roll of carpet during manufacture is
Poisson distributed, with mean 0.4 blemishes per roll. What percentage of the rolls are
classified as:
- 49 -
Applied Statistics James Reilly [Link]
This curve is sigmoidal in shape, but if the two ends of the vertical axis were stretched,
it could be made linear. This is what a probability plot does. It plots the cumulative
probabilities using a suitably transformed vertical axis, so that if the data are normal,
the graph will be linear.
- 50 -
Applied Statistics James Reilly [Link]
Fig 3.10
The plot above indicates that the heights of Irishmen are quite normal. There is some
scatter, which we expect anyway, but the points are approximately linear.
The graph below is a normal probability plot of the times spent waiting for a taxi. The
graph shows clearly that the data are not normal. This would also be obvious from a
histogram.
Fig 3.11
A lognormal distribution is a good fit to the taxi data, as the following graph shows.
- 51 -
Applied Statistics James Reilly [Link]
Fig 3.12
Problems 3E
#1. The lifetimes of vehicle bearings, in kilometres, are explored below using a
number of alternative distributions in order to identify a model that would fit the data
reasonably well. Can you say which of these distributions seems to fit best and why?
Fig 3.13
- 52 -
Applied Statistics James Reilly [Link]
4
Estimation
Having completed this chapter you will be able to:
– construct confidence intervals for means, proportions and standard deviations;
– determine the sample size required to make an estimate;
– compare two populations and estimate the magnitude of the difference between
their means or proportions.
- 53 -
Applied Statistics James Reilly [Link]
total
2
12 22 32 ...
E( X )
This formula is not surprising at all. It confirms that a sample mean is an unbiased
estimator of the population mean, provided that the sample is random. We could also
say that this formula merely states that the sweets do not get heavier, or lighter, as a
result of being packed in bags.
- 54 -
Applied Statistics James Reilly [Link]
Formula 4.3
The Standard Error of the Sample Mean
SE( X )
n
This is a very useful result. It confirms our intuition that a more precise estimate of µ
is provided by a sample, rather than by an individual value, and that larger samples
are preferable to smaller samples. In other words, sample means are less variable
than individual values. When we combine this information with our next discovery,
below, we will be in a position to calculate the exact degree of confidence that can be
placed in an estimate of a population mean.
15
Percent
10
0
1 2 3 4 5 6
Number on a Rolled Die
Now consider sample means for n = 2 dice. The sample mean is more likely to be in
the middle than at the extremes. To get a sample mean of 1, both dice must show 1,
but there are lots of ways to get a sample mean of 3: namely 3 and 3, 1 and 5, 5 and
1, 2 and 4, 4 and 2.
- 55 -
Applied Statistics James Reilly [Link]
Fig 4.2
15
Percent
10
0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
Sample Mean on 2 Dice
10
Percent
0
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0
Sample Mean on 5 Dice
The distribution is normal. Sample means taken from any population are normally
distributed if the samples are big enough. And how big is ‘big enough’? If the parent
- 56 -
Applied Statistics James Reilly [Link]
Problems 4A
#1. Compressed tablets are manufactured by a process that produces tablets on
different days with a slightly different mean tablet weight on each day. The standard
deviation of the tablet weights is constant every day, at 6 micrograms. A sample of
tablets is taken each day and the sample mean is used to estimate the mean tablet
weight for that particular day. What is the standard error of this estimate, if the sample
size is:
(a) 4 tablets?
(b) 9 tablets?
(c) 36 tablets?
X 1.96
n
This formula can be used only if σ is known. There are situations where this is the
case. Suppose I pick up a stone and weigh it 1000 times on a kitchen scale. The
results will differ slightly from each other, because the scale is not perfect. I can
calculate the mean of the results, and this tells me something about the stone – how
heavy it is. I can also calculate the standard deviation of the results, but this tells me
nothing about the stone. It tells me something about the scale – how precise it is. Now
if you pick up a different stone and begin to weigh it on the same scale, we know what
- 57 -
Applied Statistics James Reilly [Link]
standard deviation to expect for your results. It will be the same as mine, because you
are using the same weighing process. So whether you decide to weigh your stone
once, or five times, we can use the formula above to estimate your process mean,
because we already know the standard deviation from previous experience. In
summary, if we have prior knowledge of the standard deviation of a process, we can
use this knowledge to estimate the current process mean.
EXAMPLE The cycle times of a washing machine are known to have a standard
deviation of 2 seconds. Construct a 95% confidence interval for the population mean,
if a random sample of three cycles had the following durations, in seconds: 2521, 2526,
2522.
SOLUTION
X = 2523
σ=2
n=3
µ = 2523 ± 1.96 × 2 / √3
µ = 2523 ± 2.26
2520.74 ≤ µ ≤ 2525.26
We can state with 95% confidence that the mean cycle time for this machine lies inside
this interval. On 95% of occasions when we construct such a confidence interval, the
interval will include the population mean. There is a 5% probability that an unfortunate
sample has been drawn, leaving the population mean lying outside the interval. Other
confidence levels, such as 99%, could be used instead, but this book uses 95%
throughout. The simple estimate, 2523, is called a point estimate.
- 58 -
Applied Statistics James Reilly [Link]
Formula 4.5 is a special case of Formula 4.6, because t∞ = z. This formula represents
the culmination of all the preceding theory in this chapter.
Formula 4.6
95% Confidence Interval for a Population Mean (if σ is unknown)
S
X t n 1
n
EXAMPLE A random sample of flight times last year between Dublin and Edinburgh,
gave the following results, in minutes: 45, 49, 43, 51, 48. Construct a 95% confidence
interval for the population mean.
SOLUTION
X = 47.2
S = 3.19
n=5
tn-1 = 2.776
µ = 47.2 ± 2.776 × 3.19 / √5
µ = 47.2 ± 3.96
43.24 ≤ µ ≤ 51.16
We can state with 95% confidence that the mean flight time, for all flights between
Dublin and Edinburgh last year, lies between 43.24 and 51.16 minutes.
P(1 P)
P 1.96
n
- 59 -
Applied Statistics James Reilly [Link]
EXAMPLE Out of 50 students randomly selected from a college, 16 had been bitten by
a dog at some time. Construct an approximate 95% confidence interval for the
population proportion.
SOLUTION
P = 16 / 50 = 0.32
π = 0.32 ± 1.96 √ {(0.32 × (1 – 0.32) / 50}
π = 0.32 ± 0.13
0.19 ≤ π ≤ 0.45
We estimate that between 19% and 45% of all students in the college had been bitten
by a dog at some time, with approximately 95% confidence.
Problems 4B
#1. A machine that fills bags with rice is known to have a standard deviation of 0.5
grams. Construct a 95% confidence interval for the population mean, if a random
sample of filled bags had weights, in grams, as follows.
497.15, 498.21, 497.93, 497.46, 498.91, 497.61
#2. It is known that the standard deviation of the diameters of plastic tubes, made by
a certain process, is 0.05 mm. Five tubes were randomly selected and their diameters
were measured, with the following results in mm.
12.01, 12.05, 12.08, 12.02, 12.11
Construct a 95% confidence interval for the population mean.
#3. A random sample of service calls for a certain engineer involved journeys of the
following distances, in km:
17.1, 9.2, 4.0, 3.1, 20.7, 16.1, 11.0, 14.9
Construct a 95% confidence interval for the population mean.
#4. A motorist randomly selected five insurers from among all the providers in the
market, and obtained the following quotations for motor-insurance, in euro.
301, 295, 302, 305, 279
Construct a 95% confidence interval for the population mean.
#5. A random sample of 200 booklets was selected from a large consignment, and 60
of these were found to have defective binding. Construct an approximate 95%
confidence interval for the population proportion.
#6. Out of 633 students surveyed at a college in Tallaght, 105 walk to college.
Construct an approximate 95% confidence interval for the population proportion.
- 60 -
Applied Statistics James Reilly [Link]
1.96
2
n
EXAMPLE A random sample of four fish was taken at a fish-farm. The weights of the
fish, in grams, were 317, 340, 363 and 332. How many fish must be weighed in order
to estimate the population mean to within ± 10 grams, with 95% confidence?
SOLUTION
S = 19.2, the sample standard deviation is the planning value
δ = 10, the margin of error
n = (1.96 × 19.2 / 10)2
n = 14.16, rounded up becomes 15.
We require 15 fish. Since we already have 4 fish in the pilot sample, 11 more fish must
be sampled to bring the cumulative sample size up to 15.
1.962 P(1 P)
n
2
EXAMPLE How large a sample of voters is required to estimate, to within 1%, the level
of support among the electorate for a new proposal to amend the constitution?
SOLUTION
P = 0.5, conservative planning value
- 61 -
Applied Statistics James Reilly [Link]
Problems 4C
#1. The lifetimes of three rechargeable batteries in a batch were measured, with the
following results, in hours: 168, 172, 163. How large a sample is required to estimate
the population mean to within ± 1 hour, with 95% confidence?
#2. The standard deviation of the lengths of roof-tiles from a certain supplier is known
to be 3 mm. How large a sample is required to estimate the population mean to within
±0.5 mm, with 95% confidence?
#3. How large a sample is required to estimate, to within 10%, the proportion of pizza
orders from a certain outlet that require home delivery?
#4. How large a sample is required to estimate, to within 3%, the level of support for
a political party which traditionally receives 22% support in opinion polls?
EXAMPLE A depth sounder was used to take three repeat measurements at a fixed
point near the coast, and another six measurements at a fixed point further out to sea.
- 62 -
Applied Statistics James Reilly [Link]
The results, in metres, were as follows: 28, 29, 26 (coast) and 36, 38, 37, 35, 35, 36
(sea). Estimate the standard deviation of repeat measurements for this device.
SOLUTION
n1 = 3
S1 = 1.528
n2 = 6
S2 = 1.169
S2 = {(3-1) × 1.5282 + (6-1) × 1.1692} ÷ (3-1+6-1)
S2 = 1.643 is the pooled variance estimate, and
S = 1.282 is the pooled standard deviation estimate.
(n 1) S 2 (n 1) S 2
upper
2
lower
2
EXAMPLE
A random sample of five loaves had the following lengths, in cm:
26.9, 28.4, 31.6, 27.8, 30.2.
Construct a 95% confidence interval for sigma, the population standard deviation.
SOLUTION
n=5
S = 1.898
degrees of freedom, df = 4
From the chi-square table, we find
upper 2.5% point = 11.143
lower 2.5% point = 0.484 (the upper 97.5% point)
√{4(1.898)2/11.143)} = 1.137
- 63 -
Applied Statistics James Reilly [Link]
√{4(1.898)2/0.484)} = 5.456
1.137 ≤ σ ≤ 5.456
We can state with 95% confidence, assuming normality, that the standard deviation of
the lengths of all the loaves lies between 1.137 and 5.456 cm.
Problems 4D
#1. Kevin weighed himself three times in succession on a scale and the readings, in
kg, were: 25.4, 25.6, 24.9.
Rowena weighed herself twice on the same scale and her results were: 31.2, 31.9.
Use all the data to estimate the standard deviation of repeat measurements for this
scale.
#2. A 100 m hurdler was timed on five practice events, with the following results, in
seconds.
12.01, 12.05, 11.99, 12.06, 12.03.
Construct a 95% confidence interval for σ.
- 64 -
Applied Statistics James Reilly [Link]
S = 13.46
n=7
tn-1 = 2.447
µ = -17.14 ± 2.447 × 13.46 / √7
µ = -17.14 ± 12.45
-29.59 ≤ µ ≤ -4.69
We are 95% confident that Max’s is between 4.69 and 29.59 cent less expensive than
Ben’s, per item, on average.
S2 S2
1 2 X 1 X 2 t
n1 n2
- 65 -
Applied Statistics James Reilly [Link]
X 1 = 893.75
X 2 = 858.00
df = 7
t = 2.365, from the t tables
µ1 - µ2 = 893.75 - 858.00 ± 2.365 √(1057.23/4 + 1057.23/5)
µ1 - µ2 = 35.75 ± 51.58
-15.83 ≤ µ1 - µ2 ≤ 87.33
We are 95% confident that Lucan is between €15.83 less expensive and €87.33 more
expensive than Naas for apartments rental, on average. Because the confidence
interval includes zero, it may be that the mean difference is zero.
P1 (1 P1 ) P2 (1 P2 )
1 2 P1 P2 1.96
n1 n2
- 66 -
Applied Statistics James Reilly [Link]
Problems 4E
#1. Six internal wall-to-wall measurements were taken with a tape measure, and the
same measurements were also taken with a handheld laser device. The results, in cm,
are shown below, in the same order each time. Construct a 95% confidence interval
for the mean difference.
Tape Measure: 602, 406, 478, 379, 415, 477.
Laser Device: 619, 418, 483, 386, 413, 489.
#2. Construct a 95% confidence interval for the difference in weight between Kevin
and Rowena in problems 4D #1.
#3. An industrial process fills cartridges with ink. At 10 a.m. a sample of fills were
measured, with the following results, in ml: 55, 49, 57, 48, 52. At 11 a.m. another
sample was taken: 53, 59, 50, 49, 52. Construct a 95% confidence interval for the
difference between the population means.
#4. Of a sample of 150 visitors at a theme park in summer, 15 had prepaid tickets. Of
a sample of 800 visitors at the theme park in winter, only 9 had prepaid tickets.
Construct a 95% confidence interval for the difference between the population
proportions.
Project 4E #1
Estimation of a Mean
Select a sample of ten measurements from any large population of your choice and
write a report in the following format.
(a) Define the population of interest and the measurement of interest.
(b) Carefully describe how the sample was selected from the population.
(c) Present the data, compute the sample mean and the sample standard deviation
and construct a 95% confidence interval for the population mean.
(d) Explain carefully what your confidence interval tells you.
Project 4E #2
Estimation of a Proportion
Collect a sample of fifty attribute observations from any large population and write a
report in the following format.
(a) Define the population of interest and the attribute of interest.
(b) Carefully describe how the sample was selected from the population.
(c) Construct an approximate 95% confidence interval for the population proportion.
(d) Calculate the sample size required to estimate the population proportion to within
±1%, with 95% confidence.
- 67 -
Applied Statistics James Reilly [Link]
5
Testing Theories
Having completed this chapter you will be able to:
– express a practical question as a statistical hypothesis test;
– use sample data to test hypotheses;
– appreciate the implications of sample size and power.
- 68 -
Applied Statistics James Reilly [Link]
In order to test the hypothesis that the mean weight of a bag of peanuts is 25 g, we
write:
H0: µ = 25
A process engineer will be concerned if the bags of peanuts are either too heavy or
too light, and will state the alternative hypothesis as follows:
H1: µ ≠ 25
This is called a vague alternative hypothesis, because a departure from 25 in any
direction will lead to rejection of the null hypothesis. This will lead to a two-tailed test.
A consumer might be concerned only if the bags are too light, and will state the
alternative hypothesis as follows:
H1: µ < 25
This is called a specific alternative hypothesis, because only a departure from 25 in
a particular direction will lead to rejection of the null hypothesis. This will lead to a one-
tailed test.
Hypothesis Testing Procedure
1. State H0 and H1.
2. Draw a random sample.
3. Compute the test statistic using the appropriate formula.
4. Use statistical tables to identify the critical value. This identifies the rejection
region, which comprises one tail or two tails of the appropriate distribution, accounting
for an area of 5%.
5. If the test statistic is in the rejection region then p < 5%, so reject H0. Otherwise,
accept H0.
Alternatively, software can be used to calculate an exact p-value after steps 1 and 2
are performed.
Problems 5A
#1. A batch of shoelaces is subjected to a statistical hypothesis test to test for
conformity to a specification for length, prior to shipment of the batch to consumers. In
this context which error is more costly, a type 1 error or a type 2 error?
#2. A shopkeeper often waits in a queue at a bank to lodge coins. One day, the
shopkeeper notices that the customer waiting times seem to be longer than usual. He
records all the customer waiting times that he observes that day while waiting in the
queue. He then checks the bank’s website and reads a commitment to achieve a
particular mean waiting time for customers, and decides to carry out a statistical
hypothesis test of this hypothesised mean waiting time using his sample data. Is there
anything wrong with the shopkeeper’s approach?
- 69 -
Applied Statistics James Reilly [Link]
Formula 5.1
One-sample z-test
H0: µ = some value
X
z
n
Use z if σ is known.
Notice that the numerator in the one-sample z-test measures the difference between
our observations (the sample mean) and our expectations (based on H0). The
denominator is the standard error, i.e. how big we would expect this difference to be.
The test statistic therefore shows the scale of the disagreement between the evidence
and the theory. The p-value measures exactly how likely it is that so much
disagreement would happen by chance.
EXAMPLE Bags of peanuts are claimed to weigh ‘not less than 25 g, on average’. It is
known that the standard deviation of the bag weights is σ = 3 g. Test the claim, if a
random sample of bag weights were as follows:
21, 22, 22, 26, 19, 22
SOLUTION
H0: µ = 25, H0 must state a single value for µ.
H1: µ < 25, this specific alternative hypothesis will lead to a one-tailed test, using the
left tail, because it is ‘less than’.
X = 22
n=6
z = (22 - 25)/(3 / √6) = -2.449
The tables identify critical z = 1.645 (i.e. t tables, final row, one-tailed). Since we are
dealing with the left tail, critical z = -1.645. Any value more extreme than -1.645 is in
the rejection region.
-2.449 is in the rejection region (p < 5%).
Reject H0: µ = 25, in favour of H1: µ < 25
The claim, that the bags weigh not less than 25g on average, is rejected, at the 5%
level.
Testing a Mean When Sigma is Unknown
Usually the population standard deviation is unknown, and a one-sample t-test must
be used. We repeat the previous example, without assuming any prior knowledge of
σ.
Formula 5.2
One-sample t-test
H0: µ = some value
X
t
S
n
Use t if σ is unknown.
EXAMPLE Bags of peanuts are claimed to weigh ‘not less than 25 g, on average’.
Test this claim, if a random sample of bag weights were as follows:
- 70 -
Applied Statistics James Reilly [Link]
- 71 -
Applied Statistics James Reilly [Link]
Testing a Proportion
We may have a theory about a population proportion, for example, the proportion of
voters who favour a certain candidate, or the proportion of manufactured units that are
defective. If we draw a random sample from the population, we can use the observed
sample proportion to test the theory about the population proportion, using a one-
sample P-test.
Formula 5.3
One-sample P-test
H0: π = some value
P
z
(1 )
n
Conditions: n. 5 and n.(1 ) 5 .
EXAMPLE
A courier made a commitment that ‘not more than 8% of all deliveries will be late’. A
random sample of 315 deliveries included 30 late deliveries. Can it be asserted that
the commitment has been violated?
SOLUTION
H0: π = 0.08
H1: π > 0.08
n = 315
P = 30/315 = 0.095
z = (0.095 - 0.08) / √(0.08 × 0.92/315) = 0.98
Critical z = 1.645
0.98 is not in the rejection region (p>5%).
Accept H0.
The data do not establish, at the 5% level, that the proportion of all deliveries that are
late exceeds 8%. Therefore it cannot be asserted that the commitment has been
violated.
Problems 5B
#1. The average weight of glue beads is thought to be 5 grams. The standard deviation
is known to be 0.15 grams. The weights, in grams, of a random sample of 5 of these
beads are shown below. Do these data contradict the theory at the 5% level?
4.92, 5.06, 4.88, 4.79, 4.93
#2. The weights of chocolate bars are described as ‘average not less than 50 g’. Does
the following random sample of weights challenge this description at the 5% level?
49.76, 49.72, 50.13, 50.04, 49.54, 49.72
#3. The mean annual rainfall in a certain area is believed to be 80 mm. Test the theory
at the 5% level, using the following random sample of annual rainfall data.
75, 81, 72, 75, 69, 63, 70, 72
- 72 -
Applied Statistics James Reilly [Link]
#4. The fuel consumption of a random sample of five vehicles, in litres per 100 km,
was measured before servicing, and again afterwards. The data are shown below.
Can it be asserted at the 5% level that fuel consumption tends to be lower after
servicing?
Table 5.2
Driver Before After
Stephanie 13 12
Lauren 15 13
Sean 16 14
Joshua 15 15
Mary-Ann 14 13
#5. A random sample of seven workers attended a course aimed at improving their
typing speeds. Their speeds, in words per minute, before and after the course, are
shown below. Do these data establish at the 5% level that the course is generally
effective?
Table 5.3
Worker Before After
Pauline 23 25
Peggy 37 37
Paddy 39 43
James 45 50
John 46 45
Marian 39 46
Tom 51 54
#6. A random sample of 220 passengers who use the Galway to Dublin train included
92 passengers who were travelling for business reasons. Can it be asserted at the 5%
level that less than half of all the passengers on this route are travelling for business
reasons?
#7. A botanist predicted that one in four of a set of laburnum trees would have pink
flowers. Out of a random sample of 96 of these trees, only 15 had pink flowers. Is there
significant evidence at the 5% level that the prediction was incorrect?
5C. Difference between Means or Proportions
One of the most common questions in statistics is whether two populations are
different with respect to some measurement or attribute. An example concerning
measurements could be: Is pollution higher in one watercourse compared to another?
- 73 -
Applied Statistics James Reilly [Link]
An example concerning attributes could be: Are women more likely than men to
choose an iPhone as their Smartphone?
Differences between Means
To test whether pollution is higher in one watercourse compared to another, two
independent random samples must be taken, one from each watercourse.
Formula 5.4
Two-sample t-test
H0: µ1 = µ2
X1 X 2
t
2 2
S S
n1 n2
S2 is the pooled variance estimate, assuming equal variances.
EXAMPLE
The following measurements of ‘biological oxygen demand’ were taken from the
influent and effluent of a groundwater treatment tank. A two-sample t-test can be used
to investigate whether the effluent measurements are lower.
Influent: 220, 198, 198
Effluent: 186, 180, 180, 174, 177, 165, 174, 171
SOLUTION
H0: µ1 = µ2
H1: µ1 > µ2
n1 = 3
n2 = 8
S1 = 12.70
S2 = 6.40
S2 = 67.73, pooled variance estimate using formula 4.10
df = 9
X 1 = 205.33
X 2 = 175.88
t = (205.33-175.88)/√(67.73/3 + 67.73/8) = 5.29
- 74 -
Applied Statistics James Reilly [Link]
Table 5.4
iPhone Android Totals
Male 20 30 50
Female 26 24 50
Totals 46 54 100
A table like this is called a contingency table, because it is used to test whether the
proportion of iPhones is contingent (i.e. dependent) on gender. This test is called a
chi-square test of association, because we are investigating whether there is an
association between gender and smartphone. It can also be called a test of
independence because we are investigating whether the probability of someone
choosing an iPhone is independent of their gender.
Formula 5.5
Contingency Tables
H0: No association exists between the row and column categories.
(O E ) 2
2
E
O is the observed frequency and E is the expected frequency.
E = row total × column total ∕ grand total
Condition: Every E ≥ 5
df = (number of rows - 1) × (number of columns - 1)
- 75 -
Applied Statistics James Reilly [Link]
The next step is to calculate the chi-square contribution in every cell, using the formula
above. These contributions can be displayed below the expected frequencies.
The chi-square contributions measure the degree of disagreement between the
observed and expected frequencies, which is a measure of the degree of
disagreement between the data and the null hypothesis.
Table 5.6
iPhone Android Totals
Male 20 30 50
23 27
0.3913 0.3333
Female 26 24 50
23 27
0.3913 0.3333
Totals 46 54 100
Finally, the chi-square contributions from all the cells are added together to provide
the value of the test statistic, called the Pearson chi-square statistic.
χ2 = 1.4492
The chi-square distribution table indicates that the critical value of chi-square with 1
degree of freedom is 3.841. We use the upper tail of the distribution, because we wish
to know whether the observed and expected frequencies are significantly different, not
whether they are significantly similar. There is one degree of freedom because the
degrees of freedom in a 2×2 table is df = (2-1)×(2-1) = 1.
1.4492 is not in the rejection region.
p >5%
Therefore the null hypothesis, which states that no association exists between gender
and smartphone, is accepted at the 5% level.
The data do not establish, at the 5% level, that women are any more likely than men,
or any less likely than men, to choose an iPhone as their smartphone.
Tests of Association
A test of association can be used to compare proportions in any number of
populations, not just two. In fact, a contingency table with any number of rows and any
number of columns can be used to test for association between the row categories
and the column categories.
EXAMPLE The numbers of people whose lives were saved, and lost, when the Titanic
sank, are shown below. The figures are broken down by class. Is there an association
between class and survival?
- 76 -
Applied Statistics James Reilly [Link]
Table 5.7
Class Saved Lost
First class 202 123
Second class 118 167
Third Class 178 528
Crew 192 670
SOLUTION
H0: No association exists between class and survival.
The marginal totals are shown in the table below.
The expected frequencies are displayed below the observed frequencies.
The chi-square contributions are displayed below the expected frequencies.
Table 5.8
Class Saved Lost Totals
202 123 325
First class 103.0 222.0
95.265 44.175
118 167 285
Second class 90.3 194.7
8.505 3.944
178 528 706
Third Class 223.7 482.3
9.323 4.323
192 670 862
Crew 273.1 588.9
24.076 11.164
Totals 690 1488 2178
χ2 = 200.775
In this case, df = 3
The critical value of chi-square with 3 degrees of freedom is 7.815.
200.775 is in the rejection region, so p < 5%.
H0 is rejected at the 5% level.
Looking at the different cells we notice that the largest chi-square contribution
corresponds to first class passengers who were saved, so it makes sense to mention
that first in our conclusions:
- 77 -
Applied Statistics James Reilly [Link]
The data establish, at the 5% level, that an association does exist between class and
survival. First class passengers were more likely than others to be saved, and crew
members were less likely than others to be saved.
Sampling for a Test of Association
When collecting data for a test of association, it is OK to use the entire population of
data if it is available, as in the case of the Titanic data. If sampling is to be used,
sufficient data must be collected so that the expected count in every cell is at least
five.
There are three equally valid approaches to sampling for a test of association, and
usually whichever approach is most convenient is the one that is used.
1. Collect a single random sample from the entire population, and classify every
observation into the appropriate row and column. For example, a single random
sample of students could be selected from among the students at a college, and each
selected student could then be classified by their gender, either male or female, and
also by their smartphone category, either iPhone or Android.
An advantage of collecting the data in this way is that can be used to not only perform
a contingency table analysis, but also to estimate the proportion of the population that
belongs in any row or column category, e.g. the proportion of students who are male,
or the proportion of students who have an iPhone.
2. Collect a random sample from each row, and classify every observation into the
appropriate column. For example, collect a random sample of male students and
classify each selected student by their smartphone category, either iPhone or Android.
Then collect another random sample of female students and again classify each
selected student by their smartphone category, either iPhone or Android.
An advantage of collecting the data in this way is that we can ensure sufficient
representation of sparse categories, e.g. if there are only a small number of male
students in the college, we can ensure that we have sampled sufficient numbers of
male students by targeting them in this way.
3. Collect a random sample from each column, and classify every observation into the
appropriate row. For example, collect a random sample of students with iPhones, and
classify each selected student by their gender, either male or female. Then collect
another random sample of students with Android smartphones and again classify each
selected student by their gender, either male or female.
An advantage of collecting the data in this way is that we can ensure sufficient
representation of a sparse smartphone category, either iPhone or Android.
Now suppose that there are more than two categories of smartphone, e.g. iPhone,
Android and Windows. And suppose that Windows smartphones are so sparse that
after collecting the data the number of Windows smartphones is insufficient to carry
out a test of association, because some of the expected frequencies are less than five.
We could combine two categories together, e.g. combine the Android and Windows
categories, thus reducing the number of categories to two: ‘iPhone’ and ‘Other’. If it
doesn’t make sense to combine categories, the sparse category could be simply
omitted from the analysis.
- 78 -
Applied Statistics James Reilly [Link]
Problems 5C
#1. Kevin weighed himself three times in succession on a scale and the readings, in
kg, were: 25.4, 25.6, 24.9. Rowena weighed herself twice on the same scale and her
results were: 31.2, 31.9. Do these data assert, at the 5% level, that Rowena is heavier?
#2. An industrial process fills cartridges with ink. At 10 a.m. a sample of fills were
measured, with the following results, in ml: 55, 49, 57, 48, 52. At 11 a.m. another
sample was taken: 53, 59, 50, 49, 52. Do these data assert, at the 5% level, that the
process mean has changed?
#3. Out of a sample of 150 visitors at a theme park in summer, 15 had prepaid tickets.
Out of a sample of 800 visitors in winter, only 9 had prepaid tickets. Do these data
assert, at the 5% level, that the population proportions are different?
#4. Broken biscuits are occasionally noticed in boxes awaiting shipment. A sample of
100 boxes packed by hand included 3 boxes with some broken biscuits. A sample of
100 boxes packed by machine included 17 boxes with some broken biscuits. Do these
data indicate that the problem is related to the packing method?
#5. A mobile communications provider has two types of customer: Prepay and
Postpay. Occasionally, complaints are received from customers and these are
categorised as relating to the customer’s Handset, Network or Account. A random
sample taken from the complaints database is tabulated below. Do these data indicate
that different types of customer tend to have different complaints?
Table 5.9
Type Handset Network Account
Prepay 15 2 4
Postpay 35 48 46
#6. A number of students were sampled from courses in Computing, Science and
Engineering, and then classified by gender. Is there an association between gender
and course?
Table 5.10
Gender Computing Science Engineering
Male 20 17 44
Female 19 45 4
#7. The numbers of people whose lives were saved, and lost, when the Lusitania sank,
are shown below. The figures are broken down by class. Is there an association
between class and survival?
- 79 -
Applied Statistics James Reilly [Link]
Table 5.11
Class Saved Lost
First class 113 177
Second class 229 372
Third Class 134 236
Crew 292 401
- 80 -
Applied Statistics James Reilly [Link]
2. Sigma
Secondly, it is difficult to find what we are looking for because it is hidden by random
variation (like a haystack hides a needle). The random variation is measured by the
standard deviation of the population. Its value may be known from previous experience
or estimated from a pilot sample. The larger the standard deviation, the larger the
sample size will be.
3. Power
Power is the probability that we will find what we are looking for (like the chance of
finding the needle in the haystack). In other words, it is the probability of rejecting the
null hypothesis, if the size of the difference is delta. As a rule of thumb we select a
power of 80% for detecting a difference of size delta. This gives a good chance (80%)
of detecting a practically significant difference if one exists. If the difference is greater
than delta, then the probability of rejecting the null hypothesis will be even greater than
80%. And if the difference is less than delta, then the probability of rejecting the null
hypothesis will be less than 80%.
Having identified the values of delta, sigma and power, statistical software can be used
to compute the required sample size. It is also necessary to specify whether a one-
tailed or two-tailed test is proposed. A one-tailed test requires a smaller sample size,
because we have a better idea about where we need to look.
If a sample size was decided arbitrarily in advance, and then a hypothesis test leads
to the null hypothesis being accepted, it is important to find out if the test was
sufficiently powerful. Statistical software can be used in this case to compute the
power of the test. If the power was low, then the hypothesis test does not tell us much.
Research versus Validation
We can use the word research to refer to the activity of looking for a difference, as
described above. However, on other occasions, our objective is to prove that there is
no difference. We call this validation. This is like trying to prove that there is no needle
in the haystack. This requires a larger sample size, because we will not be willing to
assume that there is no needle (based on absence of evidence), but rather we want
to assert that there is no needle (based on evidence of absence). We must make sure
to take a sample so big that if there is a difference, we will find it. We achieve this by
selecting a power of 95% for detecting a difference of size delta. This will call for a
larger sample, and if we are still unable to reject the null hypothesis, then we are no
longer merely assuming it to be true, we are asserting it to be true. In summary, a
power of 80% is recommended for research, and 95% for validation.
Problems 5D
#1. If a statistical hypothesis test is compared to a court case, where the null
hypothesis corresponds to the assumption of innocence, then what does validation
correspond to in this situation?
#2. It is believed that the regular consumption of liquorice may cause an increase in
blood pressure. A hypothesis test is planned to test the difference in blood pressure
of a sample of volunteers before and after a prescribed period of liquorice
consumption. When considering how many volunteers are needed for the study, a
value for delta will be required. How would you identify this value for delta?
- 81 -
Applied Statistics James Reilly [Link]
(n 1) S 2
2
2
Condition: A normal population.
SOLUTION
H0: σ2 = 4, the test uses the variance, not the standard deviation
H1: σ2 > 4
n=6
S = 2.16
χ2 = (6-1) × 2.162 / 4
χ2 = 5.832
df = 5
Critical χ2 = 11.070
5.832 is not in the rejection region.
Accept H0.
These data do not assert, at the 5% level, that the standard deviation exceeds 2 mm.
Two-sample Variance Test
If two samples are drawn from normal populations, the ratio of the sample variances
follows an F distribution. We define the F-statistic as the ratio of the larger sample
variance to the smaller one, and this always leads to a one-tailed test, using the upper
5% point of the F distribution. The F distribution has degrees of freedom for the
numerator and denominator, in that order.
Formula 5.7
Two-sample Variance Test
H0: σ12 = σ22
2
where S1 S 2
S1 2 2
F 2
S2
Condition: Normal populations.
The degrees of freedom are n1-1 and n2-1 for the numerator and denominator.
- 82 -
Applied Statistics James Reilly [Link]
EXAMPLE
Paint can be applied by two modes: brush or sprayer. We are interested to know if the
thickness is equally consistent for the two modes of application. Two random samples
of paint specimens were selected, one from each mode. The thickness of each paint
specimen was measured in microns, with the following results.
Brush: 270, 295, 315, 249, 296, 340
Sprayer: 325, 333, 341, 334, 317, 342, 339, 321
Can it be asserted, at the 5% level, that the variances are unequal for the two modes
of application?
SOLUTION
H0: σ12 = σ22
n1 = 6
n2 = 8
S1 = 32.13
S2 = 9.47
F = 11.51
df = 5,7
Critical F = 3.972
Yes. The data assert that sprayed paint exhibits less variation in thickness than
brushed paint.
Test of Goodness-of-fit
We may wish to test the goodness-of-fit of some data to a particular distribution, e.g.
a Poisson distribution. This may sound like a rather academic exercise, but in fact it is
one of the most interesting of all hypothesis tests. If we find that the Poisson
distribution is not a good fit to data, for example, on the number of breakdowns that
occur on a photocopier, this indicates that the assumptions which apply to the Poisson
distribution do not apply to that situation. The Poisson distribution assumes that events
occur at random. If the Poisson distribution is not a good fit, then the events do not
occur at random: perhaps they occur at regular intervals (i.e. uniform, like a ‘dripping
tap’), or perhaps they occur in clusters (i.e. contagious, ‘it never rains but it pours’).
This gives us useful information which may enable us to prevent future breakdowns.
If the breakdowns occur at regular intervals, they may be caused by wear-out of some
component: this could be addressed by scheduling regular servicing or replacement
of that part. If the breakdowns occur in clusters, it may indicate that the remedy being
applied is ineffective, so a new repair strategy needs to be devised. This is just one
example: the interpretation will vary greatly from one case to another.
- 83 -
Applied Statistics James Reilly [Link]
Formula 5.8
χ2
Goodness-of-fit Test
H0: The population has some specified distribution
(O E ) 2
2
E
O is the observed frequency and E the expected frequency in each of the k
categories.
df = k-j-1, where j is the number of parameters estimated from the data.
Conditions:
(i) k≥5
(ii) Every E≥1 (Adjacent categories may be combined to achieve this.)
EXAMPLE A die was rolled 30 times. The outcomes are summarised below. Is it a fair
die?
Table 5.12
Outcome 1 2 3 4 5 6
Frequency 6 3 5 8 5 3
SOLUTION
H0: The population has a uniform distribution with k = 6.
There were a total of 30 rolls, so the expected frequencies are: 5, 5, 5, 5, 5, 5.
χ2 = (6-5)2/5 + (3-5)2/5 + (5-5)2/5 + (8-5)2/5 + (5-5)2/5 + (3-5)2/5
χ2 = 1/5 +4/5 + 0/5 + 9/5 +0/5 +4/5 = 3.6
df = 6-0-1 = 5 (No parameters were estimated from the data.)
Critical value of χ2 = 11.070
3.6 is not in the rejection region.
Accept H0.
The data are consistent with the assumption that the die is fair. We cannot assert, at
the 5% level, that the die is ‘weighted’ in favour of some numbers at the expense of
others.
Problems 5E
#1. A process that fills masses of powder into capsules has a standard deviation of 5
mg. A new process has been tried out and the masses of nine quantities of powder
filled by this process are shown. Do the data assert, at the 5% level, that the new
process is less variable?
Mass in mg: 251, 255, 252, 250, 254, 253, 253, 251, 252
#2. A hotel provides a coach to take guests to the airport. There are two alternative
routes: through the city streets, or via a ring road. A random sample of journeys was
observed for each route, and the time was recorded in minutes on each occasion. Do
the data below assert, at the 5% level, that that either one of the two routes has more
consistent journey times than the other?
City Streets: 40, 36, 36, 40, 33
Ring Road: 35, 36, 37, 34, 37
- 84 -
Applied Statistics James Reilly [Link]
#5. One Saturday, the number of goals scored in each of the 56 matches played in
the English and Scottish leagues was recorded. Is the Poisson distribution a good fit?
Table 5.14
Goals 0 1 2 3 4 5 6 7
Matches 6 13 15 12 5 1 1 2
#6. A consignment of eggs was found to contain a large number of cracked eggs.
Before disposing of the consignment, the number of cracked eggs in each carton (of
6) was counted. Is the binomial distribution a good fit? Did the damage occur before
or after the eggs were packed into the cartons?
Table 5.15
Cracked 0 1 2 3 4 5 6
Cartons 80 14 12 20 16 21 40
Project 5E
Test of Association
Design and carry out a test of association on any population of your choice, using a
contingency table, and write a report in the following format.
(a) Describe the population of interest and the purpose of your study.
(b) Explain how you sampled from the population, and why you chose this particular
sampling scheme.
(c) Display the table of observed and expected frequencies, and present the analysis.
(d) Are the categories associated? State your conclusions simply and completely,
without using any statistical jargon.
(e) Suggest two possible causes of association between the categories you have
studied.
- 85 -
Applied Statistics James Reilly [Link]
6
Correlation and Regression
Having completed this chapter you will be able to:
– explore the relationship between two or more variables;
– identify variables that are useful for making predictions;
– make predictions and estimates using regression models.
6A. Correlation
Often we interested in estimating the value of a variable that is not directly available.
It could be something that is difficult to measure, like temperature. Or it could be
something that will not be available until the future, like the number of pages that can
be printed with a recently purchased ink cartridge. Or it could be something that is
unknown, like the height of an intruder who departed without being seen.
But there may be some other variable that can be measured instead, and perhaps this
other variable may be useful for estimating the value of the variable of interest. This
other variable is called a predictor variable and the variable of interest is called the
response variable.
So, instead of measuring temperature, we could measure the length of a thread of
liquid in a glass tube. Instead of measuring the number of pages that can be printed
with a recently purchased ink cartridge, we could measure the volume of ink in the
cartridge. Instead of measuring the height of an intruder, we could measure the size
of a footprint left by the intruder. In each case, the value of the predictor variable could
be useful for estimating the value of the response variable.
Scatterplots
The first step is to investigate whether there seems to be any kind of linear relationship
between the two variables. Such a linear relationship is called correlation, and this
can be explored with a scatterplot. The predictor goes on the horizontal axis and the
response goes on the vertical axis. A sample is drawn, and the values of both variables
are observed. The resulting data-pairs are plotted as points on the graph.
EXAMPLE 1
A number of people were sampled and their shoe-sizes and heights were observed.
The data are represented in the scatterplot below.
We can see that as shoe-size increases, height tends to increase also, i.e. there is
some positive correlation between shoe-size and height. The correlation is fairly strong
but not very strong. There definitely seems to be a relationship but there is a lot of
random scatter as well. We could say that there is some explained variation and also
some unexplained variation. Knowing an intruder’s shoe-size would be useful for
estimating their height, but it would not provide a perfect estimate.
- 86 -
Applied Statistics James Reilly [Link]
Fig 6.1
175
170
Height
165
160
3 4 5 6 7
Shoe-Size
EXAMPLE 2
A number of small cars were sampled from a car sale website. The age and asking
price of each selected car is represented in the scatterplot below.
Fig 6.2
14000
12000
10000
Price
8000
6000
4000
2000
0 1 2 3 4 5 6 7 8 9
Age
- 87 -
Applied Statistics James Reilly [Link]
In this case, there is negative correlation, i.e. as age increases, price tends to
decrease. And on this occasion the correlation is strong: the points are not widely
scattered. Instead, they exhibit a high degree of linearity. It seems that age explains a
lot of the variation in the price of second-hand cars.
EXAMPLE 3
A number of people were sampled and in each case the number of letters in their
surname and their pulse were observed. The data are represented in the scatterplot
below.
Fig 6.3
80
Pulse
70
60
50
4 5 6 7 8 9 10 11
Length of Surname
Of course, these two variables are not correlated at all, either positively or negatively.
A scatter diagram of two uncorrelated variables will reveal a set of randomly scattered
points. As X increases, Y does not tend to either increase or decrease. This can be
called a plum-pudding model, because the points are randomly scattered, like the
currants in a plum pudding.
Correlation does not prove that X causes Y. When a scatterplot of observed values of
X and Y reveals correlation between the two variables, we say that there is a predictive
relationship between X and Y. We can predict the value of Y more precisely if the value
of X is known. We do not assume that X causes Y. For example, if we take a sample
of countries from among the countries of the world, and observe the number of TV
sets per person in each country, and the life expectancy in each country, we will find
that there is positive correlation between these two variables. But this does not prove
that having more TV sets causes people to live longer. To prove causation requires
a designed experiment in which the values of X are randomly assigned.
- 88 -
Applied Statistics James Reilly [Link]
Correlation Coefficient
The correlation coefficient (r) is a statistic that measures the degree of linearity
between two variables. Its value always lies between -1 and 1, inclusive. The sign of r
(+ or -) indicates the type of correlation (positive or negative). The size of r indicates
the strength of the correlation. If r = 0 then there is no correlation, while if r = 1 there
is perfect correlation with all the points lying on a straight line. Values close to zero
indicate that the correlation is weak, while values close to 1 indicate that the correlation
is strong.
The value of r is not easy to calculate by hand, but you can use your calculator or
statistical software to calculate it for you.
EXAMPLE 1
For the shoe-size and height data, r = 0.814
This value indicates positive correlation, since the value of r is positive. This tells us
that as shoe-size increases, height tends to increase.
This value also indicates correlation that is fairly strong but not very strong, since the
magnitude of r is well above zero but not very close to 1. This tells us that shoe-size
is useful for predicting height, but the prediction will not be very precise.
EXAMPLE 2
For the age and price data, r = -0.996
This value indicates negative correlation, since the value of r is negative. This tells us
that as the age of a car increases, the price tends to go down.
This value also indicates very strong correlation, since the magnitude of r is very close
to 1. This tells us that age is useful for predicting price, and the prediction will be
precise.
EXAMPLE 3
For the surname and pulse data, r = 0.009
This value indicates little or no correlation, since the value of r is close to 0. This tells
us that surname length is not useful for predicting pulse.
All of these values for the correlation coefficient are based on sample data, so they
are estimates of the population correlation coefficient, which is denoted by the Greek
letter rho, symbol ρ.
Coefficient of Determination
The square of the correlation coefficient, r2, is called the coefficient of determination.
It is usually expressed as a percentage, and it indicates the proportion of the variation
in Y which is explained by the variation in X. In more general terms it is the proportion
of the variation in the response that is explained by the model being considered. Any
model will appear to explain some of the unexplained variation in the sample because
the degrees of freedom is reduced by the number of parameters in the model. The
value of R-squared can be adjusted for this loss of degrees of freedom and the value
- 89 -
Applied Statistics James Reilly [Link]
Problems 6A
#1. (Exercise) Select a sample of fellow students or work colleagues who arrived by
car today and in each case observe the journey distance and the journey time. Find
the value of the correlation coefficient and explain in words what it means. Also find
the value of the coefficient of determination and explain in words what it means.
- 90 -
Applied Statistics James Reilly [Link]
our sample data sets, we will first take a look at the equation of a line and what it
means.
The equation of a line has the form: Y = a + b X
X and Y are the two variables. X is the independent variable (the predictor) and Y is
the dependent variable (the response). The parameter ‘b’ is the key to the relationship,
and we should look at its value first. It tells us how much Y increases when X increases
by one unit. It is called the slope of the line.
Consider this story about a taxi fare. The amount of a taxi fare (Y) depends on the
length of the journey in km (X). Suppose that the cost per km is €1.50, then b = 1.5.
For every one extra kilometre, the fare increases by €1.50.
But this is not the full story. If you call a taxi, even before you begin the journey, there
is an initial fee of about €4.50. This parameter, ‘a’, is called the intercept. It is the
value of Y when X is zero. You could say that ‘a’ represents how big Y is at the start,
and that ‘b’ represents how fast it grows.
Fig 6.4
10
6
Fare
0 1 2 3 4
Distance
Consider another story about a visit to a shop to buy bananas. The price (Y) that you
will pay depends on the weight (X) of the bananas. Bananas cost €2 per kg, so for
every additional kg of bananas that you take, you will pay an extra €2. But, if you
decide to come away with no bananas, you don’t pay anything, so a=0. We say that
the line passes through the origin. Not only is there a linear relationship between X
and Y, but X and Y are directly proportional. This means that if the value of X is
doubled, the value of Y will double also, and any change in X is matched by the same
proportional change in Y.
- 91 -
Applied Statistics James Reilly [Link]
Fig 6.5
10
6
Price
0 1 2 3 4 5
Weight
The final story concerns paying a television licence fee. You might think that the
television licence fee (Y) would depend on the number of hours per week (X) that you
spend watching television. But this is not so. Every additional hour per week adds
nothing to the licence fee, so b =0. If you have a TV set and never watch it, you must
pay the licence fee of €160. So a=160.
Fig 6.6
200
160
120
Fee
80
40
0 1 2 3 4 5
Hours
- 92 -
Applied Statistics James Reilly [Link]
Regression Equation
A regression equation is an equation of a line that represents the underlying
relationship between two variables, even though the points are scattered above and
below the line. The idea is that, even though the individual values of Y are scattered,
the mean value of Y, for each different value of X, does lie on the line. The equation
of the population regression line is
Y= α + β X
where β is the increase in the mean value of Y for every one unit increase in X, and α
is the mean value of Y when X is zero. Of course, to find out the values of α and β we
would need to take a census of the entire population, so we will have to be content
with sample estimates. The sample regression equation is
Y=a+bX
where a and b are the sample estimates of α and β respectively.
In this context, the slope, b, is called the regression coefficient, and the intercept, a,
is called the regression constant.
We now take a look at the regression equations for some of our earlier examples.
Fig 6.7
175
170
Height
165
160
3 4 5 6 7
Shoe-Size
The regression coefficient indicates that every one unit increase in shoe-size tends to
be associated with an average increase of 3.677 cm in height. The regression constant
indicates that people who wear a size zero shoe (if such a thing existed) would be
148.2 cm tall, on average. Notice that the regression constant does not always make
sense in isolation (because zero may not be a realistic value for X), but it is always
meaningful as part of the regression equation.
- 93 -
Applied Statistics James Reilly [Link]
Fig 6.8
12000
10000
Price
8000
6000
4000
2000
0 1 2 3 4 5 6 7 8 9
Age
The regression coefficient indicates that the average annual depreciation for small
cars is €1340. The regression constant indicates that the mean price at which current
year small cars are offered for resale is €14,191.
Values for ‘a’ and ‘b’ can be found by using your calculator or statistical software.
Prediction
A regression equation can be used to predict an unknown value of Y, by substituting
the corresponding value of X into the equation.
For example, the sample regression equation that relates shoe-size and height is
Y = 148.2 + 3.677 X
or
Height = 148.2 + 3.677 Shoe-Size
Now suppose we find a size 6.5 shoe-print made by an unknown intruder. We can
estimate how tall this person is by substituting 6.5 for X in the equation.
Height = 148.2 + 3.677 × 6.5
= 172 cm
This prediction seems reasonable when we compare it with the heights of people in
the sample who wear size 6 and size 7 shoes. We call this type of prediction
interpolation because the new value of X lies inside the range of the sample data.
- 94 -
Applied Statistics James Reilly [Link]
If we use the regression equation to predict the height of a person who wears a size
14 shoe, we get a value of 200cm for height. This type of prediction is called
extrapolation, since this new value of X lies outside the range of the sample data.
Extrapolation can lead to predictions that have very large errors, as a result of non-
linearity in the relationship between X and Y. Our prediction assumes that the
underlying relationship between height and shoe-size continues to be linear even
outside the range of the sample data, but there is no way to verify this. The relationship
may actually exhibit curvature rather than linearity, but this is not noticed over the
limited range of the sample data.
The Irish property crash in 2009 is an example of the dangers of extrapolation. The
crash occurred because developers, buyers and lenders assumed that property values
would continue to increase in a linear way, in line with the levels of price inflation seen
in the early years of the millennium. However property prices did not continue to rise,
but fell sharply instead. The values of properties in later years turned out to be much
lower than predicted and in many cases much lower than the values of the mortgages
on those properties, leading to negative equity.
Extrapolation can also lead to predictions that are meaningless, as a result of
discontinuity in the relationship between X and Y. Some shoe manufacturers do not
make a size 14 shoe, so estimating the height of a person who wears such a shoe is
nonsense. It follows that extrapolation must always be used with caution.
Fig 6.9
Problems 6B
#1. A lecturer made a note of the rates of attendance, and the final exam marks, of a
number of students on a science course. Take a look at the fitted line plot and explain
what is meant by b, a, R-Sq(adj) and S.
- 95 -
Applied Statistics James Reilly [Link]
Fig 6.10
60
Exam Mark (%)
50
40
30
20
10
0
0 20 40 60 80 100
Attendance %
#2. It may be possible to predict the length of the grass on a suburban lawn, in cm,
using the number of days since mowing as the predictor variable. Study the output
below, explain what is meant by b, a, R-Sq(adj) and S, and predict the length of the
grass on a suburban lawn 20 days after mowing. Comment on you answer.
Fig 6.11
5.0
4.5
Length
4.0
3.5
3.0
1 2 3 4 5 6 7 8 9 10
Days
- 96 -
Applied Statistics James Reilly [Link]
#3. A number of oranges were individually weighed, in grams, and then individually
juiced in a juicer. The volume of juice produced, in ml, was observed in each case.
Would it be possible to use these data to predict the volume of juice that would be
expected from an orange weighing 200 grams? Also, explain what is meant by b, R-
Sq(adj) and S, in this situation.
Fig 6.12
200
180
Juice
160
140
120
100
200 250 300 350 400 450
Weight
#4. Construct a fitted line plot using the journey distance and the journey time of a
sample of fellow students or work colleagues who arrived by car today. Explain what
is meant by b, a, R-Sq(adj) and S, in this situation. Identify a journey distance for some
new individual and estimate the journey time for that person.
- 97 -
Applied Statistics James Reilly [Link]
If this null hypothesis is accepted, it means that the population regression line may
pass through the origin. This would mean that Y is directly proportional to X, so that
any change in X would be matched by an identical percentage change in Y. If zero is
a meaningful value for X, it also means that the corresponding average value for Y is
zero.
Each hypothesis can be tested using a p-value. The p-value for the hypothesis H0: β
= 0 can be identified by the name of the predictor variable, and it must be tested first.
The p-value for the hypothesis H0: α = 0 will be denoted as the p-value for the constant.
EXAMPLE 1
The Minitab® software output for the shoe-size and height data is as follows.
Term Coef SE Coef T-Value P-Value
Constant 148.19 5.81 25.51 0.000
Shoe-Size 3.68 1.07 3.44 0.014
The p-value for shoe-size is 0.014, which is less than 5%, so we reject the hypothesis
that the population regression line is horizontal. We conclude that shoe-size is a useful
predictor of height.
The p-value for the constant is 0.000, which is less than 5%, so we reject the
hypothesis that the regression line passes through the origin. We conclude that height
is not directly proportional to shoe-size. This makes sense: we know that people who
wear a size 10 shoe are generally taller than people who wear a size 5 shoe, but they
are not twice as tall.
EXAMPLE 2
The software output for the age and price of cars data is as follows.
Term Coef SE Coef T-Value P-Value
Constant 14191 239 59.28 0.000
Age -1340.4 44.8 -29.89 0.000
The p-value for age is 0.000, which is less than 5%, so we reject the hypothesis that
the population regression line is horizontal. We conclude that age is a useful predictor
of price.
The p-value for the constant is 0.000, which is less than 5%, so we reject the
hypothesis that the regression line passes through the origin. We conclude that price
is not directly proportional to age. This makes sense, because older cars cost less, not
more.
- 98 -
Applied Statistics James Reilly [Link]
Least Squares
Fig 6.13
- 99 -
Applied Statistics James Reilly [Link]
represent the errors in Y. These errors (or residuals) are illustrated by the short vertical
line segments on the graph above. The regression line is chosen so that the sum of
squared errors is a minimum and therefore it is called the least squares line.
From this it follows that the least squares regression equation of Y on X is not the
same as the least squares regression equation of X on Y. So if we want to predict
shoe-size from height, we need to calculate a new regression equation, with height as
X and shoe-size as Y.
Every Y value in the sample is assumed to have been drawn at random, therefore the
values of Y must not be selected in advance, although the X values can be preselected
if desired. In a situation where the experimenter has selected the X values (e.g.
concentration), and needs to predict X from Y (e.g. absorbance), the regression
equation of Y on X should first be constructed and then transformed to provide a
formula for X.
Assumptions
The first assumption of linear regression is that the linear model is the appropriate
model. This means that for any value of X, the mean value of Y is given by the linear
equation Y = α + βX. In other words, the average values of Y do lie in a straight line.
This assumption can be confirmed by drawing a scatterplot and observing that the
points seem approximately linear without any obvious curvature. Linear regression is
often used in situations where this assumption is only approximately true. In the words
of George Box, “All models are wrong, but some are useful.”
A second assumption is usually made, that the errors are normally distributed. This
means that for each value of X, the population of Y is normal. The normality
assumption can be tested by drawing a histogram of the residuals, which should look
approximately normal.
A third assumption is usually made, that Y has a constant variance that does not
depend on X. This uniform variance assumption can be tested by drawing a plot of
residuals versus fits, which should show the points forming a horizontal belt of points
with roughly uniform vertical width.
Problems 6C
#1. Some output is shown below from a regression analysis of final exam mark on
attendance. Test the two null hypotheses β=0 and α=0, and state your conclusions.
Term Coef SE Coef T-Value P-Value
Constant 12.48 8.03 1.56 0.131
Attendance % 0.412 0.138 2.99 0.006
Also, use the output below to predict the mark of a student with 40% attendance.
Fit SE Fit 95% CI 95% PI
28.9585 3.75129 (21.2862, 36.6307) (-9.19061, 67.1076)
- 100 -
Applied Statistics James Reilly [Link]
#2. Some output is shown below from a regression analysis of length of grass, in cm,
on number of days since mowing. Test the two null hypotheses β=0 and α=0, and state
your conclusions.
Term Coef SE Coef T-Value P-Value
Constant 2.950 0.298 9.88 0.002
Days 0.2650 0.0450 5.89 0.010
Also, use the output below to predict the average length of grass, 7 days after mowing.
Fit SE Fit 95% CI 95% PI
4.805 0.135 (4.37537, 5.23463) (3.80253, 5.80747)
#3. Some output is shown below from a regression analysis of volume of juice, in ml,
on weight of orange, in grams. Test the two null hypotheses β=0 and α=0, and state
your conclusions.
Term Coef SE Coef T-Value P-Value
Constant 25.4 11.0 2.31 0.147
Weight 0.4704 0.0387 12.15 0.007
Refer to the output below and explain what is meant by the CI and the PI. The new
value of X was 200 grams.
Fit SE Fit 95% CI 95% PI
119.495 4.48726 (100.188, 138.802) (82.6406, 156.350)
#4. (Exercise) Carry out a regression analysis of journey time on journey distance for
a sample of fellow students or work colleagues who arrived by car today. Test the two
null hypotheses β=0 and α=0, and state your conclusions. Find some new individual
who was not in the sample, identify this person’s journey distance, and construct an
interval estimate of this person’s journey time.
Project 6C
Regression Project
Use regression analysis to investigate the relationship between some predictor and
some response, in any application area of your choice. Collect at least 10 data pairs
and write a report under the following headings.
(a) State the purpose of your investigation, and explain how you applied randomisation
to the data collection.
(b) Display the data, a scatter plot, and the regression equation.
(c) Explain what r2 (adjusted) means in this situation.
(d) Test the two hypotheses β = 0 and α = 0, and explain what your conclusions mean.
(e) Choose some new value for the predictor variable, and construct a prediction
interval for the corresponding value of the response. Explain what your prediction
interval signifies.
- 101 -
Applied Statistics James Reilly [Link]
This analysis indicates that both Age and Weight are useful predictors of VO2 Max,
since both p-values are significant.
The regression equation below indicates that VO2Max increases with Age but
decreases with Weight.
Regression Equation
VO2Max = 56.61 + 0.371 Age - 0.1179 Weight
Notice that the p-value of Age differs depending on whether Weight is also present in
the model. The p-value 0.004 indicates that Age is useful on its own as a predictor of
VO2Max. The p-value 0.001 indicates that Age is useful as a predictor of VO2Max,
even if Weight is also available as a predictor.
Also, the coefficient of Age is not the same in the two models. In the simple regression
equation, it signifies that every one unit increase in Age is associated with an increase
of 0.347 in the VO2Max. In the multiple regression equation, it signifies that every one
unit increase in Age is associated with an increase of 0.371 in the VO2Max, provided
that the Weight remains constant.
Selection of Variables
There may be other variables that could also be used to predict the VO2Max, such as
Height, BMI, Bodyfat, and so on. We are now faced with a choice among a large
number of alternative regression models. We could use any one predictor variable,
any two, any three, etc. How can we choose the best subset of predictors? A simple
criterion is to choose the model that has the highest r2(adjusted) value. Software can
be used to evaluate all possible subsets and to present the best candidates. The final
- 102 -
Applied Statistics James Reilly [Link]
headings in the output below should be read vertically, and each different model is
represented by a different row of the output. An X denotes a variable that is included
in the model.
Response is VO2Max
S
k B
H W i o T
e e n d R L w
i i f y G G F e
A g g B o f r r C i T n
R-Sq R-Sq Mallows g h h M l a i i M v e t
Vars R-Sq (adj) (pred) Cp S e t t I d t p p J e n y
1 19.7 17.6 9.6 5.0 2.6626 X
2 28.6 24.9 16.2 2.3 2.5425 X X
3 35.7 30.5 20.1 0.6 2.4453 X X X
4 39.3 32.5 19.9 0.7 2.4095 X X X X
5 41.6 33.3 20.6 1.5 2.3963 X X X X X
6 42.0 31.7 12.5 3.3 2.4239 X X X X X X
7 43.9 32.0 14.6 4.3 2.4195 X X X X X X X
8 44.7 30.9 7.8 5.9 2.4388 X X X X X X X X
9 45.9 30.2 3.1 7.3 2.4508 X X X X X X X X X
10 46.2 28.3 0.0 9.1 2.4834 X X X X X X X X X X
11 46.4 26.1 0.0 11.0 2.5225 X X X X X X X X X X X
12 46.4 23.4 0.0 13.0 2.5669 X X X X X X X X X X X X
Adding more and more predictor variables to the model might seem like a good idea
but there are a number of reasons to be cautious about adding extra predictor
variables.
Firstly, although extra predictor variables will always increase the value of R-squared,
unless the value of R-squared (adjusted) increases, the variable is probably of no use.
A degree of freedom is lost every time a new variable is introduced, so it is better to
use the statistic that is adjusted for degrees of freedom. Therefore, when comparing
models with different numbers of predictor variables, use R-squared (adjusted) and
not simple R-squared.
Secondly, if the addition of an extra predictor variable produces only a small
improvement in R-squared (adjusted) then it may be wiser to omit the extra predictor
variable, especially if it is not statistically significant. This approach, where a simpler
model is chosen in preference to a more complex model, is often referred to as
Occam's razor or the law of parsimony.
Thirdly, we should be wary of using any pair of predictor variables that are highly
correlated with each other. This is called multicollinearity and it tends to inflate the
errors in the coefficient estimates. For example, if r > 0.9 then the variance inflation
factor, VIF, exceeds 5, which can be regarded as a threshold for a tolerable VIF value.
This indicates that the variance of the coefficient estimate for this predictor is more
than five times greater as a result of its correlation with other predictors.
Fourthly, whenever you select a large number of predictors, there is a possibility of
overfitting the model, i.e. including some predictor variables that are merely
explaining the random variation in the sample rather than the underlying relationship
- 103 -
Applied Statistics James Reilly [Link]
in the population. A low value for R-squared (predicted) provides a warning that
overfitting has occurred. The value of R-squared (predicted) is obtained by omitting
the observations one at a time, and using the remaining observations to predict the
missing observation. So R-squared (predicted) gives an idea of how good the model
is at predicting new values of the response. It is also helpful to validate a model by
applying it to a fresh set of data.
Fifthly, a Mallows' Cp value that is close to p+1 where p is the number of predictors,
suggests a good balance between a biased model with too few predictors and an
imprecise model with too many predictors.
If you already have historical data from some process, best-subsets analysis is a
simple way to trawl through the data for clues about which variables are useful for
predicting some important response. But perhaps the most important step is to identify
all the potentially relevant variables at the beginning and to make sure to include all
these variables among the observations. For example, in the best-subsets analysis
above, the athlete’s position was not included as a predictor, but this variable could be
relevant.
Non-Linear Regression
Sometimes a scatterplot reveals a curvilinear relationship between two variables.
EXAMPLE Different doses of a drug were administered and the responses were
observed in each case.
Fig 6.14
2.5
2.0
Response
1.5
1.0
0.5
2 4 6 8 10 12
Dose
The curvature is visually obvious. But even if we went ahead and naïvely fitted a
simple linear regression model, the Lack-of-Fit of the model would come to our
attention.
- 104 -
Applied Statistics James Reilly [Link]
Fig 6.15
2.0
Response
1.5
1.0
0.5
2 4 6 8 10 12
Dose
Look at the data points for a dose of two units. There are two replicate measurements
of response and these are not identical, because the same dose will not always elicit
the same response. We see this feature in every regression plot and it is called Pure
Error. But there is another source of error present. It seems obvious that the mean
response for a dose of two units does not lie on the line of best fit. This is called Lack-
of-Fit and it indicates that the model used is inappropriate. The pure error measures
how much the replicates differ from each other, while the lack-of-fit error measures
how much the measurements differ from the fitted values. The software can test if the
lack-of-fit error is significantly greater than the pure error and provide a p-value to test
the null hypothesis that the model is a good fit to the data.
- 105 -
Applied Statistics James Reilly [Link]
Fig 6.16
Non-Linear Regresion
Response = - 0.0200 + 0.3618 Dose
- 0.01161 Dose^2
3.0
S 0.136684
R-Sq 97.1%
R-Sq(adj) 96.5%
2.5
2.0
Response
1.5
1.0
0.5
2 4 6 8 10 12
Dose
The second approach is to apply a mathematical transformation which will have the
effect of straightening out the curve. To begin with, the X variable is simply X itself, or
X1. This is called the identity transformation. We can stretch the X-axis by raising the
power to X2 or X3. Or we can shrink the X-axis by using √X or log X or X-1 or X-2. The
further the power is raised or lowered, the more stretching or shrinking occurs.
Fig 6.17
2.0
Response
1.5
1.0
0.5
- 106 -
Applied Statistics James Reilly [Link]
We are now dealing with a familiar linear regression, and we can use all the usual
techniques such as prediction intervals. Just be careful to remember that X now
represents the log of the dose and not the dose itself, so X = 1 corresponds to a dose
of 10 units.
If you try a transformation that bends the curve too far, or not far enough, just try again.
If you have a good knowledge of the underlying process, then you may be able to
identify the appropriate transformation even before looking at the data. For example,
if X is the diameter of an orange and Y is its weight, then X3 is probably the correct
transformation.
The Y variable could be transformed instead of the X variable. For a Y variable that
grows at an increasing rate with increasing X, and that also exhibits greater random
variation at large Y values, it can sometimes happen that a logarithmic transformation
of Y can provide the double benefit of both linearising the relationship and stabilising
the variance.
Problems 6D
#1. The output below shows a regression equation to predict the drying time of glue
using the moisture content of the glue and the relative humidity of the atmosphere as
predictors. Explain in words what the two coefficients mean. Time is measured in
minutes, and moisture content and relative humidity are both measured in %.
Regression Equation
Time = -15.9 + 2.17 Moisture Content + 0.975 Relative Humidity
#2. Study the output below and identify the best subset of predictors for the asking
price of a small second-hand car, from among the available predictors: age, kilometres
and engine size.
Best Subsets Regression: Price versus Age, Kilometers, Engine
Response is Price
K
i
l
o
m E
e n
t g
A e i
R-Sq R-Sq Mallows g r n
Vars R-Sq (adj) (pred) Cp S e s e
1 99.1 99.0 98.5 1.6 407.25 X
1 78.9 76.2 66.5 175.2 1987.1 X
2 99.2 99.0 98.3 2.7 407.22 X X
2 99.1 98.9 94.8 3.5 431.06 X X
3 99.3 99.0 95.0 4.0 417.58 X X X
- 107 -
Applied Statistics James Reilly [Link]
construct a regression model with shrinkage as the response and minutes as the
predictor, how would you proceed?
Fig 6.18
2.8
2.7
2.6
Shrinkage
2.5
2.4
2.3
2.2
2.1
2.0
0 2 4 6 8 10
Minutes
- 108 -
Applied Statistics James Reilly [Link]
7
Design of Experiments
Having completed this chapter you will be able to:
– design experiments involving one or more factors;
– analyse experimental data using ANOVA tables;
– communicate experimental findings and identify improvement actions.
The purpose of an experiment is to discover information. This information may provide
insight leading to action that will improve a process. An experiment must be designed
properly so that it will capture the right kind of information, capture a sufficient amount
of information, and not mix this information up with something else.
Typically an experiment will consider the effect of a factor on a response. For
example, we might want to investigate the effect of different players on the distance a
football is kicked. The response (distance) is what we measure in the experiment. The
factor (player) is what we change, to see how it will affect the response. In the
experiment, a number of different players (factor levels) are studied. Notice that we
are not just watching a game as spectators: we decide who kicks the ball, in what
order, and how often. This is an experiment, and not an observational study. In an
observational study we become familiar with the behaviour of a process, and so we
can describe or predict its behaviour. For example, we might notice through
observation that the goalkeeper kicks the ball further than anyone else, but this may
be due to the position on the field, and not the ability, of the goalkeeper. In contrast,
design of experiments (DoE) sets out to identify causes, and this may enable us to
change the behaviour of the process.
- 109 -
Applied Statistics James Reilly [Link]
that cannot be identified or eliminated, such as the exact state of the weather, the
grass, the ball, the player’s physical and mental condition, etc. Later on we will
consider exactly how many replicates are required: for now we will simply note that
one observation is not enough.
When someone carries out a new task, their performance will change as they improve
with practice, and eventually they will settle down at some steady level of performance.
This is called the learning effect and it should not be confused with random variation.
Random variation refers only to unexplained variation, and not to explained
variation. Therefore, if an experiment aims to study people as they perform some new
task, they should be allowed a sufficient number of practice runs before the experiment
begins, in order to eliminate the learning effect.
Randomisation
Where an experiment consists of a series of runs that are performed one after another,
the runs must be performed in a random run order. We do not allow one player to
perform all their kicks first, with another player taking all their kicks at the end of the
experiment. Instead, we make a list of all the runs to be performed and then we use a
random-number generator to determine the run order. The reason for this is that there
may be some progressive change in the environmental conditions, such as the wind
direction, or the wind speed, or the ambient temperature, or the emotional climate.
This could confer an advantage on the player who goes last, if the environmental
conditions later on in the experiment are more favourable for achieving long distances.
In general, anything in the experimental environment that changes over time could
affect the responses, and this could affect one factor level more than others if all the
runs of that factor level were scheduled one after another or at regular intervals.
Randomisation addresses this concern. Note that randomisation does not guarantee
rest breaks at appropriate intervals, so rest breaks should be scheduled as needed.
Sometimes, every measurement in an experiment is carried out on a different
experimental unit, e.g. a different football for every kick. In such cases, there must
be random allocation of the experimental units (footballs) to the different factor levels
(players). We do not give the first footballs we find to one player, because the first
ones may be newer or heavier or different in some other way.
Sometimes an experiment involves selecting samples from different populations in
order to compare the populations. In such cases, random selection must be used to
determine which units will be selected. For example, to investigate if the average tablet
weight differs among a number of batches, a number of tablets must be randomly
selected from each batch.
Blocking
Sometimes it is impossible to hold everything constant in an experiment. Suppose the
number of replicates is so great that we require two days to complete the experiment.
There could be some difference between the two days that would affect the responses.
Then instead of fully randomising the runs across the two days, we regard each day
as a block, and within each block we randomly allocate an equal number of runs to
each factor level. So if each player has to perform 30 kicks, they could each do 20
kicks on the first day and 10 kicks on the second day. A block is a subset of the
- 110 -
Applied Statistics James Reilly [Link]
experimental units, which are believed to be similar in responsiveness, but which may
differ in responsiveness from the units in another block. Typical blocks are: in
manufacturing, batches; in clinical trials, gender; in measurement systems, days. We
will not mention blocks again until we come to deal with two-factor experiments, where
the blocks can be treated as the second factor in the experiment.
In this table, a column of random numbers was generated, and these were ranked to
determine the run order. When software is used to create the design, the entire table
is usually sorted into run order and the random numbers are not displayed.
Table 7.1
Player Random No. Run order Distance
Gareth 0.368843 5 45
Jessy 0.738684 9 48
Eden 0.474818 7 43
Gareth 0.928478 12 42
Jessy 0.534303 8 50
Eden 0.212354 2 41
Gareth 0.882270 10 46
Jessy 0.437575 6 47
Eden 0.159179 1 43
Gareth 0.890182 11 48
Jessy 0.248853 3 52
Eden 0.272851 4 48
The question of interest is: does the player have an effect on the distance? In
general, in a single-factor experiment, the question of interest is: does the factor
have an effect on the response?
- 111 -
Applied Statistics James Reilly [Link]
Yij i ij
ANOVA by Hand
We will now analyse the data from the experiment using Analysis of Variance
(acronym ANOVA). As its name suggests, ANOVA involves looking at the data for
evidence of every alleged source of variation. Although ANOVA is usually performed
by software, it can help your understanding to work through the calculations. We begin
by unstacking the data.
- 112 -
Applied Statistics James Reilly [Link]
Table 7.2
Distance Gareth Distance Jessy Distance Eden
45 48 43
42 50 41
46 47 43
48 52 48
Step 1 We look at the error variance first, i.e. how different are the distances when the
player is not changed? The three sample variances 2.502, 2.2172 and 2.9862 all
attempt to answer this question. Each of these variances is an estimate of the error
variance, so it makes sense to combine them into a single ‘pooled’ estimate, by
averaging them. The pooled estimate is (2.502 + 2.2172 + 2.9862) / 3 = 6.69, symbol
Sw2. This is the pooled within-samples variance, also called the error variance. Each
of the individual estimates is based on 4-1 degrees of freedom, so the pooled estimate
has 3(4-1) = 9 degrees of freedom.
Step 2 Next we look at variation due to the factor, i.e. how different are the distances
when the players are changed? We can estimate this by calculating the variance
between the sample means above. This between-samples variance is denoted by
Sb2. It is based on 3-1 = 2 degrees of freedom. Sb2 = 2.8432 = 8.083. This figure needs
to be adjusted because it is a variance of sample means, not a variance of individual
values. Individual values vary more than sample means, by a factor n, where n is the
sample size. Therefore, we multiply Sb2 by 4, in order to estimate the variance between
individual distances for different players. We require n × Sb2 = 4 × 8.083 = 32.33
Step 3 Remember that the question of interest is: does the factor have an effect on the
response? Maybe not. Perhaps the apparent variation due to the factor is just more
random variation. We can test this hypothesis by comparing the variance due to the
factor with the variance due to error. To see if it is significantly bigger, we compute the
variance ratio, F, which is named after Sir Ronald Fisher who developed ANOVA. In
this case F = 32.33 / 6.69 = 4.83.
Step 4 Finally, we consider whether the calculated value of F is so unusual that it could
not have occurred by chance. From the tables of the F distribution, we see that the
critical value of F, at the 5% level, having 2 and 9 degrees of freedom, is 4.256. The
calculated value of F exceeds this value, and therefore its p-value is less than 5%. We
reject the null hypothesis, which states that the player has no effect on the distance.
- 113 -
Applied Statistics James Reilly [Link]
We say that the result is significant, i.e. the disagreement between the data and the
null hypothesis is unlikely to have occurred by chance. The data assert, at the 5%
level, that player does affect distance. Jessy kicks the ball further than the other
players.
The ANOVA calculations are summarised in the formula below.
Formula 7.1
One-Way ANOVA
H0: The factor has no effect on the response.
For k factor levels with n replications each, i.e. k samples of size n:
n.S b2
F 2
Sw
Sb2 is the ‘between-samples’ variance
Sw2 is the pooled ‘within-samples’ variance
df = k-1, k(n-1)
ANOVA by Software
Software will produce an ANOVA table which always looks like this.
One-way ANOVA: Distance versus Player
Source DF Adj SS Adj MS F-Value P-Value
Player 2 64.67 32.333 4.83 0.038
Error 9 60.25 6.694
Total 11 124.92
The p-value is less than 5%, showing that the data were unlikely to arise on the basis
of the assumption that player has no effect on distance. We therefore reject this null
hypothesis in favour of the alternative: player does have an effect on distance.
Every ANOVA table has the same headings: Source (of variation), DF (degrees of
freedom), SS (‘sum of squares’ of the deviations), MS (‘mean square’ deviation, i.e.
variance), F (variance ratio), and P (the probability of obtaining the data if the factor
has no effect on the response).
ANOVA Assumptions
ANOVA is based on three assumptions.
1. The errors are independent. This means that if one of Gareth’s kicks is a short one
for Gareth, there is no reason to expect that his next kick will also be short. Gareth is
not being disadvantaged in any way, and neither is any other player. That is, the
experiment is free from systematic bias. Randomisation will tend to ensure that this is
true.
2. The errors are normally distributed. The population of Gareth’s distances is
normally distributed, and the same can be said for the other players’ distances. We
rely on this assumption when we calculate F, because the F statistic assumes that we
are dealing with normal populations.
- 114 -
Applied Statistics James Reilly [Link]
3. The errors have a uniform variance, i.e. the error variance does not depend on the
factor level. Even though one player may kick the ball further on average than some
other player, the variation in distances is the same for all players. We rely on this
assumption when we calculate the pooled variance estimate, because by pooling the
sample variances we assume that they are estimating the same quantity.
The first of these assumptions must be satisfied, since systematic bias will render the
experimental results untrustworthy. The second and third assumptions are less
important and ANOVA will still work well with mild non-normality or unequal variances.
For this reason, we say that ANOVA is robust.
The ANOVA assumptions can be tested by residual analysis. First of all, calculate
the estimated fitted values (µ+α) and residuals (the errors, ε). This can be easily done
with software. Then construct a histogram of the residuals: this should look
approximately normal (assumption 2). Also draw a plot of residuals versus fits: this
should show the points forming a horizontal belt of roughly uniform vertical width
(assumption 3). The residuals can also be plotted against the run order to confirm that
there was no time-related pattern.
Problems 7A
#1. Jody carried out an experiment to investigate whether the colour of a birthday
candle affects the burning time, measured in seconds. The output is shown below.
One-way ANOVA: Time versus Colour
Source DF Adj SS Adj MS F-Value P-Value
Colour 2 4594 2297 0.58 0.586
Error 6 23571 3929
Total 8 28165
(a) Explain what each term in the model represents in this situation.
(b) State and test the null hypothesis, and state your conclusions in simple language.
#2. Anne carried out an experiment to investigate whether the supermarket of origin
affects the weights of clementines (measured in grams). The output is shown below.
One-way ANOVA: Weight versus Supermarket
Source DF Adj SS Adj MS F-Value P-Value
Supermarket 1 3385.6 3385.6 27.75 0.001
Error 8 976.0 122.0
Total 9 4361.6
(a) Explain what each term in the model represents in this situation.
(b) State and test the null hypothesis, and state your conclusions in simple language.
#3. (Exercise) Design and carry out an experiment to investigate the effect of the
design of a paper airplane on its distance travelled. Five airplanes should be made
using each of two different designs. One person should throw all the airplanes, in
random order. If the airplanes are thrown over a tiled floor, the distance can be
measured by counting the number of tiles. Analyse the data using one-way ANOVA.
- 115 -
Applied Statistics James Reilly [Link]
Project 7A
Single-Factor Experiment
Design a single-factor experiment, carry it out and analyse the results. You may
choose any area of application: a business or industrial process, an area of academic
interest, a sporting activity, or everyday life. Make sure that you have the authority to
set the factor levels, and that you have a suitable instrument for measuring the
response. Your report must consist of the following sections.
(a) State the purpose of your experiment and explain the role of randomisation in
your experiment.
(b) Write a model for the response in your experiment and carefully explain what
each term in the model represents.
(c) Display the data and the ANOVA table.
(d) Express the experimental findings in simple language.
This is referred to as a crossed design because every row crosses every column,
showing that every level of one factor is combined with every level of the other factor,
i.e. every driver drives every vehicle.
The number of entries in each cell indicates the level of replication. This experiment is
not fully satisfactory because there is only one observation per cell, but it will be
possible to carry out a limited analysis.
The data must be presented to the software in a proper structure, with a column for
every variable, and a row for every case, as shown below.
- 116 -
Applied Statistics James Reilly [Link]
Table 7.5
Driver Vehicle Fuel Economy
Jeremy Hatchback 5.2
Jeremy Saloon 5.5
Jeremy SUV 7.1
Ralph Hatchback 5.7
Ralph Saloon 5.4
Ralph SUV 6.9
The ANOVA table has the usual headings but has two sources of explained
variation, namely driver and vehicle.
ANOVA: Fuel Economy versus Driver, Vehicle
Source DF SS MS F P
Driver 1 0.0067 0.0067 0.09 0.789
Vehicle 2 3.2033 1.6017 22.35 0.043
Error 2 0.1433 0.0717
Total 5 3.3533
Notice that the p-value for driver is greater than 0.05, but the p-value for vehicle is less
than 0.05. We accept the null hypothesis that driver has no effect on fuel economy,
but we reject the null hypothesis that vehicle has no effect on fuel economy. SUVs use
more fuel.
In this experiment, we may not have been interested in the driver effect. We may have
included two drivers in the experiment simply because there was not enough time for
one driver to perform all of the experimental runs. If so, the drivers are the blocks. A
block is like an extra factor that we include in an experiment because we cannot avoid
doing so.
Interaction
We now consider a full two-factor experiment with replication. The experiment below
considers the effect of person and language on the time, in seconds, required to read
a page from a novel. There are three levels of person (Samuel, Alice and Manuel) and
two levels of language (English and Spanish).
Table 7.6
Samuel Alice Manuel
English 159 157 326
163 153 307
Spanish 319 306 160
302 312 152
- 117 -
Applied Statistics James Reilly [Link]
In this experiment, there is replication in the cells: each person reads more than one
page in each language, so we can do a full analysis.
Looking at the data in the table, you will notice that the situation is not straightforward.
The question, ‘Who takes longer to read?’ does not have a simple answer. The answer
depends on the language. This is referred to as an interaction. An interaction means
that the effect of one factor depends on the level of some other factor. To put it another
way, interaction is present when the effect of the combination of factor levels is not
the same as what you would expect to get by adding the effects of the factors on their
own.
The word ‘interaction’ should be used with care. It always refers to the effect of a
combination of factor levels. It should never be used to refer to the effect of a single
factor on its own. The phrase ‘main effect’ can be used to refer to the effect of a single
factor on its own.
In generic terms:
Y is an individual response in the table: it is observation number k in the cell located
in row i and column j.
μ is the grand average response, averaged over all rows and columns.
α is the row effect, i.e. how much the row average exceeds the grand average.
β is the column effect, i.e. how much the column average exceeds the grand
average.
η is the interaction effect, i.e. how much the cell average exceeds what is expected.
Based on the additive model we would expect the cell average to be μ + α + β.
ε is the error, i.e. how much an individual observation exceeds the cell average.
In relation to this particular experiment:
Y is the reading time for language i and person j on occasion k.
μ is the grand average reading time, averaged over all languages and persons.
α is the language main effect, i.e. how much more time on average is required for
that language.
β is the person main effect, i.e. how much more time on average is required by that
person.
η is the interaction effect, i.e. how much more time on average is required for that
particular language-person combination, compared to what would be expected.
ε is the error, i.e. how much more time was taken on that occasion, compared to the
average time for that language-person combination.
There are three null hypotheses of interest, namely:
- 118 -
Applied Statistics James Reilly [Link]
every η = 0, i.e. there is no interaction, i.e. there are no unusual cells in the table, i.e.
no particular factor-level combination gives an unusual response, i.e. no particular
person reading any particular language takes an unusual amount of time.
every α = 0, i.e. there is no main effect due to factor 1, i.e. there are no unusual rows
in the table, i.e. language has no effect on reading time.
every β = 0, i.e. there is no main effect due to factor 2, i.e. there are no unusual
columns in the table, i.e. person has no effect on reading time.
The data should be properly structured like this for analysis.
Table 7.7
Language Person Time
English Samuel 159
English Samuel 163
Spanish Samuel 319
Spanish Samuel 302
English Alice 157
English Alice 153
Spanish Alice 306
Spanish Alice 312
English Manuel 326
English Manuel 307
Spanish Manuel 160
Spanish Manuel 152
The ANOVA table has three sources of explained variation, namely language,
person and interaction. The interaction term can be denoted language*person.
ANOVA: Time versus Language, Person
Source DF SS MS F P
Language 1 6816 6816 104.60 0.000
Person 2 43 22 0.33 0.730
Language*Person 2 65010 32505 498.80 0.000
Error 6 391 65
Total 11 72261
The interaction p-value must always be investigated first. In this case, the interaction
p-value is less than 0.05, so we conclude that there is an interaction effect. In order
to find out about the details of the interaction we draw an interaction plot.
- 119 -
Applied Statistics James Reilly [Link]
Fig 7.1
We can see that for Manuel, switching from English to Spanish reduces the time. But
for Samuel or Alice, switching from English to Spanish increases the time.
If there is no significant evidence of interaction in an ANOVA table, then we simply go
ahead and interpret the p-values for the main effects in the usual way.
Different kinds of interaction could arise. In the example above, some people took
longer to read in English and others took longer in Spanish. With a different set of
people, it could happen that all of them require extra time to read in Spanish, but the
amount of extra time required could vary from person to person. This is also classed
as interaction, because the effect of the combination is not obtained by adding together
the main effects.
If everyone requires the same amount of extra time to read in Spanish, that is not
interaction: it is a language main effect. An interaction is like an allergic reaction that
occurs with a particular person-food combination. In general, that person is healthy. In
general, that food does not cause problems. The unusual effect is observed only when
the two combine. In the same way, we say that interaction is present if some factor-
level combination gives rise to an unexpected average response.
It should now be apparent that a major advantage of two-factor experiments is that
interactions can be discovered. In single-factor experiments, interactions cannot be
discovered.
- 120 -
Applied Statistics James Reilly [Link]
Problems 7B
#1. Justin carried out an experiment to investigate the effect of river and bait on the
number of fish caught in a day. The ANOVA table and interaction plot are shown
below.
(a) Explain what each term in the model represents in this situation.
(b) State and test the null hypotheses, and state your conclusions in simple
language.
ANOVA: Fish versus River, Bait
Source DF SS MS F P
River 1 256.00 256.00 23.91 0.000
Bait 1 240.25 240.25 22.44 0.000
River*Bait 1 25.00 25.00 2.33 0.152
Error 12 128.50 10.71
Total 15 649.75
- 121 -
Applied Statistics James Reilly [Link]
Fig 7.2
#2. Darragh carried out an experiment to investigate the effect of guitar and method
on the time required to play a scale. The ANOVA table and interaction plot are shown
below.
(a) Explain what each term in the model represents in this situation.
(b) State and test the null hypotheses, and state your conclusions in simple
language.
ANOVA: Time versus Guitar, Method
Source DF SS MS F P
Guitar 1 0.0000 0.0000 0.00 1.000
Method 1 12.0000 12.0000 13.09 0.007
Guitar*Method 1 16.3333 16.3333 17.82 0.003
Error 8 7.3333 0.9167
Total 11 35.6667
- 122 -
Applied Statistics James Reilly [Link]
Fig 7.3
#3. Ken carried out an experiment to investigate the effect of person and paddle on
the time taken to travel 500 m in a kayak. The ANOVA table and interaction plot are
shown below.
(a) Explain what each term in the model represents in this situation.
(b) State and test the null hypotheses, and state your conclusions in simple
language.
ANOVA: Time versus Girl, Paddle
Source DF SS MS F P
Girl 1 0.394 0.394 0.09 0.767
Paddle 1 48.686 48.686 11.31 0.006
Girl*Paddle 1 1.328 1.328 0.31 0.589
Error 12 51.646 4.304
Total 15 102.053
- 123 -
Applied Statistics James Reilly [Link]
Fig 7.4
#4. (Exercise) Working with a colleague, design and carry out a two-factor experiment
to investigate the effects of person (two levels: you and your colleague) and hand (two
levels: left and right) on writing speed. Allow three replicates. Measure the writing
speed by recording how long it takes to write the following sentence: ‘The quick brown
fox jumps over the lazy dog.’ Analyse the data using an ANOVA table, illustrate the
results on an interaction plot, and share your conclusions with others who have also
performed this exercise.
Project 7B
Two-Factor Experiment
Design a two-factor experiment with replication in any application area of your choice,
carry it out, analyse the results, and write a report consisting of the following sections.
(a) State the purpose of your experiment and explain why randomisation was
necessary in your experiment.
(b) Write a model for the response in your experiment and carefully explain what
each term in the model represents.
(c) Display the data and the ANOVA table.
(d) Show one interaction plot.
(e) Express the experimental findings in simple language.
- 124 -
Applied Statistics James Reilly [Link]
Covariates
All of the factors that we have considered in this chapter so far have been categorical
factors. But we may wish to include a continuous variable as a factor also, as we
have done already in regression analysis. We are now combining ANOVA with
regression, and this use of the general linear model is often referred to as analysis of
covariance or ANCOVA. The continuous factor is referred to as a covariate.
Let us say that the response of interest is the VO2Max of athletes. VO2Max may be
related to age, which is a continuous variable. VO2Max may also be related to position,
which is a categorical factor. It would be unwise to investigate the effect of position on
VO2Max without correcting for age, because athletes who play in a certain position,
say goalkeepers, may be younger than athletes in other positions. The general linear
model allows us to simultaneously consider these two sources of explained variation.
The output from a general linear model will provide p-values for both age and position,
a coefficient for age, and constants for each level of position.
General Linear Model: VO2Max versus Age, Position
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Age 1 14.56 14.558 13.21 0.001
Position 3 236.82 78.940 71.65 0.000
Error 36 39.66 1.102
Lack-of-Fit 24 27.49 1.146 1.13 0.428
Pure Error 12 12.17 1.014
Total 40 344.19
Model Summary
S R-sq R-sq(adj) R-sq(pred)
1.04962 88.48% 87.20% 84.66%
Regression Equation
Position
D VO2Max = 50.88 + 0.1685 Age
F VO2Max = 54.64 + 0.1685 Age
G VO2Max = 46.41 + 0.1685 Age
M VO2Max = 52.95 + 0.1685 Age
This approach is used in Stability Studies to model the degradation of drugs over
time, in order to determine appropriate expiry dates. Batch is the categorical factor,
time is the covariate, and the response is the remaining percentage of active
pharmaceutical ingredient.
- 125 -
Applied Statistics James Reilly [Link]
The distinction between fixed and random factors is quite similar to the distinction
between descriptive and inferential statistics. If we are interested only in the factor
levels before us, then we have a fixed factor. But if the factor levels in the experiment
are merely a random sample from a larger population of factor levels, then we have a
random factor. When blocks are included in an experimental design, the blocks are
usually random factors because we have no particular interest in the individual levels
of the block, e.g. the different batches or days.
Nested Designs
In certain situations, although two factors are included in an experiment, the factors
cannot be cross-classified because of the relationship between them. For example,
suppose we are interested to know whether the length of laurel leaves varies from tree
to tree, and also from branch to branch within each tree. We may include four random
trees in our study, and four random branches from each tree. The branches could be
numbered 1, 2, 3, and 4, on each tree. But branch number 1 on tree number 1 is not
the same as branch number 1 on tree number 2, because the branches are numbered
within each tree. We say that the factor ‘branch’ is nested within the factor ‘tree’, and
such a design is said to be a nested design or hierarchic design.
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Tree 3 1168 389.42 0.45 0.719
Branch(Tree) 12 10280 856.69 11.13 0.000
Error 16 1231 76.94
Total 31 12680
The p-value for branch is 0.000, indicating that there are significant differences
between branches. The p-value for tree is 0.719, indicating that there are no significant
differences between trees. The F value for tree is obtained by dividing by the mean-
- 126 -
Applied Statistics James Reilly [Link]
square for branch, not by the mean-square for error, because we are asking if there
are differences between trees that are not explained by the differences between
branches. In general, when considering any factor, we compare its mean-square with
the mean-square of the factor that is one stage lower in the hierarchy.
Because a branch can never be combined with a different tree, there is no possibility
of interaction in this experiment, or between nested factors in any experiment.
Other examples of nested factors include: regions nested within countries, or casks of
liquid product nested within batches.
Problems 7C
#1. Timber logs are dried by leaving them overnight on a storage heater or placing
them in an oven for either one hour or four hours. The percentage reduction in weight
of each log is observed. It is thought that the percentage reduction in weight may
depend on the weight of the log, so the original weight of each log is recorded also.
The General Linear Model output is shown below.
(a) Is the original weight a factor or a covariate? Explain.
(b) Does the weight of a log have an impact on its reduction in weight?
(c) What is the best way to dry a log? What is the next best way?
General Linear Model: %Reduction versus Original Wt, Drying Mode
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Original Wt 1 24.55 24.550 6.64 0.026
Drying Mode 2 346.96 173.480 46.89 0.000
Error 11 40.70 3.700
Total 14 419.76
Model Summary
S R-sq R-sq(adj) R-sq(pred)
1.92355 90.30% 87.66% 71.28%
Regression Equation
Drying Mode
oven 1hr %Reduction = 9.73 - 0.00343 Original Wt
oven 4hr %Reduction = 24.06 - 0.00343 Original Wt
storage heater %Reduction = 12.14 - 0.00343 Original Wt
- 127 -
Applied Statistics James Reilly [Link]
8
Distance
4
30 35 40 45 50 55 60
Angle
- 128 -
Applied Statistics James Reilly [Link]
Fig 7.6
8
Distance
30 35 40 45 50 55 60
Angle
This problem can be overcome by including a centre point in the design. A centre
point is an experimental run at which all factors are set halfway between high and low.
Irrespective of the number of factors, one centre point will provide a p-value for
curvature and put our minds at rest about undetected curvature in the design space.
Fig 7.7
- 129 -
Applied Statistics James Reilly [Link]
The diagram above shows a centre point in a three-factor experiment that considers
the effects of length, width and temperature on the distance travelled by a paper
airplane.
According to the curvature null hypothesis there is no curvature in any of the factors,
i.e. the average response at the centre point is equal to the average response at all
the corner points. Replication at the centre point is not essential, because there is
replication at the corner points. However, if some additional time or material is
available in an experiment then extra centre points will improve the error variance
estimate without unbalancing the design.
A centre point makes sense only for factors with numeric levels. The levels of some
factors are text, e.g. the person who throws the paper airplane is either Sean or Ben,
and there is no factor setting halfway between those two levels. If we require a centre
point in an experiment that includes text factors, we have no option but to duplicate
the centre point at each of the two levels of every text factor.
Notice that the design is balanced: of the four factor-level combinations included, two
are long and two are short, two are narrow and two are wide, two are hot and two are
- 130 -
Applied Statistics James Reilly [Link]
cold. Also, of the two that are long, one is wide and one is narrow; and one is hot and
one is cold, and so on.
However, fractional factorial designs can give rise to two problems. Firstly, information
about certain interactions will be completely missing. If a three-factor interaction
occurs at one of the missing combinations (e.g. long-wide-hot), we will simply not find
out about it. Now, if we have a good knowledge of the process, we may be able to
assert that these interactions would be zero anyway.
The second problem with fractional factorial designs is that information about different
effects can become mixed up with each other. This is called confounding or aliasing.
Suppose that a paper airplane needs to be well proportioned, i.e. it should be long and
wide, or else short and narrow, in order to travel a long distance. This means that there
is a length-width interaction. Look at each of the design points in the diagram above
and identify which combinations give rise to large values of distance. Do you notice
something? There are two favourable combinations, and two that are unfavourable.
But the two favourable combinations occur when the temperature is cold and the two
unfavourable combinations occur when the temperature is hot. So it appears that
temperature is the explanatory factor! The main effect of temperature has been
confounded with the length-width interaction. Some confounding, especially of higher-
order interactions, is inevitable in fractional factorial designs. An outline of the effects
that are confounded with each other, called the alias structure, can be identified, so
that alternative explanations can be considered for any significant effects. The
confounding of temperature with the length-width interaction is indicated below by the
terms C and AB appearing on the same line.
Alias Structure
Factor Name
A Length
B Width
C Temperature
Aliases
I - ABC
A - BC
B - AC
C - AB
Despite their weaknesses, fractional factorial designs are amazingly powerful and very
popular. They provide a lot of information in return for a small amount of data
collection.
You may have noticed that Figure 7.7 shows uncoded units while figure 7.8 shows
coded units. Both representations are legitimate. Uncoded units are more intuitive,
because the factor levels are familiar to someone who knows the process. But coded
units have advantages for analysis. Coded units preserve the orthogonality of the
design so that every main effect and interaction can be estimated independently of the
others. Coded units also allow the sizes of the coefficients to be compared on a
common scale.
- 131 -
Applied Statistics James Reilly [Link]
Multi-factor experiments are often used as screening experiments, with the purpose
of identifying the factors that are worthy of further study in a larger experiment, where
additional levels of those factors can be explored.
Notice that an alpha value of 0.01, rather than the usual 0.05, was used in the plot
above. When multiple hypotheses are being tested it is advisable to reduce the value
of alpha as the family error rate exceeds the individual error rate. One suggestion,
attributed to Bonferroni, is to divide 0.05 by the number of hypotheses to be tested,
and use the result as the alpha level. Another suggestion is to use 0.01 rather than
0.05 if five or more hypotheses are to be tested.
Problems 7D
#1. A two-level experiment was carried out to investigate the effect of six factors on
the Core CPU Temperature of a laptop. The factors, and their low and high levels,
were: External fan (Off, On), Application Type (Browser, High-end Game), Central
Heating (Off, On), Battery (In, Out), Application Origin (Internal, External), Multiple
Apps ( Off, On).
- 132 -
Applied Statistics James Reilly [Link]
(a) Why was it not possible to include centre points in this design?
(b) What conclusion is suggested by the Pareto chart below?
(c) Given the alias structure below, what alternative conclusion could be proposed?
Fig 7.10
Alias Structure
Factor Name
A External fan
B Application Type
C Central Heating
D Battery
E Application Origin
F Multiple Apps
Aliases
I + ABD - ACE + BCF - DEF - ABEF + ACDF - BCDE
A + BD - CE - BEF + CDF + ABCF - ADEF - ABCDE
B + AD + CF - AEF - CDE - ABCE - BDEF + ABCDF
C - AE + BF + ADF - BDE + ABCD - CDEF - ABCEF
D + AB - EF + ACF - BCE - ACDE + BCDF - ABDEF
E - AC - DF - ABF - BCD + ABDE + BCEF + ACDEF
F + BC - DE - ABE + ACD + ABDF - ACEF - BCDEF
AF - BE + CD + ABC - ADE + BDF - CEF - ABCDEF
- 133 -
Applied Statistics James Reilly [Link]
Project 7D
Multi-Factor Experiment
Design a fractional factorial experiment with at least four factors, in any application
area of your choice. Carry out the experiment, analyse the results, and write a report
under the following headings.
(a) State the purpose of your experiment, and list the factors and factor levels.
(b) Display the data.
(c) Show the ANOVA table and any other relevant output.
(d) Show an effects plot.
(e) Express the experimental findings in simple language.
Contour Plots
Contour plots use colour to represent higher and lower values of the response in the
same way that maps use colour to represent higher and lower terrain, with deeper
shades of blue for increasingly lower values (deep seas) and deeper shades of green
for increasingly higher values (high mountains). The plot below illustrate the
relationship between the distance travelled by a paper aeroplane, and the two factors:
grade of paper and initial angle of flight path. The levels of grade were in the region of
80 g/m2 and 90 g/m2, and the levels of angle were in the region of 20 and 30 degrees
to the horizontal. The first plot shows us a number of things:
1. Greater distance tends to be achieved by using lighter paper, and larger angles.
2. The contours are roughly parallel: this indicates that the response surface is a plane,
as opposed to a ridge (Fig. 7.12), trough, peak (Fig. 7.13), valley or saddle.
3. The optimal settings for the factors are probably outside the range of levels included
in the experiment. The optimal settings may be an angle substantially greater than 30
degrees and a grade of paper substantially less than 80 g/m2.
4. As we move outside the plotted region towards the area of optimum response, the
shape of the response surface may change, e.g. there may be curvature in the surface,
leading to a ridge or a peak in the response, as illustrated in the second and third plots.
5. We may be restricted from following the path to the peak by other considerations,
e.g. a very low value for paper grade may optimise distance travelled but may have
- 134 -
Applied Statistics James Reilly [Link]
an adverse effect on some other target response such as product lifetime. Or a very
low value for paper grade may be unattainable with the current paper manufacturing
process.
Fig 7.11
25.0
22.5
20.0
Fig 7.12
45.0
42.5
40.0
Fig 7.13
- 135 -
Applied Statistics James Reilly [Link]
45.0
42.5
40.0
Although response surface methodology (RSM) can be used with many factors,
contour plots can only show two input variables at a time, like the easting and the
northing on a map. The other factors can be held at high, low, or middle settings.
Process Optimisation
The objective is usually to identify the settings of the input variables that optimise the
value of the response variable, i.e. the location on the map that achieves the desired
height. The optimum could mean that ‘largest is best’, ‘smallest is best’, or ‘target is
best’. If a number of alternative locations are equally suitable, it is best to choose a
location where the contours are widely spaced. Widely spaced contours indicate a
robust region so that variation in the process variables at such a region will cause
only small variation in the response.
To fit a geometric model to the process, data can be collected from a specially
designed off-line RSM experiment or from an on-line experiment using evolutionary
operation (EVOP) where small sequential changes are made to nudge the process
closer to its optimum settings. An off-line experiment is not restricted by the process
specifications, requires fewer data, and usually delivers much faster results. The
advantage of an on-line experiment is that the cost of the experiment is quite small as
the product made during the experiment is saleable, but it is a slow process requiring
a lot of data and repeated analysis steps.
Steepest Ascent
‘The Path of Steepest Ascent’ is an EVOP technique for moving the process closer to
the optimum operating conditions. A simple planar model is used to approximate the
response surface: this may be a useful approximation, especially if the initial
experimental region is some distance away from the optimum.
- 136 -
Applied Statistics James Reilly [Link]
- 137 -
Applied Statistics James Reilly [Link]
It can be seen that the linear and quadratic terms in time and temperature are the
significant terms. A contour plot could be constructed using these two variables, while
holding material constant at its high value.
Fig 7.14
490 – 500
3.0
500 – 510
> 510
Hold Values
Material 30
2.5
2.0
80 85 90 95 100
Temperature
The optimum process setting is in the region of (85.3, 2.2, 30) in the temperature-time-
material space, giving a strength in the region of 513.6.
The analysis is very useful but, as with any statistical model, it is not magic. We have
fit a simple geometric model that may not be the ‘true’ model: perhaps linear and
quadratic terms do not adequately represent the relationship. Also the values of the
coefficients presented by the software are only sample estimates of the actual
parameter values. In addition to all this uncertainty, we must also note that the
response surface only represents the average response: individual responses will lie
above and below the response surface.
- 138 -
Applied Statistics James Reilly [Link]
- 139 -
Applied Statistics James Reilly [Link]
An overlaid contour plot can be used to simultaneously display all responses in relation
to any two input variables. Additional input variables not included in the plot must be
held at fixed settings, such as previously identified optimal settings.
Fig 7.15
The white area in the contour plot shows the ranges of these two factors that
simultaneously satisfy the specifications for all responses. Additional plots could be
drawn to consider the effects of different factors.
Problems 7E
#1. (Exercise) Construct a Box-Behnken Design and carry out an experiment to
optimise a player’s performance at throwing darts. The response is the distance from
the bull’s-eye. The factors are the height of the bull’s-eye from the floor, the distance
the player stands away from the dartboard, and the time that the player pauses to take
aim before throwing.
Project 7E
RSM Project
Design an RSM experiment, in any application area of your choice. Carry out the
experiment, analyse the results and write a report under the following headings.
(a) State the purpose of your experiment and list the factors.
(b) Display the data.
(c) Show the relevant analysis from the software output.
(d) Show one contour plot.
(e) Present your findings and recommendations.
- 140 -
Applied Statistics James Reilly [Link]
8
Statistical Process Control
Having completed this chapter you will be able to:
– investigate the capability of a process;
– control processes, using appropriate SPC techniques;
– select and apply sampling plans;
– validate a measurement system.
- 141 -
Applied Statistics James Reilly [Link]
EXAMPLE A process fills bottles with a target volume of 120 ml. Samples of size five
were taken from thirty different batches as a basis for a process capability study.
Fig 8.1
First, we check for normality. Ignore the bell-shaped curves which are drawn by the
software, and look at the histogram. It peaks in the middle and tails off on both sides
in a fairly symmetric pattern, so we conclude that the process is normal.
Next, we check for stability. The overall standard deviation, 1.196, is more than 10%
greater than the within standard deviation, 1.011, so we conclude that this process is
not stable: the mean fill-volume varies from batch to batch.
The curves on the graph represent the short-term and long-term behaviour of the
process, using the dashed curve and the solid curve respectively. To use an analogy,
this is like viewing a bell-shaped vehicle from behind. The dashed curve indicates the
width of the stationary vehicle in a snapshot. The solid curve indicates the width of the
vehicle’s path over a period of time in a video, with some additional side-to-side
movement as it travels. The path of the moving vehicle is wider than the stationary
vehicle because of the time-to-time variation.
Centrality
Up to this, we have been listening to the voice of the process. We have allowed the
process to tell us what it is able to do, i.e. what standard deviation it is capable of
delivering.
We now introduce the voice of the customer, i.e. what the process is required to
do. For our example, the lower and upper specification limits are: LSL = 115 ml and
USL = 125 ml.
- 142 -
Applied Statistics James Reilly [Link]
Looking at the graph we see that the process is not centred, and this is confirmed by
observing that the mean of the sample of 150 measurements, 119.381, does not
coincide with the target value, 120. This means that there is a greater risk of violating
the lower specification limit rather than the upper specification limit. In this case the
lower specification limit can be designated as the nearest specification limit, NSL. For
some processes, it is a simple matter to correct this asymmetry by adjusting the
process mean, µ, but for certain other processes, this is not a simple adjustment at all.
Capability
We now ask whether the process is capable of satisfying the specifications, i.e. is it
able to do what it is required to do? In other words, are the specification limits
comfortably wider than the process variation, ±3σ? By dividing the difference between
USL and LSL by 6σ we obtain the process capability, Cp. The minimum acceptable
value for process capability is 1.33, for a process to be deemed capable. A value of 2
or greater is considered ideal. This is like comparing the width of a vehicle with the
width of a road, to ensure that the vehicle fits comfortably on the road. The boundaries
on the sides of the road correspond to the LSL and the USL.
This process capability is calculated using the within standard deviation, since the
process has demonstrated that it is capable of achieving this standard deviation, at
least in the short term. This may be rather optimistic if the process is unstable, so this
is referred to as the potential process capability.
The process capability also assumes that the process can be easily centred. If this
cannot be easily done then it makes more sense to calculate a one-sided process
capability index, Cpk, by dividing the difference between the process mean and NSL
by 3σ. This should also have a value of 1.33 minimum.
Performance
If a process is unstable, then its actual long-term performance is not as good as its
potential short-term capability. Therefore, the overall standard deviation should be
used to carry out the calculations. Now, dividing the difference between USL and LSL
by 6σ we obtain the process performance, Pp.
The process performance assumes that the process can be easily centred. If this is
not realistic then it is useful to calculate a one-sided process performance index,
Ppk, by dividing the difference between the process mean and NSL by 3σ, again using
the overall standard deviation.
The words ‘performance’ and ‘capability’ are similar in meaning. Cp and Cpk are short-
term measures of the potential process capability. The process may not be able to live
up to these standards in the long-term. So the measures of performance, Pp and Ppk,
are more realistic measures of what the process is able to do, in practice, in the long
term. So the process performance is often referred to as the overall capability of the
process. The word ‘performance’ is also used to refer to the rate of defectives from the
process in parts per million, PPM.
Ppk is the key measure. If the value of Ppk is at least 1.33 then it usually follows that
all the other measures are at least 1.33 also and, if this is so, we can say that the
process is both capable and performing.
- 143 -
Applied Statistics James Reilly [Link]
Problems 8A
#1. The target duration of a castle tour is 45 minutes. The LSL and USL are 35 minutes
and 55 minutes respectively. Five tours were timed on each of thirty days. The process
capability report is shown below. Assess this process under each of the following
headings, supporting your answers with appropriate references to the software output:
(a) normality (b) stability (c) centrality (d) capability (e) performance.
Fig 8.2
#2. Refer again to the situation described in question #1 above. If the USL was
reduced to 50 minutes and the LSL remained the same at 35 minutes, would each of
your answers at (a), (b), (c), (d) & (e) change or stay the same? Defend your answer
in each case.
#3. (Exercise) Remind yourself of what 100 mm looks like by having a look at a ruler.
Then put the ruler out of sight. Now take a pencil and a blank sheet of paper and draw
a freehand line to a target length of 100 mm. Hide your first attempt and draw another
freehand line. Repeat until you have a subgroup of five lines. Take a break for about
one minute and then repeat the whole activity until you have thirty subgroups, giving
a total of 150 lines. Alternatively, share the exercise with 30 fellow students to provide
thirty subgroups, giving a total of 150 lines. Finally, measure each line with the ruler
and conduct a process capability study with respect to a LSL of 70 mm and a USL of
130 mm.
- 144 -
Applied Statistics James Reilly [Link]
- 145 -
Applied Statistics James Reilly [Link]
Individuals Chart
An individuals chart is the simplest control chart. Individual process measurements,
assumed normally distributed, are sampled and plotted.
EXAMPLE Customers at a drive-thru restaurant are sampled and the delay experienced
by these customers is recorded in seconds. For the first 30 customers sampled, the
data were plotted on an individuals chart.
Fig 8.3
200 UCL=199.4
180
Delay in Seconds
160 _
X=154.5
140
120
LCL=109.5
100
5 10 15 20 25 30
Sample Number
The final point indicates that the process is out-of-control. The reason could be an
untrained worker (men), a shipment of lettuce heads instead of the usual lettuce leaves
(material), a broken drink dispenser (machines), failure to start cooking the food before
pouring drinks (methods), a reduction in the electrical supply voltage that affects the
cooking time (milieu), or measuring the time until the customer drives away instead of
the time until the order is delivered (measurements). The cause must be identified and
corrected. If the USL is 240 seconds, then corrective action can be taken before any
defectives occur.
Xbar Chart
An Xbar chart uses a plot of sample means to track the process mean. The process
measurements do not need to be normally distributed. The specification limits for
individual values must not be shown on the chart, or compared with values on the
chart, because these are sample means.
EXAMPLE Samples of five bottles each were taken from thirty different batches of a
filling process, and the fill volume was observed in each case.
- 146 -
Applied Statistics James Reilly [Link]
Fig 8.4
120.0
Sample Mean
119.5 __
X=119.381
119.0
118.5
118.0 LCL=118.025
5 10 15 20 25 30
Sample Number
The default approach is to use the StDev(Within) to calculate the control limits but if
the process is unstable then it may be more realistic to use the StDev(Overall). The
final few points indicate that the process is out-of-control.
Sample standard deviation charts, S charts, or sample range charts, R charts, are
sometimes used to accompany Xbar charts. These charts give a signal when the
spread of values rises or falls significantly.
NP Chart
We now consider control charts for attributes. The number of defectives in a sample
can be monitored using an NP chart. This involves drawing a sample of fixed size, n,
at regular intervals, counting the number of defectives in the sample, and plotting this
number on the chart. Every unit in the sample is classified wholly as defective or non-
defective. The parameter P represents the proportion defective in the process, and
this can be estimated from the data. Sample sizes must be large, because an average
of at least 5 defectives per sample is required. Violation of an upper control limit
indicates an increase in the proportion of defective units, but violation of a lower control
limit indicates improvement in the process. NP charts are based on the binomial
distribution.
- 147 -
Applied Statistics James Reilly [Link]
Fig 8.5
UCL=17.78
15
Sample Count
10 __
NP=9.47
LCL=1.16
0
5 10 15 20 25 30
Sample Number
The chart indicates that the process is in control. This means that the proportion of
late deliveries is consistent, with no evidence of special causes.
The proportion of defectives is high, and might not be acceptable to some customers.
But control charts do not exist to ensure that products conform to customer
specifications, but to ensure that processes perform to the highest standard of which
they are capable.
C Chart
The next attribute control chart, the C chart, tracks the number of defects in a sample.
A defect is not the same thing as a defective unit, because we can count many defects
on a single unit. The sample must be defined: it could consist of one unit of product or
a number of units, or it could be an interval of time, a length, an area, or a volume. The
number of defects in the sample (e.g. missed calls, potholes, scratches, bubbles, etc.)
is plotted on the chart.
Sample sizes must be large, because an average of at least 5 defects per sample is
required. Violation of an upper control limit indicates an increase in the level of defects,
but violation of a lower control limit indicates improvement in the process. C charts are
based on the Poisson distribution.
EXAMPLE The number of bags damaged at an airport each day was recorded for thirty
consecutive days. The results are plotted on the C chart below. Notice that a sample
is defined as a day.
- 148 -
Applied Statistics James Reilly [Link]
Fig 8.6
UCL=27.50
25
20
Sample Count
_
C=15.63
15
10
5
LCL=3.77
1
0
5 10 15 20 25 30
Sample Number
The chart indicates that the process has improved. The cause should be identified and
action taken to ensure recurrence. Possible causes include: a new worker taking
greater care with the bags (men), a reduction in the maximum allowed bag weight that
makes damage to bags less likely (materials), installation of a new carousel
(machines), a revised procedure for the handling and delivery of bags (methods),
better lighting in the baggage reclaim area (milieu), or a failure to record all the
damaged bags on a particular day (measurements).
Setting up an SPC System
The questions to be addressed when setting up an SPS system are outlined below.
1. Which process parameter needs to be controlled?
In a filling process, the critical parameter could be the fill volume, the fill height, the fill
weight, the gross weight, or the proportion of bottles that are overfull, or underfull.
2. Which statistics should be plotted?
A chart for controlling fill volume could plot individual volumes, sample means, or
sample means and standard deviations.
3. How big should the sample be?
‘Little and often’ is best. A small ‘handful’ of about five measurements, drawn
frequently, is preferable to a larger sample drawn less frequently, because this
facilitates a faster response when problems occur.
4. How will the samples be drawn?
Samples should form rational subgroups. This means that the units within a sample
should be as alike as possible, so they should be taken close together in time and not
mixed from different sources. Unexplained variation should be the only source of
variation within a sample.
- 149 -
Applied Statistics James Reilly [Link]
#2. An SPC chart is required to monitor the number of scratches on polarising filter
screens. The process mean is 1.7 scratches per screen. What type of chart should be
used, and what sample size would you recommend?
#3. (Exercise) Construct an Xbar chart using the data from Problems 8A #3. Identify
whether the process is out-of-control and, if so, recommend some corrective action.
Project 8B
Process Capability Study
Conduct a process study, using 30 samples of size 5 from any process.
(a) Briefly describe your process and your sampling procedure.
(b) Present the software output consisting of the graphs and indices.
(c) Comment on the normality, stability, centrality, capability and performance of the
process.
(d) Present a suitable SPC chart of the data.
(e) Comment on what the SPC chart tells you about the state of the process.
- 150 -
Applied Statistics James Reilly [Link]
- 151 -
Applied Statistics James Reilly [Link]
happen to contain fewer than two defective units, although the proportion defective in
the batch is 40%. Hence, the probability of acceptance in this case (11%, in this
example) is called the consumer’s risk, β.
What Sampling Plans Do
Firstly, sampling plans can identify a very bad batch (LTPD or worse) immediately.
Secondly, sampling plans can identify an unacceptable process (worse than AQL)
sooner or later: as batches from such a process are frequently submitted for sampling
inspection, some of them will eventually get caught.
There are some things that sampling plans cannot do for us. Firstly, sampling plans
cannot take the place of corrective action: when batches are rejected because of poor
quality, corrective action is essential – just continuing with sampling inspection is a
waste of time. Secondly, sampling plans cannot guarantee a correct decision about
any one particular batch. Thirdly, sampling plans cannot guarantee zero defects:
either ‘right first time’ or effective 100% inspection is required for this.
Sampling Plan Calculations
Let n, p and c represent the sample size, the incoming quality level, and the
acceptance number, respectively. The probability of acceptance is the probability of
obtaining c or fewer defectives. This is the cumulative probability of c for a binomial
distribution with parameters n and p.
The table below shows the probability of acceptance, corresponding to a number of
different values of the incoming quality level, for the sampling plan n = 8, Ac 1, Re 2.
Table 8.1
p (%) PA
0 1
5 0.9428
10 0.8131
15 0.6572
20 0.5033
25 0.3671
30 0.2553
35 0.1691
40 0.1064
45 0.0632
100 0
- 152 -
Applied Statistics James Reilly [Link]
Figure 8.7
The OC curve is very useful. It shows the level of protection that the plan offers against
any value of the incoming quality level. Before selecting a sampling plan for use, its
OC curve should be studied, to see if it is satisfactory. For example, the above plan
could be useful in a situation where the LTPD is 40%, because PA = 11%, but it would
not be useful if the LTPD was 20% because PA = 50%. The value of p (p = 20% in this
case) that has a probability of acceptance of 50% is called the indifferent quality
level.
Outgoing Quality
Average outgoing quality (AOQ) is the proportion of defective parts downstream
from sampling inspection. If no rectification is carried out on the rejected batches, then
AOQ is only marginally better than the incoming quality level. (The rejected batches
are removed, and in the case of accepted batches, any defective part found while
sampling is replaced.) If rectification is carried out, the AOQ is better than the incoming
quality level of the submitted batches. Theoretically, AOQ is low when incoming quality
is very good and also when incoming quality is very poor, because this causes a great
amount of 100% inspection. The peak on the AOQ curve is known as the average
outgoing quality limit (AOQL). The AOQL provides a ‘worst case’ guarantee for the
average outgoing quality of a continuous series of batches from a stable process,
assuming that a rectification scheme is in place. Typically, the AOQL is marginally
greater than the AQL if large samples are used. If the sampling plan uses smaller
samples, the AOQL may be around two or three times the AQL.
Double Sampling Plans
Double sampling involves taking two smaller samples rather than one sample. If the
first sample is very good or very bad, the batch can be accepted or rejected straight
away, without any need for a second sample. Otherwise the second sample is drawn
- 153 -
Applied Statistics James Reilly [Link]
and a verdict is reached based on the cumulative sample. The advantage of double
sampling plans is that they typically involve smaller sample sizes. The disadvantages
are that they are more complex to operate and in some situations it can be very
troublesome to have to return for a second sample. Multiple sampling plans are
similar to double sampling plans, but involve taking up to seven small samples: after
each sample it may be possible to make a decision about the batch without drawing
any further samples. Sequential sampling takes this concept to the limit: after each
sampling unit is drawn, a decision is made to accept the batch, to reject the batch, or
to continue sampling.
Problems 8C
#1. (Exercise) For a telephone service, a dropped call can be regarded as a defective
unit, and a day can be regarded as a batch.
(a) Identify the values of the AQL and LTPD that you consider appropriate.
(b) Use a sampling plan calculator to identify a suitable sampling plan.
(c) Explain what the producer’s risk and the consumer’s risk mean in this context.
(d) Use the OC Curve of your selected plan to identify the indifferent quality level.
- 154 -
Applied Statistics James Reilly [Link]
Fig 8.8
EXAMPLE A kitchen scale was tested by weighing each of five consumer products
twice. The data are shown below.
- 155 -
Applied Statistics James Reilly [Link]
Table 8.2
Part Reference Measurement Deviation Average Deviation
1 200 260 60 65.0
270 70
2 300 375 75 57.5
340 40
3 400 500 100 102.5
505 105
4 625 725 100 107.5
740 115
5 700 780 80 85.0
790 90
Components of Variance
When a number of parts are measured, the total variation in the measurements arises
from two different sources. Firstly there is material variation because the parts are
not all the same. Secondly there is measurement system variation because even
when the same part is measured repeatedly, the results are not the same, because
the instruments and operators that take the measurements are not perfect. These two
sources of variation are assumed to be independent, so the following additive model
applies.
Model 8.1
Total
2
Mat
2
MS
2
Care should be taken that any subtle material variation is not attributed to the
measurement system. Examples include time-to-time variation, such as shrinkage, or
within-part variation, such as different diameters on the same part.
It may be possible to explain some of the measurement system variation by pointing
out that the measurements arose on different sessions, i.e. different operators,
different laboratories, different instruments, or different days. But even on a single
session, there may be some unexplained variation. The explained variation between
sessions is called reproducibility (rpd) and the unexplained variation within sessions
is called repeatability (rpt). These two sources of variation are also assumed to be
independent, so an additive model can be used again.
Model 8.2
MS
2
rpd
2
rpt
2
A single model can be used to represent all three sources of variation in a set of
measurements: differences between parts (material variation), differences between
sessions (reproducibility), and unexplained variation (repeatability).
- 156 -
Applied Statistics James Reilly [Link]
Model 8.3
Total
2
Mat
2
rpd
2
rpt
2
Gage R&R
%Contribution
Source VarComp (of VarComp)
Total Gage R&R 7.3739 30.52
Repeatability 7.3739 30.52
Reproducibility 0.0000 0.00
Session 0.0000 0.00
Part-To-Part 16.7834 69.48
Total Variation 24.1572 100.00
Process tolerance = 80
- 157 -
Applied Statistics James Reilly [Link]
The analysis consists of an ANOVA in which the response is ‘measurement’ and the
factors are ‘part’ and ‘session’. A preliminary ANOVA table with interaction is used to
confirm that there is no significant interaction. This would mean that some parts gave
higher results in a particular session.
The main analysis consists of an ANOVA table without interaction, which is used to
estimate variance components, which are also presented as standard deviations.
Remember that variances are additive while standard deviations are not.
We see that ‘part’ is significant, and contributes 69% of the total variance. This is not
at all surprising since it is to be expected that parts are different.
The remaining 31% of the variation is contributed by the measurement system, and
all of this seems to be unexplained variation due to repeatability. This is unfortunate
because explained variation due to reproducibility can sometimes be corrected by
addressing the causes. If reproducibility was the main problem, corrective actions
could include training (if sessions denote different operators), or standard operating
procedures (if sessions denote different laboratories), or calibration (if sessions denote
different instruments), or environmental controls (if sessions denote different days).
When repeatability is the main problem there may be no corrective actions available:
perhaps the measuring instrument is simply not good enough.
The following metrics are often used to summarise the results of a gage R&R study.
1. Number of Distinct Categories
This is the number of different groups into which the measurement system can divide
the parts. It is calculated by taking account of the variation among the parts, and
comparing this to a confidence interval for a single measurement. Anything less than
two groups (big and small) is obviously useless for distinguishing between parts. In
the automotive industry, at least five distinct categories are required for a measuring
system to be considered adequate.
2. Signal-to-noise ratio
The signal-to-noise ratio (SNR) also considers whether the measurement system is
able to tell the difference between the parts that are presented to it. The SNR is the
ratio of the material standard deviation to the measurement system standard deviation.
In the example above:
SNR = 4.09675 / 2.71549 = 1.5
Threshold values for the SNR are as follows:
>10, good
3-10, acceptable
<3, unacceptable
The SNR in our example is unacceptable. The measurement system is unable to
distinguish between the parts that are presented to it. This may be because the parts
are very alike. Strangely, as production processes improve, the SNR goes down: it
appears that the measurement system is deteriorating when in fact the parts are
simply more alike, and therefore it is more difficult to tell them apart.
3. Precision to tolerance ratio
Although the measurement system in our example cannot distinguish between the
parts that are presented to it, it may be able to tell the difference between a good part
- 158 -
Applied Statistics James Reilly [Link]
and a bad part. The precision to tolerance ratio, also called the %R&R, or the
%Tolerance, considers how well a measurement system can distinguish between
good and bad parts, by dividing the study variation by the tolerance.
In our example the %R&R = 20.37%
Threshold values for the %R&R are as follows:
< 10%: Good
10-30%: Acceptable
>30%: Unacceptable
So this measurement system is able to distinguish between good and bad parts.
Problems 8D
#1. A paper map was used to estimate a number of road distances in kilometres, and
these estimates were compared with the actual distances. A gage linearity and bias
study yielded the following regression equation:
Average Deviation = -0.950 - 0.0550 Reference
Calculate the %Linearity and explain what it means.
#2. Suzanne carried out a Gage R&R study in which three operators each measured
twelve parts three times using a vision system. The parts had LSL = 42 mm and USL
= 48 mm. Use the output below to identify the values of the Number of Distinct
Categories, the SNR and the %R&R and explain what these values indicate.
Gage R&R
%Contribution
Source VarComp (of VarComp)
Total Gage R&R 0.06335 2.94
Repeatability 0.02525 1.17
Reproducibility 0.03809 1.77
Session 0.03809 1.77
Part-To-Part 2.09107 97.06
Total Variation 2.15441 100.00
Process tolerance = 6
Project 8D
Gage R&R Study
Design and carry out a Gage R&R Study on any measurement system of your choice.
Collect the measurements, analyse the results and write a report under the following
headings.
(a) Define the measurements, the parts and the sessions.
- 159 -
Applied Statistics James Reilly [Link]
- 160 -
Applied Statistics James Reilly [Link]
Appendix 1
Answers to Problems
Answers 1A
1. The histogram indicates that, under ideal conditions, the journey can be completed
in about 40 minutes. Adverse travelling conditions may substantially increase the
journey time.
2. The tablets probably weigh a nominal 400 g. Most tablets are overweight.
Unacceptably overweight tablets have been removed by inspection.
3. There are high numbers of registrations at the start, and low numbers of
registrations at the end, of each calendar year. From 2013 onwards, the figures
experience a boost in July, presumably as a consequence of the new registration
arrangements. There is also some evidence of economic growth over this period.
Answers 1B
1. (a) None of the first 50 passengers has difficulty being in time for the train.
(b) The interviewer may be afraid to call at a home displaying a ‘Beware of the Dog’
sign. Also, will the owner tell the truth?
(c) The ten stalks all experienced similar ‘edge of field’ conditions of soil, irrigation and
light, that were not shared by stalks elsewhere in the field.
(d) The first word on each page is more likely to be part of a heading, and may not be
representative of normal text.
2. Assign 1 to Josiah, 2 to Sandrina, etc. N = 9, so enter 10 on your calculator and
press the random number key. Suppose the result is 3.907 then select the name with
ID = 3, i.e. Matthew. Repeat until three names are chosen.
Answers 1C
1. (a) mean 8, sd 1 (b) mean 8, sd 3.633
2. mean 4.296, sd 2.250
3. (a) The median. (b) The mode. (c) The mean.
Answers 2A
1. (a) If the die is rolled many times, then a ‘six’ occurs on one-sixth of all rolls. On a
single future roll, ‘one-sixth’ is a measure of how likely it is that a six will occur, i.e.
more likely not to occur.
(b) 90% of all invoices are paid within 30 days. 90% is a measure of how likely it is
that a single future invoice will be paid within 30 days, i.e. very likely.
- 161 -
Applied Statistics James Reilly [Link]
Answers 2B
1. 40,320
2. (a) 1/6840 (b) 1/1140
3. 8,145,060 ways. Every time we choose six numbers to ‘take away’, we are choosing
39 numbers to ‘leave behind’, and so the answers are the same.
4. (a) 1/8145060 (b) 234/8145060 (c) 11115/8145060
5. nPn means the number of ways of arranging n things, taken from among n things,
i.e. all n things are arranged. But this is the definition of n!.
6. To take n things from among n things, you have to take them all; you have no choice.
There is only one way to do it, hence nCn = 1
Answers 2C
1. 94.85%
Answers 2D
1. (a) 0.874552
(b) 0.764841
(c) 0.015737
(d) 0.984263
(e) 0.944516
2. (a) 0.6840 and 0.9001
(b) 0.948024
(c) 0.8208
(d) 0.9859449
- 162 -
Applied Statistics James Reilly [Link]
Answers 3A
1. (a) {0,1} (b) Uniform (c) 2
Answers 3B
1. (a) 0.1587 (b) 0.6247 (c) 0.7734 (d) 0.1314 or 0.1292 (e) 0.8643 (f) 0.0994 (g)
0.0968 (h) 0.5 (i) 0 (j) 0.95 or 95%. Take note of this z-score.
2. 0.9848
3. 32.32% small, 47.06% medium, 14.88% large, 5.74% other.
Answers 3C
1. Q, R, U and Z only.
2. 0.0625 + 0.25 + 0.375 + 0.25 + 0.0625 = 1, because it is certain (p = 1) that the
number of heads will be 0 or 1 or 2 or 3 or 4.
3. (a) 0.9838 (b) 0.0162
4. (a) 0.0305 (b) 0.0023 (c) 0.9977 (d) 0.9672 (e) 0.0328
5. 0.9664
6. 0.1314
Answers 3D
1. N, O, R, S and T only.
2. (a) 0.30% (b) 98.62%
3. (a) 67.03% (b) 32.17% (c) 0.80%
Answers 3E
1. The 3-Parameter Weibull distribution is the best fit out of the candidate distributions
considered in the graph. Its probability plot shows some random scatter without
obvious curvature.
Answers 4A
1. (a) 3 micrograms (b) 2 micrograms (c) 1 microgram
Answers 4B
1. The mean weight of all the bags of rice filled by that machine lies between 497.478
and 498.278 grams, with 95% confidence.
2. The mean diameter of all plastic tubes made by that process lies between 12.010
and 12.098 mm, with 95% confidence.
3. The mean journey distance for all service calls by that engineer lies between 6.7
and 17.3 km, with 95% confidence.
- 163 -
Applied Statistics James Reilly [Link]
4. The mean cost of motor insurance from all the providers in the market, for that
motorist, lies between 284 and 309 euro, with 95% confidence.
5. Between 24% and 36% of all the booklets in the consignment have defective
binding, with 95% confidence.
6. Between 13.7% and 19.5% of all students at the college in Tallaght walk to college,
with 95% confidence.
Answers 4C
1. 79
2. 139
3. 97
4. 733
Answers 4D
1. 0.410
2. Between 0.017 and 0.082
Answers 4E
1. The tape measurements are between 15.48 and 1.52 shorter than the laser
measurements, on average.
2. Kevin is between 7.44 and 5.06 kg lighter than Rowena.
3. The 10am mean minus the 11am mean is between -6.05 and 5.25. There might be
no difference.
4. Between 4.02% and 13.73% more Summer visitors than Winter visitors had prepaid
tickets.
Answers 5A
1. A type 2 error is virtually always more costly. The shoelaces will be relied upon ‘in
the field’ to contribute to some larger purpose. This purpose may be thwarted by
defective shoelaces, resulting in considerable loss for the consumer, additional trouble
and expense to arrange replacement, and loss of reputation for the supplier. A type 1
error will lead only to the loss of the production and disposal costs of the batch of non-
defective shoelaces.
2. There are two major problems with the shopkeeper’s approach. Firstly, having
formed a hunch that the waiting times at the bank were longer than the hypothesised
mean waiting time, he should have drawn a (fresh) random sample to test this
hypothesis, and should not use the pessimistic sample that suggested the hypothesis
test to him.
Secondly, a series of waiting times taken together are not independently random: one
hold-up in the queue will cause all the following waiting times to be prolonged.
- 164 -
Applied Statistics James Reilly [Link]
Answers 5B
1. No. Accept H0, z = -1.25, critical z = ±1.96
2. No. Accept H0, z = -0.89, critical z = -1.645
3. Reject H0, t = -4.24, critical t = ±2.365
4. Yes. Reject H0, t = 3.207, critical t = 2.132
5. Yes. Reject H0, t = -2.705, critical t = -1.943
6. Yes. Reject H0, z = -2.427, critical z = -1.645
7. Yes. Reject H0, z = -2.12, critical z = ±1.96
Answers 5C
1. Yes. Reject H0, t = -16.69, critical t = -2.353
2. No. Accept H0, t = -0.16, critical t = ±2.306
3. Yes. Reject H0, Chi-Sq = 40.40, critical Chi-Sq = 3.841
4. Yes. Reject H0, Chi-Sq = 10.89, critical Chi-Sq = 3.841
5. Yes. Reject H0, Chi-Sq = 16.28, critical Chi-Sq = 5.991
6. Yes. Reject H0, Chi-Sq = 45.214, critical Chi-Sq = 5.991
7. No. Accept H0, Chi-Sq = 4.185, critical Chi-Sq = 7.815
Answers 5D
1. Validation is like requiring a defendant in a trial to prove their innocence, rather than
assuming their innocence.
2. Ask a medical doctor, or other competent person, to identify the minimum increase
in average blood pressure that is of practical importance.
Answers 5E
1. Yes. Reject H0, Chi-Sq = 0.8, critical Chi-Sq = 2.733
2. No. Accept H0, F = 5.33, critical F = 6.388
3. No. Accept H0, F = 1.20, critical F = 9.01
4. No. Reject H0, Chi-Sq = 17.55, critical Chi-Sq = 16.919
5. Yes. Accept H0, Chi-Sq = 4.4, critical Chi-Sq = 11.070
6. No. Reject H0, Chi-Sq = 2124, critical Chi-Sq = 11.070. After.
- 165 -
Applied Statistics James Reilly [Link]
Answers 6A
1. The coefficient of determination estimates the proportion of the variation in the
journey times (of all the classmates or colleagues) that is explained by the variation in
the journey distances.
Answers 6B
1. Every 1% increase in attendance tends to be associated with an average increase
in final exam mark of 0.4119%.
Students who don’t attend at all would be expected to obtain an average final exam
mark of 12.48%.
20.9% of the variation in the final exam marks, of all the students on this science
course, can be explained by the variation in their attendances.
Students with the same level of attendance will have final exam marks that differ by
18.27%, typically, from the average final exam mark of students with this level of
attendance.
2. Every extra day tends to produce an average growth of 0.265 cm in the length of
the grass.
Immediately after mowing, the grass is typically 2.95 cm long.
89.4% of the variation in the length of grass on all the suburban lawns, can be
explained by the variation in the number of days since mowing.
Lawns that were last mown on the same day will have lengths of grass that differ by
0.28 cm, typically, from the average length of grass on lawns mown on that day.
After 20 days it can be predicted that the grass would be 8.25 cm long, but this
prediction assumes that grass continues to grow at the same rate between 10 and 20
days, as has been observed between 2 and 10 days. (Extrapolation)
3. The predicted volume is 119.49 ml. Every 1 gram increase in the weight of an
orange tends to be associated with an average increase of 0.47 ml in the volume of
juice that it produces.
98% of the variation in the volume of juice produced by all the oranges can be
explained by the variation in their weight.
Oranges that weigh the same will have volumes of juice that differ by 7.30 ml, typically,
from the average volume of juice of oranges of this weight.
Answers 6C
1. Beta: reject, because p=0.006, so attendance is a useful predictor of final exam
mark. Alpha: accept, because p=0.131, so attendance could be directly proportional
to final exam mark. With 95% confidence, a student with 40% attendance would obtain
a final exam mark between zero and 67%.
- 166 -
Applied Statistics James Reilly [Link]
2. Beta: reject, because p=0.010, so days is a useful predictor of length. Alpha: reject,
because p=0.002, so days is not directly proportional to length. With 95% confidence,
the average length of grass after 7 days, is between 4.4 cm and 5.2 cm.
3. Beta: reject, because p=0.007, so weight is a useful predictor of volume of juice.
Alpha: accept, because p=0.147, so the weight could be directly proportional to the
volume of juice. With 95% confidence, the average volume of juice in oranges that
weigh 200g, lies between 100 ml and 139 ml. With 95% confidence, the volume of
juice from a single orange that weighs 200g is between 83 ml and 156 ml.
Answers 6D
1. Every 1% increase in moisture content is associated with an average increase of
2.17 minutes in the drying time, provided that the relative humidity remains constant.
Every 1% increase in relative humidity is associated with an average increase of 0.975
minutes in the drying time, provided that the moisture content remains constant.
2. Age, by itself, is the best predictor.
3. Either fit a quadratic regression equation, or transform the minutes variable, with
something like log(minutes).
Answers 7A
1. (a)
µ represents the population grand average burning time for all the different colours.
α represents the effect of colour i, that is how much the average burning time of that
colour exceeds the grand average burning time.
ε represents the random error on that occasion, that is how much the burning time of
a single candle exceeds the average burning time for that colour.
(b) H0: Colour does not affect burning time. Accept H0, because p = 0.586. The data
do not prove, at the 5% level, that colour affects burning time.
2. (a)
µ represents the population grand average weight of clementines for both
supermarkets.
α represents the effect of supermarket i, that is how much the average weight of
clementines in that supermarket exceeds the grand average weight.
ε represents the random error on that occasion, that is how much the weight of a single
clementine exceeds the average weight of clementines in that supermarket.
(b) H0: Supermarket does not affect weight. Reject H0, because p = 0.001. The data
prove, at the 5% level, that supermarket affects weight. One supermarket stocks
clementines that are heavier, on average.
- 167 -
Applied Statistics James Reilly [Link]
Answers 7B
1. (a)
μ is the grand average number of fish caught, averaged over all rivers and baits.
α is the river main effect, i.e. how many more fish are caught on average in that river.
β is the bait main effect, i.e. how many more fish are caught on average using that
bait.
η is the interaction effect, i.e. how many more fish are caught on average in that
particular river, using that particular bait, compared to what would be expected.
ε is the error, i.e. how many more fish were caught on that day, compared to the
average number of fish caught per day in that particular river, using that particular bait.
(b) Both bait and river affect the number of fish caught. Maggots are better bait than
pellets, and the Shannon is a better than the Blackwater.
2. (a)
μ is the grand average time, averaged over all guitars and methods.
α is the guitar main effect, i.e. how much more time on average is required to play a
scale on that guitar.
β is the method main effect, i.e. how much more time on average is required to play a
scale using that method.
η is the interaction effect, i.e. how much more time on average is required for that
particular guitar-method combination, compared to what would be expected.
ε is the error, i.e. how much more time was taken on that occasion, compared to the
average time for that guitar-method combination.
(b) With a steel-string guitar, switching from using a plectrum to plucking the strings
increases the time required to play a scale, but with a nylon-string guitar, it does not
make much difference to the time.
3. (a)
μ is the grand average time, averaged over all paddles and girls.
α is the paddle main effect, i.e. how much longer the time is, on average, using that
paddle.
β is the girl main effect, i.e. how much longer the time is, on average, when travelled
by that girl.
η is the interaction effect, i.e. how much longer the time is, on average, for that
particular paddle-girl combination, compared to what would be expected.
ε is the error, i.e. how much how much longer the time was on that occasion, compared
to the average time for that particular paddle-girl combination.
- 168 -
Applied Statistics James Reilly [Link]
(b) The paddle affects the time. The straight paddle is faster.
Answers 7C
1. (a) It is a covariate, because it is a continuous variable that is corrected for in the
model.
(b) Yes. p=0.026
(c) Put it in the oven for 4 hours. Next best, leave it on a storage heater overnight.
Answers 7D
1. (a) All the factors have text levels. No centre points are available.
(b) Application type is the only factor that affects the temperature.
(c) The combination of battery in with fan off, or the combination of multiple apps with
central heating on, are also possible explanations for high temperature.
Answers 8A
1. (a) Normality. The process appears normal, based on a visual appraisal of the
histogram. (b) Stability. The process appears to be stable based on the values on
StDev(Overall) and StDev(Within) (c) Centrality. The process appears central since
the estimated process mean is very close to the target. (d) Capability. The process
appears capable since the Cpk value exceeds 1.33. (e) Performance. The process is
performing since the Ppk value exceeds 1.33.
2. (a) Normality. The process is normal, because the histogram will not change. (b)
Stability. The process is stable because the values of StDev(Overall) and
StDev(Within) will not change (c) Centrality. The process is no longer central since the
estimated process mean is not close to the new target. (d) Capability. The process is
not capable since the Cpk value will fall below 1.33. (e) Performance. The process is
not performing since the Ppk value will fall below 1.33.
Answers 8B
1. Possible causes include: an untrained worker misjudging the fill volume (men), a
less viscous batch of liquid (materials), damage to the filling head (machines),
incorrect sequencing of the steps involved in the fill cycle (methods), a change in the
voltage of the power supply used to pump the liquid into the bottles (milieu), or
incorrect measurement of the fill-volume due to instrument calibration issues
(measurements).
2. A C chart should be used, because it is required to monitor the number of scratches
(defects) and not the number of scratched screens (defectives). A sample size of three
screens is required so that the mean number of scratches per sample exceeds five.
Answers 8D
1. %Linearity = -5.5%. For every 100 km increase in the reference measurement, the
bias tends to decrease by 5.5 km.
- 169 -
Applied Statistics James Reilly [Link]
2. The Number of Distinct Categories = 8. This indicates that the measurement system
is able to separate the parts into eight different groups. This is adequate for
distinguishing different parts because the number of groups is five or more.
SNR = 1.44605 / 0.25169 = 5.7. Since the SNR >3, the measurement system is
acceptable. It is able to distinguish between the parts that are presented to it.
%R&R = 25.17% which is acceptable because it is less than 30%. This measurement
system is able to distinguish between good and bad parts.
- 170 -
Applied Statistics James Reilly [Link]
Appendix 2
Statistical Tables
Statistical Table 1
Normal Distribution: Cumulative Probability
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8079 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9983 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
The table provides the majority probability corresponding to any z-score, either
positive or negative.
To obtain the minority probability, subtract the tabulated value from one.
- 171 -
Applied Statistics James Reilly [Link]
Statistical Table 2
5% Points of the t-distribution
df One-tailed Two-tailed
1 6.314 12.706
2 2.920 4.303
3 2.353 3.182
4 2.132 2.776
5 2.015 2.571
6 1.943 2.447
7 1.895 2.365
8 1.860 2.306
9 1.833 2.262
10 1.812 2.228
11 1.796 2.201
12 1.782 2.179
13 1.771 2.160
14 1.761 2.145
15 1.753 2.132
16 1.746 2.120
17 1.740 2.110
18 1.734 2.101
19 1.729 2.093
20 1.725 2.086
21 1.721 2.080
22 1.717 2.074
23 1.714 2.069
24 1.711 2.064
25 1.708 2.060
26 1.706 2.056
27 1.703 2.052
28 1.701 2.048
29 1.699 2.045
30 1.697 2.042
40 1.684 2.021
60 1.671 2.000
120 1.658 1.980
z 1.645 1.960
z = t∞
- 172 -
Applied Statistics James Reilly [Link]
Statistical Table 3
Upper Percentage Points of the Chi-Square Distribution
- 173 -
Applied Statistics James Reilly [Link]
Statistical Table 4
Upper 5% Points of the F-Distribution
Df 1 2 3 4 5
1 161.4 199.5 215.7 224.6 230.2
2 18.51 19.00 19.16 19.25 19.30
3 10.13 9.55 9.28 9.12 9.01
4 7.709 6.944 6.591 6.388 6.256
5 6.608 5.786 5.409 5.192 5.050
6 5.987 5.143 4.757 4.534 4.387
7 5.591 4.737 4.347 4.120 3.972
8 5.318 4.459 4.066 3.838 3.687
9 5.117 4.256 3.863 3.633 3.482
10 4.965 4.103 3.708 3.478 3.326
11 4.844 3.982 3.587 3.357 3.204
12 4.747 3.885 3.490 3.259 3.106
13 4.667 3.806 3.411 3.179 3.025
14 4.600 3.739 3.344 3.112 2.958
15 4.543 3.682 3.287 3.056 2.901
16 4.494 3.634 3.239 3.007 2.852
17 4.451 3.592 3.197 2.965 2.810
18 4.414 3.555 3.160 2.928 2.773
19 4.381 3.522 3.127 2.895 2.740
20 4.351 3.493 3.098 2.866 2.711
21 4.325 3.467 3.072 2.840 2.685
22 4.301 3.443 3.049 2.817 2.661
23 4.279 3.422 3.028 2.796 2.640
24 4.260 3.403 3.009 2.776 2.621
25 4.242 3.385 2.991 2.759 2.603
26 4.225 3.369 2.975 2.743 2.587
27 4.210 3.354 2.960 2.728 2.572
28 4.196 3.340 2.947 2.714 2.558
29 4.183 3.328 2.934 2.701 2.545
30 4.171 3.316 2.922 2.690 2.534
The top row represents the numerator degrees of freedom, and the left column
represents the denominator degrees of freedom.
- 174 -
Applied Statistics James Reilly [Link]
Index
Acceptable quality level, 151 Composite Desirability, 139
Accurate, 21, 154 Conditional Probability, 29
Aliasing, Alias structure, 131 Confidence Interval for a Population
Alternative hypothesis, 68 Mean, 57
Analysis of covariance, 125 Confidence Interval for a Population
Analysis of Variance, 112 Proportion, 59
ANCOVA, 125 Confidence Interval for sigma, 63
ANOVA Assumptions, 114 Confidence Intervals in Regression, 98
ANOVA table, 114 Confidence, 24
AQL, 151 Confirmatory runs, 121
Assumptions (regression), 100 Confounding, 131
Attributes, 17 Consumer’s risk, 68, 152
Audits, 121 Contingency table, 75
Average outgoing quality limit, 153 Continuous data, 17
Average, 19 Contour plot, 134
Axial points, 137 Control Limits, 145
Base rate, 32 Convenience Sampling, 17
Bayes’ Theorem, 32 Corner points, 137
Best subsets, 102 Correlation Coefficient, 89
Between-samples variance, 113 Correlation, 86
Bias, 15, 155 Covariate, 125
Big data, 18 Critical value, 69
Bimodal, 9 Cross-classified, 34
Binary, 17 Crossed design, 116
Binomial distribution, 44 Curvature, 95, 128
Blocking, 110 Curvilinear relationship, 104
C Chart, 148 CV, 21
Categorical data, 18 Cyclical, 13
Causation, 88 De Morgan’s Rule, 30
Central Limit Theorem, 57 Defective, 147
Centre point, 129, 137 Defects, 148
Chi-square test of association, 75 Degrees of freedom, 20
Chi-square, 63, 82 Delta, 80
Classical Probability, 24 Design of Experiments, 109
Cluster Sampling, 16 Deviation, 20
Coded units, 131 Difference between Proportions, 66,
Coefficient of Determination, 89 74
Coefficient of Variation, 21 Differences between Means, 74
Combinations, 31 Directly proportional, 91
Common causes, 145 Discontinuity, 95
Comparing Means, 64 Discrete, 17
Complement, 26 Dispersion, 20
- 175 -
Applied Statistics James Reilly [Link]
- 176 -
Applied Statistics James Reilly [Link]
- 177 -
Applied Statistics James Reilly [Link]
- 178 -