0% found this document useful (0 votes)
17 views

Data Analysis and Statistics Book

Uploaded by

Ahsan Bughio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Data Analysis and Statistics Book

Uploaded by

Ahsan Bughio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

A GENTLE INTRODUCTION

TO DATA ANALYSIS AND


STATISTICS FOR STUDENTS
TAKING MA334

Graham J. G. Upton and Dan Brawn


ii


CONTENTS

1 First steps 1
1.1 Types of data 2
1.2 Sample and population 2
1.2.1 Observations and random variables 3
1.2.2 Sampling variation 3
1.3 Methods for sampling a population 4
1.3.1 The simple random sample 5
1.3.2 Cluster sampling 5
1.3.3 Stratified sampling 6
1.3.4 Systematic sampling 6
1.4 Oversampling and the use of weights 7

2 Summarising data 9
2.1 Measures of location 9
2.1.1 The mode 9
2.1.2 The mean 10
2.1.3 The trimmed mean 11
iii
iv CONTENTS

2.1.4 The Winsorized mean 11


2.1.5 The median 12
2.2 Measures of spread 12
2.2.1 The range 12
2.2.2 The interquartile range 12
2.3 Boxplot 13
2.4 Histograms 14
2.5 Cumulative frequency diagrams 16
2.6 Step diagrams 18
2.7 The variance and standard deviation 18
2.8 Symmetric and skewed data 20
2.9 Using weights 20

3 Probability and probability distributions 21


3.1 Probability 21
3.2 The rules of probability 24
3.3 Conditional probability and independence 26
3.4 The total probability theorem 29
3.5 Bayes’ theorem 30
3.6 Notation 32
3.7 Mean and variance of a probability distribution 33
3.8 Sample proportion and population probability 34
3.9 Combining means and variances 35
3.10 Discrete uniform distribution 36
3.11 Probability density function 37
3.12 The continuous uniform distribution 38

4 Estimation and confidence 41


4.1 Point estimates 41
4.1.1 Maximum likelihood estimation (mle) 42
4.2 Confidence intervals 42
4.3 Confidence interval for the population mean 42
4.3.1 The normal distribution 42
4.3.2 The Central Limit Theorem 44
4.3.3 Construction of the confidence interval 46
4.4 Confidence interval for a proportion 46
4.4.1 The binomial distribution 47
CONTENTS v

4.4.2 Confidence interval for a proportion (large sample


case) 47
4.4.3 Confidence interval for a proportion (small sample) 48
4.5 Confidence bounds for other summary statistics 48
4.5.1 The bootstrap 48
4.6 Some other probability distributions 50
4.6.1 The Poisson and exponential distributions 50
4.6.2 The Weibull distribution 54
4.6.3 The chi-squared (χ2 ) distribution 54

5 Models, p-values, and hypotheses 57


5.1 Models 57
5.2 p-values and the null hypothesis 58
5.2.1 Comparing p-values 60
5.2.2 Link with confidence interval 60
5.3 p-values when comparing two samples 61
5.3.1 Do the two samples come from the same population? 61
5.3.2 Do the two populations have the same mean? 63

6 Comparing proportions 65
6.1 The 2 by 2 table 65
6.2 Some terminology 68
6.2.1 Odds, odds-ratios, and independence 68
6.2.2 Relative risk 68
6.2.3 Sensitivity, specificity, and related quantities 69
6.3 The R by C table 70
6.3.1 Residuals 72
6.3.2 Partitioning 73

7 Relations between two continuous variables 75


7.1 Scatter diagrams 76
7.2 Correlation 77
7.2.1 Testing for independence 80
7.3 The equation of a line 82
7.4 The method of least squares 82
7.5 A random dependent variable, Y 85
7.5.1 Estimation of σ 2 85
7.5.2 Confidence interval for the regression line 86
vi CONTENTS

7.5.3 Prediction interval for future values 86


7.6 Departures from linearity 87
7.6.1 Transformations 87
7.6.2 Extrapolation 88
7.6.3 Outliers 89
7.7 Distinguishing x and Y 90
7.8 Why ‘regression’ ? 91

8 Several explanatory variables 95


8.1 AIC and related measures 96
8.2 Multiple regression 97
8.2.1 Two variables 98
8.2.2 Collinearity 99
8.2.3 Using a dummy variable 100
8.2.4 The use of multiple dummy variables 102
8.2.5 Model selection 105
8.2.6 Interactions 105
8.2.7 Residuals 107
8.3 Cross-validation 108
8.3.1 k-fold cross-validation 108
8.3.2 Leave-one-out cross-validation (LOOCV) 110
8.4 Reconciling bias and variability 110
8.5 Shrinkage 111
8.5.1 Standardisation 112
8.6 Generalised linear models (GLMs) 113
8.6.1 Logistic regression 113
8.6.2 Loglinear models 116

9 Last words 119


PREFACE

This book, which will be published by Oxford University Press in June 2023,
was written with the students who will be taking MA334 very much in mind.
The book’s aim is to provide the would-be data scientist with a working idea
of the most frequently used tools of data analysis. R coding skills will be
taught on the module and will be essential for the assessed individual project.

vii
CHAPTER 1

FIRST STEPS

This book assumes that you, the readers, are keen to become Data Scientists,
but may have limited knowledge of Statistics or Mathematics. Indeed, you
may have limited knowledge of what, exactly, is this new subject called Data
Science.
According to Wikipedia,
‘Data are individual facts, statistics, or items of information, often nu-
meric, that are collected through observation.’
So the first thing to learn about the word ‘Data’ is that it is a plural1 . The
same source gives us
‘Science is a systematic enterprise that builds and organizes knowledge in
the form of testable explanations and predictions about the world.’
So we see that the job of the Data Scientist is to form testable expla-
nations of a mass of information.
We will begin by looking at the various types of data and how those data
may have been obtained.

1 It is the plural of ‘Datum’, which is rarely used

A gentle introduction to data analysis and statistics for students taking MA334. 1
By Graham J. G. Upton and Dan Brawn Copyright © 2023 Graham Upton.
2 FIRST STEPS

1.1 Types of data

There are three common types: qualitative, discrete and continuous.


Qualitative data (also referred to as categorical data) consist of de-
scriptions using names. For example:

‘Male’ or ‘Female’
‘Oak’, ‘Ash’, or ‘Birch’

In the examples above, the alternatives are just names and the variables
might be referred to as nominal variables. In other cases the categories
may have a clear ordering:
‘Poor’, ‘Adequate’, or ‘Abundant’
‘Small’, ‘Medium’, or ‘Large’
‘Cottage’, ‘House’, or ‘Mansion’
In these cases the variables are called ordinal variables.
Discrete data consist of numerical values in cases where we can make a
list of the possible values. Often the list is very short:

1, 2, 3, 4, 5 and 6.

Sometimes the list will be infinitely long, as for example, the list
0, 0.5, 1, 1.5, 2, 2.5, 3.0, 3.5, 4.0, . . ..
Continuous data consist of numerical values in cases where it is not possi-
ble to make a list of the outcomes. Examples are measurements of physical
quantities such as weight, height and time.

The distinction between discrete and continuous data is often


blurred by the limitations of our measuring instruments. For
example, we may record our heights to the nearest centimetre,
in which case observations of a continuous quantity are being
recorded using discrete values.

1.2 Sample and population

Here the term ‘population’ refers to a complete collection of anything! Could


be pieces of paper, pebbles, tigers, sentence lengths,. . . . Absolutely anything;
not just people. In the next paragraph, the population is ‘the heights of every
oak tree in the country’.
Suppose we want to know what proportion of oak trees are greater than
10 metres in height. We cannot examine every oak tree in the country (the
SAMPLE AND POPULATION 3

population). Instead we take a sample consisting of a relatively small number


of oak trees selected randomly from the population of oak trees. We find p,
the proportion in the sample that are greater than 10 metres in height. We
hope that our sample is representative of the population (we discuss how to
randomly sample efficiently later in the chapter). We therefore take p as our
estimate of the proportion of oak trees in the population that are greater than
10 metres in height.

1.2.1 Observations and random variables


The terms observation or observed value are used for the individual values
within our sample. Different samples will usually give rise to different sets of
observations. Imagine all the samples of oak trees that might have been
observed. Consider the first observation in the sample actually taken. That
tree could have been any one of the oak trees in the population (which consists
of every tree that might be observed). If the sample were taken at random,
then the height recorded could have been any of the heights of the oak trees in
the population. These heights vary and our choice has been made at random:
we have an observation on a random variable.
Some other examples are:

Random variable Range of values Value observed


Weight of a person Perhaps 45 kg to 180 kg 80kg
Speed of a car in the UK 0 to 70 mph (Legally!) 69mph
Number of letters in a post box Perhaps 0 to 100 23
Colour of a postage stamp All possible colours Tyrian plum

1.2.2 Sampling variation


Suppose that we have carefully taken a random sample from the population
of oak trees. Using the heights of the trees in our sample, we have happily
obtained an estimate, p, of the proportion greater than 10 metres in height.
That’s fine, but . . . if we take another sample, with equal care, we would not
expect to get exactly the same proportion (though we would expect it to be
similar). If we take several samples then we can get an idea of a range of
values within which the true population proportion lies. We will see in later
chapters that we can often deduce that range from a single sample.
Example 1.1

Suppose that a large population contains equal proportions of 1, 2, 3, 4, and


5. A random sample of just four observations is taken from the population.
The observations in this sample are 1, 5 1, and 1. These have an average
of 2. The next three samples are (2, 4, 2, 2), (1, 4, 1, 5), and (4, 2, 2,
3) with averages 2.5, 2.75, and 2.75, respectively. A further 96 samples,
4 FIRST STEPS

each containing 4 observations are randomly generated. The 100 averages


are illustrated using a bar chart in Figure 1.1.

Figure 1.1 Bar chart showing the averages of 100 samples of size 4 taken
from a population containing an equal proportion of the numbers 1 to 5.
The true population average is 3, but the sample averages vary widely. A
sample with only four observations would never be used in practice. To get re-
liable information about a population, much larger samples are required. But
the sample will usually be much smaller than the population. For example, in
the UK, market research companies typically obtain information from about
1000 people in order to deduce the thoughts or characteristics of a population
of over 60 million.

As we saw in the example, different samples from the same


population are likely to have different means. Each mean is an
estimate of the population mean (the average of all the values
in the population).
The larger the sample size, the nearer (on average) will the
sample mean be to the unknown population mean that it aims
to estimate

1.3 Methods for sampling a population

Although the data analyst is unlikely to be collecting the data, it is always


important to understand how the data were obtained. If the data consists of
information on the entire population of interest, then this is straightforward.
However, if the data represents information on a sample from the population,
then it is important to know how the sample was obtained. Some common
methods are briefly described in this section.
METHODS FOR SAMPLING A POPULATION 5

1.3.1 The simple random sample


Most sampling methods endeavour to give every member of the population
a known probability of being included in the sample. If each member of the
sample is selected by the equivalent of drawing lots, then the sample selected
is described as being a simple random sample.
One procedure for drawing lots is the following:
1. Make a list of all N members of the population.
2. Assign each member of the population a different number.
3. For each member of the population place a correspondingly numbered
ball in a bag.
4. Draw n balls from the bag, without replacement. The balls should be
chosen at random.
5. The numbers on the balls identify the chosen members of the population.
An automated version would use the computer to simulate the drawing of the
balls from the bag.
The principal difficulty with the procedure is the first step: the creation of
a list of all N members of the population. This list is known as the sampling
frame. In many cases there will be no such central list. For example, suppose
it was desired to test the effect of a new cattle feed on a random sample of
British cows. Each individual farm may have a list of its own cows (Daisy,
Buttercup, . . .), but the goverment keeps no central list.
Indeed, for the country as a whole there is not even a 100% accurate list
of people (because of births, deaths, immigration and emigration).
But, because of the straightforward nature of the simple random sample,
it will be the first choice when it is practicable.

1.3.2 Cluster sampling


Even if there were a 100% accurate list of the British population, simple
random sampling would almost certainly not be performed because of expense.
It is easy to imagine the groans emitted by the pollsters on drawing two balls
from the bag corresponding to inhabitants of Land’s End and the Shetland
Isles. The intrepid interviewer would be a much travelled individual!
To avoid this problem, populations that are geographically scattered are
usually divided into conveniently sized regions. A possible procedure is then
1. Choose a region at random.
2. Choose individuals at random from that region.
The consequences of this procedure are that instead of a random scatter of
selected individuals there are scattered clusters of individuals. The selection
6 FIRST STEPS

probabilities for the various regions are not equal, but would be adjusted to be
in proportion to the numbers of individuals that the regions contain. If there
are r regions, with the ith region containing Ni individuals, then the chance
that it is selected would be chosen to be Ni /N , where N = N1 +N2 +· · ·+Nr .
The size of the chosen region is usually sufficiently small that a single in-
terviewer can perform all the interviews in that region without incurring huge
travel costs. In practice, because of the sparse population and the difficulties
of travel in the highlands and islands of Scotland, most studies of the British
population are confined to the region south of the Caledonian Canal.

1.3.3 Stratified sampling


Most populations contain identifiable strata, which are non-overlapping sub-
sets of the population. For example, for human populations, useful strata
might be ‘males’ and ‘females’, or ‘Receiving education’, ‘Working’ and ‘Re-
tired’, or combinations such as ‘Retired female’. From census data we might
know the proportions of the population falling into these different categories.
With stratified sampling, we ensure that these proportions are reproduced
by the sample. Suppose, for example, that the age distribution of the adult
population in a particular district is as given in the table below.
Aged under 40 Aged between 40 and 60 Aged over 60
38% 40% 22%
A simple random sample of 200 adults would be unlikely to exactly repro-
duce these figures. If we were very unfortunate, over half the individuals in the
sample might be aged under 40. If the sample were concerned with people’s
taste in music, then, by chance, the simple random sample might provide a
misleading view of the population.
A stratified sample is made up of separate simple random samples for
each of the strata. In the present case, we would choose a simple random
sample of 76 (38% of 200) adults aged under 40, a separate simple random
sample of 80 adults aged between 40 and 60, and a separate simple random
sample of 44 adults aged over 60.
Stratified samples exactly reproduce the characteristics of the strata and
this almost always increases the accuracy of subsequent estimates of popu-
lation parameters. Their slight disadvantage is that they are a little more
difficult to organise.

1.3.4 Systematic sampling


Suppose that we have a list of the N members of a population. We wish to
draw a sample of size n from that population. Let k be an integer close to
N/n. Systematic sampling proceeds as follows:
1. Choose one member of the population at random. This is the first mem-
ber of the population.
OVERSAMPLING AND THE USE OF WEIGHTS 7

2. Choose every kth individual thereafter, returning to the beginning of the


list when the end of the list is reached.

For example, suppose we wish to choose six individuals from a list of 250. A
convenient value for k might be 40. Suppose that the first individual selected
is number 138. The remainder would be numbers 178, 218, 8, 48 and 88.
If the population list has been ordered by some relevant characteristic, then
this procedure produces a spread of values for the characteristic – a type of
informal stratification.

1.4 Oversampling and the use of weights

Using either strata or clusters, the sampler is dividing a population into


smaller sections. Often both strata and clusters are required, with the po-
tential result that some subsamples would include only a few individuals.
Small samples are inadvisable since, by chance, they give a false idea of the
subpopulations that they are supposed represent.
The solution is oversampling. The sample size for the rare subpopulation
is inflated by some multiple M , so as to give a reliable view of that subpop-
ulation. Subsequent calculations will take account of this oversampling so as
to fairly represent the entire population.
CHAPTER 2

SUMMARISING DATA

Now we have a computer full of data, we have to decide what to do with it!
How should we report it? How should we present it to those who need to
know? In this chapter we present possible answers to these questions. We
apologise in advance for the number of specialised terms and their definitions.

2.1 Measures of location

2.1.1 The mode


The mode of a set of discrete data is the single outcome that occurs most
frequently. This is the simplest of the measures of location, but it is of limited
use. If there are two such outcomes that occur with equal frequency then there
is no unique mode and the data would be described as being bimodal; if there
were three or more such outcomes then the data would be called multimodal.
When measured with sufficient accuracy all observations on a continuous
variable will be different: even if John and Jim both claim to be 1.8 metres tall,
we can be certain that their heights will not be exactly the same. However, if
we plot a histogram (Section 2.4) of men’s heights, we will usually find that
A gentle introduction to data analysis and statistics for students taking MA334. 9
By Graham J. G. Upton and Dan Brawn Copyright © 2023 Graham Upton.
10 SUMMARISING DATA

it has a peak in the middle: the class corresponding to this peak is called the
modal class.
Example 2.1

A random sample of 50 place names was chosen from a gazetteer of place


names in north-west Wales. The numbers of letters in these place names are
summarised in the following table:

Length of
place name 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Frequency of
occurrence 3 5 5 5 6 8 9 3 1 1 2 1 0 1

In this sample, the most frequent length of place name is 10 (often names
starting Llan), which is therefore the mode. According to Wikipedia there
are over 600 places in Wales that start with ‘Llan’ which refers to the region
around a church.

2.1.2 The mean


This measure of location is simply the average value: the sum of the observed
values divided by the number of values in the sample. Unlike the mode, the
mean will usually not be equal to any of the observed values.
Suppose that the data set consists of the n values, x1 , x2 , . . . , xn . The
sample mean, denoted by x̄, is given by:
n
1 1X
x̄ = (x1 + x2 + · · · + xn ) = xi . (2.1)
n n i=1

The convenient capital sigma notation,


n
X

i=1

states that summation is needed over the index i from 1 to n.

If the data are summarised in the form of a frequency distribution, with


Pmdistinct values, and with the value xj occurring fj times, then, since
m
j=1 fj = n,
m
1X
x̄ = fj xj . (2.2)
n j=1
MEASURES OF LOCATION 11

This is just another way of writing Equation 2.1.1 The formula would also be
used for grouped data, with xj being the mid-point of the jth group.
Example 2.1 (cont.)

The sum of the lengths of the 50 Welsh place names is 3 × 4 + 5 × 5 + · · · +


1 × 17 = 430 so the mean length is 430/50 = 8.6 letters.

2.1.3 The trimmed mean

Sometimes the data contains a number of outliers (values that are not typical
of the bulk of the data). For example, if a multi-millionaire takes up residence
in a small village, then the average income for the village inhabitants will be
greatly inflated and the resulting value would not be useful. A trimmed mean
avoids this problem by discarding a specified proportion of both the largest
and the smallest values.
Example 2.2

The numbers of words in the 18 sentences of Chapter 1 of A Tale of Two


Cities by Charles Dickens are as follows:
118, 39, 27, 13, 49, 35, 51, 29, 68, 54, 58, 42, 16, 221, 80, 25, 41, 33.

The famous first sentence ("It was the best of times, it was the worst of
times, ...") is unusually long, but the thirteenth sentence is even longer. The
rest of the sentences have more usual lengths. The mean is 55.5, which is
greater than 13 of the 18 values.
A 20% trimmed mean would ignore the bottom 10% and the top 10% of
values. Taking 10% as meaning just one of the 18 values, the trimmed mean
is in this case the average of the central 16 values: 47.8. This is much more
representative, being larger than 10 values and less than 8.

2.1.4 The Winsorized mean


In this alternative to the trimmed mean, the mean is calculated after replacing
the extreme values by their less extreme neighbours. When data are collected
simultaneously on several variables (this is called multivariate data) and one
variable appears to contain unusual values, Winsorizing could be preferable
to trimming because it will retain the correct number of ‘observed’ values.
However, the Winsorizing must be reported as part of the data analysis.

1 All that is happening is that y + y + y, say, is being calculated as 3 × y.


12 SUMMARISING DATA

Example 2.2 (cont.)

A 20% Winsorized mean for the previous data would use the values

16, 16, 25, 27, 29, 33, 35, 39, 41, 42, 49, 51, 54, 58, 68, 80, 118, 118.

to report a value of 49.9.

2.1.5 The median


The word ‘median’ is just a fancy name for ‘middle’. It is a useful alternative
(because it is easier to explain) to the trimmed mean (or the Winsorized
mean) when there are outliers present.
With n observed values arranged in order of size, the median is calculated
as follows. If n is odd and equal to (2k + 1), say, then the median is the
(k + 1)th ordered value. If n is even and equal to 2k, say, then the median is
the average of the kth and the (k + 1)th ordered values.
Example 2.2 (cont.)

Re-ordering the 18 sentence lengths gives

13, 16, 25, 27, 29, 33, 35, 39, 41, 42, 49, 51, 54, 58, 68, 80, 118, 221.

The median length is 41.5 words.

2.2 Measures of spread

2.2.1 The range


This usually refers to a pair of values: the smallest value and the largest value.
Example 2.2 (cont.)

The range of sentence lengths is from 13 to 221, a span of 221 − 13 = 208


words.

2.2.2 The interquartile range


The median (Section 2.1.5) is the value that subdivides ordered data into two
halves. The quartiles go one step further by using the same rules to divide
each half into halves (to form quarters).
Denoting the lower quartile and upper quartile by Q1 and Q2 , respec-
tively, the inter-quartile range is reported either as the pair of values (Q1 , Q2 ),
or as the difference Q2 − Q1 .
BOXPLOT 13

Quartiles are a special case of quantiles. Quantiles subdivide the ordered


observations in a sample into equal-sized chunks. Another example of quan-
tiles is provided by the percentiles. which divide the ordered data into 100
sections. Quartiles are useful for the construction of the boxplots introduced
in the next section.
Example 2.2 (cont.)

Using R, the lower and upper quartiles were reported to be 30 and 57, re-
spectively. Different computer packages may report slightly different values.

2.3 Boxplot

This is a diagram that summarises the variability in the data. It is also known
as a box-whisker diagram. The diagram consists of a central box with a
‘whisker’ at each end.
The ends of the central box are the lower and upper quartiles, with either
a central line or a notch indicating the value of the median. The simplest
form of the diagram has two lines (the whiskers); one joining the lowest value
to the lower quartile, and the other joining the highest value to the upper
quartile.
A refined version has the length of the whiskers limited to some specified
multiple of the inter-quartile range. In this version any more extreme values
are shown individually.
Box-whisker diagrams provide a particularly convenient way of comparing
two or more sets of values. In this case, a further refinement may vary the
widths of the boxes to be proportional to the numbers of observations in each
set.
Example 2.2 (cont.)

Figure 2.1 Boxplots comparing the sentence lengths of sentences in the


first chapters of works by Charles Dickens and Jeffrey Archer.
14 SUMMARISING DATA

An interesting contrast to the lengths of sentences in the Dickens work is


provided by the lengths of the seventeen sentences in Chapter 1 of Jeffrey
Archer’s Not a Penny More, Not a Penny Less:

8, 10, 15, 13, 32, 25, 14, 16, 32, 25, 5, 34, 36, 19, 20, 37, 19.

When the whisker lengths are limited to a maximum of 1.5 times the inter-
quartile range, the result is Figure 2.1.

Figure 2.2 The boxplots of Figure 2.1 presented using a log scale.

For those with keen eyesight it may be apparent that the median of the
Archer data is less than the lower quartile of the Dickens data, while the
median of the Dickens data is greater than the upper quartile of the Archer
data. We can make the difference between the two data sets much more
apparent by using a log scale as shown in Figure 2.2.

2.4 Histograms

Bar charts are not appropriate for continuous data (such as measurements
of weight, time, etc.). A histogram is a diagram that also uses rectangle areas
to represent frequency. It differs from the bar chart in that the rectangles may
have differing widths, but the key feature is that, for each rectangle,

Area is proportional to class frequency

Some computer packages attempt to make histograms three-


dimensional. Avoid these if you can, since the effect is likely
to be misleading.
HISTOGRAMS 15

Example 2.3

As glaciers retreat they leave behind rocks known as ‘erratics’ because they
are of a different type to the rocks normally found in the area. In Scotland
many of these erratics have been used by farmers in the walls of their fields.
One measure of the size of these erratics is the cross-sectional area visible on
the outside of a wall. The areas of 30 erratics (in cm2 ) are given below:
216, 420, 240, 100, 247, 128, 540, 594, 160, 286, 216, 448, 380, 509, 90,
156, 135, 225, 304, 144, 152, 143, 135, 266, 286, 154, 154, 386, 378, 160
Area is a continuous variable, so a histogram is appropriate. Figure 2.3
uses six intervals, each spanning 100 cm2 .

Figure 2.3 Histogram of the data on the cross-sections of erratics, using


equal-width ranges.

The interpretation of the y-axis is that there are, for example, 8 observa-
tions with values between 200 and 300 cm2 . Of course the figure is not lying,
but it is, perhaps, giving a false impression of the distribution of the areas.
A better idea is provided by collecting the data into decades (0-99, 100-109,
etc). The result is Table 2.1.

Table 2.1 Summary of the cross-section data for the erratics Here ‘9-’ means an
observation in the range 90-99.
Decade 9- 10- 12- 13- 14- 15- 16- 21- 22- 24-
Count 1 1 1 2 2 4 2 2 1 2
Decade 26- 28- 30- 37- 38- 42- 44- 50- 54- 59-
Count 1 2 1 1 2 1 1 1 1 1
16 SUMMARISING DATA

The table shows that areas around 150 cm2 are particularly frequent. To
bring this out in the histogram requires unequal intervals. For Figure 2.4 we
used ranges with the following end-points: (0, 125, 150, 200, 250, 400, 600).

Figure 2.4 A more informative histogram of the data on cross-sections


of erratics.

It is always a good idea to experiment with varying diagrams.

2.5 Cumulative frequency diagrams

These are an alternative form of diagram that provide answers to questions


such as “What proportion of the data have values less than x?”. In such
a diagram, cumulative frequency on the ‘y-axis’ is plotted against observed
value on the ‘x-axis’. The result is a graph in which, as the x-coordinate
increases, the y-coordinate cannot decrease.
With grouped data, such as in the following example, the first step is to
produce a table of cumulative frequencies. These are then plotted against
the corresponding upper class boundaries. The successive points may be con-
nected either by straight-line joins (in which case the diagram is called a
cumulative frequency polygon) or by a curve (in which case the diagram
is called an ogive).
CUMULATIVE FREQUENCY DIAGRAMS 17

Example 2.4

In studying bird migration, a standard technique is to put coloured rings


around the legs of the young birds at their breeding colony. Table 2.2, which
refers to recoveries of razorbills, summarises the distances (measured in hun-
dreds of miles) between the recovery point and the breeding colony.

Table 2.2 Recovery distances of ringed razorbills.


Distance (miles) Freq. Cum. Distance (miles) Freq. Cum.
(x) freq. (x) freq.
x < 100 2 2 700 ≤ x < 800 2 30
100 ≤ x < 200 2 4 800 ≤ x < 900 2 32
200 ≤ x < 300 4 8 900 ≤ x < 1000 0 32
300 ≤ x < 400 3 11 1000 ≤ x < 1500 2 34
400 ≤ x < 500 5 16 1500 ≤ x < 2000 0 34
500 ≤ x < 600 7 23 2000 ≤ x < 2500 2 36
600 ≤ x < 700 5 28

Figure 2.5 A cumulative frequency diagram of recovery distances of


razorbills.

The cumulative frequency polygon (Figure 2.5) shows that 50% of the ra-
zorbills had travelled more than 500 miles.
18 SUMMARISING DATA

2.6 Step diagrams

A cumulative frequency diagram for ungrouped data is sometimes referred to


as a step polygon or step diagram because of its appearance.
Example 2.5

In a compilation of Sherlock Holmes stories, the 13 stories that comprise


The Return of Sherlock Holmes have the following numbers of pages:
13.7, 15.5, 16.4, 12.8, 20.8, 13.7, 11.2, 13.7, 11.7, 15.0, 14.1, 14.8, 17.1.
The lengths are given to the nearest tenth of a page. Treating the values as
being exact, we use them as the boundaries in a cumulative frequency table.
We first need to order the values:
11.2, 11.7, 12.8, 13.7, 13.7, 13.7, 14.1, 14.8, 15.0, 15.5, 16.4, 17.1, 20.8.
The resulting diagram is Figure 2.6.

Figure 2.6 A step diagram illustrating the lengths of stories in The


Return of Sherlock Holmes.

Notice that the cumulative frequencies ‘jump’ at each of the observed val-
ues. It is this that gives rise to the vertical strokes in the diagram. The
horizontal strokes represent the ranges between the ordered values.

2.7 The variance and standard deviation

Denoting a sample of n values by x1 , x2 , . . . , xn , and their average (the sample


mean) by x̄, the variability is captured by the differences of the values from
their mean: x1 − x̄, x2 − x̄, . . ..
THE VARIANCE AND STANDARD DEVIATION 19

To avoid positive differences cancelling out negative differences, we work


with the squared values: (x1 − x̄)2 , (x2 − x̄)2 , . . .,2 and, to get a measure of
their overall variability, we work with their sum:

n
X
(x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2 = (xi − x̄)2 .
i=1

The magnitude of this sum is affected by the number of terms in the summa-
tion. To get a proper idea of the average variability we need to take the size
of n into account. This we do by dividing by (n − 1) to obtain the quantity
s2 given by 3 :
n
1 X
s2 = (xi − x̄)2 . (2.3)
n − 1 i=1

This has a minimum value of zero when all the values are equal.
The units of variance are squared units. To obtain an idea of the variability
in terms of the units of the observations, all that is needed is to take the
square root. This quantity, denoted by s, is called the standard deviation
(or sample standard deviation).

As a general rule you can expect most data to lie in the range
(x̄ − 2s, x̄ + 2s). Observations less than x̄ − 3s, or greater than
x̄ + 3s will be very unusual. If any occur, then it would be
wise to check that they have been recorded correctly. Even with
computers, transcription errors can occur.

Example 2.5 (cont.)

To investigate the general rule given above, for one final time we look at
the sentence lengths. For the Archer data, (x̄ − 2s, x̄ + 2s) corresponds to
the interval (0.85 , 41.50). which does indeed contain all the Archer sentence
lengths. values. For the Dickens data, because of two extraordinarily long
sentences, the interval is the bizarre (-40.94 , 151.95). The lower value is
preposterous and alerts us to the unusual nature of the data. Only the longest
sentence falls outside this range.

2 We cannot work with the sum of the differences as this is zero. We could work with the
sum of the absolute differences, but that leads to intractable mathematics!
3 The reason for using (n − 1) rather than n, is that we want s2 , which is a sample quantity,

to be an unbiased estimate of the same quantity in the population being sampled.


20 SUMMARISING DATA

2.8 Symmetric and skewed data

If a population is approximately symmetric then a reasonable sized sample


will have mean and median having similar values. Typically their values will
also be close to that of the mode of the population (if there is one!).
A population that is not symmetric is said to be skewed. A distribution
with a long ‘tail’ of high values (like the Dickens data) is said to be positively
skewed, in which case the mean is usually greater than the mode or the
median. If there is a long tail of low values then the mean is likely to be
the lowest of the three location measures and the distribution is said to be
negatively skewed.

2.9 Using weights

In Section 1.4 we noted that surveys often include oversampling of small sub-
populations in order to gain reliable information concerning the whole popu-
lation and that subsequent calculations would take that into account. Let’s
see how that works.
Suppose that we have a large organisation has 100 managers and 19,900
workers. We want to sample 50 individuals in order to estimate the average
salary of the employees.
If we take a simple random sample of the 20,000 individuals, then there is a
good chance that none of the managers will be included. A better plan would
be to to take a stratified sample (Section 1.3.3) made up of 10 managers and
40 workers. In this sample one-fifth are managers, whereas in the population
only one in two hundred is a manager. In order to reflect this, each manager’s
salary would be divided by 40 before the sampled values are averaged4 .
Suppose that, associated with observation xi there is the weight wi . Alge-
braically we are replacing
n n
, n
1X X X
x̄ = xi by x
e= wi xi wi . ,
n i=1 i=1 i=1

where wi is the weight associated with xi . For the managers wi was 1/40,
with the weights for the workers staying at 1 5 .
Similar adjustments would be used for other calculations.

4 1/5divided by 1/200 is 40.


5 All
that matters is the sizes of the weights relative to each other: multiplying every weight
by 1000, for example, would not alter the value obtained.
CHAPTER 3

PROBABILITY AND PROBABILITY


DISTRIBUTIONS

3.1 Probability

Although the data analyst is unlikely to need to perform complex


probability calculations, a general understanding of probability is
certainly required.

Suppose we have an ordinary six-sided die. When we roll it, it will show
one of six faces:

Suppose we roll the die six times. Using numbers, as they are easier to
read, we get: 1, 5, 1, 1, 2, and 4. Have we got one of each face? Certainly not,
by chance (actually using a random number generator) we find that half of the
numbers are 1s and there are no 3s or 6s. Since we are using the computer,
we can find out what happens as we ‘roll the die’ many more times (see Table
3.1).
A gentle introduction to data analysis and statistics for students taking MA334. 21
By Graham J. G. Upton and Dan Brawn Copyright © 2023 Graham Upton.
22 PROBABILITY AND PROBABILITY DISTRIBUTIONS

Table 3.1 The results of ‘rolling a 6-sided die’ (using a computer)


Number of observations Proportion showing a particular face
1 2 3 4 5 6
6 0.500 0.167 0.000 0.167 0.167 0.000
60 0.133 0.283 0.100 0.183 0.167 0.133
600 0.162 0.195 0.160 0.153 0.163 0.167
6000 0.168 0.174 0.166 0.172 0.165 0.156
60,000 0.169 0.166 0.167 0.165 0.168 0.165

If the die is fair, then, on average, and in the very long run, each face will
appear on exactly 1/6 of occasions. After 60,000 repeats we are very close to
the limiting value1 .

It is often helpful to imagine probability as being the limiting


value of the proportion approached over a long series of repeats
of identical experiments.

We start with some formal definitions and notation, using as an example


the result of a single throw of a fair six-sided die:

The sample space, S, is the set of all possible outcomes of the situation
under investigation.
In the example that was the numbers 1 to 6

An event, E, is a possible outcome, or group of outcomes, of special


interest.
For example, obtaining a 6.

Probability ranges from 0 to 1:

An event or value that cannot occur has probability 0.


Obtaining a 7.

An event or value that is certain to occur has probability 1.


Obtaining any number in the range 1 to 6 inclusive.

All other events have probabilities between 0 and 1.


For example, obtaining a 6.

1 The value that we would get after an infinite number of repeats (supposing we were still
alive!).
PROBABILITY 23

An event or value that has an equal chance of occurring or not occurring


has probability 0.5.
Obtaining an odd number.
Diagrams that help with visualising the relation between events are called
Venn diagrams.2 Figure 3.1 is a Venn diagram illustrating the sample space
for the die-rolling example. The sample space is subdivided into six equal-
sized parts corresponding to the six possible outcomes. The highlighted area
is one-sixth of the whole: P(Die shows a 5) = 1/6.

Figure 3.1 Venn diagram illustrating the sample space S and its
subdivision into six equi-probable parts including the event ‘obtaining
a 5’.

To handle probabilities involving more than one event we need some more
notation:

A∪B A union B At least one of events A and B occurs.

A∩B A intersection B Events A and B both occur,

Figure 3.2 uses Venn diagrams to illustrate the union and intersection ideas.
Suppose that the probability that an event X occurs is denoted by P(X).
Then, since the intersection, A ∩ B, is part of both A and B, we can see that

P(A) + P(B) = P(A ∪ B) + P(A ∩ B). (3.1)

Rearranging, this means that

P(A ∪ B) = P(A) + P(B) − P(A ∩ B). (3.2)

2 JohnVenn (1834-1923) was a Cambridge lecturer whose major work, The Logic of Chance,
was published in 1866. It was in this book that he introduced the diagrams that now bear
his name.
24 PROBABILITY AND PROBABILITY DISTRIBUTIONS

Figure 3.2 Venn diagrams illustrating events A, B, their union A ∪ B,


and their intersection A ∩ B.

Example 3.1

Continuing with the fair six-sided die, define the events A and B as follows:
A : The die shows a multiple of 2. 2, 4, or 6
B : The die shows a multiple of 3. 3 or 6
Thus P(A) = 36 and P(B) = 26 .
Combining the events:
A ∪ B: The die shows a multiple of 2 or 3 (or both). 2, 3, 4, or 6
A ∩ B: The die shows both a multiple of 2 and a multiple of 3. 6
Thus P(A ∪ B) = 46 and P(A ∩ B) = 61 .
Alternatively, using Equation (3.2), P(A ∪ B) = 36 + 26 − 61 = 46 .

3.2 The rules of probability

Addition rule: If events are mutually exclusive, then the prob-


ability that one or other occurs is the sum of the probabilities
of the individual events.

If events C and D are mutually exclusive, then they cannot occur


simultaneously, which implies that P(C ∩ D) = 0 (see Figure 3.3). The rule is
a consequence of substituting zero for the intersection term in Equation (3.2).

Example 3.2

We roll a fair six-sided die that has sides numbered 1 to 6. The die can
only show one side at a time, so the probability of occurrence for the event
‘the die shows either a six or a one on top’ is P(Die shows a six) + P(die
shows a one)= 16 + 16 = 13 .
THE RULES OF PROBABILITY 25

Figure 3.3 Events C and D have no intersection: they are mutually


exclusive.

Multiplication rule: If events are mutually independent (i.e.


the probability of an event occurring is the same, regardless of
what other events occur), then the probability that all the events
occur is the product of their separate probabilities.

Example 3.3

Suppose we now roll a fair six-sided die twice. On the first roll we are
equally likely to obtain any of 1 to 6, so that P(6)= 61 . The same is true
for the second roll: P(6)= 16 . The outcomes of the rolls are independent of
one another, so the probability that we obtain 6 on both rolls is 16 × 61 = 36
1
.
Figure 3.4 illustrates the situation.

1
Figure 3.4 With a fair die, the probability of two sixes is 36
.
26 PROBABILITY AND PROBABILITY DISTRIBUTIONS

3.3 Conditional probability and independence

The probability that we associate with the occurrence of an event is always


likely to be influenced by any prior information that we have available. Sup-
pose, for example, that I see a man lying motionless on the grass in a nearby
park and am interested in the probability of the event “the man is dead”. In
the absence of other information a reasonable guess might be that the proba-
bility is one in a million. However, if I have just heard a shot ring out, and a
suspicious looking man with a smoking revolver is standing nearby, then the
probability would be rather higher.

The probability that the event B occurs (or has occurred) given
the information that the event A occurs (or has occurred) is
written as P(B|A).

The quantity B|A is read as “B given A” and P(B|A) is described as a


conditional probability since it refers to the probability that B occurs (or
has occurred) conditional on the event that A occurs (or has occurred).
Example 3.4

Continuing with the example of a single throw of a fair six-sided die, define
the events A and B as follows:
A: An odd number is obtained.
B: The number 3 is obtained.
If we do not know that A has occurred, then P(B) = 16 as illustrated:
CONDITIONAL PROBABILITY AND INDEPENDENCE 27

Now suppose that we know that the roll was an odd number. The situation
has become:

We see that P(B|A) = 13 . Originally ,we were choosing from the entire
sample space (1, 2, 3, 4, 5, and 6), but we are now choosing from the subset
(1, 3, and 5).

If it is known that event A occurs, then, as the previous example illustrated,


the question is ‘What proportion of the occasions when event A occurs, does
event B also occur?’. We define the answer to this question as the probability
of event B conditional on event A , thus:

P(A ∩ B)
P(B|A) = , (3.3)
P(A)

or, cross-multiplying,

P(A ∩ B) = P(B|A) × P(A). (3.4)

Example 3.5

An electronic display is equally likely to show any of the digits 1, 2, . . . , 9.


Thus:
P(A) that it shows an odd number (1, 3, 5, 7, or 9), is 5/9.
P(B), that it shows a prime number (2, 3, 5, or 7), is 4/9.
P(A ∩ B), that it shows an odd prime number (3, 5, or 7), is 3/9.
Suppose that we are told that the number shown is odd. There are five
equally likely odd numbers, so the conditional probability of a prime number,
given this information, is 3/5. Alternatively using Equation (3.3):

P(A ∩ B) 3/9 3
P(B|A) = = = .
P(A) 5/9 5
28 PROBABILITY AND PROBABILITY DISTRIBUTIONS

Figure 3.5 Venn diagram illustrating the prime number example.

One final note:

If event B is independent of event A, then P(B|A) = P(B).


If P(B|A) = P(B), then event B is independent of event A.

The assumption of independence is central to many data analysis methods.


Example 3.6

Now consider again the rolling of a die with two new events:
C: A multiple of 2 is obtained.
D: A multiple of 3 is obtained.

So the probability of the event D is 62 , while the conditional probability of


D, given that C has occurred, is 13 . Since 26 = 31 , we can deduce that the two
events are independent: knowing that an even number has occurred does not
make it any more, or less, likely that some multiple of 3 has occurred.
THE TOTAL PROBABILITY THEOREM 29

3.4 The total probability theorem

The total probability theorem amounts to the statement that the whole is
the sum of its parts. A simple illustration of the general idea is provided by
Figure 3.6.

Figure 3.6 An event is the sum of its parts.

It may help to think of B as a potato, with A being a 5-compartment potato


slicer. The five compartments have different widths because the events A1 ,
A2 , · · · , A5 are not equally likely.
The potato may be sliced into as many as five parts, though in this case
there are just four: the part falling in the first compartment, the part falling
in the second compartment, and the parts falling into the third and fourth
compartments. The algebraic equivalent of reassembling the potato from its
component parts is:

P(B) = P(B ∩ A1 ) + P(B ∩ A2 ) + · · · + P(B ∩ A5 ).

In this case, since the fifth compartment is empty, P(B ∩ A5 ) = 0.


Using Equation (3.4) this becomes:

P(B) = {P(B|A1 ) × P(A1 )} + {P(B|A2 ) × P(A2 )} + · · · + {P(B|A5 ) × P(A5 )}.

Generalising to the case where A has n categories we get the theorem:


n
X n
X
P(B) = P(B ∩ Ai ) = P(B|Ai ) × P(Ai ). (3.5)
i=1 i=1

Example 3.7

It is known that 40% of students are good at Physics. Of those students,


80% are also good at Mathematics. Of those students who are not good at
Physics, only 30% are good at Mathematics. We wish to find the overall
proportion that are good at Mathematics.
The information is collected together in Figure 3.7:
30 PROBABILITY AND PROBABILITY DISTRIBUTIONS

Figure 3.7 The proportions who are good or not good at Physics, and
the proportions of those groups that are good at Mathematics.

Here we take A1 to be the event ’Good at Physics’, with A2 being ‘Not good
at Physics’. We require P(B), where B is the event ’Good at Mathematics’.
We are told that P(A1 ) = 0.4, which implies that P(A2 ) = 0.6, since these are
the only possibilities. We are also told that P(B|A1 ) = 0.8 and P(B|A2 ) = 0.3,
Using Equation (3.5), we have

P(B) = {P(A1 ) × P(B|A1 )} + {P(A2 ) × P(B|A2 )}


= (0.4 × 0.8) + (0.6 × 0.3)
= 0.32 + 0.18 = 0.50

Thus half the students do well in Mathematics.

3.5 Bayes’ theorem

In introducing the idea of conditional probability we effectively asked the


question
“Given that event A has occurred in the past, what is the probability that
event B will occur?”.
We now consider the following “reverse” question:
“Given that the event B has just occurred, what is the probability that it
was preceded by the event A?”.
In the previous section we imagined a potato being sliced into segments.
Now suppose those segments are placed into a black bag and a skewer pierces
one segment. The probability of a particular segment being skewered will be
equal to the size of that segment when regarded as a proportion of the whole
potato. The biggest piece will have the highest probability.
BAYES’ THEOREM 31

Figure 3.8 The biggest segment is the most likely to occur.

Reverting to the general case, the whole potato represented P(B), with
segment i representing P(B ∩ Ai ). So, if we know that the event B has
occurred, then the probability that that was a consequence of event Ai having
occurred is given by P(B ∩ Ai )/P(B). Using Equation (3.5), we have Bayes’
theorem3 :

P(B ∩ Ai ) P(B|Ai ) × P(Ai )


P(Ai |B) = = Pn . (3.6)
P(B) i=1 P(B|Ai ) × P(Ai )

Example 3.8

A test has been devised to establish whether a patient has a particular


disease. If the patient has the disease, then there is a 2% chance that the test
will not detect it (a false negative). If the patient does not have the disease,
there is nevertheless a 2% chance that the test will report that the patient
has the disease (a false positive). Suppose that 4% of the population have
the disease. The question of interest to the patient is,
If I am diagnosed as having the disease, what is the probability that I do
indeed have it?
It is easy to suppose that, since the test is 98% accurate, the probability
that the patient has the disease must be 98%. This is not the case. We start
by arranging the information in a table. For convenience, suppose that we
have a population of 10,000 people. Then 4% = 400 have the disease, and
9600 do not. Of the 400 with the disease, 2%=8, test negative and 392 test
positive. Of the 9600 without the disease, 2% = 192, test positive. These
numbers are set out in the table:
Positive Negative Total
Has disease 392 8 400
Does not have disease 192 9600
Total 584 10000
3 The Reverend Thomas Bayes (1701-1761) was a Nonconformist minister in Tunbridge
Wells, Kent. He was elected a Fellow of the Royal Society in 1742. The theorem was
contained in an essay that did not appear until after his death and was largely ignored at
the time.
32 PROBABILITY AND PROBABILITY DISTRIBUTIONS

We see that of the 584 diagnosed positive, only 392 do actually have the
disease.
Formally, define the events B, A1 and A2 as follows:

B: Tests positive
A1 : Has the disease.
A2 : Does not have the disease..

We need to calculate P(A1 |B). To evaluate this we first calculate P(B).


Using Equation (3.5):

P(B) = 0.04 × 0.98 + 0.96 × 0.02 = 0.0392 + 0.0192 = 0.0584.

Using Equation (3.6), we find that


0.0392
P(A1 |B) = = 0.67.
0.0584
Despite the high accuracy of the test, only about two-thirds of patients who
test positive actually have the disease. This is because, although most people
do not have the disease, and only a small proportion of these are false positives,
a small proportion of a large number can be rather a lot! In this case ‘rather
a lot’ is 192.

3.6 Notation

Conventionally, random variables are denoted using capital letters: X, Y ,


. . . , whereas observations and the values that random variables might take
are denoted using lower case letters: x, y, . . . . So we might write P(X = x)
to mean ‘the probability that the random variable X takes the value x’. In
this chapter we concentrate on situations where an appropriate summary of
the data might be a tally chart, and an appropriate representation would be
a bar chart (rather than a histogram).
Suppose that X is a discrete random variable that can take the values x1 ,
x2 , . . . , xm and no others. Since X must take some value:

P(X = x1 ) + P(X = x2 ) + · · · + P(X = xm ) = 1.

The sizes of P(X = x1 ), P(X = x2 ), . . ., show how the total probability


of 1 is distributed amongst the possible values of X. The collection of these
values therefore defines a probability distribution.
MEAN AND VARIANCE OF A PROBABILITY DISTRIBUTION 33

3.7 Mean and variance of a probability distribution

Equation (2.2) gave the sample mean as


m
1X
x̄ = fj xj
n j=1

where n is the sample size, x1 , x2 , . . . , xm are the distinct values that occur
in the sample, and fj is the frequency with which the value xj occurs.
Taking the n1 inside the summation gives
m
X fj
x̄ = xj .
j=1
n

f
The ratio nj , is the proportion of the sample for which X equals xj . Denoting
this proportion by pj , the formula for the sample mean becomes:
m
X
x̄ = p j xj ,
j=1

In the same way, the sample variance, given by Equation (2.3), can be written
as
m
n X
pj (xj − x̄)2 .
n − 1 j=1

Now imagine the sample increasing in size to include the entire population.
Two consequences are:
1. Every possible value of X will occur at least once, and
2. The proportion of the ‘sample’ taking the value xj will be equal to the
probability that a randomly chosen member of the population has the
value xj .
The population mean, which is usually denoted by µ, is therefore given by
X
µ= P (X = x)x, (3.7)

where the summation is over all possible values of x. As n increases the


ratio n/(n − 1) comes ever closer to 1, while x̄ comes ever closer to µ. The
population variance, usually denoted by σ 2 , is therefore given by
X
σ2 = P (X = x)(x − µ)2 . (3.8)

A little bit of algebra gives the equivalent formula:


X
σ2 = P (X = x)x2 − µ2 . (3.9)
34 PROBABILITY AND PROBABILITY DISTRIBUTIONS

We can use whichever is the more convenient; we will get the same answer
either way!
Example 3.9

As an example, consider the result of rolling a fair die, numbered in the


usual way from 1 to 6. With six equally likely possibilities, the probability of
each is 61 . So

1 1 1 1 7
µ= × 1 + × 2 + · · · + × 6 = × 21 = = 3.5.
6 6 6 6 2
Using Equation (3.8), the variance is:
1 1 1 35
σ2 = × (1 − 3.5)2 + × (2 − 3.5)2 + · · · + × (6 − 3.5)2 = .
6 6 6 12
This is a rather messy calculation. Using Equation (3.9) is more straightfor-
ward, since
X 1 91
P(X = x)x2 = × (1 + 4 + 9 + 16 + 25 + 36) = ,
6 6
so that  2
91 7 35
σ2 = − = .
6 2 12

Unless the distribution is very skewed (Section 2.8) the major-


ity of values will lie in the range (µ − 2σ, µ + 2σ), where σ is
the population standard deviation.

3.8 Sample proportion and population probability

Probabilities are really just the proportions that we would get if we continued
sampling for ever!
Example 3.10

As an example of the relation between the sample and population statistics,


we again use the simulated die-rolling results given in Table 3.1 and included
now in Table 3.2.
COMBINING MEANS AND VARIANCES 35

Table 3.2 Extended results of ‘rolling a 6-sided die’.


Proportion showing a particular face
n 1 2 3 4 5 6 x̄ s2
6 0.500 0.167 0.000 0.167 0.167 0.000 2.333 3.067
60 0.133 0.283 0.100 0.183 0.167 0.133 3.367 2.779
600 0.162 0.195 0.160 0.153 0.163 0.167 3.462 2.940
6000 0.168 0.174 0.166 0.172 0.165 0.156 3.460 2.867
60,000 0.169 0.166 0.167 0.165 0.168 0.165 3.491 2.923
600,000 0.167 0.167 0.167 0.167 0.166 0.167 3.499 2.920

Extending the simulation to 600,000 ‘rolls of the die’, we can see that all
the sample proportions are very close to their common population probability,
1/6. In the same way we see that the sample mean is close to the population
mean, 3.5. The sample variance is similarly close to the population variance,
σ 2 (which, in the previous example, we found to be 35/12 = 2.917).

3.9 Combining means and variances

If X and Y are independent random variables, then writing Var for variance,
there are some simple rules:

1. The mean of X + Y equals the mean of X plus the mean of Y .

2. Var(X + Y ) equals Var(X) plus Var(Y ).

3. The mean of aX equals a times the mean of X.

4. Var(aX) equals a2 times Var(X).

Using rules 2 and 4 we find that

Var(X − Y ) = Var(X) + Var(Y ), (3.10)

which shows that the difference between X and Y has the same variance as
their sum X + Y .
Rules 1 to 4 extend naturally when combining information on more than
two variables.
A case of particular interest occurs when we take a sample of n independent
observations from a population. We can think of each member of the sample
as being an observation on its own random variable. We could call these
variables ‘The first sample value’, X1 , ‘The second sample value’, X2 , and so
on up to Xn . Since they all come from the same population, each random
variable has the same mean, µ, and the same variance, σ 2 . Thus
36 PROBABILITY AND PROBABILITY DISTRIBUTIONS

5. (X1 + X2 + · · · + Xn ) has mean nµ.

6. (X1 + X2 + · · · + Xn ) has variance nσ 2 .

Using these results together with rules 3 and 4, and noting that
1
X̄ = (X1 + X2 + · · · + Xn ),
n
we have the results:
1
1. X̄ has mean n × nµ = µ (from rules 3 and 5).
This shows that the sample mean, x̄, is an unbiased estimate of the
population mean, µ.
2 2
2. X̄ has variance n1 × nσ 2 = σn (from rules 4 and 6).
This shows that increasing the sample size, n, reduces the variance of the
estimate and hence makes it more accurate. Specifically, variance reduces
inversely with sample size.

In the context of determining the sample mean from rolls of a fair die
numbered from 1 to 6, Figure 3.9 shows the effect of increasing the sample
size from 60 to 600. Each diagram summarises the means of 100 samples.

Figure 3.9 As the sample size increases so the variability of the sample
mean decreases and it becomes an increasingly precise estimate of the
population mean.

3.10 Discrete uniform distribution

Here the random variable X is equally likely to take any of k values x1 , x2 , . . . ,


xk , and can take no other value. The distribution is summarised by
PROBABILITY DENSITY FUNCTION 37

(
1
k i = 1, 2, . . . , k,
P(X = xi ) = (3.11)
0 otherwise.

Example 3.11

Suppose that the random variable X denotes the face obtained when rolling
a fair six-sided die with faces numbered 1 to 6. The probability distribution
is given by: (
1
x = 1, 2, . . . , 6,
P(X = x) = 6
0 otherwise.

3.11 Probability density function

With a continuous variable, probability is related to a range of values rather


than any single value. For continuous data, we used histograms (Section
2.4) to illustrate the results. With a small amount of data, the histogram
is inevitably rather chunky, but, as the amount of data increases, so the
histogram can develop a more informative outline (see Figure 3.10).

Figure 3.10 As the sample size so the histogram becomes more


informative concerning the values in the population being sampled.

The peak of a histogram corresponds to the range of values where the values
are most dense. Ultimately, as we move from large sample to population, so we
move from proportion to probability. At that point we need only the outline,
which is a function of the values recorded. This is called the probability
density function, which is commonly shortened to pdf, and is designated by
f(). For the situation leading to the histograms in Figure 3.10, the probability
density function is that illustrated in Figure 3.11.
38 PROBABILITY AND PROBABILITY DISTRIBUTIONS

Figure 3.11 The probability density function that was randomly


sampled to obtain the data for the histograms of Figure 3.10.

The minimum value for a pdf is zero. Values for which f is zero are values
that will never occur.
Closely related4 to the probability density function is the cumulative dis-
tribution function (cdf), F. For a value x, the value taken by the cdf, is the
probability that the random variable X takes a value less than or equal to x:

F(x) = P(X ≤ x). (3.12)

For every distribution, F(−∞) = 0 and F(∞) = 1. Figure 3.10 showed how,
as the amount of data increases, so the outline of the histogram increasingly
resembles the probability density function describing the values in the pop-
ulation. In the same way, successive plots (for increasing amounts of data)
lead the outline of a cumulative frequency diagram (Section 2.5) to resemble
the cumulative distribution function.

3.12 The continuous uniform distribution

This is the simplest distribution to describe, as it states that all the values
in some range (from a to b, say), are equally likely to occur, with no other
values being possible5 .

4 In fact the cumulative distribution function, F(x), is the integral of the density function,

f(x).
5 The pdf for a distribution that is uniform between the values a and b, with no other values

being possible, is given by f(x) = b−a1


for a < x < b.
THE CONTINUOUS UNIFORM DISTRIBUTION 39

The uniform distribution, also called the rectangular distri-


bution, has mean = 12 (a + b) and variance = 12 1
(b − a)2 .

Figure 3.12 A uniform distribution between the values a and b.

When a measurement is rounded to a whole number, the asso-


ciated rounding error has a uniform distribution in the range
−0.5 to 0.5.

Example 3.12

The distance between points A and B is recorded as 4, while that between


B and C is recorded as 6. Distances have been rounded to whole numbers.
The best estimate of the distance between A and C via B is therefore 10, but
the question is ‘How accurate is that estimate?’.
The true distance between A and B is somewhere between 3.5 and 4.5, with
no value being more likely than any other. This is therefore an observation
from a uniform distribution mean 4 and variance (4.5 − 3.5)2 /12 = 1/12.
Similarly, the distance between B and C has mean 6 and variance 1/12.
Using the results of Section 3.9, the means and the variances simply add
together so the entire journey has mean 4 + 6 = 10 and variance equal to
1/12 + 1/12 which corresponds
√ to a standard deviation for the distance of the
entire journey as 1/ 6 ≈ 0.408 units.
CHAPTER 4

ESTIMATION AND CONFIDENCE

4.1 Point estimates

A point estimate is a numerical value, calculated from a set of data, which


is used as an estimate of an unknown parameter in a population. The random
variable corresponding to an estimate is known as the estimator. The most
familiar examples of point estimates are:

the sample mean, x̄, used as an estimate of the population mean, µ;


the sample proportion, r/n, used as an estimate of the population proportion, p;
the sample variance,
s2 = (xi − x̄)2 /(n − 1) used as an estimate of the population variance, σ 2 .
P

These three estimates are very natural ones and they also have a desirable
property: they are unbiased. That is to say that their long-term average
values are precisely equal to the population values.
A gentle introduction to data analysis and statistics for students taking MA334. 41
By Graham J. G. Upton and Dan Brawn Copyright © 2023 Graham Upton.
42 ESTIMATION AND CONFIDENCE

4.1.1 Maximum likelihood estimation (mle)


This is the method most often used by statisticians for the estimation of
quantities of interest.
The likelihood is equal to a constant times the probability that a future
set of n observations have precisely the values observed in the current set. In
most cases, while the form of the distribution of X may be specified (e.g. bi-
nomial, normal, etc), the value of that distribution’s parameter or parameters
(e.g. p, µ, σ 2 ,etc) will not be known. The likelihood will therefore depend on
the values of these unknown parameter(s). A logical choice for the unknown
parameter(s) would be those value(s) that maximise the probability of reoc-
currence of the observations: this is the principle of maximum likelihood.

4.2 Confidence intervals

A point estimate is just that: an estimate. We cannot expect it to be exactly


accurate. However, we would be more confident with an estimate based on a
large sample than one based on a small sample.
A confidence interval, which takes the general form (Lower bound, Upper
bound), quantifies the uncertainty in the value of the point estimate.

4.3 Confidence interval for the population mean

Of all the properties of a population, the average value is the one that is most
likely to be of interest. Underlying the calculation of the confidence interval
for the population mean is the so-called normal distribution (also known as
the Gaussian distribution) which we now introduce.

4.3.1 The normal distribution


The normal distribution describes the very common situation in which very
large values are rather rare, very small values are rather rare, but middling
values are rather common. Figure 4.1 is a typical example of data displaying
a normal distribution. The data refer to verbal IQ scores of school-children
in the Netherlands.1
The distribution is unimodal and symmetric with two parameters, µ (the
mean) and σ 2 (the variance).2 As a shorthand we refer to a N(µ, σ 2 ) distri-
bution. Because of the symmetry, the mean is equal to both the mode and

1 The data are used in the form presented by the authors of the book in which the data first
appeared.  2
2 The probability density function is given by f(x) = √1 exp − (x−µ) .
σ 2π 2σ 2
CONFIDENCE INTERVAL FOR THE POPULATION MEAN 43

Figure 4.1 The verbal IQ scores of 2287 eighth-grade pupils (aged about
11) in schools in the Netherlands, together with a normal distribution
having the same mean and variance.

the median. The distribution is illustrated in Figure 4.2 where the x-axis is
marked in intervals of σ.

Roughly 95% of values lie within about 2 standard deviations


from the mean.

Figure 4.2 The probability density function for a normal distribution


with mean µ and variance σ 2 .

The normal distribution with mean 0 and variance 1 is described as the


standard normal distribution and you will often see it denoted as N(0,1).
44 ESTIMATION AND CONFIDENCE

4.3.2 The Central Limit Theorem


This is the reason why the normal distribution is so frequently encountered.

If X1 , X2 , . . . , Xn all have the same distribution, then as n


increases, the distribution of both their sum and their average
increasingly resembles a normal distribution.

The importance of the Central Limit Theorem is because

The common distribution of the X-variables is not stated — it can be


almost any distribution.
In most cases the resemblance to a normal distribution holds for remark-
ably small values of n.

Figure 4.3 The distributions of means of observations from a uniform


distribution.

As an example, Figure 4.3 shows results for observations for a uniform


distribution. Each histogram reports 1000 results for the original distribution,
for the means of pairs of observations, for the means of four observations, and
for the means of eight observations. As the group size increases so the means
become increasingly clustered in a symmetrical fashion about the mean of the
population.
As a second example, Figure 4.4 shows histograms for successive averages
of observations from the opposite of a normal distribution: a situation where
central values are uncommon, and most values are very large or very small.
Once we start looking at averages of even as few as n = 2 observations, a
peak starts to appear.
As a final example (Figure 4.5) we look at a very skewed distribution. The
skewness has almost vanished by the time we are working with means of eight
observations.
CONFIDENCE INTERVAL FOR THE POPULATION MEAN 45

Figure 4.4 The distributions of means of observations from a V-shaped


distribution.

Figure 4.5 The distributions of means of observations from a skewed


distribution.

The sample mean of a variable with mean µ and standard devi-


ation σ can be regarded as being an observation from√a normal
distribution with mean µ and standard deviation σ/ n.

It is clear that the practical consequences of the Central Limit Theo-


rem were understood well before the time of Laplace. A 16th century
German treatise on surveying instructs the surveyor to establish the
length of a rood (a standard unit of length) in the following manner:
“Stand at the door of a church on a Sunday and bid 16 men to
stop, tall ones and small ones, as they happen to pass out when
the service is finished; then make them put their left feet one
behind the other, and the length thus obtained shall be a right
and lawful rood to measure and survey the land with, and the
16th part of it shall be a right and lawful foot.”
46 ESTIMATION AND CONFIDENCE

4.3.3 Construction of the confidence interval


The normal distribution was introduced because the confidence interval for a
mean relies upon it. We noted earlier that roughly 95% of values lie within
about two standard deviations of the centre of a normal distribution, which
is at µ. So we know that the observed sample mean, x̄, will generally be
within two standard deviations of µ. Equally, µ will be within two standard
deviations of x̄.
Usually the value of σ 2 , is unknown, but, in a large sample, a good ap-
proximation3 will be provided by the sample variance, s2 . So an approximate
95% confidence interval for a population mean is

 
s s
x̄ − 2 √ , x̄ + 2 √ . (4.1)
n n

Example 4.1

A machine fills ‘500 ml’ containers with orange juice. A random sample of
10 containers are examined in a laboratory, and the amounts of orange juice
in each container were determined correct to the nearest 0.05 ml. The results
were as follows:
503.45, 507.80, 501.10, 506.45, 505.85, 504.40, 503.45, 505.50, 502.95, 507.75.

Since the total volume of juice in a container can be thought of as the sum
of the volumes of large numbers of juice droplets, it is reasonable to assume
that these are observations from a normal distribution.
The sample mean and standard deviation are 504.87 and 2.18, respectively,
so that an approximate 95% confidence interval for the population mean is
2.18
504.87 ± 2 × √ 10
, which gives (503.5, 506.3).
For a single bottle (n = 1) the approximate interval is much wider at (500.5,
509.2), which excludes 500 ml.

4.4 Confidence interval for a proportion

Once again the computer can be relied upon to do the calculations. However,
this time the user may obtain slightly different answers depending on the
choice of computer routine. To discover why this is, we first introduce the
underlying distribution.

3 Statisticians
would observe that a more accurate interval would replace the multiplier 2
by an appropriate value from a t-distribution with n − 1 degrees of freedom.
CONFIDENCE INTERVAL FOR A PROPORTION 47

4.4.1 The binomial distribution


The word ‘binomial’ means having two terms. For this distribution to be
appropriate the following must be true:
There is a fixed number, n, of independent identical trials.
Each trial results in either a “success" or a “failure" (the two terms).
The probability of success, p, is the same for each trial.
The probability of obtaining exactly r successes in the n trials is given by4 :
!
n r
P(X = r) = p (1 − p)n−r , (4.2)
r

where
! !
n n n!
= = (4.3)
r n−r r!(n − r)!
is the number of ways of choosing r out of n, and

r! = r × (r − 1) × · · · × 1, for r > 0 with 0! = 1. (4.4)

Two examples with n = 10 are illustrated in Figure 4.6.

The binomial distribution has mean np and variance np(1 − p).

4.4.2 Confidence interval for a proportion (large sample case)


When n is large the shape of the binomial distribution resembles that of a nor-
mal distribution to the extent that the latter can be used as an approximation
that avoids heavy computations.
Suppose that we have r successes in n trials, so that the point estimate,
p̂, is given by p̂ = r/n. Using the normal approximation, r is an observation
from a normal distribution with µ = np and σ 2 = np(1 − p). Thus p̂ has mean
np/n = p and variance (using Rule 4 of Section 3.9) p(1 − p)/n. Substituting
p̂ for p, leads to an approximate 95% confidence interval for p as5
r r !
p̂(1 − p̂) p̂(1 − p̂)
p̂ − 2 , p̂ + 2 . (4.5)
n n

4 This is found by expanding out {p + (1 − p)}n


5 Depending upon the package used, the method may be attributed to either Laplace or
Wald.
48 ESTIMATION AND CONFIDENCE

Figure 4.6 Two binomial distributions with n = 10. The distribution is


symmetric when p = 0.5.

Example 4.2

As an example suppose that r = 50 and n = 250., so that p̂ = 0.2. The


approximate 95% confidence interval is (0.15, 0.25).

4.4.3 Confidence interval for a proportion (small sample)


With small samples the normal approximation to the binomial is no longer
appropriate. We need to work directly with the binomial distribution itself.
However, this is not straightforward and there is considerable discussion in
the statistical literature on how best to proceed. Different approaches lead to
different bounds. Fortunately, the data scientist will rarely encounter small
samples!

4.5 Confidence bounds for other summary statistics

4.5.1 The bootstrap


This approach depends on the ability of the modern computer to carry out
a huge number of calculations very quickly. No assumptions are made about
the underlying population distribution, and no use is made of the Central
Limit Theorem. Instead, in response to the question ‘What is known about
the distribution?’, the answer is ‘The values x1 , x2 , . . . , xn can occur’. Assume
CONFIDENCE BOUNDS FOR OTHER SUMMARY STATISTICS 49

for clarity that all n values are different, then those n values are apparently
equally likely to appear. The assumed distribution is therefore given by
1
P(X = xi ) = , (i = 1, 2, . . . , n). (4.6)
n
Any intermediate values are supposed to have zero probability of occurrence.
The next question is ‘If this is the distribution, and we take n observations
from the distribution, with replacement after each observation, then what
sample values might have been obtained?’. This is called resampling. Note
that each new sample contains only the values from the original sample, but
some of those values may appear more than once, while some will not appear
at all. As an example, suppose that the data vector consists of the numbers
0 to 9. Successive samples of 10 observations might be

Sample 1: 5 5 3 9 7 0 7 7 3 8
Sample 2: 3 5 8 1 3 8 4 1 2 6
Sample 3: 1 7 7 7 5 1 1 9 8 0

A key feature is that not only are the samples different from one another,
but they have potentially different means, medians, variances, etc.
Suppose that for some reason we are interested in the median. We not only
have an observed value, but also the possibility, by resampling, to find the
other values that might have occurred. We would use at least 100 resamples.
Remember that the original sample will not exactly reproduce the popu-
lation from which it is drawn. Since the resampling is based on the sample
rather than the unknown population, the inevitable biases in the sample will
be manifest in the bootstrap distribution.

If the true distribution is known, or can be approximated, then


methods based on that distribution should be used in preference
to the bootstrap.

Example 4.3

To illustrate the procedure we need some data. Here are 30 observations


(randomly generated from a uniform distribution with a range from 0 to 500):

457 469 143 415 321 260 368 67 328 353


229 360 467 128 231 470 489 59 237 280
452 69 494 473 41 257 195 453 223 418

Suppose that we are interested in the coefficient of variation, cv, which is


a quantity that compares the variability of a set of data with its typical value
50 ESTIMATION AND CONFIDENCE

by dividing the standard deviation by the mean. In this case the observed
value is reported (by R) as 0.4762557.
We will now use the bootstrap with 1,000,000 resamples to find a 95%
confidence interval for this value. The computer gives the apparently highly
accurate result (0.3403718, 0.6097751).
Do not be persuaded by the large number of resamples and the seven sig-
nificant figures into thinking that high precision has thereby been achieved.
Even with such a huge number of resamples, repeat runs will produce dif-
ferent values. In this case two further runs gave the bounds (0.3405424,
0.6099814) and (0.3404129, 0.6099990) which are very similar, but
not identical. A reasonable report would be that the 95% bounds are (0.34,
0.61).
Since we generated the data from a known distribution this means we also
know the true value of cv. The √distribution (see Section 3.12) has mean
250 and standard deviation 500/ 12 so that the true cv is 0.58. This is
much higher than the observed value ( 0.4762557); it only just lies within the
confidence interval. This serves as a reminder that relatively small samples
may not accurately reflect the population from which they came.

4.6 Some other probability distributions

4.6.1 The Poisson and exponential distributions

4.6.1.1 Poisson process When every point in space (or time, or space-time)
is equally likely to contain an event, then the events are said to occur at
random. In mathematical terms this is described as a Poisson process.6
Figure 4.7 illustrates a temporal Poisson process, with the passing of time
indicated by the position on the line from left to right. The time line is
subdivided into ten equal lengths. There is a total of 55 events, but these are
not at all evenly distributed, with, by chance, a concentration of events in the
first tenth.

Figure 4.7 Events occurring at random points on the time line: an


example of a temporal Poisson process.

This mixture of gaps and clumps is typical of events happening at random.


When the events are catastrophic (plane crashes, wild fires, etc) the clumping

6 Siméon Denis Poisson (1781-1840) was a French mathematician whose principal interest lay
in aspects of mathematical physics. His major work on probability was entitled Researches
on the probability of criminal and civil verdicts. In this long book (over 400 pages) only
about one page is devoted to the derivation of the distribution that bears his name.
SOME OTHER PROBABILITY DISTRIBUTIONS 51

of random events should be borne in mind. Whilst there may be some common
cause for the catastrophes, this need not be the case.

Figure 4.8 Three further examples of a temporal Poisson process.

Figure 4.8 gives three further examples of a Poisson process. All four
examples refer to the same length of time, with the random events occurring
at the same rate throughout. The differences between the total numbers of
events is a further example of the way in which randomness leads to clusters
of events.

4.6.1.2 Poisson distribution For events following a Poisson process, the Pois-
son distribution 7 describes the variation in the numbers of events in equal-
sized areas of space or intervals of time. Two examples of Poisson distributions
are illustrated in Figure 4.9.

Figure 4.9 The Poisson distribution becomes more symmetric as the


mean, λ, increases.

7 For events occurring at a rate of λ per unit area (or time), the probability that a unit area

contains exactly x events is given by P(X = x) = λ x! e , (x = 0, 1, 2, . . .), where e is the


x −λ

exponential function, and the quantity x! was given by Equation (4.4).


52 ESTIMATION AND CONFIDENCE

The Poisson distribution has mean and variance equal to one


another.

Example 4.4

Figure 4.10 illustrates a spatial Poisson process. The event positions in


two dimensions were generated using pairs of random numbers. A summary
of the counts in 100 sub-regions is given in Table 4.1.

Figure 4.10 The left-hand diagram illustrates the positions of 125


randomly positioned points. The right-hand diagram reports the counts
in the 100 sub-regions.

Table 4.1 A summary table of the numbers of events in the 100 sub-regions of
Figure 4.10.
Number of events, x, in a sub-region 0 1 2 3 4 5
Number of sub-regions containing x events 27 37 26 5 4 1

If the events are distributed at random in space, then the numbers of events
in each sub-region will be observations from a Poisson distribution. Since Pois-
son distributions have mean equal to variance, a useful check for randomness
is provided by comparing the sample mean with the sample variance. In this
case the sample mean is 1.25 and the sample variance is 1.20, so the pattern
is consistent with randomness.
SOME OTHER PROBABILITY DISTRIBUTIONS 53

4.6.1.3 The exponential distribution This distribution refers to the lengths


of the intervals between events occurring at random points in time.8 Two
examples of the distribution are illustrated in Figure 4.11.

Figure 4.11 Two exponential distributions.

The exponential distribution has mean 1/λ and variance 1/λ2 ,


with P(X > x) = exp(−λx).

Figure 4.12 Histogram of the intervals between the random events of


Figure 4.7, together with the fitted exponential distribution.

8 It has probability density function given by f(x) = λe−λx x > 0.


54 ESTIMATION AND CONFIDENCE

Figure 4.12 shows a histogram of the intervals between the random events
shown in Figure 4.7. This is an example of data from an exponential distri-
bution.

4.6.2 The Weibull distribution


The Weibull9 distribution is an example of an extreme value distribu-
tion. It is often used to model the lifetimes of systems of components. Its
probability density functioninvolves two parameters, k and λ. As Figure 4.13
demonstrates, the shape of the distribution is critically dependent on the value
of k, with the case k = 1 corresponding to the exponential distribution.

Figure 4.13 Three Weibull distributions, each with λ = 1.

The exponential distribution assumed that events such as failures occurred


at a steady rate over time. However, often there is a small proportion of
components that fail almost immediately, while for other components the
failure rate increases as the components get older. The result is a characteristic
‘bathtub’ curve. The example shown in Figure 4.14 was created by mixing
the three Weibull distributions given in Figure 4.13.

4.6.3 The chi-squared (χ2 ) distribution

‘Chi’ is the Greek letter χ, pronounced ‘kye’. The distribution is continuous


and has a positive integer parameter d, known as the degrees of freedom,
which determines its shape. Some examples of chi-squared distributions are
illustrated in Figure 4.15.
The distribution has mean d and variance 2d. A random variable with a
χ2d -distribution occurs as the sum of d independent squared standard normal
random variables. Since random variations can usually be assumed to be
observations from normal distributions, the chi-squared distribution is often

9 Ernst Hjalmar Wallodi Weibull (1887-1979) was a Swedish engineer. His career was in the
Royal Swedish Coast Guard, where he studied component reliability. In 1972 the American
Society of Mechanical Engineers awarded him their Gold Medal and cited him as ‘a pioneer
in the study of fracture, fatigue and reliability’
SOME OTHER PROBABILITY DISTRIBUTIONS 55

Figure 4.14 A bathtub curve showing how the failure rate typically
varies over time.

χ21

χ22
χ24
χ28
χ216

0 2 6 14

Figure 4.15 Examples of chi-squared distributions. The suffices give


the degrees of freedom.

used when assessing the appropriateness of a model as a description of a set


of data.
CHAPTER 5

MODELS, P -VALUES, AND


HYPOTHESES

5.1 Models

There are two sorts of models: those that we propose in advance of seeing the
data, and those that we devise having seen the data. In this chapter we focus
on the former.
As an example of a model formulated before seeing the data, consider the
rolling of a six-sided die. The obvious model is that the die is fair, with
each of the six sides being equally likely to appear. The model will not be
entirely correct because there are sure to be imperfections. Maybe is very
slightly more likely to face downwards than ? Maybe one side is dirtier than
another? Maybe there is a scratch? However, unless there is some cheating
going on, the bias will probably be so small that the die would have been
worn away before we could detect it! The model is not 100% correct, but it
should provide a very useful description.
Now suppose that, after rolling the die 100 times, we obtain roughly equal
counts of the occurrence of and , but we never see any of the other four
sides. This result would lead us to strongly doubt that our model was correct.

A gentle introduction to data analysis and statistics for students taking MA334. 57
By Graham J. G. Upton and Dan Brawn Copyright © 2023 Graham Upton.
58 MODELS, P -VALUES, AND HYPOTHESES

Why? Because we could easily calculate that, if the assumption of fairness


had been correct, then these observed data would be highly unlikely. It is this
kind of test which we consider in this chapter, the so called ’Hypothesis Test’.
Example 5.1

A simple model, such as one that specifies that oak trees are at random
locations in a wood, has at least two uses. Taken together with a count of
the oak trees in sampled equal-sized regions of the wood, the model provides
the basis for an initial estimate of the number of oak trees in the entire wood.
Examination of the variation in the counts can then be used as potential
evidence against the simple model. For example, differences in the terrain or
other relevant features, might lead to variations in where the oak trees are
likely to be found and hence to a non-random pattern.

5.2 p-values and the null hypothesis

According to Oxford Languages a hypothesis is “a supposition or proposed


explanation made on the basis of limited evidence as a starting point for
further investigation." This is precisely what the word means to statisticians,
who elaborate on the idea by introducing a particular type of hypothesis that
they call the null hypothesis.
The null hypothesis specifies that the population possesses some statistical
feature. Some examples are

The mean is 500 g.


The probability of a six is 1/6.
Oak trees occur at random.

If we imagined that this was a court of law, then the null hypothesis would
be that the prisoner is innocent and the sample data would be the evidence.
The prisoner will be given the benefit of the doubt (the null hypothesis is
accepted) unless there is evidence of guilt (the observed data are too unlikely
for the null hypothesis to be believed).
Consider a hypothesis about the population mean. The evidence of guilt
(or otherwise) is provided by the sample mean. If it is very close to the
hypothesised value then we would be happy to accept that hypothesis. But
the greater the difference between the sample mean and the population mean,
the stronger the evidence that the null hypothesis was incorrect. The p-value
is a measure of the strength of that evidence.1
A rough interpretation of a p-value might be as follows:

1 It
is the probability that, if the null hypothesis were correct, then the observed result, or
a more extreme one, could have occurred.
P -VALUES AND THE NULL HYPOTHESIS 59

p-value Interpretation
0.01 < p ≤ 0.05 There is some evidence that contradicts the null hypothesis.
Ideally we would like more data to provide confirmation.
0.001 < p ≤ 0.01 There is strong evidence to reject the null hypothesis.
p ≤ 0.001 There is overwhelming evidence against the null hypothesis.

To be clear, the case p ≤ 0.001 is stating that, if the null hypothesis were
true, and we were to look at a thousand situations similar to that currently
studied, then perhaps only one of those would have results as extreme as has
occurred in our case.
Example 5.2

As an example, suppose we test the hypothesis that a population has mean


5, using a random sample of 100 observations. Suppose that, in reality, the
population has mean 6 and variance 4. Figure 5.1 shows a histogram of the
sample data.

Figure 5.1 Histogram for a sample of 100 observations from a


population with mean 6. The hypothesised mean was 5.

The sample mean is 6.1. To assess whether this is unreasonably far from
the hypothesised mean of 5, we use a so-called t-test. This makes use of the
t-distribution, which is a symmetric distribution very similar to the normal
distribution.
The output from R states that the p-value is ‘6.955e-05’, which is a short-
hand for 0.00006955. In other words, if the mean were 5, then the probability
of a sample mean this different, or more different, from the hypothesised value
is less than 7 in one hundred thousand. The null hypothesis would most cer-
tainly be rejected.
60 MODELS, P -VALUES, AND HYPOTHESES

5.2.1 Comparing p-values


Individual p-values should always be quoted so as to leave the reader or re-
searcher able to decide how convincing is the evidence against the null hy-
pothesis. This is particularly useful when comparing p-values.
When a model involves several explanatory variables2 , then some of these
variables will inevitably be more important than others. Typical computer
output will associate p-values with each variable. These p-values are quanti-
fying the importance of each variable by reporting how reasonable it would
be to remove that variable from the model. As we will see in later chapters,
removal of a variable is equivalent to accepting the null hypothesis that a
model parameter has a zero value.

5.2.2 Link with confidence interval


Consider, as an example, a 95% confidence interval for the population mean,
µ, based on the sample mean, x̄. The interval is centered on x̄ and takes the
form

(x̄ − k) to (x̄ + k)

for some value of k. If, beforehand, we had hypothesised that µ had a value
that turns out to lie in the interval, then we would probably look smug and
say I thought as much. We would have, in effect, accepted our null hypothesis.
However, suppose that the value that we anticipated lies outside the 95%
confidence interval. In that case it must either be higher than the top of the
interval or lower than the bottom of the interval. With 95% inside the interval
this implies that it lies in one or other of the two 2.5% tail regions. So, the
probability of the observed value, or a more extreme one (either greater or
lesser), will be at most 5%. That probability is simply the p-value.
Example 5.2 (cont.)

The R output for the t-test included a 95% confidence interval for the true
population mean as being between 5.565 and 6.598. Suppose that our null
hypothesis had been that the population mean was not 5, but 5.6. This time,
when we perform a t-test we get the information that the p-value is 0.067.
Any hypothesised value lying inside the 95% confidence interval will lead to
a p-value greater than 5%.

2 Knowledge of the values of explanatory variables, which are also known as background or
predictor variables, may help to explain the variation in the values of the variable of prime
interest.
P -VALUES WHEN COMPARING TWO SAMPLES 61

5.3 p-values when comparing two samples

In this section we give two examples of the use of p-values. Both situations
involve the comparison of a pair of samples.

5.3.1 Do the two samples come from the same population?

A diagram that helps to answer this question is the quantile-quantile plot.


which is usually referred to as a q-q plot We met quantiles in Section 2.2.2.
To see how it works, suppose that there are 500 observations in one data
set and 200 in another. For the larger data set, working for convenience with
the special case of percentiles3 , the values of the first five are the 5th, 10th,
15th, 20th and 25th largest values. For the smaller data set the first five
quantiles are the 2nd, 4th, 6th, 8th, and 10th largest values. In the q-q plot
the quantiles for one data set are plotted against those for the other data
set. If the two sets come from the same population, then the plot will be an
approximately straight line.
Example 5.3

As an example we use three data sets, x, y, and z. Each has 100 observa-
tions. Sets x and y consist of random samples from one population, whereas z
is a sample from a different population (but with the same mean and variance
as the first population).
Figure 5.2 compares the Q-Q plot for our two samples x and y with the
Q-Q plot for x and z.

Figure 5.2 Two Q-Q plots: (a) The samples x and y come from the same
distribution. (b) The samples x and z come from different distributions.

3 Quantiles that divide the ordered data into 100 sections.


62 MODELS, P -VALUES, AND HYPOTHESES

At the ends of a Q-Q plot there is often some discrepancy from the central
line as these cases refer to extreme results that are, by definition, rather
variable and uncommon. However, the centre of the plot should lie close to
the line. For the plot of x against y there is a degree of wobble, but it appears
random, whereas for the plot of x against z there are marked and coherent
divergences from the central line.

5.3.1.1 Kolmogorov-Smirnov test The Q-Q plot does not provide a test, just
a visual suggestion. By contrast, the Kolmogorov-Smirnov (KS) test provides
a formal appraisal of the null hypothesis that the two samples come from the
same distribution. The test statistic is, D, the maximum difference between
the two cumulative frequency distributions (Section 2.5) of the samples being
compared.
Example 5.3 (cont.)

Figure 5.3 (a) compares the cumulative distributions of x and y, which


were samples of 100 observations sampled from the same population. There
is an appreciable gap between the two curves, but the KS test reports that the
maximum difference, 0.18, could reasonably have occurred by chance. The p-
value is reported as being 0.078, which is greater than 5%. The null hypothesis
that the two samples came from the same population would (correctly) be
accepted.

Figure 5.3 Superimposed cumulative distribution functions for (a) x


and y, (b) x and z.

Figure 5.3 (b), which compares the cumulative distributions of x and z,


is very similar. But here the gap is larger (0.20) and, the KS test reports a
p-value of 0.037. Since this is smaller than 5% there is reason to doubt that
the samples have come from the same population (though, since the p-value
is not really really small, that decision would be made cautiously).
P -VALUES WHEN COMPARING TWO SAMPLES 63

5.3.2 Do the two populations have the same mean?


The KS test compared the entirety of two samples. If all that is of interest is
whether two populations have the same mean, then a t-test will do the trick.
Example 5.3 (cont.)

The means of the two populations being compared (x and z) were arranged
to be identical (both equal to 10). It is therefore no surprise that the sample
means are very similar (10.0 and 9.9). The t-test duly reports a large p-value
(0.48) which is much bigger than 5% and indicates that the null hypothesis of
equal means is entirely tenable. The 95% confidence interval for the magnitude
of the difference between the population means therefore includes zero (the
value corresponding to equality).
CHAPTER 6

COMPARING PROPORTIONS

6.1 The 2 by 2 table

Suppose that we have two populations, A and B. Within each population


there are two categories of interest which we will call ‘success’ and ‘failure’.
We wish to know whether pA , the probability of a success in population A,
is equal to pB , the corresponding probability in population B. If the two are
equal, then the probability of a success is independent of the population
being sampled.
An equivalent situation is presented by a single population that contains
some individuals of category A and other individuals of category B. Within
each category the individuals can be divided into ’successes’ and ‘failures’.
The question of interest is whether the proportion of successes is the same for
each category.

A gentle introduction to data analysis and statistics for students taking MA334. 65
By Graham J. G. Upton and Dan Brawn Copyright © 2023 Graham Upton.
66 COMPARING PROPORTIONS

Table 6.1 Hypothetical observed cell frequencies and their totals.


A B Total
Success 10 20 30
Failure 30 40 70
Total 40 60 100

Table 6.2 The cell frequencies corresponding to the observed


frequencies in Table 6.1 for the independence model.
A B Total
Success 12 18 30
Failure 28 42 70
Total 40 60 100

A test of equality of two probabilities is equivalent to a test of


independence of the two classifying variables.

Suppose that the data are as shown in Table 6.1.


If the probability of a success is not affected by whether an observation
belongs to A or to B, then (using the Total column) the best estimate of the
probability of a success is 30/100 = 0.3. If success or failure is independent of
whether an individual belongs to A or B, then there would be 0.3 × 40 = 12
successes for A. Assuming that P(success) is indeed 0.3, the complete outcome
would be the cell frequencies given in Table 6.2. The greater the difference
between the observed counts (Table 6.1), and the ideal independence outcome
(Table 6.2), the less likely is it that independence is a correct description of
the situation. The question therefore is how best to compare the two sets of
numbers. Using the notion of likelihood (Section 4.1.1), the answer is to use
the so-called likelihood-ratio statistic1 , G2 , given by
X
G2 = 2 Obs. × ln(Obs./Exp.), (6.1)
Every combination
where Obs. refers to an observed count in Table 6.1 and Exp. refers to the
ideal values given (in Table 6.2) for the current model (which in this case is

1 The quantity G2 results from the comparison of two likelihoods. The likelihood, which
was introduced in Section 4.1.1, can be described as the probability of obtaining a future
set of data identical to the current data under a given model. Here the current model is the
model of independence. We compare that likelihood with the maximum possible likelihood.
THE 2 BY 2 TABLE 67

the model of independence). Here the Obs. values are (10, 20, 30, and 40),
while the corresponding Exp. values are (12, 18, 28, and 42).
In the situations likely to be encountered by data scientists, where there is
plenty of data, the distribution of G2 can be approximated by a chi-squared
distribution (Section 4.6.3). In the case of a table with 2 rows and 2 columns
(such as Table 6.1), the reference chi-squared distribution has a single degree
of freedom.
Example 6.1

According to the 1990 report of the British Board of Trade, when the
Titanic hit the iceberg in 1912 there were 862 male crew and 805 adult male
passengers on board. There were 338 male survivors. A possible question
of interest is whether the outcome (died or survived) was independent of a
person’s role (crew or passenger). The relevant numbers are shown in Table
6.3. A table such as this, which would be described as a 2x2 table, is an
example of a contingency table..

Table 6.3 The numbers and fates of adult male passengers and male crew on board
the Titanic at the time of its collision with an iceberg in 1912.

Crew Passenger Total


Survived 192 146 338
Died 670 659 1329
Total 862 805 1667

The probability of survival for a male crew member was 192/862 = 0.22.
The probability for an adult male passenger was rather less: 146/805 = 0.18.
Could the difference be due to chance? To assess this we first calculate the
values resulting from a a precise fit to the independence model. The values
are shown in Table 6.4. The resulting value of G2 is 4.42 and the question is
whether this is such an unusually large value that our model (independence)
should be rejected.

Table 6.4 The values obtained when the independence model is applied to the
frequencies in Table 6.3.

Crew Passenger Total


Survived 174.8 163.2 338
Died 687.2 641.8 1329
Total 862 805 1667
68 COMPARING PROPORTIONS

Comparing the value of G2 to a chi-squared distribution with one degree of


freedom, we find that the probability of that value, or a more extreme one, is
about 0.035. Since this is less than 0.05, we would claim that, at the 5% level
there was a significantly higher proportion of surviving male crew than there
were surviving male passengers. However, this statement needs to be put into
context. Priority was evidently given to women and children, with nearly 70%
of the 534 women and children on board surviving; we might speculate that
some male passengers sacrificed themselves for their loved ones.

6.2 Some terminology

There are a good many rather specialised terms that are used in connection
with 2 by 2 tables. We now briefly introduce some of them.

6.2.1 Odds, odds-ratios, and independence

We are continuing with situations that are classified either as a ‘success’ or a


‘failure’. The odds on a success is given by:
P(outcome is a success)
odds = .
P(outcome is a failure)
The odds-ratio, is the ratio of the odds for two categories:

P( a success in category A) P( a success in category B)
odds ratio = .
P( a failure in category A) P( a failure in category B)
If the probability of a success is the same for both categories, the odds-ratio
is 1 and its logarithm will be zero2 .
Example 6.1 (cont.)

The odds on survival for male crew was 192/670, while for male passengers
it was 146/659. The odds-ratio is therefore (192/670) ÷ (146/659) ≈ 1.3.

6.2.2 Relative risk


Suppose that we are comparing two treatments, A and B, for their success in
curing a deadly disease. Suppose that the chance of survival with treatment
A is 2%, whereas the chance of survival with treatment B is 1%. We,would
strongly encourage the use of treatment A, even though the difference in the

2 Although the log(odds-ratio) seems unreasonably obscure in the current context, it is


useful in more complex situations.
SOME TERMINOLOGY 69

probabilities of survival is just 1%. A report of this situation would state that
treatment A was twice as successful as treatment B. In a medical context, the
ratio of the success probabilities, R, is called the relative risk.

Notice that the relative risk for failures is different to that for
successes. Advertisers of the merits of a treatment will be sure
to choose whichever makes the best case for their treatment.

Example 6.1 (cont.)

For the Titanic situation the relative risk for the male passengers as opposed
to crew is 146/805 ÷ 192/862 ≈ 0.81. This implies that the probability of
surviving for a male passenger was only about 80% of that for a male crew
member, which sounds rather dramatic. If we focus instead on those who died
and calculate 659/805 ÷ 670/862, then we find that the risk of dying was just
5% greater for the male passengers.

6.2.3 Sensitivity, specificity, and related quantities


Consider the following 2 × 2 table:

Patient has disease Patient without disease Total


Patient tests positive True positives (a) False positives (b) r
Patient tests negative False negatives (c) True negatives (d) s
Total m n N

The conditional probabilities of particular interest to the clinician are the


following:

Sensitivity: P( A patient with the disease is correctly diagnosed)= a/m.


Specificity: P(A patient without the disease is correctly diagnosed)= d/n.

However, a test that suggested that every patient had the disease would have
sensitivity= 1 (which is good), but specificity= 0 (bad). Both quantities need
to be large. A useful summary is provided by Youden’s index3 :

J = Sensitivity + Specificity − 1. (6.2)

3 William John Youden (1900-1971) was an American chemical engineer who became inter-
ested in Statistics when designing experimental trials.
70 COMPARING PROPORTIONS

While J is a good measure of the usefulness of a test, it gives


little information concerning the correctness of the diagnosis.

The conditional probabilities of particular interest to the patient are:

Positive predictive value (PPV): P(Diagnosis of disease is correct)= a/r.


Negative predictive value (NPV): P(Diagnosis of no disease is correct)= d/s.

However, the values of PPV and NPV will be affected by the overall prevalence
(m/N ) of the disease. For example, if every patient had the disease then
PPV= 1 and NPV= 0. An associated term is the false discovery rate b/r
(=1 - PPV).

6.3 The R by C table

We now turn to the more general situation where a table provides information
about two variables, with one variable having R categories and the other
having C categories. For example, a table such as that shown as Table 6.5:

Table 6.5 Hypothetical observed cell frequencies and their totals.


A1 A2 A3 A4 Total
B1 10 20 30 40 100
B2 20 40 60 80 200
B3 30 60 90 120 300
Total 60 120 180 240 600

In this very contrived table, the variable A has four categories, with prob-
abilities 0.1, 0.2, 0.3, and 0.4, while the variable B has three categories with
probabilities 1/6, 2/6, and 3/6. Furthermore the two variables are completely
independent of one another. This can be seen from the fact that the frequen-
cies in each column are in the ratio 1 to 2 to 3.
Real data is not so obliging, but the question of interest remains: “Are the
variables independent of one another?" If they are completely independent,
then knowing the category for one variable will not help us to guess the
category of the other variable.
The values that would occur, if the variables were completely independent
of one another, are again found by the simple calculation:

Row total × Column total


Exp = .
Grand total
THE R BY C TABLE 71

Thus, for the top left frequency, the Exp value is given by 100×60
600 = 10, which
in this case is equal to the observed value.
A test of the overall independence of the two variables is again provided by
G2 , with the reference chi-squared distribution having (R − 1)(C − 1) degrees
of freedom.
Example 6.2

The passengers on the Titanic occupied three classes of cabin. We now


inquire whether survival of male passengers was independent of their cabin
class. The data are shown in Table 6.6.

Table 6.6 The numbers and fates of male passengers on the Titanic,
with passengers subdivided by cabin class.
Class 1 Class 2 Class 3 Total
Survived 57 14 75 146
Died 118 154 387 659
Total 175 168 462 805

Approximate proportion surviving 1/3 1/12 1/6

There appear to be considerable variations in the survival rates. Table 6.7


gives the Exp values based on independence of survival and cabin class.

Table 6.7 The average numbers expected in the various category


combinations if cabin class and survival were independent.
Class 1 Class 2 Class 3 Total
Survived 31.7 30.5 83.8 146
Died 143.3 137.5 378.2 659
Total 175 168 462 805

In this case the value of G2 is 35.2. There are (2 − 1)(3 − 1) = 2 degrees


of freedom for the approximating chi-squared distribution, which therefore
has mean 2. If survival were independent of cabin class, then a value of G2
as large as that observed would only occur on about two occasions in one
hundred million. In other words we have overwhelming evidence of a lack of
independence.
72 COMPARING PROPORTIONS

6.3.1 Residuals
The Pearson4 residual, r is defined by:

Obs − Exp
r= √ . (6.3)
Exp

The idea behind the division by Exp is to take account of the magnitude
of the numbers concerned. For example, 101 differs from 102 by the same
amount as 1 differs from 2, but the latter represents a doubling, whereas the
former represents
√ a negligible change. The so-called standardized residual
replaces Exp by a slightly smaller value. The difference is negligible in large
samples.
As a guide, when the independence model is correct, the r-values will be
in the range (-2, 2), or, exceptionally, (-3, 3). Values with greater magnitude
indicate category combinations where the independence model is certainly
failing.

Remember that the values of (Obs − Exp) sum to zero, so neg-


ative r-values will be offset by positive r-values.

Example 6.2 (cont.)

The Pearson residuals for the Titanic data are given in Table 6.8, with the
most extreme value shown in bold type.

Table 6.8 The Pearson residuals for the model of independence of


survival on cabin class.
Class 1 Class 2 Class 3
Survived 4.5 -3.0 -1.0
Died -2.1 1.4 -.5

The residuals highlight that it is the high proportion of survivors amongst


male passengers who occupied Class 1 cabins that results in a failure of the
model of independence.

4 Named after Karl Pearson (1857-1936), an English biometrician who proposed the chi-
squared test in 1900.
THE R BY C TABLE 73

6.3.2 Partitioning
Occasionally it is useful to divide a table into component parts, so as to make
clear departures from independence. If there are d degrees of freedom for
the G2 statistic, then notionally it would be possible to present the data as
d separate 2 by 2 tables, or some other combination of tables. Whatever
subdivision is used, the degrees of freedom for the separate tables will sum to
d and the sum of their G2 -values will equal the G2 -value for the overall table.
Example 6.2 (cont.)

The obvious subdivision for the cabin class data is to treat Class 1 cabins
as a class apart, as shown in Table 6.9.

Table 6.9 Partition of the class data into two 2 by 2 tables.


Class 1 Other classes Total Class 2 Class 3 Total
Survived 57 89 146 14 75 89
Died 118 541 659 154 387 541
Total 175 630 805 168 462 630

The two G2 -values, each with 1 degree of freedom are 28.2 and 7.0. The
latter is also unusually large (tail probability less than 1%) but no Pearson
residual has a magnitude greater than 2, suggesting that the evidence is not
compelling when it comes to a comparison of the survival rates for the pas-
sengers in Classes 2 and 3.
CHAPTER 7

RELATIONS BETWEEN TWO


CONTINUOUS VARIABLES

In previous chapters we have generally looked at one variable at a time. How-


ever, the data faced by a data analyst will usually consist of information on
many different variables, and the task of the analyst will be to explore the
connections between the variables.
In this chapter we look at the case where the data consist of pairs of values
and we begin with the situation where both variables are numeric. Here are
some examples where the information on both variables is available simulta-
neously:

x y
Take-off speed of ski-jumper Distance jumped
Size of house Value of house
Depth of soil sample from lake bottom Amount of water content in sample

Sometimes data are collected on one variable later than the other variable,
though the link (the same individual, same plot of land, same family, fxtc
etc.) remains clear:
A gentle introduction to data analysis and statistics for students taking MA334. 75
By Graham J. G. Upton and Dan Brawn Copyright © 2023 Graham Upton.
76 RELATIONS BETWEEN TWO CONTINUOUS VARIABLES

x y
Amount of fertilizer Amount of growth
Height of father Height of son when 18

In all of these cases, while the left-hand variable, x, may affect the right-
hand variable, y, the reverse cannot be true.
Not all cases follow this pattern however. For example:
x y
No. of red blood cells in sample of blood No. of white blood cells in sample
Hand span Foot length

Whichever situation is relevant, a measure of the strength and type of


relation between the variables is provided by the correlation coefficient, with
which, having plotted the data, we start the chapter.

7.1 Scatter diagrams

The first step in data analysis is to plot the data in order to get an idea of any
relationship and also to check for possible false records. A scatter diagram is
simply the plot of the value of one variable against the corresponding value
of the other variable.
Example 7.1

In this example we use a data set that gives the numbers of arrests per
100,000 residents for assault, murder, and rape in each of the 50 US states in
1973. Also given is the percent of the population living in urban areas. Figure
7.1 shows two scatter diagrams involving the incidence of assaults.

Figure 7.1 Scatter diagrams of the incidence of arrests for murder


during 1973, plotted against (a) the percent of population living in urban
areas, and (b) the incidence of arrests for assault. Each data point refers
to a single state of the USA.
CORRELATION 77

We might have expected that the crime rates would be greatest in the big
cities, but Figure 7.1 (a) shows no obvious pattern. By contrast, Figure 7.1
(b) shows that arrests for murders are common in states where arrests for
assault are common. There is a much clearer pattern in Figure 7.1 (b) than
in Figure 7.1 (a).

7.2 Correlation

We begin with the measure of the fuzziness in a scatter diagram that Francis
Galton1 called the index of co-relation. These days it is usually denoted by
r and referred to simply as correlation. You may also find it described as
the correlation coefficient, or even as the product-moment correlation
coefficient.
The coefficient can take any value from −1 to 1. In cases where increasing
values of one variable, x, are accompanied by generally increasing values of the
other variable, y, the variables are said to display positive correlation. The
opposite situation (negative correlation) is one where increasing values of
one variable are associated with decreasing values of the other variable. Figure
7.2 shows examples.

Figure 7.2 Examples of data sets showing positive and negative


correlation.

The case r = 0 can usually be interpreted as implying that the two variables
are unrelated to one another. However, the measure is specifically concerned
with whether or not the variables are linearly related. Figure 7.3 shows three
examples of zero correlation; these include a case where x and y clearly are
related, but not linearly. This underlines the need to plot the data whenever
possible.

1 Francis Galton (1822-1911), a cousin of Charles Darwin, turned his hand to many activities.
He studied medicine at Cambridge. He explored Africa (for which he received the gold medal
of the Royal Geographical Society). He devised an early form of weather map introducing
the term anticyclone. Possibly inspired by Darwin’s work, Galton turned to inheritance and
the relationships between the characteristics of successive generations. The term co-relation
appeared in his 1869 book Hereditary Genius.
78 RELATIONS BETWEEN TWO CONTINUOUS VARIABLES

Figure 7.3 Examples of data sets showing zero correlation.

If the data are collinear (i.e. they lie on a straight line) then
(unless that line is parallel to an axis) the correlation is ±1.

The value of r is unaffected by changes in the units of mea-


surements.
The correlation between x and y is the same as the correlation
between y and x.

Example 7.1 (cont.)

Table 7.1 summarises the correlations between the four variables in the US
states data.

Table 7.1 Correlations between the variables of the US state data.


The variables are the percent of the state population living in urban
communities, and the numbers of arrests (per 100,000 residents) for
murder, assault, and rape.
Murder Assault % Urban Rape
Murder 1.00 0.80 0.07 0.56
Assault 0.80 1.00 0.26 0.67
% Urban 0.07 0.26 1.00 0.41
Rape 0.56 0.67 0.41 1.00
The diagonal entries are all 1 because they are looking at the correlation
between two identical sets of values. The values above the diagonal are equal
to the corresponding entries below the diagonal because cor(x, y) = cor(y, x).
Figure 7.1 (a) showed no obvious pattern and this is reflected in the near-
zero correlation (0.07) between the percent of urban population and arrests
for murders. By contrast the pattern evident in Figure 7.1 (b) is reflected in
a sizeable positive correlation (0.80) between the relative numbers of arrests
for assaults and for murders.
CORRELATION 79

Using R it is possible to look at several scatter diagrams simultaneously.


This is illustrated by Figure 7.4. Notice that each pair of variables appears
twice, so that, for example, the diagram at top right (the scatter diagram of
Murder against Rape) gives the same information as the diagram at bottom
left (the scatter diagram of Rape against Murder). The diagrams appear dif-
ferent because the variables are interchanged so that the points are effectively
reflected about the 45-degree line.

Figure 7.4 A complete set of scatter diagrams for the US arrests.


80 RELATIONS BETWEEN TWO CONTINUOUS VARIABLES

Correlation does not imply causation. Two variables may


appear correlated simply because each is related to some third
variable. For example, in children, hand width and foot size
are both related to age.

7.2.1 Testing for independence

If two variables are independent of one another then the population correlation
will be zero and the sample correlation should be close to zero. If a 95%
confidence interval for the correlation excludes zero, then we would reject the
hypothesis of independence at the 100 − 95 = 5% level.
Before the advent of modern computers the usual test was based on Fisher’s
z-transformation2 which required various distributional assumptions. Nowa-
days we can work directly with the observed data by resampling (Section 4.5.1)
the pairs of values and recalculating the correlation coefficient for each resam-
ple. The distribution obtained can then be used to find a bootstrap confidence
interval for the correlation coefficient. If the confidence interval includes zero,
then the hypothesis of independence would be acceptable
An alternative test of independence requires resampling the values of one
variable only. Suppose that we have n pairs of values for the variables X and
Y and that we resample the values for Y 3 . Each resample results in an original
X-value being paired with a resampled Y -value so as to produce a new set of
n pairs of values. Since the revised pairs use Y -values chosen at random from
those available, there will be no connection with the X-values. This means
that we are examining the situation where X and Y are uncorrelated, with
the aim of seeing what values might occur for the sample correlation.
By re-sampling only one variable and recalculating the correlations many
times we are effectively creating a distribution of feasible correlation values,
assuming a true value of zero. In this way we have produced a bootstrap null
distribution for correlation. We can compare the original observed correlation
value with that distribution and obtain a p-value. If this p-value is substan-
tially less than 0.05 we have clear evidence against the null hypothesis and we
can discount the idea of independence.
Example 7.2

We will use data relating to the eruptions by Old Faithful, a geyser in the
Yellowstone National Park. The geyser erupts every hour to an hour and

2 Sir Ronald Aylmer Fisher (1890-1962) was an English statistician. The fourteen editions
of his Statistical Methods for Research Workers set out the foundations
 of the subject for
20th century statisticians. The test statistic z is given by z = 21 ln 1−r
1+r
.
3 It does not matter which variable is resampled.
CORRELATION 81

a half, but the time intervals vary. It is thought that this variation may be
linked to the lengths of the geyser eruptions. The data set consists of 272 pairs
of eruption lengths and the waiting times between eruptions, both recorded
in minutes. There is strong evidence (a correlation of 0.9) of a link between
these two variables. The data are illustrated in Figure 7.5.

Figure 7.5 Scatter diagram of waiting times between eruptions of the


Old Faithful geyser and the lengths of the eruptions.

Using 10,000 resamples on one variable we find correlations in the range


(−0.225, 0.225). The observed value of 0.9 is well outside this interval, con-
firming that there is an undoubted connection between the two variables: the
p-value is evidently much less than 1 in 10,000. The 95% confidence interval
for the correlation coefficient is found to be (0.88, 0.92).

If the data consists of the entire population then the observed x-values are
the only values that can occur, with the same being true for the y-values. We
want to know whether the y-values in some way match the x-values.
A subtly different form of re-sampling is required. We keep the values of one
variable unchanged, but randomly re-order the values of the second variable.
For each re-ordering we calculate the correlation. If the original correlation
is unusually high (or low) compared to those from the random permutations,
then we have significant evidence that the variables are related. This is an
example of a permutation test.
Example 7.1 (cont.)

Figure 7.1 (a) showed no obvious relationship between the numbers of mur-
ders (per 100,000 residents) and the percent of the population living in urban
communities. The central 9,500 correlations resulting from 10,000 random
permutations of the urban values range from -0.27 to 0.27. This interval eas-
ily includes the observed value of 0.07, so the conclusion is that there is no
82 RELATIONS BETWEEN TWO CONTINUOUS VARIABLES

significant interdependence of murder rate and percent urban for the 50 U.S.
states.
By contrast, for the relation between murders and assaults illustrated in
Figure 7.1 (b), the observed correlation of 0.80 exceed all the correlations
obtained from 10,000 random permutations of the assault rate. There is an
undoubted connection between the two variables.

7.3 The equation of a line

Correlation is a measure of the extent to which two variables are linearly


related. If it appears that they are linearly related, then the next question
will be ‘What is that relation?’. Since the equation of a straight line has the
form
y = α + βx,
our task becomes one of finding appropriate values for α and β.
The constant α is called the intercept, and the constant β is the slope or
gradient. Figure 7.6 illustrates the relation between these quantities: when
x is zero, y is equal to α, while a unit change in the value of x causes a change
of β in the value of y.

Figure 7.6 Interpretation of the equation for a straight line.

7.4 The method of least squares

Correlation treated x and y in the same way, but we now adopt an asymmetric
approach. We assume that the value observed for one variable depends upon
the value taken by the other variable. We will refer to x as the value of the
THE METHOD OF LEAST SQUARES 83

explanatory variable, with y being the value of the dependent variable


(also called the response variable).
Suppose that the ith observation has co-ordinates (xi , yi ). Corresponding
to the value xi , the line will have a y-value of α + βxi . The discrepancy
between this value and yi is called the residual and is denoted by ri :

ri = yi − (α + βxi ). (7.1)

Figure 7.7 The residuals for the line of best fit of y on x are shown as
dotted lines. The line goes through (x̄, ȳ), shown as a larger dot.

If the values of α and β are well chosen, then all of the residuals r1 , r2 ,
. . . , rn will be relatively small in magnitude. In view of the fact that some of
the residuals will be negative and some will be positive, we work with their
squares.

P 2values chosen for α and β are the values that minimize


The
ri .

Example 7.3

A factory uses steam to keep its radiators hot. Records are kept of y, the
monthly consumption of steam for heating purposes (measured in lbs) and
of x, the average monthly temperature (in degrees C). The results were as
follows:
x 1.8 −1.3 −0.9 14.9 16.3 21.8 23.6 24.8
y 11.0 11.1 12.5 8.4 9.3 8.7 6.4 8.5
x 21.5 14.2 8.0 −1.7 −2.2 3.9 8.2 9.2
y 7.8 9.1 8.2 12.2 11.9 9.6 10.9 9.6
We begin by plotting the data (Figure 7.8).
84 RELATIONS BETWEEN TWO CONTINUOUS VARIABLES

Figure 7.8 The relation between monthly temperature and the amount
of steam required.

From the scatter diagram there does indeed appear to be a roughly linear
relation between the variables, with the y-values decreasing as the x-values
increase. The estimated regression line of y on x is found to be approximately

y = 11.3 − 0.156x.

This fitted line is shown on Figure 7.9, together with the line resulting from
muddling the two variables and regressing monthly temperature on pounds of
steam.

Figure 7.9 The fitted line, together with the incorrect line based on
regressing monthly temperature on steam consumption.

Always plot the data to make sure that:

1. Fitting a straight line is appropriate.


2. There are no outliers (possibly incorrect records).
3. The fitted line goes through the data. If it doesn’t then you
have probably made an error!
A RANDOM DEPENDENT VARIABLE, Y 85

When using the least squares regression line there is the implicit assumption
that every deviation from the fitted line can be considered to be an observation
from the same (unknown) normal distribution. This is usually a reasonable
assumption, but there are situations where it is clearly incorrect. For example,
if we were asked to count the ducks on the village pond, then we would not
expect to miscount by very many. However, with the same task at a large
reservoir that sort accuracy would be infeasible.

Check that the extent of the scatter about the line is the same
for small values of x (or y) as it is for large values. If this is
not the case, experiment by taking logarithms (or some other
simple function) of one or both of the variables.

7.5 A random dependent variable, Y

In the previous section no mention was made of probability, probability dis-


tributions, or random variables. The method of least squares was simply a
procedure for determining sensible values for α and β, the parameters of a line
that partially summarised the connection between y and x. We now introduce
randomness.
If there is an underlying linear relationship, then this relationship connects
x not to an individual y-value, but to the mean of the y-values for that par-
ticular value of x. The mean of Y , given the particular value x, is denoted
by
E(Y |x),
which is more formally called the conditional expectation of Y . Thus the
linear regression model is properly written as
E(Y |x) = α + βx. (7.2)
An equivalent presentation is
yi = α + βxi + i , (7.3)
where i , the error term, is an observation from a distribution with mean
0 and variance σ 2 . Comparing these two equations demonstrates that the
regression line is providing an estimate of the mean (the expectation) of Y for
any given value of x.

7.5.1 Estimation of σ2
Assuming that our model is correct, the information about σ 2 is provided by
the residuals and an appropriate estimate of σ 2 is provided by
c2 = D/(n − 2),
σ (7.4)
86 RELATIONS BETWEEN TWO CONTINUOUS VARIABLES

where n is the number of observations, and D is the residual sum of


squares:
n
X
D= ri2 . (7.5)
i=1

7.5.2 Confidence interval for the regression line


For each observed xi -value, there is a fitted ybi value given by:

ybi = α
b + βx
b i, (7.6)

where α b and βb are the least squares estimates of the unknown regression
parameters and ybi provides an estimate of the average of Y when the value of
x is xi .
The confidence interval surrounding a ybi value gives an indication of how
accurately the average value of y is known for that particular x-value. The
variance associated with ybi is a function of (xi − x̄)2 which implies that the
variance is least when x = x̄ and steadily increases as x deviates from x̄.

7.5.3 Prediction interval for future values


Using Equation (7.3) a future observation, to be taken at xf , would be given
by
α + βxf + f .

However, not only do we not know the precise values for α and β, but we
also cannot know what size random error (ef ) will occur. The uncertainty
concerning α and β is responsible for the confidence interval for the regression
line. Here, however, we have a third source of variation, so the prediction
interval is always wider than the confidence interval for the regression line.
Example 7.3 (cont.)

In Figure 7.10 the inner pair of bounds are the 95% confidence bounds for
the fitted regression line. Any line that can be fitted between the bounds
would be considered to be a plausible description of the dependence of steam
consumption on average monthly temperature.

According to the predictions, 95% of future observations would fall between


the outer bounds. Reassuringly, all the observations actually made do fall
within these bounds.
DEPARTURES FROM LINEARITY 87

Figure 7.10 The fitted line, together with the 95% confidence and
prediction intervals.

7.6 Departures from linearity

7.6.1 Transformations
Not all relationships are linear! However, there are quite a few non-linear
relations which can be turned into the linear form. Here are some examples:

y = γxβ Take logarithms log(y) = log(γ) + βlog(x)


βx
y = γe Take natural logarithms ln(y) = ln(γ) + βx
y = (α + βx)k Take kth root y 1/k = α + βx

In each case we can find a way of linearising the relationship so that the
previously derived formulae can be used. The advantage of linearisation is that
the parameter estimates can be obtained using standard formulae. Remember
that it will be necessary to translate the line back to the original non-linear
form of the relationship when the results are reported.
Example 7.4

Anyone owning a car will be dismayed at the rate at which it loses value. A
reasonable model states that a car loses a constant proportion of its previous
value with every passing year. Thus, if initially a car is valued at V0 and its
value after a year is given by V1 = γV0 , then its value after two years will be
V2 = γV1 = γ 2 V0 , and so forth. Thus, after x years, the car will have a value
γ x V0 . If we take logarithms then we have the linear model:
log(Vx ) = log(V0 ) + log(γ)x.
88 RELATIONS BETWEEN TWO CONTINUOUS VARIABLES

Figure 7.11 (a) Price plotted against age, with the fitted curve derived
from (b). (b) As (a), but with price plotted using a logarithmic scale.

Figure 7.11 refers to the Nissan Leaf. It ignores model revisions and simply
reports the list prices of second-hand Leafs and their approximate ages based
on their number plates. Figure 7.11 (a) illustrates the data in the usual
fashion, whereas Figure 7.11 (b) uses a logarithmic scale on the y-axis. This
diagram shows the linear relation between log(price) and age. The relation
is also plotted on Figure 7.11 (a). The value of γ is estimated as 0.82: the
annual loss of value was nearly 20% of the previous year’s value.
7.6.2 Extrapolation
For many relations no transformation is needed because, over the restricted
range of the data, the relation does appear to be linear. As an example,
consider the following fictitious data:
x y
Amount of fertilizer per m2 Yield of tomatoes per plant
10 g 1.4 kg
20 g 1.6 kg
30 g 1.8 kg
In this tiny data set there is an exact linear relation between the yield and
the amount of fertilizer, namely y = 0.02x + 1.2. How we use that relation
will vary with the situation. Here are some examples:
1. We can reasonably guess that, for example, if we had applied 25 g of
fertilizer then we would have got a yield of about 1.7 kg. This is a
sensible guess, because 25 g is a value similar to those in the original
data.
2. We can expect that 35 g of fertilizer would give a yield of about 1.9 kg.
This is reasonable because the original data involved a range from 10 g
DEPARTURES FROM LINEARITY 89

to 30 g of fertilizer and 35 g is only a relatively small increase beyond the


end of that range.

3. We can expect that 60 g of fertilizer might lead to a yield in excess of 2


kg, as predicted by the formula. However, this is little more than a guess,
since the original range of investigation (10 g to 30 g) is very different
from the 60 g that we are now considering.

4. If we use 600 g of fertilizer then the formula predicts over 13 kg of toma-


toes. This is obviously nonsense! In practice the yield would probably
be zero because the poor plants would be smothered in fertiliser!
Our linear relation cannot possibly hold for all values of the variables,
however well it appears to describe the relation in the given data.

The least squares regression line is not a substitute for common


sense!

7.6.3 Outliers
An outlier is an observation that has values that are very different from the
values possessed by the rest of the data. An outlier can seriously affect the
parameter estimates. Figure 7.12 shows two examples. In each case all the
points should be lying close to the regression line. However, in each case, one
y-value has been increased by 30.

Figure 7.12 (a) An inflated y-value for a central x-value shifts the fitted
line upwards. (b) An inflated y-value for an extreme x-value drastically
alters the slope.

In Figure 7.12 (a), there is little change to the slope, but the intercept
with the y-axis is slightly altered. The difference appears trivial, but is very
90 RELATIONS BETWEEN TWO CONTINUOUS VARIABLES

apparent when the residuals are studied (Figure 7.13): nine of the ten residuals
are negative.
In Figure 7.12 (b), because the outlier has an extreme x-value, both the
slope and the intercept are affected. Notice how all the residuals are affected,
not just that for the affected value. There is an obvious pattern including a
succession of five negative residuals.

Figure 7.13 The residuals corresponding to the cases illustrated in


Figure 7.12.

The most common typographical error involves the interchange


of neighbouring digits.

For any model, the good data scientist will always look for pat-
terns in the residuals. If a pattern is found then that indicates
either an outlier, or an incorrect model.

7.7 Distinguishing x and Y

Below are pairs of examples of x and Y -variables. In each case the x-variable
has a non-random value set by the person carrying out the investigation, while
the Y -variable has an unpredictable (random) value.
x Y
Length of chemical reaction (mins) Amount of compound produced (g)
Amount of chemical compound (g) Time taken to produce this amount (mins)

An interval of time (hrs) Number of cars passing during this interval


Number of cars passing junction Time taken for these cars to pass (hrs)
WHY ‘REGRESSION’ ? 91

To decide which variable is x and which is Y evidently requires some knowl-


edge of how and why the data were collected.

7.8 Why ‘regression’ ?

Figure 7.14 The start of Galton’s notebook recording the heights of


family members. Notice the record of the first son’s height is given as
13.2, rather than 13.25.

Galton (see footnote in Section 7.2) studied inheritance by examining the


heights of successive generations of people. He recorded his data in a notebook
that is preserved at University College, London. Photographs of the pages can
be viewed on line4 ; the start of the notebook is shown in Figure 7.14.
Examining Figure 7.14, it is immediately apparent that Galton was not
being over-precise with his measurements, since a majority of the heights
recorded are given in round numbers of inches. An electronic version of the
data can be downloaded5 . Using that data, the ‘round-number’ bias, is sum-
marized in Table 7.2 using the complete data for fathers and sons.

Table 7.2 The decimal places recorded by Galton show a distinct ‘round-number’
bias.
Decimal part of height 0 1 2 3 4 5 6 7 8 9
Frequency 676 0 26 4 0 188 0 36 0 0

It seems probable that ‘0.2’ and the occasional ‘0.3’ were Galton’s render-
ings of 0.25, with 0.75 being recorded as ‘0.7’.

4 Currently at
http://www.medicine.mcgill.ca/epidemiology/hanley/galton/notebook/index.html.
5 Currently at http://www.randomservices.org/random/data/Galton.html.
92 RELATIONS BETWEEN TWO CONTINUOUS VARIABLES

Figure 7.15 Two plots of father’s heights against son’s heights. The
left-hand plot shows the height combinations that were recorded. By
adding jitter (see text) to the original data, the right-hand plot gives a
slightly clearer idea of the amount of data recorded.

Figure 7.15 plots Galton’s data. The left-hand diagram is very clear, but it
is misleading, since, because of Galton’s rounding of the data, a plotted point
may correspond to more than one father-son combination. For example, there
were 14 cases where both father and son were recorded as having heights of 70
inches. The right-hand diagram presents the data with each original pair of
values being adjusted by the addition or subtraction of small randomly chosen
increments. The process is called adding jitter.
When Galton studied the data, he noticed two things:
1. On average, the heights of adult children of tall parents were greater than
the heights of adult children of short parents: the averages appeared to
be (more or less) linearly related.
2. On average, the children of tall parents are shorter than their parents,
whereas the children of short parents are taller than their parents: the
values regress towards the mean.

These findings led Galton (in a talk entitled “Regression towards mediocrity
in hereditary stature’ ’ given to the British Association for the Advancement
of Science in 1885) to refer to his summary line drawn through the data as
being a regression line, and this name is now used to describe quite general
relationships.
WHY ‘REGRESSION’ ? 93

Figure 7.16 On average, short fathers have taller sons; tall fathers have
shorter sons.

To better visualise Galton’s second finding, Figure 7.16 plots the difference
between the son’s height and the father’s height against the father’s height.
CHAPTER 8

SEVERAL EXPLANATORY VARIABLES

The previous chapter concerned the dependence of one variable (the response
variable) on a single explanatory variable. However, there are often several
possible explanatory variables. The challenge for the data analyst is to de-
termine which are the key variables, and which can safely be ignored. This
process is often called feature selection.
The saturated model is the most complicated model that can be con-
sidered for the given data. Although it will be the best fitting model, it will
also employ the most parameters, and may therefore be too complicated to
be useful. In effect the saturated model is attempting to describe not only
the underlying trend (a good thing), but also the random variations present
in the sample (a bad thing). This is described as overfitting. Such a model
can provide less reliable predictions than those from a model that picks out
only the underlying trend. We will return to this point in Section 8.5.
A measure of the lack fit of a model is provided by the deviance which
compares the likelihood (Section 4.1.1) for the model with that for the satu-
rated model.

A gentle introduction to data analysis and statistics for students taking MA334. 95
By Graham J. G. Upton and Dan Brawn Copyright © 2023 Graham Upton.
96 SEVERAL EXPLANATORY VARIABLES

Example 8.1

Suppose we toss a coin 100 times and obtain 43 heads and 57 tails. The
saturated model (echoing the observed outcome) would propose that the prob-
ability of a head was 0.43 and the probability of a tail was 0.57. For this model,
the likelihood, Ls would be given by

Ls = c × 0.4343 × 0.5757 .

where the constant c is a count of the number of ways that 43 heads could
arise during 100 tosses.
A simpler (and potentially more useful) model would state that the coin
is fair: P(head)=P(Tail)=0.5. The likelihood, Lf , for this model would be
given by
Lf = c × 0.5100 .
The deviance for the fair model is given by

Df = −2 ln(Lf /Ls ),

where ln signifies the natural logarithm. The value of c is irrelevant as it


cancels out when the ratio of the likelihoods is evaluated.

In the next section we see how we may decide between competing models,
in a way that takes account of their fit to the data and their complexity.

8.1 AIC and related measures

William of Ockham was a Franciscan friar believed to have been born in


Ockham (in Surrey, England) in about 1287. He suggested that, when there
are alternative explanations of some phenomenon, it is the simpler explanation
that should be preferred: this is the principle now called Ockham’s razor
or the principle of parsimony.
Suppose we have a complicated model, Mc with many explanatory vari-
ables, for which the deviance is equal to Dc . There is also the simpler model
Mb which involves just a few of the variables that were in Mc , and nothing
extra. It is impossible for this simpler model to give a better fit, so it will
have a larger deviance, Db . The question is whether the gain in simplicity
offsets the loss of fit.
Example 8.2

As an analogy, suppose that we have 3 piles of bricks, as follows:

Number of bricks 8 12 20
Weight (kg) 22 38 60
MULTIPLE REGRESSION 97

A perfectly accurate description would be : “a pile of 8 bricks weighing 22


kg, a pile of 12 bricks weighing 38 kg, and a pile of 20 bricks weighing 60 kg.".
But a more useful description would be“there are piles containing 8, 12, and
20 bricks with “an average brick weighing 3 kg". will be much more useful,
even though it is not a perfect description.

A useful measure is therefore one that balances model complexity and


goodness-of-fit. The most commonly used measure is the Akaike1 Informa-
tion Criterion (AIC) which is a function of the deviance and the degrees of
freedom.

The degrees of freedom for a model is the number of observa-


tions minus the number of parameters in the model.

Since Akaike introduced his measure, a small correction has been suggested.
The adjusted measure is referred to as AICc.
Very similar in spirit to AIC is the Bayesian Information Criterion
(BIC), also known as the Schwarz criterion.. As with AIC, small val-
ues are preferable to large ones.However, AIC and BIC have subtly differ-
ent motivations: AIC seeks to select that model, from those available, that
most closely resembles the true model (which will be governed by a myriad
of unmeasured considerations and will rarely be amongst those considered),
whereas BIC assumes that the correct model is amongst those on offer and
seeks to identify that optimal model.

With complicated situations that require a data scientist, the


true explanation for the data will involve variables unavailable
to the investigator. For this reason we recommend using AICc
(or AIC) for model selection.

8.2 Multiple regression

The previous chapter concentrated on the situations where the value exhibited
by a response variable might depend upon the value of a single explanatory
variable. Often, however, there are several possible explanatory variables.
Rather than considering a series of regression models linking the response

1 HirotuguAkaike (1927-2009) graduated from the University of Tokyo in 1952. His career
was spent in the Institute of Statistical Mathematics where he was Director General be-
tween 1986 and 1994. He was awarded the Kyoto Prize (Japan’s highest award for global
achievement) in 2006.
98 SEVERAL EXPLANATORY VARIABLES

variable, in turn, to each of the explanatory variables, we can incorporate


information about all the explanatory variables in a single model. This is
called a multiple regression model. We start with the simple case where there
are just two explanatory variables.

8.2.1 Two variables


Suppose we have two explanatory variables, X1 and X2 , and the response
variable Y . The straightforward extension of the linear regression model is

y = α + β1 x1 + β2 x2 (8.1)

A sensible analysis might begin with a plot of y against each x-variable in


turn, possibly with calculation of the correlation in each case. Assuming that
the data are acceptable (no obvious errors or outlying values), the AIC values
for the separate linear regression models would be compared with that for the
multiple regression model to determine which provides the best fit (i.e. has
the smallest AIC value).
The data in the example that follows were provided as the birthwt file
within the MASS library using R.
Example 8.3

As an example we use data collected during 1986 concerning the weights of


189 babies. The data were collected at Baystate Medical Center, Springfield,
Mass during 1986. We will return to this data set throughout the chapter,
whilst comparing possible models.
For this first example we use the information concerning the mother’s age
and weight (in lbs). We begin with separate plots of the weights of the babies
against mother’s weight and mother’s age and calculation of the two correla-
tions. The results are shown in Figure 8.1.

Figure 8.1 Scatter diagrams and fitted regression lines for the
dependence of a baby’s weight on the mother’s weight and age.
MULTIPLE REGRESSION 99

There is a plausible positive correlation between the weights of the babies


and their mothers. The correlation with the mother’s age is much less (0.09)
and hawk-eyed readers will have noticed that one mother (with a heavy baby)
was 9 years older than the next oldest. When the data for that mother and
child are removed, the correlation falls to 0.03.
Continuing by fitting models to the complete data set we get the results
summarised in Table 8.1.

Table 8.1 Fits of linear regression models and the multiple regression
model to the baby weight data. Here y denotes the baby’s weight, x1 the
mother’s age, and x2 , the mother’s weight
α βMother’s age βMother’s weight AIC
M0 y = 2945 3031
Mage y = 2656 +12.43x1 3032
Mweight y = 2370 +4.43x2 3026
Mage+weight y = 2214 +8.09x1 +4.18x2 3028

The model M0 is the model that estimates the same weight for every baby
and ignores the possible influences of any explanatory variables. It is in-
cluded for completeness and provides a means of judging the usefulness of the
explanatory variables. Note that the model Mage actually has a higher AIC
value, suggesting that age (when on its own) should be ignored.
Using AIC the best model (lowest AIC value) is the linear regression model
Mweight , with a fit indicated in Figure 8.1. Including information about the
mother’s age is again unhelpful.
Notice that the βs in the multiple regression model have different values
from those in the simpler one-variable models.

The relevance of a variable depends on which other variables


are in the model.

8.2.2 Collinearity
When there are many variables that may help to explain the variation in the
values of a response variable, it will sometimes happen that two (or more)
of these variables are highly correlated because of their dependence on some
unmeasured background variable (see Figure 8.2).
In the extreme case where the correlation between variables A and B is
1 (or −1), if one variable is included in the model, then adding the second
variable adds no extra useful information: no change in G2 , but an increase in
the AIC value (because of an unnecessary extra parameter being included).
100 SEVERAL EXPLANATORY VARIABLES

Figure 8.2 When two variables each depend on some (possibly


unmeasured) background variable, their values are often correlated.

If the correlation between A and B is large and each variable provides


information about the value of a response variable, then a useful model will
probably include one of the two variables, but not both. However, which of A
and B should be included may depend upon which other variables are present.
This can complicate stepwise selection (Section 8.2.5).

It is a good idea to examine the correlations between the po-


tential explanatory variables before the main analysis. If two
of the variables are highly correlated then consider using just
one of them. The extra information gained from including the
second variable can be examined after an initial ‘final’ model
has been established.

8.2.3 Using a dummy variable

Categorical variables (for example, gender or race) can be included in a re-


gression model using dummy variables. A dummy variable takes the value 0
or 1 depending upon whether a condition is true. Suppose, for example, that
we are attempting to estimate a person’s weight, w, from knowledge of that
person’s height, h, using the model
w = α + βh.
Suppose we collect data from two different countries, A and B, and believe
that the linear relation may differ from one country to the other. We will use
the dummy variable, D.
A possible model is:
w = (α + γD) + βh (8.2)
With the dummy variable D taking the value 0 for country A and the value
1 for country B, we are therefore fitting the models:
MULTIPLE REGRESSION 101

For country A For country B


w = α + βh w = (α + γ) + βh

Figure 8.3 shows that, by using the dummy variable with the α term, we
are, in effect, fitting parallel lines.

Figure 8.3 The model w = (α + γD) + βh.

Of course it would also be possible to fit separate models (different slopes


and intercepts) by analyzing the data from each country separately.
Example 8.3 (cont.)

Amongst other information collected concerning the mothers and their ba-
bies was whether the mothers smoked during pregnancy. Figure 8.4 distin-
guishes between the two groups and shows the regression lines that result
when each group are separately analysed.

Figure 8.4 The separate regression lines for mothers who smoked during
pregnancy and for mothers who did not smoke.
102 SEVERAL EXPLANATORY VARIABLES

It appears that the non-smoking mothers had slightly heavier babies and,
since the two lines are reasonably parallel, this suggests that the model given
by Equation (8.2) will be appropriate for the entire dataset. The fitted model
is
y = (2501 − 272D) + 4.24x,
where y and x denote, respectively, the baby’s weight and the mother’s weight.
Here D = 0 denotes a non-smoker and D = 1 denotes a smoker. These
estimates are included in Table 8.2 along with those for simpler models and
the AIC values.

Table 8.2 Fits of the regression models of baby’s weight on mother’s


weight, with and without the inclusion of the smoking dummy variable.
Here y denotes the baby’s weight and x the mother’s weight.
α γSmoker βMother’s weight AIC
M0 y = 2945 3031
Msmoker y = 3056 -284 3026
Mweight y = 2370 +4.43x 3026
Mweight+smoker y = 2501 -272 +4.24x 3022

The model Msmoker states that the mother’s weight is irrelevant, with the
only determinant for the variation in the weights of the babies being whether
or not the mother smoked during pregnancy. The reduction from the AIC
value for the null model is the same for this model as for Mweight (which
ignores the smoking factor).
However, the model that includes both, Mweight+smoker , results in a fur-
ther reduction in the AIC value. This model estimates that a baby will weigh
272 g less for a smoking mother than for a non-smoker of the same weight
and also that an increase in the weight of the mother by 1lb will be matched
by an increase of 4.24g in the baby’s weight.

8.2.4 The use of multiple dummy variables

As an example of the use of dummy variables, we supposed, in our introduc-


tion to the previous section, that our height and weight data came from two
countries, A and B. Suppose instead that the data came from four countries,
A to D. We would now need three dummy variables, each of which takes the
value 0 or 1 as shown in Table 8.3.
For a model that uses the same slope but different intercepts, we would
have:
w = α + γ1 D1 + γ2 D2 + γ3 D3 + βh,
so that, for example, for country C, the model becomes:
w = (α + γ2 D2 ) + βh.
MULTIPLE REGRESSION 103

Table 8.3 The values for three dummy variables corresponding to four
categories of the variable ‘country’.
D1 D2 D3
Country A 0 0 0
Country B 1 0 0
Country C 0 1 0
Country D 0 0 1

Here country A is the reference category with which the other countries are
compared2 .

If a variable has M non-numeric categories, then we will need


M − 1 dummy variables.

Some computer languages apply multiple dummy variables with


a single instruction. The language R uses the factor com-
mand for this purpose.

Example 8.3 (cont.)

Another variable reported for the baby weight data was the mother’s race,
with the 3 alternatives being ‘white’, ‘black’, or ‘other’. The result of including
this variable is reported in Table 8.4, where the parameter for the intercept
is denoted by α, the coefficient of the quantitative variable by β, and the
coefficients for the dummy variables by γ.

Table 8.4 The impact of race on the baby’s weight.

α γSmoker γBlack γOther βMum wt AIC


M0 2945 3031
Mrace3 3103 -383 -297 3025
Mrace3+weight 2486 -452 -241 4.66 3020
Mrace3+smoker 3335 -429 -450 -453 3012
Mrace3+smoker+weight 2799 -400 -504 -395 3.94 3009

2 Regression software usually chooses either the first category or the last category as the
reference category. An alternative is to compare every category with their average.
104 SEVERAL EXPLANATORY VARIABLES

Race is evidently an important variable (reducing AIC to 3025). Indeed,


each variable reduces AIC. The best fitting model (lowest AIC) includes all
three variables. We see that mothers described as ‘black’ have babies weighing
on average about 500 g less than those of mothers classed as ‘white’ (here
chosen arbitrarily as the reference category). The reduction for the mothers
described as ‘other’ is about 400 g. Regardless of race there is a further
reduction of 400 g if the mother is a smoker.
The mean weight of a mother is given as 130 lb (just over 9 stones). The
average baby weight for a non-smoking white mother is estimated by the
model as
2799 + 3.94 × 130 ≈ 3300 g,
whereas that for a black mother of the same weight who smoked during preg-
nancy is estimated as

2799 − 400 − 504 + 3.94 × 130 ≈ 2400 g.

8.2.4.1 Eliminating dummy variables A variable with m distinct categories


will require (m − 1) dummy variables. For example, with four countries, we
needed three dummy variables. However, we may find that two (or more)
categories are sufficiently similar that the same dummy variable can be used
for both (or all).
Suppose that Countries C and D are very similar to one another, but are
quite different from the other countries. In this case we need only two dummy
variables as shown in Table 8.5.

Table 8.5 The values for two dummy variables corresponding to four
categories of the variable ‘country’, when countries C and D are very
similar.
D1 D2
Country A 0 0
Country B 1 0
Country C 0 1
Country D 0 1
MULTIPLE REGRESSION 105

Example 8.3 (cont.)

Table 8.4 showed that race was an important factor affecting the baby’s
weight, with both mothers classed as ‘Black’ and those classed as ‘Other’ hav-
ing lighter babies. For some models in the table we see that γBlack ≈ γOther .
We now consider combining these two categories into a single category which
we call ‘Non-white’. The results for the competing models are summarised in
Table 8.6.

Table 8.6 The impact of race on the baby’s weight.

α γSmoker γBlack γOther βMum wt AIC


Mrace3+smoker+weight 2799 -400 -504 -395 3.94 3009
γNon-white
Mrace2+smoker+weight 2849 -412 -431 3.61 3007

Repeating the best model, but with the three categories reduced to two
categories results in a further decrease in AIC. The simplification is worth-
while.

8.2.5 Model selection


If there are m explanatory variables, each of which will be either in or out of
the model, then there are 2m possible models3 . With m = 10, that means
there would be 1024 models. With 20 variables there would be more than
a million possible models. So an automated procedure for choosing the best
model is badly needed.
A useful technique is to create the model that contains all of them and
require the computer to perform stepwise selection starting with that overly
complex saturated model. Stepwise selection proceeds by alternately removing
(or adding) terms one at a time, with the aim of optimising some function
(typically minimising the AIC value).4
However, the bottom line is that it is you who is in charge, not the com-
puter! If there is a simple model that explains the vast majority of the vari-
ation in the response variable, then the pragmatic choice may be to use that
model rather than the computer’s more complicated suggestion.

8.2.6 Interactions
It is sometimes the case that the combined effect of two explanatory variables
is very different from the simple sum of their separate effects. For example

3 When interactions (Section 8.2.6) are included the number is much greater still.
4 Some regression software includes an automated stepwise routine.
106 SEVERAL EXPLANATORY VARIABLES

the noise made striking a match while a pile of gunpowder makes no noise at
all, but the outcome is very different when the match is struck by a pile of
gunpowder.
If x1 and x2 are two explanatory variables, then we can introduce their
interaction by adding a product term, x1 x2 , into the model. For example,

y = α + β1 x1 + β2 x2 + β12 x1 x2 .

The same procedure works with dummy variables, as the following example
illustrates.
Example 8.3 (cont.)

We have found that a mother’s race is relevant to the weight of her baby,
and that whether she smoked during pregnancy also appears to affect the
weight. We now wonder whether the effect of smoking might be more se-
rious for some races than others. Continuing to combine the ‘Black’ and
‘Other’ categories, a preliminary investigation involves calculating the mean
baby weights for the four race-smoking combinations. The results, which are
shown in Table 8.7 certainly suggest that there may be important differences.
However, since these results do not take account of the differing mother’s
weights, a full regression model is required.

Table 8.7 The mean baby weights (and the numbers of mothers) for
the four combinations of smoking and race.
White Non-white
Smoker 2827 (52) 2642 (22)
Non-smoker 3429 (44) 2824 (71)
Difference 602 182

Using the subscripts S, and R, for ‘Smoker’, and ‘Race’, we now have the
model MI given by

b = α + γS xS + γR xR + γSR xS xR + βm, (8.3)

where b and m refer to the weights of babies and mothers, and xS and xR are
dummy variables. Table 8.8 shows how this works for the four race-smoker
combinations using 0 for ‘White’ and for ‘Non-smoker’, and 1 otherwise).

Table 8.8 The outcome of the model for the four combinations of
smoking and race.
White non-smoker b = α + βm
White smoker b = α + γS + βm
Non-white non-smoker b = α + γR + βm
Non-white smoker b = α + γS + γR + γSR + βm
MULTIPLE REGRESSION 107

The AIC value for the model MI (Equation 8.3) for this model is 3006.5,
which is marginally less than the value for the model without interactions
(3007.3). The difference between these AIC values is much less than the
other differences recorded between the competing models. This suggests that
the interaction (which complicates the interpretation of the model) is of minor
importance.

8.2.7 Residuals
Remember that the analysis of a data set does not stop with the choice of a
model. Any model is a simplification of reality. We always need to examine the
residuals. A residual is the difference between the value observed and the value
predicted by the model. Some will be positive, some will be negative, and,
hopefully, most will be relatively small in magnitude. A box plot (Section 2.1)
is a good way of identifying any unusual values. If there are several similarly
unusual values that share some trait not accounted for by the current model,
then this points to the need to include an additional explanatory variable or
an interaction in the current model (or try a completely different model).
Example 8.3 (cont.)

Figure 8.5 shows the boxplot for model MI (Equation (8.3). There are two
obvious outlier values. Examination of these cases shows that both mothers
had a condition described as ‘uterine irritability’. There were 26 other mothers
with this condition and their babies were, on average, 600 g lighter than the
babies of mothers without that condition. Inclusion of this extra variable in
the model reduces the AIC value to 2996, which is a considerable improvement
on all previous values.

Figure 8.5 Boxplot of the residuals from the model MI . The two outliers
correspond to babies of mothers who had experienced uterine irritability.
108 SEVERAL EXPLANATORY VARIABLES

8.3 Cross-validation

The principal purposes for fitting a model to a set of data are these:

1. To understand the relations between the variables forming the data set.

2. To summarise the data set in a simple fashion.

3. To predict future values.

The previous emphasis has been on on understanding and summarising the


current data. We turn now to considering how useful our present summary
model will be for predicting and describing future observations.
The idea of cross-validation is simple. We divide the available data into
two parts: a training set and a validation set. (also called a hold-out
set). The model is estimated (using only the data in the training set) and is
then applied to the validation set to hopefully discover that it fits well. The
process may be referred to as out-of-sample validation.
Suppose that the observed values of the response variable in the validation
set are y1 , y2 , . . . , yn , with the corresponding estimated values using the
training set being yb1 , yb2 , . . . , yc
n . The key quantity is the mean squared
error (M SE) given by
n
1X
M SE = (yi − ybi )2 . (8.4)
n i=1

As an alternative the root mean squared error (RM SE) may be quoted.
This is simply the square root of the mean squared error and has the advantage
of being in the same units as the original values. The M SE and RM SE are
effectively estimates of the error variance and standard deviation that will
accompany future estimates based on the selected model.
The aims of cross-validation are to estimate the error variance, and to
be assured that a reasonable choice of model has been made. With that
reassurance, the parameter estimates subsequently used are those based on
fitting the selected model to the entire data set. There are several similar
approaches.

8.3.1 k-fold cross-validation


In this approach the complete data set is divided into k subsets of approxi-
mately the same size. Each of these subsets takes on, in turn, the role of the
hold-out set, with the remainder in each case being used as the training set.
The statistic used in such a case is the average of the k M SE calculations
over the k choices for the hold-out set.
The choice of value for k is not critical, though the values 5 and 10 are often
recommended. However, it may be more useful to choose a smaller value, such
CROSS-VALIDATION 109

as 2 (when each half of the data set is used alternately as the training set and
the testing set), or 3 (when there will be a 50% overlap between any pair
of training sets). The advantage of these small values of k is that it will be
possible to examine the extent to which parameter estimates are affected by
the particular data used for training. If there is close agreement in the various
parameter estimates, then this will provide reassurance that the model will
be appropriate for use with future data.

As ever, there will be a need for experimentation. For example,


if we are unsure as to which of rival models to use (because of
similar AIC values), then a study of their comparative M SE
(or RM SE) values for each test set may be useful.
If we find that there are a few observations that are persistently
poorly estimated, then this might suggest re-estimating the mod-
els with those observations removed, or with the addition of a
relevant variable that had previously been omitted.

Example 8.3 (cont.)

Starting with a model that uses almost all the variables available, together
with some interactions, stepwise selection (Section 8.2.5) leads to a model, MS ,
that includes the combined ‘nonwhite’ dummy variable. The model includes
several other dummy variables. One of these, which indicates whether the
mother has experienced urinary infection, is associated with the parameter
γu . Using the entire data set this model has the AIC value 2988. Replacing
the single ‘nonwhite’ dummy variable by the two dummies ‘black’ and ‘other’
(model MS2 ) results in a slightly larger AIC value (2990).
Using k = 3. we can examine whether MS has a lower RM SE or AIC-
value than MS2 ) using each of the three test sets, and also whether there is
much variation in the parameter estimates from one trial set to another. A
selection of the results are given in Table 8.9.

Table 8.9 With k = 3, The AIC value and the value of the parameter, γu
(see text) are shown for the test sets using 3-fold cross-validation. The
RM SE values are shown for each model.

Model First third Second third Third third


RM SE AIC γu RM SE AIC γu RM SE AIC γu
MS 628 2000 -704 664 1994 -487 682 1992 -521
MS2 630 2002 -709 672 1994 -485 695 1993 -518
110 SEVERAL EXPLANATORY VARIABLES

Comparing the two models, we can see that the AIC value for model for
model MS2 is never smaller than that for model MS , confirming that the latter
is the preferred model. Similarly the RM SE values (over the remaining two-
thirds of the data set, for each choice of a third) show that MS gives slightly
more accurate predictions.
However, when we compare the estimates across the three cases, there are
considerable variations that are exemplified by the estimates for γu using
model MS ranging from −487 to −704. The conclusion is that there is a good
deal of uncertainty concerning the precise importance of ‘’uterine irritability’,
though there is no doubt (because of the lower AIC value found previously)
that this parameter is important.

8.3.2 Leave-one-out cross-validation (LOOCV)

In the previous section we advocated a small value of k as a means of exploring


the reliability of the chosen model. However, the principal purpose of cross-
validation is to estimate the uncertainty when the chosen model is applied to
future observations. In this case, with n observations, the choice of k = n is
ideal. In this case the trial set consists of all the observations bar one (the test
observation). For large n the model based on n − 1 of the observations will be
well nigh identical to the model that will be used in future (the model based on
all n observations). So the comparison of the predicted value based on the (n−
1)-model will be virtually identical to that based on all n observations, so the
difference between the predicted and observed values (for the nth observation)
will be a very accurate representation of a typical error.
Example 8.3 (cont.)

For the model MS , the LOOCV estimate of RM SE (averaged over all the
cases) is 657.

8.4 Reconciling bias and variability

Suppose that we have arrived at the model

yb = α
b+β c2 x2 + · · · + βc
c1 x1 + β m xm ,

where α b, βc1 , etc are all the least squares estimates (Section 7.4) of their
respective parameters.
Good news: A property of any least squares estimate is that it is unbiased:
if we repeat the procedure many times, then the long-term average of the
estimate will be precisely correct. That being the case, assuming the model
is correct, the long-term average value of yb will exactly equal the true value.
SHRINKAGE 111

Bad news: Least squares estimates are only estimates. Each new set of
data taken from the same population will give a new estimate Yes, on average
the results are correct, but, associated with every β-value
b there is variability.
The more β-values that require estimation, the greater the uncertainty in the
estimate of y.
Using the least squares estimates gives an estimate that is, on average,
absolutely correct. However, because of the inherent uncertainty we may have
been unlucky and the estimate that we have got may be far from correct. It
might be better to use instead a procedure that leads to biased estimates that
are less variable.
A convenient measure is again the M SE (the mean squared error), which is
the average value of (y−by )2 . This is equal to the variance of the estimate added
to the square of the bias. Figure 8.6, illustrates a case where the unbiased
procedure has variance 400, whereas the biased procedure has a variance of
100 and a bias of 5.

Figure 8.6 A less variable, but biased, procedure can have a smaller
mean squared error than an unbiased procedure.

8.5 Shrinkage

Shrinkage refers to any procedure that reduces the variability of the model
parameter estimates and hence the variability of the predictions.
The standard least squares procedure focuses on minimising the sum of
squared errors, D, given by:
n
X
D= (yi − ybi )2 , (8.5)
i=1

where yi is the ith of n observed values and ybi is the corresponding estimated
value. By contrast, shrinkage estimates are formed by minimising

D + g (β
c1 , β
c2 , . . . , βc
m ), (8.6)
112 SEVERAL EXPLANATORY VARIABLES

where g is some positive function. One choice for the function g is the sum of
squares of the coefficients, so that one is minimizing R given by:
m
X 2
R=D+λ βbi . (8.7)
i=1

This is known as ridge regression.


An alternative is to minimize L, given by:
m
X
L=D+λ |βbi |, (8.8)
i=1

where |x| means the absolute value of x. This is called the lasso (which
stands for ‘least absolute shrinkage and selection operator’). Using the lasso
often results in some of the βs being given the value zero, which means that
the corresponding explanatory variables are eliminated from the model. The
result is a model that is easier to interpret, and gives more precise (but possibly
slightly biased) estimates.
In either case, the parameter λ is known as a tuning constant. The
choice λ = 0 gives the usual least squares model, with increasing λ placing
ever more attention on the β-values. The best value for λ is chosen using
cross-validation.

8.5.1 Standardisation

Suppose that y is the wear of a car tire (measured in mm) and the single
explanatory variable is x, the distance travelled. If x is measured in thousands
of miles, then maybe β = 1. However, if x is measured in miles, then β =
0.001. The fitted values will be the same, whichever units are used for x, so
that the value of D in Equations (8.7) or (8.8) will be unaltered. The value
of g(β) will be greatly changed, however.
The solution is standardisation. Rather than working with x, we work with
the standardised value, x0 , given by
x − x̄
x0 = , (8.9)
s
where x̄ is the mean of the x-values and s is their standard deviation. This is
applied separately to each x-variable, and also to the y-values. The result is
that every standardised variable has mean 0 and variance 1.
Both ridge regression and the lasso work with standardised values. When
a group of explanatory variables are highly correlated with one another, ridge
regression tends to give each of them similar importance (i. e. roughly equal β
values) whereas the lasso will give zero weight to all bar one. Some shrinkage
routines include standardisation automatically.
GENERALISED LINEAR MODELS (GLMS) 113

Example 8.3 (cont.)

For the baby weight data, the model MS2 uses 8 explanatory variables. We
assess the advantage of shrinkage by using LOOCV. Results are summarised
in Table 8.10.

Table 8.10 The values of the root mean square error (RM SE) resulting
from using LOOCV. Also given is the number of non-zero parameters in
the model.
Model No. of non-zero parameters RMSE
Ordinary least squares (λ = 0) 8 656.7
Ridge regression 8 656.5
Lasso 6 655.1

We see that using ridge regression does slightly reduce the RM SE , but
not as effectively as the lasso. The latter also simplifies the model by setting
two parameters to zero.

8.6 Generalised linear models (GLMs)

In previous sections, the response variable Y was a continuous variable (such


as length or weight). Although we did not discuss it, the implicit assumption
in our modelling was that the underlying randomness was the same for all
y-values and could be described by a normal distribution.
However, linear models can also be used when the response variable has
many possible distributions. For example, when the response variable is cat-
egorical with two categories (e.g. Yes and No) the model used is a logistic
regression model. When a categorical response variable has many categories,
a slightly different approach may be used and the model is called a log-linear
model. These are briefly described in the next sections.

8.6.1 Logistic regression


We now consider the situation where the quantity of interest is p, the proba-
bility of a ‘success’. We are investigating whether p is dependent on the values
taken by one or more explanatory variables.
However, we cannot work with p directly. To see why consider the following
situation where p depends on a single explanatory variable, x. In the example,
samples have been taken at five values of x. The table gives the numbers of
successes within each sample:
114 SEVERAL EXPLANATORY VARIABLES

Value of x 1 2 3 4 5
Sample size (n) 25 40 30 48 50
Number of successes 15 26 21 36 40
Success rate (p) 60% 65% 70% 75% 80%
The simple linear relation
p = 0.55 + 0.05x
provides a perfect explanation for the data. So, what is the problem? Well,
try putting x = 10 into this relation. The result is apparently p = 1.05. But
probability is limited to the range (0, 1).
Of course, extrapolation can always give foolish results, but we should be
using a model that would provide a feasible estimate of p for every value of x.
Rather than modelling the variation in p with its limited range of (0,1), we
need to model the variation in some function of p that takes the full (−∞, ∞)
range. The answer is the logit:
 
p
ln .
1−p
The values 0, 0.5, and 1 for p, correspond to the logit values −∞, 0, and ∞.
The relation between p and the logit is illustrated in Figure 8.7.

Figure 8.7 The relation between p and the logit is nearly linear for most
of the range of p.

Note that, for p in the range (0.1, 0.9), the logit is very nearly a straight
line. As a consequence, if the relation between p and some variable x appears
to be approximately linear in this range, then the same will be true for the
logit. Thus, instead of using p = α + βx, we use
 
p
ln = α + βx. (8.10)
1−p
GENERALISED LINEAR MODELS (GLMS) 115

with x being the observed value of X. A model of this type is described as a


logistic regression model5 .
Example 8.4

We again use the baby birthweight data. Babies weighing less than 2.5
Kg were classified as ‘low’. We now examine how the probability of a baby
being classified as low (scored as 1), as opposed to normal (scored as 0), is
dependent on other variables.
The principal explanatory variable is certainly the mother’s weight. In
addition we consider two other variables: the number of previous premature
labours (0 to 3), and whether or not the mother had a history of hypertension.
The three explanatory variables were chosen to illustrate the fact that, as for
multiple regression (Section 8.2), any type of explanatory variable can be used
(continuous, discrete, or categorical).
Including the number of previous premature labours implies that each pre-
mature labour has the same additional effect. An alternative would have been
to reduce the number of categories to two: none, or at least one.
The results for a variety of models are given in Table 8.11.

Table 8.11 Some models explaining variations in the logit of the


probability of a baby being described as having low weight.

α γHypert βPremature βMumWt AIC


M0 -0.790 237
MMumWt 0.998 -0.0141 233
MHypert+MumWt 1.451 1.855 -0.0187 227
MLabours+MumWt 0.641 0.722 -0.0125 229
MHypert+Labours+MumWt 1.093 1.856 0.726 -0.0171 224

Considering only these five models, the lowest AIC is given by the most
complex. To see how this model works, consider a mother of average weight
(130 lb). If she had had no premature labours and no history of hypertension
then, the logit would be equal to

1.093 − 0.0171 × 130 = −1.13

which corresponds to a probability of exp(−1.13)/(1 + exp(−1.13)) = 0.244.


By contrast, the logit for a lady of the same weight, suffering with hyperten-
sion, and having had two premature labours, would be

1.093 + 1.856 + 2 × 0.726 − 0.0171 × 130 = 2.18.

5 Any value for x can be converted into a value for p by using


p = exp(α + βx)/{1 + exp(α + βx)}.
116 SEVERAL EXPLANATORY VARIABLES

This corresponds to a probability of exp(2.18)/(1 + exp(2.18)) = 0.890, so the


model predicts that a mother of average weight with two premature labours
and experiencing hypertension, has a 90% chance of having a baby of low
weight.

8.6.2 Loglinear models

When all the variables are categorical with several categories, the interest
in the data may simply be to understand the connections between variables,
rather than making predictions (an approach resembling correlation rather
than regression). Table 6.5 provided a hypothetical table showing precisely
independent variables. In that table each cell count was given by
Row total × Column total
Cell count = .
Grand total
This multiplicative model can be transformed into an equivalent linear model
by taking logarithms:

log(Cell count) = log(Row total) + log(Column total) − log(Grand total).

This simple independence model can be extended to include the effects of


several variables and also the interactions between two or more variables as
the next example demonstrates.
Example 8.5

The following data are abstracted, with permission of the UK Data Service,
from the 2012 Forty-Two Year Follow-Up of the 1970 British Cohort Study
(ref. SN7473). Table 8.12 summarises the answers to questions concerning
belief in God and belief in an afterlife. To simplify the presentation, all replies
that expressed any doubts have been classed as ‘Maybe’.

Table 8.12 Belief in God and belief in an afterlife as expressed by UK


42-year-olds in 2012.
Male Female
Afterlife Maybe No afterlife Afterlife Maybe No afterlife
No God 27 412 747 66 377 248
Maybe 206 1992 266 567 2405 139
God 233 90 23 471 186 24

For simplicity, we will use the suffices G, A, and S for the variables ‘Belief
in God’, ‘Belief in an afterlife’, and ‘Gender’. In the model definitions we will
indicate independence between quantities using a + sign and an interaction
using a * sign. We assume that if, for example, the parameter βAG is in
GENERALISED LINEAR MODELS (GLMS) 117

the model, then βA and βG must also be in the model (this is, in fact, an
inevitable consequence of the model fitting procedure).
The simplest model of any interest is the model MA+G+S which states that
all three variables are mutually independent.

Table 8.13 Alternative models, their components, and their AIC values
when applied to the data of Table 8.12.

Model α βA βG βS βAG βAS βGS βAGS AIC value


MA+G+S X X X X 3979
MA∗G+S X X X X X 773
MA∗S+G X X X X X 3424
MA+G∗S X X X X X 3693
MA∗G+A∗S X X X X X X 218
MA∗G+G∗S X X X X X X 487
MA∗S+G∗S X X X X X X 3138
MA∗G+A∗S+G∗S X X X X X X X 191
MA∗G∗S X X X X X X X X 166

A glance at the AIC values reveals that the models that include the AG
interaction should be greatly preferred to the other models: belief in God
tends to go hand-in-hand with belief in an afterlife. Comparing the AIC
values for the models MA∗G+A∗S and MA∗G+G∗S , it is apparent that the there
is a stronger relation between a person’s gender and their belief in the afterlife,
as opposed to the connection between their gender and their belief in God.
The model with the lowest AIC value is the model MA∗G∗S . This model
states that the association between any two variables is dependent on the third
variable. In simple terms, it is the model that says that this is complicated!

The example makes clear that, even with only three variables, there are
many models that might be considered. This is not simply a problem with
loglinear models, it is true for any modelling that involves many explanatory
variables. The solution is to use stepwise selection (Section 8.2.5) together
with a careful examination of the results. For example, in the previous exam-
ple, there was one interaction between variables that was far more important
than all the others. In such a case that should be the fact that the data
scientist reports in bold type.
CHAPTER 9

LAST WORDS

The intention of this book has been to introduce the reader to methods for
data analysis without dwelling on the underlying mathematics. We have in-
troduced a great many methods that may help with the data scientist’s task of
forming testable explanations of the data. with any new data set it is always
difficult to know where to start. Here is some advice:

1. Go slowly!

2. Learn where the data have come from (sample, questionnaire, . . . ) and
how they were obtained (by a machine, by interview, by a person taking
measurements, . . . )

3. If feasible, consider taking some more data yourself.

4. Remember that most data sets are dirty: they are likely to have errors
(e.g. some measurements in metres, others in kilometres)

5. Plot the data for each variable (use a bar chart, histogram, box plot. . . . )
to look for that dirty data.
A gentle introduction to data analysis and statistics for students taking MA334. 119
By Graham J. G. Upton and Dan Brawn Copyright © 2023 Graham Upton.
120 LAST WORDS

6. Perhaps calculate correlations between pairs of numerical variables, and


plot scatter diagrams in order to get a feel for the relationships between
the measured variables. |item Use box plots to compare numerical values
when one of the pair of variables is categorical.
Remember that curiosities in the data may have an explanation that re-
quires none of the methods of the last few chapters.
Example 9.1

Here is a tale, possibly apocryphal, about some time-series data (measure-


ments of some quantity taken at regular intervals in time). The industry
concerned had hired a consultant to analyse the data because there was con-
cern that the early values were on average larger than the later values. The
consultant asked to be shown how the data were obtained. The answer was
that the daily values were taken by an observer noting the reading of a pointer
on a dial placed high on the wall. Further questioning revealed that there had
been a change in the staff member responsible for taking the reading: a short
person had been replaced by a tall person. The apparent change in the levels
recorded was simply due to the change in the angle at which the observer read
the dial.

Example 9.2

Earlier, in Example 7.2, we examined the relation between the eruption


lengths and the times between eruptions for the Old Faithful geyser in the
Yellowstone National park. The first few records of the eruption lengths are
apparently recorded in minutes correct to three decimal places:

3.600 1.800 3.333 2.283 4.533 2.883.

What do you notice?

Always be suspicious of your data! Study it closely.

In this case it is evident that the data were not originally recorded in
thousandths of a minute! They were recorded in minutes and seconds.

It is often worth recreating the original data, since this may


highlight shortcomings in the recording procedure.
121

We might wonder whether there is evidence of unconscious ‘round-number


bias’ in the number of seconds recorded. In order to select the seconds com-
ponent we subtract from each data item the number of minutes and multiply
the remainder by 60:

36 48 19.98 16.98 31.98, 52.98.

It is easy to see that the computer has introduced round-off errors: for exam-
ple, 19.98 should obviously be 20.

Figure 9.1 Bar chart illustrating the frequencies of the numbers of


seconds recorded for Old Faithful eruptions.

Since there were 272 eruption durations recorded, and (after rounding)
there are 60 possible values that can be recorded for the number of seconds,
we might expect that each value would occur on about 272/60 occasions.
Figure 9.1 presents a bar chart of the frequencies of the numbers of seconds
recorded. The white line indicates the average frequency.
There is definitely evidence of ‘round-number’ bias, since all of 0, 10, 20,
30, and 50 have larger than average frequencies. This is also true, to some
extent, for 5, 15, 25, 45, and 55.

Here is another example where the recorded data is not entirely accurate.
Example 9.3

For each of 25 quadrats (square survey regions) on Dartmoor, students1


were asked to report the percentage of a quadrat occupied by various species
of plant. For Heath Bedstraw (Galium saxatile) their reported cover values
122 LAST WORDS

were:

0, 0, 10, 0, 1, 5, 35, 40, 5, 10, 0, 2, 10, 0, 20, 0, 5, 10, 25, 0, 0, 0, 50, 25, 0

There are only 10 different values being reported:

0, 1, 2, 5, 10, 20, 25, 35, 40, 50

This is all quite natural. If a student reported, after a visual inspection, that
there was 23% cover in a quadrat, then there would be hoots of laughter!

Although the inaccuracies uncovered in these last two examples are un-
likely to affect any conclusions, it is always wise to bear in mind any data
shortcomings . . . and it shows that the data scientist has been looking at the
data!

When there is a large amount of data, it is a good idea to look


first at a more manageable small section, hoping that this is
representative.
Expect to need to try several different approaches before you
arrive at something that appears to make sense of the data.
You should always expect a data set to contain the unexpected:
missing information, implausible values, and the like.
If the data have been taken over a period of years, or from sev-
eral different regions, then you should look for inconsistencies
such as measurements in different units.
The key to good data analysis is to have no preconceptions and
to be prepared for a good deal of trial and error.
The data analyst often needs to try many diagrams before find-
ing the one that makes the most useful display of the data.

And finally just two things remain: to hope that you have found our tiny
tome helpful, and to wish you good fortune in your career as a data scientist.
Index

AIC, 97 Coefficient of variation, 50


BIC, 97 Collinearity, 99
k-fold cross-validation, 108 Conditional probability, 26
t-distribution, 59 Confidence interval, 42
t-test, 59 for regression line, 86
2x2 table, 67 Contingency table, 67
Continuous data, 14
Addition rule, 24 Correlation, 77
Akaike information criterion, 97 Cross-validation, 108
Average, 10 k-fold, 108
leave-one-out, 110
Bayesian information criterion, 97 Cumulative distribution function,
Bimodal, 9 38
Bootstrap, 48 Cumulative frequency polygon, 16
Box-whisker diagram, 13
Boxplot, 13 Data
categorical, 2
Central limit theorem, 44 continuous, 2, 14
Chi-squared distribution, 54 discrete, 2
Cluster sampling, 5 multivariate, 11
123
qualitative, 2 Interquartile range, 12
Deviance, 95 Intersection, 23
Deviation Interval
standard, 19 prediction, 86
Distribution
t-, 59 Jitter, 92
chi-squared, 54
exponential, 53 Kolmogorov-Smirnov test, 62
extreme value, 54 KS test, 62
Gaussian, 42
Least squares, 82
normal, 42
Leave-one-out cross-validation,
Poisson, 51
110
rectangular, 39
Likelihood, 42
standard normal, 43
Logistic regression, 113
uniform, 38
Logistic regression model, 115
Weibull, 54
Logit, 114
Dummy variable, 100
Loglinear model, 116
Estimate LOOCV, 110
point, 41
Maximum likelihood, 42
Estimator, 41
Maximum likelihood estimation,
Event, 22
42
mutually exclusive, 24
Mean, 10
Exponential distribution, 53
trimmed, 11
Extreme value distribution, 54
Winsorized, 11
False discovery rate, 70 Mean squared error, 108, 111
False negatives, 69 Median, 12
False positives, 69 Mle, 42
Feature selection, 95 Modal class, 10
Function Mode, 9
cumulative distribution, 38 Model
probability density, 38 general linear, 113
logistic regression, 115
Gaussian distribution, 42 loglinear, 116
Generalised linear model, 113 saturated, 95
GLM, 113 MSE, 108
Multimodal, 9
Histogram, 14 Multiple regression, 97
Hold-out set, 108 Multiplication rule, 25
Hypothesis Multivariate data, 11
null, 58 Mutually exclusive events, 24

Independence, 66 Negative predictive value, 70


Interaction, 105 Nominal variable, 2
Intercept, 82 Normal distribution, 42
124
INDEX 125

Null hypothesis, 58 standardised, 72


Ridge regression, 112
Observation, 3 Rounding error, 39
Ockham’s razor, 96
Odds, 68 Sample
Odds-ratio, 68 simple random, 5
Ogive, 16 Sample space, 22
Ordinal variable, 2 Sampling
Out-of-sample validation, 108 cluster, 5
Outlier, 11, 89 stratified, 6
Overfitting, 95 systematic, 6
Oversampling, 7 Sampling frame, 5
Saturated model, 95
Pearson residual, 72 Scatter diagram, 76
Percentile, 13 Schwarz criterion, 97
Permutation test, 81
Sensitivity, 69
Plot
Shrinkage, 111
q-q, 61
Simple random sample, 5
quantile-quantile, 61
Skewed
Poisson distribution, 51
negatively, 20
Poisson process, 50
positively, 20
Positive predictive value, 70
Slope, 82
Prediction interval, 86
Specificity, 69
Principle of parsimony, 96
Standard deviation, 19
Probability, 21
Standard normal distribution, 43
conditional, 26
Standardisation, 112
Probability density function, 38
Standardised residual, 72
Q-q plot, 61 Step diagram, 18
Quantile, 13 Stepwise selection, 105
Quantile-quantle plot, 61 Stratified sampling, 6
Systematic sampling, 6
Random variable, 3
Range, 12 Test
interquartile, 12 t-, 59
Rectangular distribution, 39 Kolmogorov-Smirnov, 62
Regression KS, 62
logistic, 113 permutation, 81
multiple, 97 Theorem
ridge, 112 central limit, 44
Regression line total probability, 29
confidence interval, 86 Total probability theorem, 29
Relative risk, 69 Training set, 108
Resampling, 49 Transformations, 87
Residual, 83, 107 Trimmed mean, 11
Pearson, 72 True negatives, 69
126 INDEX

True positives, 69 explanatory, 83


tuning constant, 112 nominal, 2
ordinal, 2
Unbiased, 41 random, 3
Uniform distribution, 38 response, 83
Union, 23 Venn diagram, 23
Validation set, 108 Weibull distribution, 54
Variable Winsorized mean, 11
dependent, 83
dummy, 100 Youden’s index, 69

You might also like