0% found this document useful (0 votes)
31 views274 pages

ST104B_subject_guide_2023_final (1)

Uploaded by

maximivanovevg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views274 pages

ST104B_subject_guide_2023_final (1)

Uploaded by

maximivanovevg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 274

Undergraduate study in Economics,

Management, Finance and the Social Sciences

Statistics 2

J.S. Abdey

ST104B
2023
Statistics 2
J.S. Abdey
ST104B
2023

Undergraduate study in
Economics, Management,
Finance and the Social Sciences

This subject guide is for a 100 course offered as part of the University of London’s
undergraduate study in Economics, Management, Finance and the Social
Sciences. This is equivalent to Level 4 within the Framework for Higher Education
Qualifications in England, Wales and Northern Ireland (FHEQ).
For more information see: london.ac.uk
This guide was prepared for the University of London by:
James S. Abdey, BA (Hons), MSc, PGCertHE, PhD, Department of Statistics, London
School of Economics and Political Science.
This is one of a series of subject guides published by the University. We regret that
due to pressure of work the author is unable to enter into any correspondence
relating to, or arising from, the guide. If you have any comments on this subject
guide, please communicate these through the discussion forum on the virtual
learning environment.

University of London
Publications office
Stewart House
32 Russell Square
London WC1B 5DN
United Kingdom
london.ac.uk

Published by: University of London


© University of London 2023
The University of London asserts copyright over all material in this subject guide
except where otherwise indicated. All rights reserved. No part of this work may
be reproduced in any form, or by any means, without permission in writing from
the publisher. We make every effort to respect copyright. If you think we have
inadvertently used your copyright material, please let us know.
4
Contents

Contents

0 Preface 1
0.1 Route map to the subject guide . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 Introduction to the subject area . . . . . . . . . . . . . . . . . . . . . . . 1
0.3 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.4 Aims and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.5 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
0.6 Employability outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7 Overview of learning resources . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7.1 The subject guide . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.7.2 Essential reading . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.7.3 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.7.4 Online study resources . . . . . . . . . . . . . . . . . . . . . . . . 5
0.7.5 The VLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
0.7.6 Making use of the Online Library . . . . . . . . . . . . . . . . . . 6
0.8 Examination advice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1 Probability theory 9
1.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Set theory: the basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Axiomatic definition of probability . . . . . . . . . . . . . . . . . . . . . 17
1.5.1 Basic properties of probability . . . . . . . . . . . . . . . . . . . . 18
1.6 Classical probability and counting rules . . . . . . . . . . . . . . . . . . . 21
1.6.1 Brute force: listing and counting . . . . . . . . . . . . . . . . . . . 23
1.6.2 Combinatorial counting methods . . . . . . . . . . . . . . . . . . 23
1.7 Conditional probability and Bayes’ theorem . . . . . . . . . . . . . . . . 28
1.7.1 Independence of multiple events . . . . . . . . . . . . . . . . . . . 29
1.7.2 Independent versus mutually exclusive events . . . . . . . . . . . 29
1.7.3 Conditional probability of independent events . . . . . . . . . . . 31

i
Contents

1.7.4 Chain rule of conditional probabilities . . . . . . . . . . . . . . . . 32


1.7.5 Total probability formula . . . . . . . . . . . . . . . . . . . . . . . 33
1.7.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.8 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.9 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.10 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 37
1.11 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 38

2 Discrete probability distributions 39


2.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Cumulative distribution functions . . . . . . . . . . . . . . . . . . . . . . 45
2.6.1 Cumulative distribution functions – another point of view . . . . 46
2.7 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8 Poisson approximation to the binomial . . . . . . . . . . . . . . . . . . . 48
2.9 Expected value of a discrete random variable . . . . . . . . . . . . . . . . 48
2.9.1 New random variables . . . . . . . . . . . . . . . . . . . . . . . . 50
2.10 Variance of a discrete random variable . . . . . . . . . . . . . . . . . . . 50
2.10.1 Alternative expression for the variance . . . . . . . . . . . . . . . 52
2.10.2 Limits and special cases . . . . . . . . . . . . . . . . . . . . . . . 52
2.10.3 New random variables (again) . . . . . . . . . . . . . . . . . . . . 53
2.11 Distributions related to the binomial distribution . . . . . . . . . . . . . 53
2.11.1 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . 54
2.11.2 Negative binomial distribution . . . . . . . . . . . . . . . . . . . . 54
2.12 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.13 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.14 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 55
2.15 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 56

3 Continuous probability distributions 57


3.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

ii
Contents

3.3.1 A formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . 58


3.4 Probability density function and cumulative distribution function . . . . 59
3.4.1 Attributes of a continuous random variable . . . . . . . . . . . . . 60
3.4.2 The cumulative distribution function (cdf) . . . . . . . . . . . . . 61
3.5 Continuous uniform distribution . . . . . . . . . . . . . . . . . . . . . . . 63
3.6 Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.7 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.7.1 Relevance of the normal distribution . . . . . . . . . . . . . . . . 67
3.7.2 Consequences of the central limit theorem . . . . . . . . . . . . . 67
3.7.3 Characteristics of the normal distribution . . . . . . . . . . . . . . 68
3.7.4 Standard normal tables . . . . . . . . . . . . . . . . . . . . . . . . 68
3.7.5 The general normal distribution . . . . . . . . . . . . . . . . . . . 70
3.7.6 Linear functions of normal random variables . . . . . . . . . . . . 71
3.7.7 Transforming non-normal random variables . . . . . . . . . . . . . 71
3.8 Normal approximation to the binomial . . . . . . . . . . . . . . . . . . . 72
3.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.11 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 74
3.12 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 74

4 Multivariate random variables 77


4.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Joint probability functions . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 Marginal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.6.1 Properties of conditional distributions . . . . . . . . . . . . . . . . 82
4.6.2 Conditional mean and variance . . . . . . . . . . . . . . . . . . . 83
4.7 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.7.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.7.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.8 Independent random variables . . . . . . . . . . . . . . . . . . . . . . . . 86
4.8.1 Joint distribution of independent random variables . . . . . . . . 87
4.9 Sums of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.9.1 Expected values and variances of sums of random variables . . . . 89

iii
Contents

4.9.2 Distributions of sums of random variables . . . . . . . . . . . . . 90


4.10 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.11 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.12 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 92
4.13 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 93

5 Sampling distributions of statistics 97


5.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4.1 Joint distribution of a random sample . . . . . . . . . . . . . . . . 98
5.5 Statistics and their sampling distributions . . . . . . . . . . . . . . . . . 99
5.5.1 Sampling distribution of a statistic . . . . . . . . . . . . . . . . . 100
5.6 Sample mean from a normal population . . . . . . . . . . . . . . . . . . . 102
5.7 The central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.8 Some common sampling distributions . . . . . . . . . . . . . . . . . . . . 108
5.8.1 The χ2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.8.2 (Student’s) t distribution . . . . . . . . . . . . . . . . . . . . . . . 111
5.8.3 The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.11 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 114
5.12 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 115

6 Estimator properties 117


6.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.4 Estimation criteria – bias, variance and mean squared error . . . . . . . . 118
6.5 Unbiased estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.6 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.7 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.8 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 123
6.9 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 123

iv
Contents

7 Point estimation 125


7.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.4 Method of moments (MM) estimation . . . . . . . . . . . . . . . . . . . . 126
7.5 Least squares (LS) estimation . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6 Maximum likelihood (ML) estimation . . . . . . . . . . . . . . . . . . . . 131
7.7 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.8 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.9 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 136
7.10 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 137

8 Analysis of variance (ANOVA) 139


8.1 Synopsis of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.2 Learning outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.3 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.4 Testing for equality of three population means . . . . . . . . . . . . . . . 139
8.5 One-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.6 From one-way to two-way ANOVA . . . . . . . . . . . . . . . . . . . . . 150
8.7 Two-way analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . 150
8.8 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.9 Overview of chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.10 Key terms and concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.11 Sample examination questions . . . . . . . . . . . . . . . . . . . . . . . . 157
8.12 Solutions to Sample examination questions . . . . . . . . . . . . . . . . . 157

A Probability theory 159


A.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
A.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

B Discrete probability distributions 169


B.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
B.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

C Continuous probability distributions 179


C.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
C.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

v
Contents

D Multivariate random variables 189


D.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
D.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

E Sampling distributions of statistics 197


E.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
E.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

F Estimator properties 205


F.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
F.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

G Point estimation 209


G.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
G.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

H Analysis of variance (ANOVA) 215


H.1 Worked examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
H.2 Practice questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

I Solutions to Practice questions 223


I.1 Appendix A – Probability theory . . . . . . . . . . . . . . . . . . . . . . 223
I.2 Appendix B – Discrete probability distributions . . . . . . . . . . . . . . 224
I.3 Appendix C – Continuous probability distributions . . . . . . . . . . . . 226
I.4 Appendix D – Multivariate random variables . . . . . . . . . . . . . . . . 229
I.5 Appendix E – Sampling distributions of statistics . . . . . . . . . . . . . 230
I.6 Appendix F – Estimator properties . . . . . . . . . . . . . . . . . . . . . 232
I.7 Appendix G – Point estimation . . . . . . . . . . . . . . . . . . . . . . . 233
I.8 Appendix H – Analysis of variance (ANOVA) . . . . . . . . . . . . . . . 235

J Formula sheet in the examination 237

K Sample examination paper 239

L Sample examination paper – Solutions 245

vi
Chapter 0
Preface

0.1 Route map to the subject guide


This subject guide provides you with a framework for covering the syllabus of the
ST104B Statistics 2 course and directs you to additional resources such as readings
and the virtual learning environment (VLE).
The following chapters will cover important aspects of mathematical statistics, upon
which many applications in EC2020 Elements of econometrics, as well as ST2133
Advanced statistics: distribution theory and ST2134 Advanced statistics:
statistical inference draw heavily (among other courses). The chapters are not a
series of self-contained topics, rather they build on each other sequentially. As such, you
are strongly advised to follow the subject guide in chapter order. There is little point in
rushing past material which you have only partially understood in order to reach the
final chapter. Once you have completed your work on all of the chapters, you will be
ready for examination revision. A good place to start is the sample examination paper
which you will find at the end of the subject guide.
Colour has been included in places to emphasise important items. Formulae in the main
body of chapters are in blue – these exclude formulae used in examples. Key terms and
concepts when introduced are shown mainly in red, with a few in blue to avoid
repetition. References to other courses and half courses are shown in purple (such as
above). Terms in italics are shown in purple for emphasis. References to chapters,
sections, figures and tables are shown in teal.

0.2 Introduction to the subject area


Why study statistics?

By successfully completing this course, you will understand the ideas of randomness and
variability, and the way in which they link to probability theory. This will allow the use
of a systematic and logical collection of statistical techniques of great practical
importance in many applied areas. The examples in this subject guide will concentrate
on the social sciences, but the methods are important for the physical sciences too. This
subject aims to provide a grounding in probability theory, point estimation and analysis
of variance.
The material in ST104B Statistics 2 is necessary as preparation for other subjects
you may study later on in your degree. The full details of the ideas discussed in this
subject guide will not always be required in these other subjects, but you will need to
have a solid understanding of the main concepts. This can only be achieved by seeing

1
0. Preface

how the ideas emerge in detail.

How to study statistics

For statistics, you need some familiarity with abstract mathematical ideas, as well as
the ability and common sense to apply these to real-life problems. The concepts you will
encounter in probability and statistical inference are hard to absorb by just reading
about them in a book. You need to read, then think a little, then try some problems,
and then read and think some more. This procedure should be repeated until the
problems are easy to do; you should not spend a long time reading and forget about
solving problems.

0.3 Syllabus
The up-to-date course syllabus for ST104B Statistics 2 can be found in the course
information sheet, which is available on the course VLE (virtual learning environment)
page.

0.4 Aims and objectives


The aim of this half course is to develop students’ knowledge of elementary statistical
theory. The emphasis is on topics that are of importance in applications to
econometrics, finance and the social sciences. Concepts and methods that provide the
foundation for more specialised courses in statistics are introduced.

0.5 Learning outcomes


At the end of this half course, and having completed the Recommended reading and
activities, students should be able to:

compute probabilities of events, including for univariate and multivariate random


variables

apply and be competent users of standard statistical operators and be able to recall
a variety of well-known distributions

derive estimators of unknown parameters using method of moments, least squares


and maximum likelihood estimation techniques, and analyse the statistical
properties of estimators

be familiar with the fundamental concepts of statistical modelling, with an


emphasis on analysis of variance models.

2
0.6. Employability outcomes

0.6 Employability outcomes


Below are the three most relevant skill outcomes for students undertaking this course
which can be conveyed to future prospective employers:

1. complex problem-solving
2. decision making
3. communication.

0.7 Overview of learning resources

0.7.1 The subject guide


The subject guide is a self-contained resource, i.e. the content provided here is sufficient
to prepare for the examination. All examinable topics are discussed in detail with
numerous activities and practice problems. Studying extensively using the subject guide
is essential to perform well in the final examination. As such, there is no necessity to
purchase a textbook, although some students may wish to consult other resources to
read about the same topics through an alternative tutorial voice – please see the
suggested ‘Further reading’ below.
The subject guide provides a range of activities that will enable you to test your
understanding of the basic ideas and concepts. We want to encourage you to try the
exercises you encounter throughout the material before working through the solutions.
With statistics, the motto has to be ‘practise, practise, practise. . .’. It is the best way to
learn the material and prepare for examinations. The course is rigorous and demanding,
but the skills you will be developing will be rewarding and well recognised by future
employers.
A suggested approach for students studying ST104B Statistics 2, is to split the
material into 10 weeks as follows.

Week Chapter
1&2 Chapter 1: Probability theory
3 Chapter 2: Discrete probability distributions
4 Chapter 3: Continuous probability distributions
5 Chapter 4: Multivariate random variables
6 Chapter 5: Sampling distributions of statistics
7 Chapter 6: Estimator properties
8 Chapter 7: Point estimation
9 & 10 Chapter 8: Analysis of variance (ANOVA)

The following procedure is recommended:

1. Read the introductory comments.


2. Study the chapter content, worked examples and practice questions.

3
0. Preface

3. Go through the learning outcomes carefully.

4. Refer back to this subject guide, or to supplementary texts, to improve your


understanding until you are able to work through the problems confidently.

The last step is the most important. It is easy to think that you have understood the
material after reading it, but working through problems is the crucial test of
understanding. Problem-solving should take up most of your study time.
To prepare for the examination, you will only need to read the material in the subject
guide, but it may be helpful from time to time to look at the suggested ‘Further
reading’ below.

Basic notation

We often use the symbol  to denote the end of a proof, where we have finished
explaining why a particular result is true. This is just to make it clear where the proof
ends and the following text begins.

Calculators

A calculator may be used when answering questions on the examination paper for
ST104B Statistics 2. It must comply in all respects with the specification given in the
Regulations. You should also refer to the admission notice you will receive when
entering the examination and the ‘Notice on permitted materials’.

Computers

If you are aiming to carry out serious statistical analysis (which is beyond the level of
this course) you will probably want to use some statistical software package, such as R.
It is not necessary for this course to have such software available, but if you do have
access to it you may benefit from using it in your study of the material. On a few
occasions in this subject guide R will be used for illustrative purposes only. You will not
be examined on R.

0.7.2 Essential reading

This subject guide is ‘self-contained’ meaning that this is the only resource which is
essential reading for ST104B Statistics 2. Throughout the subject guide there are
many worked examples, practice problems and sample examination questions replicating
resources typically provided in statistical textbooks. You may, however, feel you could
benefit from reading textbooks, and a suggested list of these is provided below.

4
0.7. Overview of learning resources

Statistical tables

In the examination you will be provided with relevant extracts of:

Dougherty, C. Introduction to Econometrics. (Oxford: Oxford University Press,


2016) fifth edition [ISBN 9780199676828].

Lindley, D.V. and W.F. Scott New Cambridge Statistical Tables. (Cambridge:
Cambridge University Press, 1995) second edition [ISBN 9780521484855].

These relevant extracts can be found at the end of this subject guide, and are the same
as those distributed for use in the examination. It is advisable that you become familiar
with them, rather than those at the end of a textbook which may differ in presentation.

0.7.3 Further reading


As mentioned above, this subject guide is sufficient for study of ST104B Statistics 2.
Of course, you are free to read around the subject area in any text, paper or online
resource. You should support your learning by reading as widely as possible and help
you to think about how these principles apply in the real world. To help you read
extensively, you have free access to the virtual learning environment (VLE) and
University of London Online Library (see below).
Other useful texts for this course include:

Freedman, D., R. Pisani and R. Purves Statistics. (New York: W.W. Norton &
Company, 2007) fourth edition [ISBN 9780393930436].

Johnson, R.A. and G.K. Bhattacharyya Statistics: Principles and Methods. (New
York: John Wiley and Sons, 2010) sixth edition [ISBN 9780470505779].

Larsen, R.J. and M.J. Marx An Introduction to Mathematical Statistics and Its
Applications. (London: Pearson, 2017) sixth edition [ISBN 9780134114217].

Newbold, P., W.L. Carlson and B.M. Thorne Statistics for Business and
Economics. (London: Pearson, 2012) eighth edition [ISBN 9780273767060].

0.7.4 Online study resources


You can access the VLE, the Online Library and your University of London email
account via the Student Portal at: http://mylondon.ac.uk
You should have received your login details for the Student Portal with your official
offer, which was emailed to the address that you gave on your application form. You
have probably already logged into the Student Portal in order to register! As soon as
you registered, you will automatically have been granted access to the VLE, Online
Library and your fully functional University of London email account.
If you have forgotten these login details, please click on the ‘Forgot Password’ link on
the login page.

5
0. Preface

0.7.5 The VLE


The VLE, which complements this subject guide, has been designed to enhance your
learning experience, providing additional support and a sense of community. It forms an
important part of your study experience with the University of London and you should
access it regularly.
The VLE provides a range of resources for EMFSS courses:

Course materials: Subject guides and other course materials available for
download. In some courses, the content of the subject guide is transferred into the
VLE and additional resources and activities are integrated with the text.
Readings: Direct links, wherever possible, to essential readings in the Online
Library, including journal articles and ebooks.
Video content: Including introductions to courses and topics within courses,
interviews, lessons and debates.
Screencasts: Videos of PowerPoint presentations, animated podcasts and
on-screen worked examples.
External material: Links out to carefully selected third-party resources.
Self-test activities: Multiple-choice, numerical and algebraic quizzes to check
your understanding.
Collaborative activities: Work with fellow students to build a body of
knowledge.
Discussion forums: A space where you can share your thoughts and questions
with fellow students. Many forums will be supported by a ‘course moderator’, a
subject expert employed by LSE to facilitate the discussion and clarify difficult
topics.
Past examination papers: We provide up to three years of past examinations
alongside Examiners’ commentaries that provide guidance on how to approach the
questions.
Study skills: Expert advice on getting started with your studies, preparing for
examinations and developing your digital literacy skills.

Some of these resources are available for certain courses only, but we are expanding our
provision all the time and you should check the VLE regularly for updates.

0.7.6 Making use of the Online Library


The Online Library (http://onlinelibrary.london.ac.uk) contains a huge array of journal
articles and other resources to help you read widely and extensively.
To access the majority of resources via the Online Library you will either need to use
your University of London Student Portal login details, or you will be required to
register and use an Athens login.

6
0.8. Examination advice

The easiest way to locate relevant content and journal articles in the Online Library is
to use the Summon search engine.
If you are having trouble finding an article listed in a reading list, try removing any
punctuation from the title, such as single quotation marks, question marks and colons.
For further advice, please use the online help pages
(http://onlinelibrary.london.ac.uk/resources/summon) or contact the Online Library
team using the ‘Chat with us’ function.

0.8 Examination advice


Important: The information and advice given here are based on the examination
structure used at the time this subject guide was written. Please note that subject
guides may be used for several years. Because of this we strongly advise you to always
check both the current Programme regulations for relevant information about the
examination, and the VLE where you should be advised of any forthcoming changes.
You should also carefully check the rubric/instructions on the paper you actually sit
and follow those instructions.
The examination is by a two-hour unseen question paper. No books may be taken into
the examination, but the use of calculators is permitted, and statistical tables and a
formula sheet are provided (the formula sheet can be found at the end of the subject
guide).
The examination paper has a variety of questions, some quite short and others longer.
All questions must be answered correctly for full marks. You may use your calculator
whenever you feel it is appropriate, always remembering that the examiners can give
marks only for what appears on the examination script. Therefore, it is important to
always show your working.
In terms of the examination, as always, it is important to manage your time carefully
and not to dwell on one question for too long – move on and focus on solving the easier
questions, coming back to harder ones later.
Remember, it is important to check the VLE for:

up-to-date information on examination and assessment arrangements for this course

where available, past examination papers and Examiners’ commentaries for the
course which give advice on how each question might best be answered.

7
0. Preface

8
Chapter 1
Probability theory

1.1 Synopsis of chapter


Probability theory is very important for statistics because it provides the rules which
allow us to reason about uncertainty and randomness, which is the basis of statistics.
Independence and conditional probability are profound ideas, but they must be fully
understood in order to think clearly about any statistical investigation.

1.2 Learning outcomes


After completing this chapter, you should be able to:

explain the fundamental ideas of random experiments, sample spaces and events
list the axioms of probability and be able to derive all the common probability
rules from them
list the formulae for the number of combinations and permutations of k objects out
of n, and be able to routinely use such results in problems
explain conditional probability and the concept of independent events
prove the law of total probability and apply it to problems where there is a
partition of the sample space
prove Bayes’ theorem and apply it to find conditional probabilities.

1.3 Introduction
Consider the following hypothetical example. A country will soon hold a referendum
about whether it should leave the European Union (EU). An opinion poll of a random
sample of people in the country is carried out.
950 respondents say that they plan to vote in the referendum. They answer the question
‘Will you vote ‘Yes’ or ‘No’ to leaving the EU?’ as follows:

Answer
Yes No Total
Count 513 437 950
% 54% 46% 100%

9
1. Probability theory

However, we are not interested in just this sample of 950 respondents, but in the
population which they represent, that is, all likely voters.
Statistical inference will allow us to say things like the following about the
population.

‘A 95% confidence interval for the population proportion, π, of ‘Yes’ voters is


(0.5083, 0.5717).’

‘The null hypothesis that π = 0.50, against the alternative hypothesis that
π > 0.50, is rejected at the 5% significance level.’

In short, the opinion poll gives statistically significant evidence that ‘Yes’ voters are in
the majority among likely voters. Such methods of statistical inference will be discussed
later in the course.
The inferential statements about the opinion poll rely on the following assumptions and
results.

Each response Xi is a realisation of a random variable from a Bernoulli


distribution with probability parameter π.

The responses X1 , X2 , . . . , Xn are independent of each other.

The sampling distribution of the sample mean (proportion) X̄ has expected


value π and variance π(1 − π)/n.

By use of the central limit theorem, the sampling distribution is approximately


a normal distribution.

In the next few chapters, we will learn about the terms in bold, among others.

The need for probability in statistics

In statistical inference, the data we have observed are regarded as a sample from a
broader population, selected with a random process.

Values in a sample are variable. If we collected a different sample we would not


observe exactly the same values again.

Values in a sample are also random. We cannot predict the precise values which
will be observed before we actually collect the sample.

Probability theory is the branch of mathematics which deals with randomness. So we


need to study this first.

A preview of probability

The first basic concepts in probability are the following.

10
1.4. Set theory: the basics

Experiment: for example, rolling a single die and recording the outcome.

Outcome of the experiment: for example, rolling a 3.

Sample space S: the set of all possible outcomes, here {1, 2, 3, 4, 5, 6}.

Event: any subset A of the sample space, for example A = {4, 5, 6}.

Probability of an event A, P (A), will be defined as a function which assigns


probabilities (real numbers) to events (sets). This uses the language and concepts of set
theory. So we need to study the basics of set theory first.

1.4 Set theory: the basics


A set is a collection of elements (also known as ‘members’ of the set).

Example 1.1 The following are all examples of sets.

A = {Amy, Bob, Sam}.

B = {1, 2, 3, 4, 5}.

C = {x | x is a prime number} = {2, 3, 5, 7, 11, . . .}.

D = {x | x ≥ 0} (that is, the set of all non-negative real numbers).

Membership of sets and the empty set

x ∈ A means that object x is an element of set A.


x∈/ A means that object x is not an element of set A.
The empty set, denoted ∅, is the set with no elements, i.e. x ∈
/ ∅ is true for every
object x, and x ∈ ∅ is not true for any object x.

Example 1.2 If A = {1, 2, 3, 4, 5}, then:

1 ∈ A and 2 ∈ A

6∈
/ A and 1.5 ∈
/ A.

The familiar Venn diagrams help to visualise statements about sets. However, Venn
diagrams are not formal proofs of results in set theory.

Example 1.3 In Figure 1.1, the darkest area in the middle is A ∩ B, the total
shaded area is A ∪ B, and the white area is (A ∪ B)c = Ac ∩ B c .

11
1. Probability theory

Figure 1.1: Venn diagram depicting A ∪ B (the total shaded area).

Subsets and equality of sets

A ⊂ B means that set A is a subset of set B, defined as:

A⊂B when x ∈ A ⇒ x ∈ B.

Hence A is a subset of B if every element of A is also an element of B. An example


is shown in Figure 1.2.

Figure 1.2: Venn diagram depicting a subset, where A ⊂ B.

Example 1.4 An example of the distinction between subsets and non-subsets is:

{1, 2, 3} ⊂ {1, 2, 3, 4}, because all elements appear in the larger set

{1, 2, 5} 6⊂ {1, 2, 3, 4}, because the element 5 does not appear in the larger set.

Two sets A and B are equal (A = B) if they have exactly the same elements. This
implies that A ⊂ B and B ⊂ A.

Unions of sets (‘or’)

The union, denoted ∪, of two sets is:

A ∪ B = {x | x ∈ A or x ∈ B}.

That is, the set of those elements which belong to A or B (or both). An example is
shown in Figure 1.3.

12
1.4. Set theory: the basics

Figure 1.3: Venn diagram depicting the union of two sets.

Example 1.5 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then:

A ∪ B = {1, 2, 3, 4}

A ∪ C = {1, 2, 3, 4, 5, 6}

B ∪ C = {2, 3, 4, 5, 6}.

Intersections of sets (‘and’)

The intersection, denoted ∩, of two sets is:

A ∩ B = {x | x ∈ A and x ∈ B}.

That is, the set of those elements which belong to both A and B. An example is
shown in Figure 1.4.

Figure 1.4: Venn diagram depicting the intersection of two sets.

Example 1.6 If A = {1, 2, 3, 4}, B = {2, 3} and C = {4, 5, 6}, then:

A ∩ B = {2, 3}

A ∩ C = {4}

B ∩ C = ∅.

13
1. Probability theory

Unions and intersections of many sets

Both set operators can also be applied to more than two sets, such as A ∩ B ∩ C.
Concise notation for the unions and intersections of sets A1 , A2 , . . . , An is:
n
[
Ai = A1 ∪ A2 ∪ · · · ∪ An
i=1

and: n
\
Ai = A1 ∩ A2 ∩ · · · ∩ An .
i=1

These can also be used for an infinite number of sets, i.e. when n is replaced by ∞.

Complement (‘not’)

Suppose S is the set of all possible elements which are under consideration. In
probability, S will be referred to as the sample space.
It follows that A ⊂ S for every set A we may consider. The complement of A with
respect to S is:
Ac = {x | x ∈ S and x ∈ / A}.
That is, the set of those elements of S that are not in A. An example is shown in
Figure 1.5.

Figure 1.5: Venn diagram depicting the complement of a set.

We now consider some useful properties of set operators. In proofs and derivations
about sets, you can use the following results without proof.

14
1.4. Set theory: the basics

Properties of set operators

Commutativity:

A ∩ B = B ∩ A and A ∪ B = B ∪ A.

Associativity:

A ∩ (B ∩ C) = (A ∩ B) ∩ C and A ∪ (B ∪ C) = (A ∪ B) ∪ C.

Distributive laws:

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)

and:
A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).

De Morgan’s laws:

(A ∩ B)c = Ac ∪ B c and (A ∪ B)c = Ac ∩ B c .

Further properties of set operators

If S is the sample space and A and B are any sets in S, you can also use the following
results without proof:

∅c = S.

∅ ⊂ A, A ⊂ A and A ⊂ S.

A ∩ A = A and A ∪ A = A.

A ∩ Ac = ∅ and A ∪ Ac = S.

If B ⊂ A, A ∩ B = B and A ∪ B = A.

A ∩ ∅ = ∅ and A ∪ ∅ = A.

A ∩ S = A and A ∪ S = S.

∅ ∩ ∅ = ∅ and ∅ ∪ ∅ = ∅.

15
1. Probability theory

Mutually exclusive events

Two sets A and B are disjoint or mutually exclusive if:

A ∩ B = ∅.

Sets A1 , A2 , . . . , An are pairwise disjoint if all pairs of sets from them are disjoint,
i.e. Ai ∩ Aj = ∅ for all i 6= j.

Partition

The sets A1 , A2 , . . . , An form a partition of the set A if they are pairwise disjoint
n
S
and if Ai = A, that is, A1 , A2 , . . . , An are collectively exhaustive of A.
i=1
Therefore, a partition divides the entire set A into non-overlapping pieces Ai , as
shown in Figure 1.6 for n = 3. Similarly, an infinite collection of sets A1 , A2 , . . . form

S
a partition of A if they are pairwise disjoint and Ai = A.
i=1

A3 A2

A1

Figure 1.6: The partition of the set A into A1 , A2 and A3 .

Example 1.7 Suppose that A ⊂ B. Show that A and B ∩ Ac form a partition of B.

We have:
A ∩ (B ∩ Ac ) = (A ∩ Ac ) ∩ B = ∅ ∩ B = ∅
and:
A ∪ (B ∩ Ac ) = (A ∪ B) ∩ (A ∪ Ac ) = B ∩ S = B.
Hence A and B ∩ Ac are mutually exclusive and collectively exhaustive of B, and so
they form a partition of B.

16
1.5. Axiomatic definition of probability

1.5 Axiomatic definition of probability


First, we consider four basic concepts in probability.
An experiment is a process which produces outcomes and which can have several
different outcomes. The sample space S is the set of all possible outcomes of the
experiment. An event is any subset A of the sample space such that A ⊂ S.

Example 1.8 If the experiment is ‘select a trading day at random and record the
% change in the FTSE 100 index from the previous trading day’, then the outcome
is the % change in the FTSE 100 index.
S = [−100, +∞) for the % change in the FTSE 100 index (in principle).
An event of interest might be A = {x | x > 0} – the event that the daily change is
positive, i.e. the FTSE 100 index gains value from the previous trading day.

The sample space and events are represented as sets. For two events A and B, set
operations are then interpreted as follows.

A ∩ B: both A and B happen.


A ∪ B: either A or B happens (or both happen).
Ac : A does not happen, i.e. something other than A happens.

Once we introduce probabilities of events, we can also say that:

the sample space, S, is a certain event


the empty set, ∅, is an impossible event.

Axioms of probability

‘Probability’ is formally defined as a function P (·) from subsets (events) of the sample
space S onto real numbers.1 Such a function is a probability function if it satisfies
the following axioms (‘self-evident truths’).

Axiom 1: P (A) ≥ 0 for all events A.

Axiom 2: P (S) = 1.

Axiom 3: If events A1 , A2 , . . . are pairwise disjoint (i.e. Ai ∩ Aj = ∅ for all


i 6= j), then:

! ∞
[ X
P Ai = P (Ai ).
i=1 i=1

The axioms require that a probability function must always satisfy these requirements.
1
The precise definition also requires a careful statement of which subsets of S are allowed as events,
which we can skip on this course.

17
1. Probability theory

Axiom 1 requires that probabilities are always non-negative.

Axiom 2 requires that the outcome is some element from the sample space with
certainty (that is, with probability 1). In other words, the experiment must have
some outcome.

Axiom 3 states that if events A1 , A2 , . . . are mutually exclusive, the probability of


their union is simply the sum of their individual probabilities.
All other properties of the probability function can be derived from the axioms. We
begin by showing that a result like Axiom 3 also holds for finite collections of mutually
exclusive sets.

1.5.1 Basic properties of probability

Probability property

For the empty set, ∅, we have:


P (∅) = 0. (1.1)

Probability property (finite additivity)

If A1 , A2 , . . . , An are pairwise disjoint, then:


n
! n
[ X
P Ai = P (Ai ).
i=1 i=1

In pictures, the previous result means that in a situation like the one shown in Figure
1.7, the probability of the combined event A = A1 ∪ A2 ∪ A3 is simply the sum of the
probabilities of the individual events:

P (A) = P (A1 ) + P (A2 ) + P (A3 ).

That is, we can simply sum probabilities of mutually exclusive sets. This is very useful
for deriving further results.

Probability property

For any event A, we have:


P (Ac ) = 1 − P (A).

Proof: We have that A ∪ Ac = S and A ∩ Ac = ∅. Therefore:

1 = P (S) = P (A ∪ Ac ) = P (A) + P (Ac )

using the previous result, with n = 2, A1 = A and A2 = Ac .




18
1.5. Axiomatic definition of probability

A2
A1

A3

Figure 1.7: Venn diagram depicting three mutually exclusive sets, A1 , A2 and A3 . Note
although A2 and A3 have touching boundaries, there is no actual intersection and hence
they are (pairwise) mutually exclusive.

Probability property

For any event A, we have:


P (A) ≤ 1.

Probability property

For any two events A and B, if A ⊂ B, then P (A) ≤ P (B).

Proof: We proved in Example 1.7 that we can partition B as B = A ∪ (B ∩ Ac ) where


the two sets in the union are disjoint. Therefore:
P (B) = P (A ∪ (B ∩ Ac )) = P (A) + P (B ∩ Ac ) ≥ P (A)
since P (B ∩ Ac ) ≥ 0.


Probability property

For any two events A and B, then:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Proof: Using partitions:


P (A ∪ B) = P (A ∩ B c ) + P (A ∩ B) + P (Ac ∩ B)

P (A) = P (A ∩ B c ) + P (A ∩ B)

P (B) = P (Ac ∩ B) + P (A ∩ B)
and hence:
P (A ∪ B) = (P (A) − P (A ∩ B)) + P (A ∩ B) + (P (B) − P (A ∩ B))
= P (A) + P (B) − P (A ∩ B).


19
1. Probability theory

In summary, the probability function has the following properties.

P (S) = 1 and P (∅) = 0.

0 ≤ P (A) ≤ 1 for all events A.

If A ⊂ B, then P (A) ≤ P (B).

These show that the probability function has the kinds of values we expect of something
called a ‘probability’.

P (Ac ) = 1 − P (A).

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

These are useful for deriving probabilities of new events.

Example 1.9 Suppose that, on an average weekday, of all adults in a country:

86% spend at least 1 hour watching television (event A, with P (A) = 0.86)

19% spend at least 1 hour reading newspapers (event B, with P (B) = 0.19)

15% spend at least 1 hour watching television and at least 1 hour reading
newspapers (P (A ∩ B) = 0.15).

We select a member of the population for an interview at random. For example, we


then have:

P (Ac ) = 1 − P (A) = 1 − 0.86 = 0.14, which is the probability that the


respondent watches less than 1 hour of television

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = 0.86 + 0.19 − 0.15 = 0.90, which is the


probability that the respondent spends at least 1 hour watching television or
reading newspapers (or both).

What does ‘probability’ mean?

Probability theory tells us how to work with the probability function and derive
‘probabilities of events’ from it. However, it does not tell us what ‘probability’ really
means.
There are several alternative interpretations of the real-world meaning of ‘probability’
in this sense. One of them is outlined below. The mathematical theory of probability
and calculations on probabilities are the same whichever interpretation we assign to
‘probability’. So, in this course, we do not need to discuss the matter further.

20
1.6. Classical probability and counting rules

Frequency interpretation of probability

This states that the probability of an outcome A of an experiment is the proportion


(relative frequency) of trials in which A would be the outcome if the experiment was
repeated a very large number of times under similar conditions.

Example 1.10 How should we interpret the following, as statements about the real
world of coins and babies?

‘The probability that a tossed coin comes up heads is 0.5.’ If we tossed a coin a
large number of times, and the proportion of heads out of those tosses was 0.5,
the ‘probability of heads’ could be said to be 0.5, for that coin.

‘The probability is 0.51 that a child born in the UK today is a boy.’ If the
proportion of boys among a large number of live births was 0.51, the
‘probability of a boy’ could be said to be 0.51.

How to find probabilities

A key question is how to determine appropriate numerical values of P (A) for the
probabilities of particular events.
This is usually done empirically, by observing actual realisations of the experiment and
using them to estimate probabilities. In the simplest cases, this basically applies the
frequency definition to observed data.

Example 1.11 Consider the following.

If I toss a coin 10,000 times, and 5,023 of the tosses come up heads, it seems
that, approximately, P (heads) = 0.5, for that coin.

Of the 7,098,667 live births in England and Wales in the period 1999–2009,
51.26% were boys. So we could assign the value of about 0.51 to the probability
of a boy in this population.

The estimation of probabilities of events from observed data is an important part of


statistics.

1.6 Classical probability and counting rules


Classical probability is a simple special case where values of probabilities can be
found by just counting outcomes. This requires that:

the sample space contains only a finite number of outcomes

all of the outcomes are equally likely.

21
1. Probability theory

Standard illustrations of classical probability are devices used in games of chance, such
as:

tossing a coin (heads or tails) one or more times

rolling one or more dice (each scored 1, 2, 3, 4, 5 or 6)

drawing one or more playing cards from a deck of 52 cards.

We will use these often, not because they are particularly important but because they
provide simple examples for illustrating various results in probability.
Suppose that the sample space, S, contains m equally likely outcomes, and that event A
consists of k ≤ m of these outcomes. Therefore:

k number of outcomes in A
P (A) = = .
m total number of outcomes in the sample space, S

That is, the probability of A is the proportion of outcomes which belong to A out of all
possible outcomes.
In the classical case, the probability of any event can be determined by counting the
number of outcomes which belong to the event, and the total number of possible
outcomes.

Example 1.12 Rolling two dice, what is the probability that the sum of the two
scores is 5?

The sample space is the 36 ordered pairs:

S = {(1, 1), (1, 2), (1, 3), (1, 4) , (1, 5), (1, 6),
(2, 1), (2, 2), (2, 3) , (2, 4), (2, 5), (2, 6),
(3, 1), (3, 2) , (3, 3), (3, 4), (3, 5), (3, 6),
(4, 1) , (4, 2), (4, 3), (4, 4), (4, 5), (4, 6),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}.

The event of interest is A = {(1, 4), (2, 3), (3, 2), (4, 1)}.

The probability is P (A) = 4/36 = 1/9.

Now that we have a way of obtaining probabilities for events in the classical case, we
can use it together with the rules of probability.
The formula P (A) = 1 − P (Ac ) is convenient when we want P (A) but the probability of
the complementary event Ac , i.e. P (Ac ), is easier to find.

22
1.6. Classical probability and counting rules

Example 1.13 When rolling two fair dice, what is the probability that the sum of
the dice is greater than 3?

The complement is that the sum is at most 3, i.e. the complementary event is
Ac = {(1, 1), (1, 2), (2, 1)}.

Therefore, P (A) = 1 − 3/36 = 33/36 = 11/12.

The formula:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

says that the probability that A or B happens (or both happen) is the sum of the
probabilities of A and B, minus the probability that both A and B happen.

Example 1.14 When rolling two fair dice, what is the probability that the two
scores are equal (event A) or that the total score is greater than 10 (event B)?

P (A) = 6/36, P (B) = 3/36 and P (A ∩ B) = 1/36.

So P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = (6 + 3 − 1)/36 = 8/36 = 2/9.

How to count the outcomes

In general, it is useful to know about three ways of counting.

Listing and counting all outcomes.

Combinatorial methods: choosing k objects out of n objects.

Combining different methods: rules of sum and product.

1.6.1 Brute force: listing and counting


In small problems, just listing all possibilities is often quickest.

Example 1.15 Consider a group of four people, where each pair of people is either
connected (= friends) or not. How many different patterns of connections are there
(ignoring the identities of who is friends with whom)?
The answer is 11. See the patterns in Figure 1.8.

1.6.2 Combinatorial counting methods


A powerful set of counting methods answers the following question: how many ways are
there to select k objects out of n distinct objects?

23
1. Probability theory

[1] [2] [3] [4]


s s s s s s s s

s s s s s s s s

[5] [6] [7] [8]


s s s s s s s s
@
@
@
s s s @s s s s s

[9] [10] [11]


s s s s s s
@ @
@ @
@ @
s s s @s s @s

Figure 1.8: Friendship patterns in a four-person network.

The answer will depend on:

whether the selection is with replacement (an object can be selected more than
once) or without replacement (an object can be selected only once)
whether the selected set is treated as ordered or unordered.

Ordered sets, with replacement

Suppose that the selection of k objects out of n needs to be:

ordered, so that the selection is an ordered sequence where we distinguish between


the 1st object, 2nd, 3rd etc.
with replacement, so that each of the n objects may appear several times in the
selection.

Therefore:

n objects are available for selection for the 1st object in the sequence
n objects are available for selection for the 2nd object in the sequence
. . . and so on, until n objects are available for selection for the kth object in the
sequence.

Therefore, the number of possible ordered sequences of k objects selected with


replacement from n objects is:
k times
z }| {
n × n × · · · × n = nk .

24
1.6. Classical probability and counting rules

Ordered sets, without replacement

Suppose that the selection of k objects out of n now needs to be:

ordered, so that the selection is an ordered sequence where we distinguish between


the 1st object, 2nd, 3rd etc.

without replacement, so that if an object is selected once, it cannot be selected


again.

Now:

n objects are available for selection for the 1st object in the sequence

n − 1 objects are available for selection for the 2nd object

n − 2 objects are available for selection for the 3rd object

. . . and so on, until n − k + 1 objects are available for selection for the kth object.

Therefore, the number of possible ordered sequences of k objects selected without


replacement from n objects is:

n × (n − 1) × · · · × (n − k + 1). (1.2)

An important special case is when k = n.

Factorials

The number of ordered sets of n objects, selected without replacement from n objects,
is:
n! = n × (n − 1) × · · · × 2 × 1.
The number n! (read ‘n factorial’) is the total number of different ways in which
n objects can be arranged in an ordered sequence. This is known as the number of
permutations of n objects.
We also define 0! = 1.

Using factorials, (1.2) can be written as:

n!
n × (n − 1) × · · · × (n − k + 1) = .
(n − k)!

Unordered sets, without replacement

Suppose now that the identities of the objects in the selection matter, but the order
does not.

25
1. Probability theory

For example, the sequences (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1) are
now all treated as the same, because they all contain the elements 1, 2 and 3.

The number of such unordered subsets (combinations) of k out of n objects is


determined as follows.

The number of ordered sequences is n!/(n − k)!.


Among these, every different combination of k distinct elements appears k! times,
in different orders.
Ignoring the ordering, there are:
 
n n!
=
k k! (n − k)!
different combinations, for each k = 0, 1, 2, . . . , n.
n

The number k
is known as the binomial coefficient. Note that because 0! = 1,
n n
 
0
= n
= 1, so there is only 1 way of selecting 0 or n out of n objects.

Summary of the combinatorial counting rules

The number of k outcomes from n distinct possible outcomes can be summarised as


follows:

With Without
replacement replacement
Ordered nk n!/(n − k)!
n+k−1 n n!
 
Unordered k k
= k! (n−k)!

We have not discussed the unordered, with replacement case which is non-examinable.
It is provided here only for completeness.

Example 1.16 Suppose we have k = 3 people (Amy, Bob and Sam). How many
different sets of birthdays can they have (day and month, ignoring the year, and
pretending 29 February does not exist, so that n = 365) in the following cases?

1. It makes a difference who has which birthday (ordered), i.e. Amy (1 January),
Bob (5 May) and Sam (5 December) is different from Amy (5 May), Bob (5
December) and Sam (1 January), and different people can have the same
birthday (with replacement). The number of different sets of birthdays is:

(365)3 = 48,627,125.

2. It makes a difference who has which birthday (ordered), and different people
must have different birthdays (without replacement). The number of different
sets of birthdays is:
365!
= 365 × 364 × 363 = 48,228,180.
(365 − 3)!

26
1.6. Classical probability and counting rules

3. Only the dates matter, but not who has which one (unordered), i.e. Amy (1
January), Bob (5 May) and Sam (5 December) is treated as the same as Amy (5
May), Bob (5 December) and Sam (1 January), and different people must have
different birthdays (without replacement). The number of different sets of
birthdays is:
 
365 365! 365 × 364 × 363
= = = 8,038,030.
3 3! (365 − 3)! 3×2×1

Example 1.17 Consider a room with r people in it. What is the probability that
at least two of them have the same birthday (call this event A)? In particular, what
is the smallest r for which P (A) > 1/2?
Assume that all days are equally likely.
Label the people 1 to r, so that we can treat them as an ordered list and talk about
person 1, person 2 etc. We want to know how many ways there are to assign
birthdays to this list of people. We note the following.

1. The number of all possible sequences of birthdays, allowing repeats (i.e. with
replacement) is (365)r .

2. The number of sequences where all birthdays are different (i.e. without
replacement) is 365!/(365 − r)!.

Here ‘1.’ is the size of the sample space, and ‘2.’ is the number of outcomes which
satisfy Ac , the complement of the case in which we are interested.
Therefore:
365!/(365 − r)! 365 × 364 × · · · × (365 − r + 1)
P (Ac ) = r
=
(365) (365)r

and:
365 × 364 × · · · × (365 − r + 1)
P (A) = 1 − P (Ac ) = 1 − .
(365)r
Probabilities, for P (A), of at least two people sharing a birthday, for different values
of the number of people r are given in the following table:

r P (A) r P (A) r P (A) r P (A)


2 0.003 12 0.167 22 0.476 32 0.753
3 0.008 13 0.194 23 0.507 33 0.775
4 0.016 14 0.223 24 0.538 34 0.795
5 0.027 15 0.253 25 0.569 35 0.814
6 0.040 16 0.284 26 0.598 36 0.832
7 0.056 17 0.315 27 0.627 37 0.849
8 0.074 18 0.347 28 0.654 38 0.864
9 0.095 19 0.379 29 0.681 39 0.878
10 0.117 20 0.411 30 0.706 40 0.891
11 0.141 21 0.444 31 0.730 41 0.903

27
1. Probability theory

1.7 Conditional probability and Bayes’ theorem


Next we introduce some of the most important concepts in probability:

independence

conditional probability

Bayes’ theorem.

These give us powerful tools for:

deriving probabilities of combinations of events

updating probabilities of events, after we learn that some other event has happened.

Independence

Two events A and B are (statistically) independent if:

P (A ∩ B) = P (A) P (B).

Independence is sometimes denoted A ⊥⊥ B. Intuitively, independence means that:

if A happens, this does not affect the probability of B happening (and vice versa)

if you are told that A has happened, this does not give you any new information
about the value of P (B) (and vice versa).

For example, independence is often a reasonable assumption when A and B


correspond to physically separate experiments.

Example 1.18 Suppose we roll two dice. We assume that all combinations of the
values of them are equally likely. Define the events:

A = ‘Score of die 1 is not 6’

B = ‘Score of die 2 is not 6’.

Therefore:

P (A) = 30/36 = 5/6

P (B) = 30/36 = 5/6

P (A ∩ B) = 25/36 = 5/6 × 5/6 = P (A) P (B), so A and B are independent.

28
1.7. Conditional probability and Bayes’ theorem

1.7.1 Independence of multiple events


Events A1 , A2 , . . . , An are independent if the probability of the intersection of any subset
of these events is the product of the individual probabilities of the events in the subset.
This implies the important result that if events A1 , A2 , . . . , An are independent, then:
P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 ) · · · P (An ).
Note that there is a difference between pairwise independence and full independence.
The following example illustrates.

Example 1.19 It can be cold in London. Four impoverished teachers dress to feel
warm. Teacher A has a hat and a scarf and gloves, Teacher B only has a hat, Teacher
C only has a scarf and Teacher D only has gloves. One teacher out of the four is
selected at random. It is shown that although each pair of events H = ‘the teacher
selected has a hat’, S = ‘the teacher selected has a scarf’, and G = ‘the teacher
selected has gloves’ are independent, all three of these events are not independent.
Two teachers have a hat, two teachers have a scarf, and two teachers have gloves, so:
2 1 2 1 2 1
P (H) = = , P (S) = = and P (G) = = .
4 2 4 2 4 2
Only one teacher has both a hat and a scarf, so:
1
P (H ∩ S) =
4
and similarly:
1 1
P (H ∩ G) = and P (S ∩ G) = .
4 4
From these results, we can verify that:

P (H ∩ S) = P (H) P (S)
P (H ∩ G) = P (H) P (G)
P (S ∩ G) = P (S) P (G)

and so the events are pairwise independent. However, one teacher has a hat, a scarf
and gloves, so:
1
P (H ∩ S ∩ G) = 6= P (H) P (S) P (G).
4
Hence the three events are not independent. If the selected teacher has a hat and a
scarf, then we know that the teacher has gloves. There is no independence for all
three events together.

1.7.2 Independent versus mutually exclusive events


The idea of independent events is quite different from that of mutually exclusive
(disjoint) events, as shown in Figure 1.9.

29
1. Probability theory

Figure 1.9: Venn diagram depicting mutually exclusive events.

For mutually exclusive events A ∩ B = ∅, and so, from (1.1), P (A ∩ B) = 0. For


independent events, P (A ∩ B) = P (A) P (B). So since P (A ∩ B) = 0 6= P (A) P (B) in
general (except in the uninteresting case when P (A) = 0 or P (B) = 0), then mutually
exclusive events and independent events are different.
In fact, mutually exclusive events are extremely non-independent (i.e. dependent). For
example, if you know that A has happened, you know for certain that B has not
happened. There is no particularly helpful way to represent independent events using a
Venn diagram.

Conditional probability

Consider two events A and B. Suppose you are told that B has occurred. How does
this affect the probability of event A?

The answer is given by the conditional probability of A given that B has occurred,
or the conditional probability of A given B for short, defined as:

P (A ∩ B)
P (A | B) =
P (B)

assuming that P (B) > 0. The conditional probability is not defined if P (B) = 0.

Example 1.20 Suppose we roll two independent fair dice again. Consider the
following events.

A = ‘at least one of the scores is 2’.

B = ‘the sum of the scores is greater than 7’.

These are shown in Figure 1.10. Now P (A) = 11/36 ≈ 0.31, P (B) = 15/36 and
P (A ∩ B) = 2/36. Therefore, the conditional probability of A given B is:

P (A ∩ B) 2/36 2
P (A | B) = = = ≈ 0.13.
P (B) 15/36 15

30
1.7. Conditional probability and Bayes’ theorem

Learning that B has occurred causes us to revise (update) the probability of A


downward, from 0.31 to 0.13.

A
(1,1) (1,2) (1,3) (1,4) (1,5) (1,6)

(2,1) (2,2) (2,3) (2,4) (2,5) (2,6)

(3,1) (3,2) (3,3) (3,4) (3,5) (3,6)

(4,1) (4,2) (4,3) (4,4) (4,5) (4,6)

(5,1) (5,2) (5,3) (5,4) (5,5) (5,6)

(6,1) (6,2) (6,3) (6,4) (6,5) (6,6)

A B

Figure 1.10: Events A, B and A ∩ B for Example 1.20.

One way to think about conditional probability is that when we condition on B, we


redefine the sample space to be B.

Example 1.21 In Example 1.20, when we are told that the conditioning event B
has occurred, we know we are within the solid green line in Figure 1.10. So the 15
outcomes within it become the new sample space. There are 2 outcomes which
satisfy A and which are inside this new sample space, so:
2 number of cases of A within B
P (A | B) = = .
15 number of cases of B

1.7.3 Conditional probability of independent events


If A ⊥⊥ B, i.e. P (A ∩ B) = P (A) P (B), and P (B) > 0 and P (A) > 0, then:
P (A ∩ B) P (A) P (B)
P (A | B) = = = P (A)
P (B) P (B)
and:
P (A ∩ B) P (A) P (B)
P (B | A) = = = P (B).
P (A) P (A)
In other words, if A and B are independent, learning that B has occurred does not
change the probability of A, and learning that A has occurred does not change the
probability of B. This is exactly what we would expect under independence.

31
1. Probability theory

1.7.4 Chain rule of conditional probabilities


Since P (A | B) = P (A ∩ B)/P (B), then:

P (A ∩ B) = P (A | B) P (B).

That is, the probability that both A and B occur is the probability that A occurs given
that B has occurred multiplied by the probability that B occurs. An intuitive graphical
version of this is:

s
B
s
As

The path to A is to get first to B, and then from B to A.


It is also true that:
P (A ∩ B) = P (B | A) P (A)
and you can use whichever is more convenient. Very often some version of this chain
rule is much easier than calculating P (A ∩ B) directly.
The chain rule generalises to multiple events:

P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) · · · P (An | A1 , A2 , . . . , An−1 )

where, for example, P (A3 | A1 , A2 ) is shorthand for P (A3 | A1 ∩ A2 ). The events can be
taken in any order, as shown in Example 1.22.

Example 1.22 For n = 3, we have:

P (A1 ∩ A2 ∩ A3 ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 )


= P (A1 ) P (A3 | A1 ) P (A2 | A1 , A3 )
= P (A2 ) P (A1 | A2 ) P (A3 | A1 , A2 )
= P (A2 ) P (A3 | A2 ) P (A1 | A2 , A3 )
= P (A3 ) P (A1 | A3 ) P (A2 | A1 , A3 )
= P (A3 ) P (A2 | A3 ) P (A1 | A2 , A3 ).

Example 1.23 Suppose you draw 4 cards from a deck of 52 playing cards. What is
the probability of A = ‘the cards are the 4 aces (cards of rank 1)’ ?
We could calculate this using counting rules. There are 52

4
= 270,725 possible
subsets of 4 different cards, and only 1 of these consists of the 4 aces. Therefore,
P (A) = 1/270,725.
Let us try with conditional probabilities. Define Ai as ‘the ith card is an ace’, so
that A = A1 ∩ A2 ∩ A3 ∩ A4 . The necessary probabilities are:

P (A1 ) = 4/52 since there are initially 4 aces in the deck of 52 playing cards

32
1.7. Conditional probability and Bayes’ theorem

P (A2 | A1 ) = 3/51. If the first card is an ace, 3 aces remain in the deck of 51
playing cards from which the second card will be drawn

P (A3 | A1 , A2 ) = 2/50

P (A4 | A1 , A2 , A3 ) = 1/49.
Putting these together with the chain rule gives:

P (A) = P (A1 ) P (A2 | A1 ) P (A3 | A1 , A2 ) P (A4 | A1 , A2 , A3 )


4 3 2 1 24 1
= × × × = = .
52 51 50 49 6,497,400 270,725
Here we could obtain the result in two ways. However, there are very many situations
where classical probability and counting rules are not usable, whereas conditional
probabilities and the chain rule are completely general and always applicable.

More methods for summing probabilities

We now return to probabilities of partitions like the situation shown in Figure 1.11.

 HH A1
 H
A2  HH
A1
r  HHr
 A

A2 
HH
A3 H
HH 
HH A3


Figure 1.11: On the left, a Venn diagram depicting A = A1 ∪ A2 ∪ A3 , and on the right
the ‘paths’ to A.

Both diagrams in Figure 1.11 represent the partition A = A1 ∪ A2 ∪ A3 . For the next
results, it will be convenient to use diagrams like the one on the right in Figure 1.11,
where A1 , A2 and A3 are symbolised as different ‘paths’ to A.
We now develop powerful methods of calculating sums like:
P (A) = P (A1 ) + P (A2 ) + P (A3 ).

1.7.5 Total probability formula


Suppose B1 , B2 , . . . , BK form a partition of the sample space. Therefore, A ∩ B1 ,
A ∩ B2 , . . ., A ∩ BK form a partition of A, as shown in Figure 1.12.
In other words, think of event A as the union of all the A ∩ Bi s, i.e. of ‘all the paths to
A via different intervening events Bi ’.
To get the probability of A, we now:

1. apply the chain rule to each of the paths:


P (A ∩ Bi ) = P (A | Bi ) P (Bi )

33
1. Probability theory

r B1

r B2
HH
 H
r

r B3 HHHr
H A
@H
H 
@ HH 
@ Hr
@ B4
@
@r
B5
Figure 1.12: On the left, a Venn diagram depicting the set A and the partition of S, and
on the right the ‘paths’ to A.

2. add up the probabilities of the paths:


K
X K
X
P (A) = P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
i=1 i=1

This is known as the formula of total probability. It looks complicated, but it is


actually often far easier to use than trying to find P (A) directly.

Example 1.24 Any event B has the property that B and its complement B c
partition the sample space. So if we take K = 2, B1 = B and B2 = B c in the formula
of total probability, we get:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= P (A | B) P (B) + P (A | B c )(1 − P (B)).

r Bc
HH
 HH
 HH
rH
  Hr A
HH 
H  
HH
r
H 
B

Example 1.25 Suppose that 1 in 10,000 people (0.01%) has a particular disease. A
diagnostic test for the disease has 99% sensitivity. If a person has the disease, the
test will give a positive result with a probability of 0.99. The test has 99% specificity.
If a person does not have the disease, the test will give a negative result with a
probability of 0.99.
Let B denote the presence of the disease, and B c denote no disease. Let A denote a
positive test result. We want to calculate P (A).

34
1.7. Conditional probability and Bayes’ theorem

The probabilities we need are P (B) = 0.0001, P (B c ) = 0.9999, P (A | B) = 0.99 and


P (A | B c ) = 0.01. Therefore:

P (A) = P (A | B) P (B) + P (A | B c ) P (B c )
= 0.99 × 0.0001 + 0.01 × 0.9999
= 0.010098.

1.7.6 Bayes’ theorem


So far we have considered how to calculate P (A) for an event A which can happen in
different ways, ‘via’ different events B1 , B2 , . . . , BK .
Now we reverse the question. Suppose we know that A has occurred, as shown in Figure
1.13.

Figure 1.13: Paths to A indicating that A has occurred.

What is the probability that we got there via, say, B1 ? In other words, what is the
conditional probability P (B1 | A)? This situation is depicted in Figure 1.14.

Figure 1.14: A being achieved via B1 .

So we need:
P (A ∩ Bj )
P (Bj | A) =
P (A)
and we already know how to get this.

P (A ∩ Bj ) = P (A | Bj ) P (Bj ) from the chain rule.


K
P
P (A) = P (A | Bi ) P (Bi ) from the total probability formula.
i=1

35
1. Probability theory

Bayes’ theorem

Using the chain rule and the total probability formula, we have:

P (A | Bj ) P (Bj )
P (Bj | A) = K
P
P (A | Bi ) P (Bi )
i=1

which holds for each Bj , j = 1, 2, . . . , K. This is known as Bayes’ theorem.

Example 1.26 Continuing with Example 1.25, let B denote the presence of the
disease, B c denote no disease, and A denote a positive test result.
We want to calculate P (B | A), i.e. the probability that a person has the disease,
given that the person has received a positive test result.
The probabilities we need are:

P (B) = 0.0001 P (B c ) = 0.9999


P (A | B) = 0.99 and P (A | B c ) = 0.01.

Therefore:
P (A | B) P (B) 0.99 × 0.0001
P (B | A) = c c
= ≈ 0.0098.
P (A | B) P (B) + P (A | B ) P (B ) 0.010098

Why is this so small? The reason is because most people do not have the disease and
the test has a small, but non-zero, false positive rate P (A | B c ). Therefore, most
positive test results are actually false positives.

1.8 Overview of chapter


This chapter introduced some formal terminology related to probability theory. The
axioms of probability were introduced, from which various other probability results were
derived. There followed a brief discussion of counting rules (using permutations and
combinations). The important concepts of independence and conditional probability
were discussed, and Bayes’ theorem was derived.

1.9 Key terms and concepts


Axiom Bayes’ theorem
Binomial coefficient Chain rule
Classical probability Collectively exhaustive
Combination Complement
Conditional probability Counting
Disjoint Element

36
1.10. Sample examination questions

Empty set Experiment


Event Factorial
Independence Intersection
Mutually exclusive Outcome
Pairwise disjoint Partition
Permutation Probability (theory)
Relative frequency Sample space
Set Subset
Total probability Union
Venn diagram With(out) replacement

1.10 Sample examination questions


1. A box contains 12 light bulbs, of which two are defective. If a person selects 5 bulbs
at random, without replacement, what is the probability that both defective bulbs
will be selected?

2. A and B are independent events such that:

P ((A ∪ B)c ) = π1 and P (A) = π2 .

Determine P (B) as a function of π1 and π2 .

3. A county is made up of three (mutually exclusive) communities A, B and C, with


proportions of people living in them given by the following table:

Community A B C
Proportion 0.20 0.50 0.30

Given a person belongs to a certain community, the probability of that person


being vaccinated is given by the following table:

Community given A B C
Probability of being vaccinated 0.80 0.70 0.60

(a) We choose a person from the county at random. What is the probability that
the person is not vaccinated?

(b) We choose a person from the county at random. Find the probability that the
person is in community A, given the person is vaccinated.

(c) In words, briefly explain how the ‘probability of being vaccinated’ for each
community would be known in practice.

37
1. Probability theory

1.11 Solutions to Sample examination questions


1. The sample space consists  of all (unordered) subsets of 5 out of the 12 light bulbs
12
in the box. There are 5 such subsets. The number of subsets which contain the
two defective bulbs is the number of subsets of size 3 out of the other 10 bulbs,
10
3
, so the probability we want is:
10

3 5×4
12 = = 0.1515.

5
12 × 11

2. We are given that P ((A ∪ B)c ) = π1 , P (A) = π2 , and that A and B are
independent. Hence:

P (A ∪ B) = 1 − π1 and P (A ∩ B) = P (A) P (B) = π2 P (B).

Therefore:

P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = π2 + P (B) − π2 P (B) = 1 − π1 .

Solving for P (B), we have:


1 − π1 − π 2
P (B) = .
1 − π2

3. (a) Denote acceptance of vaccination by V , and not accepting by V c . By the law


of total probability, we have:

P (V c ) = P (V c | A) P (A) + P (V c | B) P (B) + P (V c | C) P (C)


= (1 − 0.8) × 0.2 + (1 − 0.7) × 0.5 + (1 − 0.6) × 0.3
= 0.31.

(b) By Bayes’ theorem, we have:

P (V | A) P (A) 0.8 × 0.2 16


P (A | V ) = = = = 0.2319.
P (V ) 1 − 0.31 69

(c) Any reasonable answer accepted, such as relative frequency estimate, or from
health records.

There are lies, damned lies and statistics.


(Mark Twain)

38
Chapter 2
Discrete probability distributions

2.1 Synopsis of chapter


This chapter introduces the concept of random variables and discrete probability
distributions. These distributions are univariate, which means that they are used to
model a single numerical quantity. The concepts of expected value and variance of
discrete random variables are also discussed.

2.2 Learning outcomes


After completing this chapter, you should be able to:

formally define a random variable and distinguish it from the values which it takes

explain the difference between discrete and continuous random variables

summarise basic discrete distributions such as the uniform, Bernoulli, binomial,


Poisson, geometric and negative binomial.

2.3 Introduction
A random variable is a ‘mapping’ of the elementary outcomes in the sample space to
real numbers. This allows us to attach probabilities to the experimental outcomes.
Hence the concept of a random variable is that of a measurement which takes a
particular value for each possible trial (experiment). Frequently, this will be a numerical
value.

Example 2.1 Suppose we sample five people and measure their heights, hence
‘height’ is the random variable and the five (observed) values of this random variable
are the realised measurements for the heights of these five people.

Example 2.2 Suppose a fair die is thrown four times and we observe two 6s, a 3
and a 1. The random variable is the ‘score on the die’, and for these four trials it
takes the values 6, 6, 3 and 1. (In this case, since we do not know the true order in
which the values occurred, we could also say that the results were 1, 6, 3 and 6, or 1,
3, 6 and 6, or . . ..)

39
2. Discrete probability distributions

An example of an experiment with non-numerical outcomes would be a coin toss, for


which recall S = {H, T }. We can use a random variable, X, to convert the sample space
elements to real numbers: (
1 if heads
X=
0 if tails.
The value of any of the above variables will typically vary from sample to sample, hence
the name ‘random variable’.
So each experimental random variable has a collection of possible outcomes, and a
numerical value associated with each outcome. We have already encountered the term
‘sample space’ which here is the set of all possible numerical values of the random
variable.

Example 2.3 Examples of random variables include the following:

Experiment Random variable Sample space


Die is thrown Value on top face {1, 2, 3, 4, 5, 6}
Coin is tossed five times Number of heads {0, 1, 2, 3, 4, 5}
Twenty people sampled Number with blue eyes {0, 1, 2, . . . , 19, 20}
Machine operates for a day Number of breakdowns {0, 1, 2, . . .}
One adult sampled Height in cm {[100cm, 200cm]} (roughly)

2.4 Probability distribution


A natural question to ask is ‘what is the probability of any of these values?’. That is, we
are interested in the probability distribution of the experimental random variable.
Be aware that random variables come in two varieties – discrete and continuous.1

Discrete and continuous random variables

Discrete: Synonymous with ‘count data’, that is, as far as this course is
concerned, random variables which take non-negative integer values, such as
0, 1, 2, . . .. For example, the number of heads in n coin tosses.

Continuous: Synonymous with ‘measured data’ such as the real line, R =


(−∞, ∞), or some subset of R, for example the unit interval [0, 1]. For example,
the height of adults in centimetres.

The mathematical treatment of probability distributions depends on whether we are


dealing with discrete or continuous random variables. This chapter will explore the
former, while Chapter 3 will explore the latter.
In Example 2.3, the sample spaces of various experiments are shown. In most cases
there will be a higher chance of the random variable taking some sample space values
1
For completeness, be aware that mixture distributions (with discrete and continuous components)
exist, although they will not be considered in this course.

40
2.4. Probability distribution

relative to others. Our objective is to express these chances using an associated


probability distribution. In the discrete case, we can associate with each ‘point’ in the
sample space a probability which represents the chance of the random variable being
equal to that particular value. (The probability is typically non-zero, although
sometimes we need to use a probability of zero to identify impossible events.)
To summarise, a probability distribution is the complete set of sample space values with
their associated probabilities which, by axiom 2, must sum to 1 for discrete random
variables.2 The probability distribution can be represented diagrammatically by plotting
the probabilities against sample space values.
Finally, before we proceed, let us spend a moment to briefly discuss some important
issues with regard to the notation associated with random variables. For notational
efficiency reasons, we often use a capital letter to represent the random variable. The
letter X is often adopted, but it is perfectly legitimate to use any other letter: Y , Z etc.
In contrast, a lower case letter denotes a particular value of the random variable.

Example 2.4 Let X = ‘the score of a fair die’. If the die results in a 3, then this is
written as x = 3.
The probability distribution of X is:

X=x 1 2 3 4 5 6
P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6

This is an example of the (discrete) uniform distribution.3 For discrete random


variables, we talk about a mass of probability at each respective sample space value.
In the discrete uniform case this mass is the same, i.e. 1/6, and this is plotted to
show the probability distribution of X, as shown in Figure 2.1.

Discrete uniform distribution

A random variable X has a discrete uniform distribution if it has k possible


outcomes, all of which are equally likely. If the distinct outcomes are 1, 2, . . . , k, then
the fact that k is finite means the distribution is ‘discrete’, and it is ‘uniform’ because
all the probabilities are equal. The probability function is:
(
1/k for x = 1, 2, . . . , k
P (X = x) = p(x) =
0 otherwise.

For a fair die, k = 6.

2
When dealing with continuous random variables the analogous condition is integrating, rather than
summing, to 1. More of this in Chapter 3.
3
At school, ‘uniforms’ are worn, i.e. all pupils wear the same clothes (possibly slight differences across
genders), hence when the term ‘uniform’ is applied to a probability distribution, we have the same
probability of occurrence for each sample space value.

41
2. Discrete probability distributions

Probability distribution: Score on die

1.0
0.8
0.6
Probability

0.4
0.2
0.0

1 2 3 4 5 6

Score

Figure 2.1: Probability distribution for the score on a fair die in Example 2.4.

Example 2.5 Let X = ‘the number of heads when five fair coins are tossed’. The
probability distribution of X is:

X=x 0 1 2 3 4 5
P (X = x) 0.03 0.16 0.31 0.31 0.16 0.03
= (0.5)5 = 5 × (0.5)5 = 10 × (0.5)5 = 10 × (0.5)5 = 5 × (0.5)5 = (0.5)5

This is an example of the binomial distribution (discussed shortly) and can be


represented as:  
5
p(x) = × (0.5)5 for x = 0, 1, 2, . . . , 5
x
and 0 otherwise. The probability distribution of X is shown in Figure 2.2.

A probability distribution has a natural frequency interpretation – if the experiment is


repeated a very large number of times, then the probability of any particular value of
the random variable is equal to the limit of its relative frequency as the number of
experiments becomes infinitely large.
There are many important probability distributions which describe the chances of
real-life events, and these form the basis of statistical inference and data analysis. The
binomial and Poisson distributions (both about counting) are discussed in this chapter,
while the normal and other important continuous distributions are covered in the
following chapter.

42
2.5. Binomial distribution

Probability distribution: Number of heads for 5 tosses

1.0
0.8
0.6
Probability

0.4
0.2
0.0

0 1 2 3 4 5

Number of heads

Figure 2.2: Probability distribution for the number of heads when five fair coins are
tossed.

2.5 Binomial distribution


The binomial distribution is a series of n independent Bernoulli trials. Hence it
makes sense to define a Bernoulli trial first of all. In fact, we have already seen an
example of this – the single coin toss. Key features of a Bernoulli trial are as follows.

A Bernoulli trial has only two possible outcomes (i.e. it is dichotomous) which are
typically called ‘success’ and ‘failure’ – such as ‘heads’ and ‘tails’. We usually code
a success as 1 and a failure as 0.

There is a fixed probability of success, π, and, therefore, a fixed probability of


failure, 1 − π. So, for a fair coin, π = 0.50 – repeatedly tossing the same coin will
not change π.

Consequently, given a constant π, then successive Bernoulli trials are independent.

Bernoulli distribution

The probability distribution for a Bernoulli trial is:


X=x 0 1
P (X = x) 1−π π
and the Bernoulli distribution can be expressed with the following probability
function: (
π x (1 − π)1−x for x = 0, 1
P (X = x) =
0 otherwise.

43
2. Discrete probability distributions

Example 2.6 Other potential examples of Bernoulli trials are: (i.) the sex of
new-born babies (male or female), (ii.) the classification of factory output (defective
or not defective), and (iii.) voters supporting a candidate (support or not support).

In fact, many sampling situations become Bernoulli trials if we are only interested in
classifying the result categorically in one of two ways – for example, heights of people if
we are only interested in whether or not each person is taller than 180 cm, say.
Extending this idea, if we have n successive Bernoulli trials, then we define the binomial
distribution.

Binomial distribution

Let X = ‘the number of successes’ in a sequence of n independent and identically


distributed Bernoulli trials, then:4

X ∼ Bin(n, π)

where the terms n and π are called parameters, since the values of these define
which specific binomial distribution we have. Its probability function is:
( 
n
x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n
P (X = x) = (2.1)
0 otherwise.

n is the number of Bernoulli trials, π is the (constant) probability of success for each
trial, P (X = x) is the probability that the number of successes in the n trials is equal
to x. That is, we are seeking to count the number of successes, and each P (X = x)
is the probability that the discrete (count) random variable X takes the value x.

(2.1) can be used to calculate probabilities for any binomial distribution, provided n
and π are both specified. Note that a binomial random variable can take n + 1 different
values, not n, since the variable measures the number of successes. The smallest number
of successes in n trials is zero (i.e. if all trials resulted in failure); the largest number of
successes is n (i.e. if all trials resulted in success); with the intervening number of
successes being 1, 2, . . . , n − 1. Therefore, there are n + 1 different values in total.

Necessary conditions to apply the binomial distribution

Each trial has only two possible outcomes – success and failure.

Fixed probability of success, π.

Fixed number of trials, n.

All trials are statistically independent.

4
Read ‘∼’ as ‘is distributed as’.

44
2.6. Cumulative distribution functions

2.6 Cumulative distribution functions


A probability function can be used to compute p(x) = P (X = x), i.e. the probability of
a single value x of the random variable. Of course, we may wish to know the probability
that the random variable X is less than or equal to x. We call such a probability a
cumulative probability, denoted by the cumulative distribution function (cdf ):

F (x) = P (X ≤ x). (2.2)

Cumulative distribution function for discrete random variables

For discrete random variables taking non-negative integer values, the cumulative
distribution function (cdf) is:5

F (x) = P (X = 0) + P (X = 1) + P (X = 2) + · · · + P (X = x)
= p(0) + p(1) + p(2) + · · · + p(x).

It follows that we can easily find the probability function from the cumulative
distribution function, or vice versa, using this relationship. Specifically, note that:

P (X = x) = F (x) − F (x − 1).

Example 2.7 Consider ten test tubes of bacterial solution and let us suppose that
the probability of any single test tube showing bacterial growth is 0.2. Let X denote
the number of test tubes showing bacterial growth. Hence:
 
10
P (exactly 4 show growth) = P (X = 4) = × (0.2)4 × (0.8)6
4

and:

P (more than 1 show growth) = 1 − F (1)


= 1 − P (X = 0) − P (X = 1)
 
10 10
= 1 − (0.8) − × (0.2)1 × (0.8)9
1
= 1 − 0.1074 − 0.2684
= 0.6242.

Note this technique also illustrates the advantage of computing the probability of an
event by calculating the probability of it not happening and subtracting this from 1.6
5
Note you can use either form of notation p(x) or P (X = x), whichever you prefer.
6
Recall P (A) = 1 − P (Ac ).

45
2. Discrete probability distributions

2.6.1 Cumulative distribution functions – another point of view


X ∼ Bin(n, π) is one of many distributions which take only non-negative integer values,
and are typically about counting. It does not normally make sense to ask questions like:
‘What is P (X = −3), P (X = 1.5) or P (X = π)?’ for the obvious reason that all such
values are impossible,7 so the relevant probabilities are all zero.
However, it makes a lot more sense to ask questions like ‘What is P (X ≤ −3),
P (X ≤ 1.5) or P (X ≤ π)?’. The answers are 0, F (1) and F (0), respectively (using the
notation in (2.2)). This illustrates why we can define the cdf of a counting-style random
variable even for values of a continuous argument x as:8

F (x) = P (X ≤ x).

With this convention, F (x) is a function whose graph is a step function.

Example 2.8 If X ∼ Bin(2, π), then each time x reaches an integer value in the
range [0, 2], the cdf ‘jumps’ by P (X = x), until the sum of the P (X = x)s reaches 1.
This is shown in Figure 2.3.

F(x)
1

2π (1-π ) + (1-π )2

(1-π )2

0 1 2
x

Figure 2.3: Step function showing the cdf of X ∼ Bin(2, π) in Example 2.8.

The same pattern shown in Figure 2.3 applies to any version of Bin(n, π), or indeed to
any other distribution for which there is a largest possible integer value. For
distributions like the Poisson (discussed next) which count, but which do not have a
largest possible value, the pattern is similar, but the value 1 is never reached.

7
Technically speaking for n = 1, if π = 0 or 1 then P (X = π) = 1, but this means a success is
impossible or certain, respectively. Hence we no longer have two possible outcomes, but one certain
outcome, i.e. a failure or success, respectively.
8
Note that the argument x is not the same as the random variable X, nor is it the same as a realisation
or value of X. This is because X is not continuous, but takes only (selected) integer values. The x value
simply tells us the range of values in which we are interested.

46
2.7. Poisson distribution

2.7 Poisson distribution


The Poisson distribution applies to random points occurring in a ‘continuous’
medium such as time, distance, area or volume. The discussion here will concentrate
mainly on one-dimensional cases, such as time or distance. In all cases, we are dealing
with random points which have the following properties.

Properties of random points in a Poisson process

Each point is equally likely to occur anywhere in the medium.

The position taken by each point is completely independent of the occurrence


or non-occurrence of all the other points.

In this situation the random variable X is the number of points in a particular unit of
the medium.

Poisson probability function

The probability function for the Poisson distribution is:


(
e−λ λx /x! for x = 0, 1, 2, . . .
P (X = x) =
0 otherwise

where λ is the average number of points per unit of the medium, and is known as
the rate parameter. Note that, unlike the binomial distribution, there is no upper
bound on the value of x.

Example 2.9 Examples of a Poisson process include (i) machine breakdowns per
unit of time, (ii) arrivals at an airport per unit of time, and (iii) flaws along a rope
per unit of length.

Example 2.10 Consider a machine which breaks down, on average, 3.2 times per
week, hence λ = 3.2 per week. The probability that it will break down exactly once
next week is:
e−3.2 (3.2)1
P (X = 1) = = 0.1304.
1!
The probability that it will break down exactly four times in the next two weeks
(hence λ is now 6.4) is:

e−6.4 (6.4)4
P (X = 4) = = 0.1162.
4!
Note that if we know λ for one unit of time (here, per week) and we want to look at
k units of time (in this example, k = 2), then we need to proportionally change λ to
reflect this, i.e. the revised rate parameter is k × λ (hence in this example the revised
λ for a two-week period is 2 × 3.2 = 6.4).

47
2. Discrete probability distributions

2.8 Poisson approximation to the binomial


Try entering 70! in your calculator. I suspect you will encounter a ‘computer says no’
moment. (If not, try 71!, 72! etc.) In such cases we will have difficulty in computing
binomial probabilities for large values of n, due to the use of factorials in the nx


component of the probability function.


Hence it would be useful to have a suitable approximation to the binomial distribution
when direct computation of binomial probabilities proves problematic.9 There are two
commonly-used such approximations to the binomial – one using the normal
distribution (covered in Chapter 3), and another using the Poisson distribution. Clearly,
any approximation is just that, an approximation, so we should only use this
approximating procedure when we obtain a good (i.e. close) approximation.

Conditions for using a Poisson approximation to the binomial

To justify use of the Poisson as an approximating distribution to the binomial, the


following conditions should hold.

n greater than 30.

π sufficiently extreme such that nπ < 10.

The approximation is only good for small values of x, relative to n.

Set the Poisson rate parameter λ = nπ.

Example 2.11 Suppose we sample 100 items at random from a production line
which is providing, on average, 2% defective items. What is the probability of
exactly 3 defective items in our random sample?
First, we have to check that the relevant criteria for using the Poisson approximation
are satisfied. Indeed they are. n = 100 > 30, π = 0.02 is sufficiently small such that
nπ = 2 < 10 and x = 3 is small relative to n. Hence:
e−2 23
P (X = 3) = 100 C3 × (0.02)3 × (0.98)97 = 0.1823 ≈ = 0.1804.
| {z } 3!
true binomial probability

2.9 Expected value of a discrete random variable


Certain important properties of distributions arise if we consider probability-weighted
averages of random variables, and of functions of random variables.10 For example, we
might want to know the ‘average’ value of a random variable.
9
The Windows scientific calculator can handle
√ large factorials, though I suspect this uses an
approximation such as Stirling’s formula: n! ≈ 2nπnn e−n .
10
A function, f (X), of a random variable X is, of course, a new random variable, say Y = f (X).

48
2.9. Expected value of a discrete random variable

It would be foolish to simply take the arithmetic average of all the values taken by the
random variable, as this would mean that very unlikely values (those with small
probabilities of occurrence) would receive the same weighting as very likely values
(those with large probabilities of occurrence). The obvious approach is to use the
probability-weighted average of the sample space values, known as the expected
value of X.

Expectation of a discrete random variable

If x1 , x2 , . . . , xN are the possible values of the random variable X, with corresponding


probabilities p1 , p2 , . . . , pN , then:
N
X
E(X) = µ = xi pi = x1 p1 + x2 p2 + · · · + xN pN .
i=1

Note that the expected value is also referred to as the population mean, which can be
written as E(X) (in words ‘the expectation of the random variable X’), or µ (in words
‘the (population) mean of X’). Also, note the distinction between the sample mean, x̄,
(introduced in ST104a Statistics 1) based on observed sample values, and the
population mean, µ, based on the theoretical probability distribution.

Example 2.12 If the ‘random variable’ X happens to be a constant, k, then


x1 = k, and p1 = 1, so trivially E(X) = k × 1 = k.

Example 2.13 If X ∼ Bin(n, π), then:


n
X
E(X) = x P (X = x)
x=0
     
n n 1 n−1 n 2 n−2 n n
= 0 × (1 − π) + 1 × π (1 − π) +2× π (1 − π) + ··· + n × π
1 2 n
= nπ.

Why this reduces to nπ is beyond the scope of this course, but the fact that
E(X) = nπ for the binomial distribution is a useful result!

Example 2.14 If X ∼ Pois(λ), then:



X
E(X) = x P (X = x)
x=0

e−λ λ0 e−λ λ1 e−λ λ2 e−λ λk


=0× +1× +2× + ··· + k × + ···
0! 1! 2! k!
= λ.

Again, why this reduces to λ is beyond the scope of this course, but again is a useful
result.

49
2. Discrete probability distributions

2.9.1 New random variables

Above we have labelled the population mean as the ‘expectation’ of the random
variable and introduced
P the expectation operator, E(·). This operator, like the
summation operator , is a linear operator and hence this property can be used to find
the expectation of a new random variable, be it a transformation of a single random
variable or a linear combination of two (or more) random variables.

Example 2.15 Suppose X is a random variable and α is a non-zero constant.


Define W = αX to be a new random variable. What is the mean of W ? We have:
N
X
E(W ) = E(αX) = (αxi )pi
i=1

= αx1 p1 + αx2 p2 + · · · + αxN pN


= α(x1 p1 + x2 p2 + · · · + xN pN )
= α E(X).

That is, E(αX) = α E(X).

Example 2.16 Suppose X and Y are random variables. Let Z = X ± Y be a new,


but clearly related, random variable. What is the mean of Z? To obtain this, simply
exploit the linear property of the expectation operator. Hence:

E(Z) = E(X ± Y ) = E(X) ± E(Y ).

We can combine these two approaches.

Expectation of linear combinations of random variables

Given random variables X and Y , and constants α and β (both non-zero), define
T = αX ± βY . It follows that:

E(T ) = E(αX ± βY ) = α E(X) ± β E(Y ).

2.10 Variance of a discrete random variable


The concept of a probability-weighted average (or expected value) can be extended to
functions of the random variable.

50
2.10. Variance of a discrete random variable

Example 2.17 If X takes the values x1 , x2 , . . . , xN with corresponding


probabilities p1 , p2 , . . . , pN , then:
  N
1 X 1
E = pi for all xi 6= 0.
X x
i=1 i

N
X
E(ln(X)) = ln(xi ) pi for all xi > 0.
i=1

N
X
2
E(X ) = x2i pi .
i=1

One very important average associated with a distribution is the expected value of the
square of the deviation11 of the random variable from its mean, µ. This can be seen to
be a measure – not the only one, but the most widely used by far – of the dispersion of
the distribution and is known as the (population) variance of the random variable.

Variance of a discrete random variable

If X takes the values x1 , x2 , . . . , xN with corresponding probabilities p1 , p2 , . . . , pN ,


then the (population) variance of a discrete random variable is:
N
X
2 2
σ = E((X − µ) ) = (xi − µ)2 pi .
i=1

The (positive) square root of the variance is known as the standard deviation and,
given the variance is typically denoted by σ 2 , is denoted by σ.

Example 2.18 Let X represent the value shown when a fair die is thrown once.
We now compute the mean and variance of X as follows.

X=x 1 2 3 4 5 6 Total

P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6 1

x P (X = x) 1/6 2/6 3/6 4/6 5/6 6/6 21/6 = 3.5 = µ

(x − µ)2 25/4 9/4 1/4 1/4 9/4 25/4

(x − µ)2 P (X = x) 25/24 9/24 1/24 1/24 9/24 25/24 70/24 = 2.92



Hence µ = E(X) = 3.5, σ 2 = E((X − µ)2 ) = 2.92 and σ = 2.92 = 1.71. This
tabular format has some advantages. Specifically, note the following.
11
Which roughly means ‘distance with sign’.

51
2. Discrete probability distributions

It helps to have (and to calculate) the ‘Total’ column since, for example, if a
probability, P (X = x), has been miscalculated or miscopied then the row total
will not be 1 (recall axiom 2). Therefore, this would highlight an error so, with a
little work, could be identified.

It is often useful to do a group of calculations as fractions over the same


denominator (as here in the final row of the table), rather than to cancel or to
work with them as decimals, because important patterns can be more obvious,
and calculations can be easier.

2.10.1 Alternative expression for the variance


What follows is an extremely useful expression for the variance – worth remembering!

σ 2 = E((X − µ)2 ) = E(X 2 − 2µX + µ2 )


= E(X 2 ) − 2µ E(X) + µ2
= E(X 2 ) − 2µ2 + µ2
= E(X 2 ) − µ2 .

In words, ‘the (population) variance is equal to the mean of the square minus the square
of the mean’. Rearranging gives:

E(X 2 ) = σ 2 + µ2 .

This representation is useful since we often want to know E(X 2 ), but start by knowing
the usual details of a distribution, i.e. µ and σ 2 .

Example 2.19 Continuing with Example 2.18:

X=x 1 2 3 4 5 6 Total

P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6 1

x P (X = x) 1/6 2/6 3/6 4/6 5/6 6/6 21/6 = 3.5 = µ

x2 P (X = x) 1/6 4/6 9/6 16/6 25/6 36/6 91/6

Hence µ = E(X) = 3.5, E(X 2 ) = 91/6, so the variance is 91/6 − (3.5)2 = 2.92, as
before. However, this method is usually easier.

2.10.2 Limits and special cases


A useful interpretation of population properties is to think of them as the limiting
equivalents of the corresponding sample statistics (as were introduced in ST104a
Statistics 1). Suppose we sample n values of random variable X and get x1 , x2 , . . . , xn ,
then, as n → ∞:

52
2.11. Distributions related to the binomial distribution

the sample mean, x̄, tends to the population mean, µ, i.e. x̄ → µ

the sample variance, s2 , tends to the population variance, σ 2 , i.e. s2 → σ 2 .

Variance of binomial and Poisson distributions

If X ∼ Bin(n, π), then:

Var(X) = nπ(1 − π).

If X ∼ Pois(λ), then:
Var(X) = λ.

Note that for the Poisson distribution the mean and variance are equal.

2.10.3 New random variables (again)

As in the case of the expected value, we might want to look at linear combinations of
random variables.

Variance of functions of random variables

Given random variables X and Y and non-zero constants α and β, by defining two
new random variables U = αX and T = αX ± βY , then:12

Var(U ) = Var(αX) = α2 Var(X).

Assuming independence of X and Y , then:13

Var(T ) = Var(αX ± βY ) = α2 Var(X) + β 2 Var(Y ).

2.11 Distributions related to the binomial distribution


There are many useful distributions which are related to the binomial distribution. Two
of these are summarised below. Each arises in similar contexts to the binomial
distribution which, of course, is when we have a fixed number of independent Bernoulli
trials with constant probability of success π.

12
One way to remember this is to think of ‘Var’ as a homogeneous function of degree 2, like the
Cobb–Douglas utility and production functions which crop up in economics.
13
We have already met independent events, but we do not yet know what it means for random variables
to be independent. This will be covered later.

53
2. Discrete probability distributions

2.11.1 Geometric distribution


The geometric distribution is used when we perform a series of Bernoulli trials until
we get the first success. The random variable X is the trial number on which we obtain
this first success. Hence x = 1, 2, . . .. So x = 1 corresponds to the first success on the
first trial, x = 2 corresponds to the first success on the second trial etc.
If the first success occurs on the xth trial, then there must have been x − 1 failures prior
to this, each with probability of occurrence of 1 − π. Given independence of the
Bernoulli trials, we can derive the probability function.

Probability function of the geometric distribution

If: (
(1 − π)x−1 π for x = 1, 2, . . .
P (X = x) =
0 otherwise
then X has a geometric distribution, denoted X ∼ Geo(π). It can be shown that
for the geometric distribution E(X) = 1/π and Var(X) = (1 − π)/π 2 .

2.11.2 Negative binomial distribution


The negative binomial distribution extends the geometric distribution in that the
Bernoulli trials are continued until the rth success is achieved. Hence the geometric
distribution is a special case of this, i.e. when r = 1.
Define X to be the trial number of the rth success, then the smallest number of trials is
r, that is we obtain r consecutive successes from the very beginning. Of course, if x is
the trial number of the rth success, then this means we have previously incurred r − 1
successes, S, and x − r failures, F , which could have occurred in any order. Again, given
independence of the Bernoulli trials, we can derive the probability function, noting that:
x−1
P (X = x) = Cr−1 π r−1 (1 − π)x−r π.
| {z }
P (r−1 Ss & x−r F s)

Probability function of the negative binomial distribution

If: (
x−1

r−1
π r (1 − π)x−r for x = r, r + 1, r + 2, . . .
P (X = x) =
0 otherwise
then X has a negative binomial distribution, denoted X ∼ Neg. Bin(r, π). It
can be shown that for the negative binomial distribution E(X) = r/π and
Var(X) = r(1 − π)/π 2 .

54
2.12. Overview of chapter

Example 2.20 Suppose we are conducting independent Bernoulli trials with


success probability π = 1/6 (for example, we might be rolling a fair die and need to
throw a 5 for ‘success’). If we want to know the probabilities for it to take k trials
(throws) to get seven successes, then the negative binomial distribution gives us:
 7
1
P (X = 7) =
6
   7  1
7 1 5
P (X = 8) = × ×
6 6 6
   7  2
8 1 5
P (X = 9) = × ×
6 6 6
..
.
   7  k−7
k−1 1 5
P (X = k) = × × .
6 6 6

2.12 Overview of chapter


This chaper has introduced discrete random variables. In particular, some common
families of probability distributions have been presented. In addition to the functional
form of each of these distributions, important properties (such as the expected value
and variance) have been studied.

2.13 Key terms and concepts


Bernoulli trial Binomial distribution
Cumulative distribution function Discrete
Expected value Geometric distribution
Negative binomial distribution Parameter
Poisson distribution Population mean
Population variance Probability distribution
Probability function Random variable
Step function Uniform distribution
Variance

2.14 Sample examination questions


1. Find P (X ≥ 2) when X follows a binomial distribution with parameters n = 10
and π = 0.25.

55
2. Discrete probability distributions

2. Suppose that a particle starts at the origin of the real line and moves along the line
in jumps of one unit. For each jump, the probability is π (where 0 ≤ π ≤ 1) that
the particle will jump one unit to the left, and hence the probability is 1 − π that
the particle will jump one unit to the right. Find the expected value of the position
of the particle after n jumps.

3. Suppose X follows a Poisson distribution such that P (X = 0) = 1/3. Calculate


P (X ≥ 2).

2.15 Solutions to Sample examination questions


1. Since X ∼ Bin(10, 0.25), then we have:

P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − (0.75)10 − 10 × 0.25 × (0.75)9 = 0.7560.

2. Let Xi = 1 if the ith jump of the particle is one unit to the right, and let Xi = −1
if the ith jump is one unit to the left. Therefore, for i = 1, 2, . . . , n, we have:

E(Xi ) = −1 × π + 1 × (1 − π) = 1 − 2π.

The position of the particle after n jumps is X1 + X2 + · · · + Xn , hence:


n
X
E(X1 + X2 + · · · + Xn ) = E(Xi ) = n(1 − 2π).
i=1

3. We have that:
e−λ λ0 1
P (X = 0) = = e−λ = ⇒ λ ≈ 1.10.
0! 3
Therefore:
e−1.10 (1.10)1 1
P (X ≥ 2) = 1 − P (X ≤ 1) = 1 − − = 0.3010.
1! 3

Torture numbers, and they’ll confess to anything.


(Gregg Easterbrook)

56
Chapter 3
Continuous probability distributions

3.1 Synopsis of chapter


This chapter continues with probability distributions. We will look at the most
important distributions for continuous data. These distributions are univariate, which
means they are used to model a single numerical quantity.

3.2 Learning outcomes


After completing this chapter, you should be able to:

summarise basic continuous distributions such as the uniform, exponential and


normal

work with linear functions of normal random variables.

3.3 Introduction
So far, we have considered discrete distributions such as the binomial and Poisson.
These have dealt with count (or frequency) data, giving sample space values which are
non-negative integers. In such cases, as with other discrete distributions, there is always
a ‘gap’ between two possible values within which there is no other possible value. Hence
the probability of an event involving several possible values is just the sum of the
relevant probabilities of interest.
In contrast, a continuous-valued random variable, say X, can take any value over some
continuous range, or interval. So suppose x1 and x2 are distinct possible values of X,
then there is another possible value between them, for example the mid-point
(x1 + x2 )/2. Although in practice our measurements will obviously only have so many
decimal places (a consequence of limits to measurement accuracy), in principle it is
possible to measure continuous variables to infinitely many decimal places. Hence it is
mathematically convenient to use functions which can take any value over some defined
continuous interval.

Example 3.1 Possible examples of continuous random variables include:

in economics, measuring values of inputs or outputs, workforce productivity or


consumption

57
3. Continuous probability distributions

in sociology, measuring the proportion of people in a population with a


particular preference

in engineering, measuring the electrical resistance of materials.

Note the recursive appearance of measuring in the above examples. Hence


continuous random variables deal with measured data, while discrete random
variables deal with count data.

The main transition from the discrete world of thinking in Chapter 2 is that in the
continuous world only intervals are of interest, not single (point) values.

3.3.1 A formal definition


The possible values of a continuous random variable lie in the domain of real numbers,
R. Sometimes a random variable X may take any real value, i.e. −∞ < x < ∞, while
sometimes it may take just a subset of the real line, such as 0 < x < ∞ or 0 < x < 1.
Other subsets of R are possible, for example x ∈ (0, 1) ∪ (2, 3), but we will restrict
ourselves to ranges of real numbers for which there are no ‘gaps’.1 A range without gaps
is known as an interval, with either finite or infinite length, for example [0, 1] and
(−∞, ∞), respectively.

Properties of continuous distributions

The random variable X has a continuous distribution if:

there is an interval S ∈ R such that the possible values of X are the point values
in S

for any individual point value x ∈ S, we have:

P (X = x) = p(x) = 0 (3.1)

for any pair of individual point values u and v, say, in S with u < v we can
always work out P (u < X < v).

A consequence of (3.1), i.e. that in the continuous world the probability of a single point
value is zero, is that we can be somewhat blasé about our use of < and ≤ since:

P (X ≤ x) = P (X < x) + P (X = x) = P (X < x) + 0 = P (X < x).

Hence for any constants a and b, such that a < b, all of the following are equivalent:

P (a ≤ X ≤ b) = P (a < X ≤ b) = P (a ≤ X < b) = P (a < X < b).

It is useful to have a broad understanding of this definition of a continuous random


variable, although you will not need to reproduce this in detail. It is more important to
1
We could, of course, refer to x ∈ (0, 1) ∪ (2, 3) by the single, continuous interval 0 < x < 3, and treat
the probability that the value is between 1 and 2 (inclusive) as being zero.

58
3.4. Probability density function and cumulative distribution function

be able to actually calculate various probabilities such as P (a < X < b). This is possible
using either the probability density function or the cumulative distribution
function.

3.4 Probability density function and cumulative


distribution function
The probability properties of a continuous random variable, X, can be described by a
non-negative function, f (x), which is defined over the relevant interval S. f (x) is called
the probability density function (pdf ). A pdf itself does not represent a probability
(hence f (x) is not bounded by 1), instead it is a density of probability at a point x,
with probability itself corresponding to the area under the (graph of the) function f (x).

Example 3.2 If we wanted P (1 < X < 3), say, then we would compute the area
under the curve defined by f (x) and above the x-axis interval (1, 3). This is
illustrated in Figure 3.1.

f(x)

P (1 < X < 3)

1 3
x

Figure 3.1: For an arbitrary pdf, P (1 < X < 3) is shown as the area under the pdf and
above the x-axis interval (1, 3).

In this way the pdf will give us the probabilities associated with any interval of interest,
but there is never any interest in wanting the probability for a point value of a
continuous random variable (which, remember, is zero). With this in mind, it is clear
that integration is very important in the theory of continuous random variables, because
of its role in determining areas. Hence the following properties for Example 3.2 should
be readily apparent:
R3
P (1 < X < 3) = the area under f (x) above the x-axis interval (1, 3) = 1
f (x) dx

the total area under the curve is 1, since this represents the probability of X taking
any possible value.

59
3. Continuous probability distributions

Formal properties of a pdf

Any function f (x) defined on an interval S ∈ R can be the pdf for the probability
distribution of a (continuous) random variable X, provided that it satisfies the
following two criteria.

1. f (x) ≥ 0 for all x ∈ S (since you cannot have negative probabilities – recall
axiom 1).
R
2. S f (x) dx = 1, where S represents the sample space of x values. Hence the
total area under the curve (i.e. the total probability) is 1 – recall axiom 2.

So, if we want to calculate P (a < X < b), for any constants a and b in S such that
a < b, then:
Z b
P (a < X < b) = f (x) dx.
a

Therefore, this integration/area approach helps explain why, for any single point u ∈ S,
we have P (X = u) = 0. We can think of this probability Ras being the area of the
u
(vertical) line segment from the x-axis to f (u), which is u f (x) dx = 0.

3.4.1 Attributes of a continuous random variable

Just as with a discrete random variable, for a continuous random variable we want to
describe key features of the distribution. We now define various measures of location
and dispersion. The main difference from Chapter 2 is the use of integrals instead of
summations in the definitions.

Attributes of a continuous random variable

The relevant definitions are as follows.

The mean of X is: Z


E(X) = µ = x f (x) dx
S
where, as before, S denotes the sample space of X.

The variance of X is:


Z
2 2
Var(X) = σ = E((X − µ) ) = (x − µ)2 f (x) dx.
S

However, recall we can also express the variance as:

σ 2 = E(X 2 ) − µ2

where: Z
2
E(X ) = x2 f (x) dx.
S

60
3.4. Probability density function and cumulative distribution function

The standard deviation of X is:


p √
Var(X) = σ 2 = σ.

The median of X is the value m in S such that P (X ≤ m) = P (X ≥ m) = 0.5.


Therefore, in general, we can compute m using either:
Z m Z ∞
f (x) dx = 0.5 or f (x) dx = 0.5.
−∞ m

If the sample space of X has a lower bound of a then substitute a for −∞, and
if the sample space of X has an upper bound of b, then substitute b for ∞.

The mode of X is the value of X (if any) at which f (x) achieves a maximum.
Note the pdf could be multimodal.

Occasionally statisticians need to look at the expected value of higher-order powers of


(X − µ) than the second. For example, E((X − µ)3 ) is used to determine skewness,
which recall is a measure of a distribution’s departure from symmetry.

Example 3.3 Suppose that X is a random variable with probability density


function: (
kx(1 − x) for 0 ≤ x ≤ 1
f (x) =
0 otherwise.
Find k. Evaluate E(X) and Var(X).
Since the total probability under the curve is 1, we find that k = 6 because:
1 1 1
x2 x3
Z Z 
2 k
kx(1 − x) dx = k (x − x ) dx = k − = = 1.
0 0 2 3 0 6

It follows that the mean is:


Z 1 Z 1  3 1
2 2 3 x x4
E(X) = 6x (1 − x) dx = 6 (x − x ) dx = 6 − = 0.50.
0 0 3 4 0

We also have:
1 1 1
x4 x5
Z Z 
2 3 3 4
E(X ) = 6x (1 − x) dx = 6 (x − x ) dx = 6 − = 0.30
0 0 4 5 0

hence:
Var(X) = E(X 2 ) − (E(X))2 = 0.30 − (0.50)2 = 0.05.

3.4.2 The cumulative distribution function (cdf)


Just as for discrete distributions, we can think about cumulative probabilities in the
continuous setting.

61
3. Continuous probability distributions

Cumulative distribution function of a continuous distribution

If X is a continuous random variable and x ∈ R, then the cdf for X is the probability
that X is less than or equal to x, such that:
Z x
F (x) = P (X ≤ x) = f (t) dt. (3.2)
−∞

(Note the cdf is written F (x), i.e. with a capital ‘F ’, while the pdf is written f (x),
i.e. with a lower case ‘f ’.)

Hence we see an important relationship between the cdf and the pdf, that is, we obtain
the cdf by integrating the pdf from the lower bound of X (−∞ in (3.2)) to x. Therefore,
this implies we can obtain the pdf by differentiating the cdf with respect to x.

Relationship between pdf and cdf

Let F (x) be the cdf of a continuous random variable. It follows that:

d
f (x) = F (x) = F 0 (x).
dx

It is very important to remember these methods for obtaining the pdf (cdf) from the
cdf (pdf). For instance, if we needed to work out E(X), but only R ∞ had the cdf, then we
would need to differentiate this first to obtain f (x) for use in −∞ x f (x) dx.
The following points are worth noting.

It is sometimes useful thinking about any continuous-valued random variable X as


taking values on the whole of the real line, R, i.e. thinking as if the sample space of
X was −∞ < x < ∞.
So, if the original sample space of definition, S, is smaller than R, it is always
possible to ‘extend’ this sample space by defining the (extended) pdf to be the
same as the original for values in S, and 0 everywhere else. To do so leaves all the
integrals we are interested in completely unchanged, and would mean that we could
always write −∞ as the lower end of the integration when defining F (x), instead of
worrying about two separate cases.

The events X = x and X < x are mutually exclusive, so by the additive law we
know that, if X is continuous:

F (x) = P (X ≤ x) = P (X < x) + P (X = x)
= P (X < x) + 0
= P (X < x).

It is important to realise that the equality F (x) = P (X < x) only holds for
continuous distributions.

62
3.5. Continuous uniform distribution

3.5 Continuous uniform distribution


Suppose X has a continuous2 uniform distribution from a to b, such that X is
equally likely to be in any of the length-one intervals between a and b, and also that it is
impossible for X to take a value outside this interval. We now define the pdf of X.

Continuous uniform pdf

If X has a uniform distribution over the continuous interval [a, b], then:
(
1/(b − a) for a ≤ x ≤ b
f (x) = (3.3)
0 otherwise.

We can easily check that (3.3) is a valid pdf since, clearly, f (x) ≥ 0 for all values of x,
and it integrates to 1 since:
Z ∞ Z a Z b Z ∞  b
1 x
f (x) dx = 0 dx + dx + 0 dx = 0 + + 0 = 1.
−∞ −∞ a b−a b b−a a

f(x)
0.1

P (2 < X < 6)

0 2 6 10
x

Figure 3.2: The pdf of X when X has a uniform distribution over [0, 10], showing the
region denoting P (2 < X < 6).

Example 3.4 Suppose X has a continuous uniform distribution over [0, 10], such
that 1/(b − a) = 1/(10 − 0) = 0.1. We have:
Z 6 h i6
P (2 < X < 6) = 0.1 dx = 0.1x = 0.4
2 2

and: Z 8 h i8
P (X < 8) = 0.1 dx = 0.1x = 0.8.
0 0

Of course, for this distribution these probabilities can simply be found geometrically
as areas of appropriate rectangles, as illustrated in Figure 3.2 for P (2 < X < 6).
Also, note that geometrically we can determine the median to be 5.
2
We have previously encountered discrete uniform distributions in Chapter 2. For example, the score
when rolling a fair die.

63
3. Continuous probability distributions

For the distribution function we have:


Z x Z x h ix
F (x) = f (t) dt = 0.1 dt = 0.1t = 0.1x for 0 ≤ x ≤ 10.
0 0 0

In full we write this as:



0
 for x < 0
F (x) = 0.1x for 0 ≤ x ≤ 10

1 for x > 10.

Figure 3.3 displays the cdf of X.


The (population/theoretical) mean is:
10 10 10
0.1x2
Z Z 
µ = E(X) = x f (x) dx = 0.1x dx = = 5.
0 0 2 0

The (population/theoretical) variance is:


10 10
0.1x3
Z 
2 2 2 2 2
σ = E(X ) − µ = 0.1x dx − 5 = − 25 = 8.33.
0 3 0

F(x)
1

0 10
x

Figure 3.3: The cdf of X when X has a uniform distribution over [0, 10].

A few points to note.

We can only have a continuous uniform distribution for an interval of finite length,
since any rectangle of infinite length would have infinite area, not an area of 1.

The ‘full’ version of the cdf in Example 3.4 is an example of a definition by cases,
which is an important technique where we use different rules for a function
depending on which variable values we are talking about.

The graph of F (x) has a minimum of 0 (for x ≤ a, and a = 0 in Example 3.4) and
a maximum of 1 (for x ≥ b, and b = 10 in Example 3.4), which it must have since it
is a (cumulative) probability so must satisfy axioms 1 and 2. Note the cdf is
non-decreasing, as it must be; otherwise this would suggest the possibility, for
u < v say, that P (X ≤ u) > P (X ≤ v), which implies that P (u < X ≤ v) < 0, i.e.

64
3.6. Exponential distribution

a negative probability. This is ‘illegal’ since it would violate axiom 1! Hence all cdfs
are non-decreasing functions bounded by 0 and 1.

3.6 Exponential distribution


The exponential distribution arises in reliability theory and queuing theory. For
example, in queuing theory we can model the distribution of interarrival times (if, as is
often assumed, arrivals are treated as having a Poisson distribution with a rate
parameter of λ). In this case, X is a positive-valued random variable following an
exponential distribution.

Exponential pdf

Let X be a non-negative continuous random variable. It follows an exponential


distribution if: (
λe−λx for x ≥ 0 and λ > 0
f (x) =
0 otherwise.

Example 3.5 Suppose λ = 3, then:


Z 5 h i5
P (3 < X < 5) = 3e−3x dx = − e−3x = e−9 − e−15 = 0.00012
3 3

and: Z 6 h i6
P (X < 6) = 3e−3x dx = − e−3x = e0 − e−18 = 1 − e−18 ≈ 1.
0 0

The distribution function can be calculated as follows:


Z x h ix
F (x) = 3e−3t dt = − e−3t = e0 − e−3x = 1 − e−3x for x ≥ 0.
0 0

In full we write this as:


(
0 for x < 0
F (x) =
1 − e−3x for x ≥ 0.

The (population/theoretical) mean is:


Z ∞ ∞
e−3x

−3x −3x 1
µ = E(X) = x 3e dx = −xe − = .
0 3 0 3

The (population/theoretical) variance is:


Z ∞  2
2 2 2 2 −3x 1 1
σ = E(X ) − µ = x 3e dx − = .
0 3 9

Note these last two results were obtained using integration by parts (a
non-examinable technique).

65
3. Continuous probability distributions

3.7 Normal distribution


The normal distribution3 is the most important distribution in statistics and it is
essential to much statistical theory and reasoning. It is, in a sense, the ‘parent’
distribution of all the sampling distributions which we shall meet later.
In order to get some feel for the normal distribution, let us consider the exercise of
constructing a histogram of people’s heights (assumed to be normally distributed).
Suppose we start with 100 people, chosen at random, and construct a histogram using
sufficient class intervals such that the histogram gives some representation of the data’s
distribution. This will be a fairly ‘ragged’ diagram, but useful nonetheless.
Now, suppose we increase our random sample to 500 and construct an appropriate
histogram for these observations, but using more class intervals now that we have more
data. This histogram will be smoother than the first, peaked in the centre, and roughly
symmetric about the centre. The normal distribution is emerging! If we continue this
exercise to random samples of 5,000, or even 50,000, then we will eventually arrive at a
very smooth bell-shaped curve as shown in Figure 3.4. Hence we can view the normal
distribution as the smooth limit of the basic histogram as the sample size becomes very
large.
Figure 3.4 represents the distribution of the population. It is conventional to adjust the
vertical scale so that the total area under the curve is 1, and so it is easy to view the
area under the curve as probability. The mathematical form of this curve is well-known
and can be used to compute areas, and hence probabilities. In due course we shall make
use of the New Cambridge Statistical Tables for this purpose.

Figure 3.4: The (standard) normal distribution.

3
Also referred to as the Gaussian distribution, after Carl Friedrich Gauss (1777–1855).

66
3.7. Normal distribution

3.7.1 Relevance of the normal distribution


The normal distribution is relevant to the application of statistics for many reasons. A
few of these follow.

Many naturally-occurring phenomena can be modelled as following a normal


distribution. Examples include heights of people, diameters of bolts, weights of
animals etc.4

A very important point is that averages of sampled variables (discussed later),


indeed any functions of sampled variables, also have probability distributions. It
can be demonstrated, theoretically and empirically, that, providing the sample size
is reasonably large, the distribution of the sample mean, X̄, will be
(approximately) normal regardless of the distribution of the original variable. This
is known as the central limit theorem (CLT), which we will return to in greater
depth in Chapter 5.

The normal distribution is often used as the distribution of the error term in
standard statistical and econometric models such as linear regression. This
assumption can be, and should be, checked. We will see the distributional
assumption of normality applied in the context of analysis of variance (ANOVA) in
Chapter 8.

3.7.2 Consequences of the central limit theorem


The consequences of the CLT are two-fold.

A number of statistical methods which we use have a robustness property, i.e. it


does not matter for their validity what the true population distribution of the
variable being sampled is.

We are justified in assuming normality for statistics which are sample means, or
linear transformations of them.

The CLT was introduced above ‘providing the sample size is reasonably large’. In
practice a sample size of 30 or more is usually sufficient (and can be used as a
rule-of-thumb), although the distribution of X̄ may be normal for n much less than 30.
This depends on the distribution of the original (population) variable. If this population
distribution is in fact normal, then all sample means computed from it will be normal.
However, if the population distribution is very non-normal, such as the exponential,
then a sample size of (at least) 30 would be needed to justify normality.
4
Note the use of the word ‘modelled’. This is due to the distributional assumption of normality. A
normal random variable X is defined over the entire real line, i.e. −∞ < x < ∞, but we know a person
cannot have a negative height, even though the normal distribution has positive, non-zero probability
over negative values. Also, nobody is of infinite height (the world’s tallest man ever, Robert Wadlow,
was 272 cm), so clearly there is a finite upper bound to height, rather than ∞. Therefore, height does
not follow a true normal distribution, but it is a good enough approximation for modelling purposes.

67
3. Continuous probability distributions

3.7.3 Characteristics of the normal distribution


The pdf of the normal distribution takes the general form:
(x − µ)2
 
1
f (x) = √ exp − .
2πσ 2 2σ 2
The shape of this function is the bell-shaped curve as shown in Figure 3.4. Do be aware
that it involves two parameters – the mean, µ, and the variance, σ 2 .

Since the normal distribution is symmetric about µ, the distribution is centred at


µ. As a consequence of this symmetry, the mean is equal to the median. Also, since
the distribution peaks at µ, it is also equal to the mode. In principle, −∞ < µ < ∞.
The variance is σ 2 , hence the larger σ 2 , the larger the dispersion of the distribution.
Note that σ 2 > 0.

If X has a normal distribution with parameters µ and σ 2 , we denote this as


X ∼ N (µ, σ 2 ). Given the infinitely-many possible values for µ and σ 2 , and given that
a normal distribution is uniquely defined by these two parameters, there is an infinite
number of normal distributions due to the infinite combinations of values for µ and σ 2 .
The most important normal distribution is the special case when µ = 0 and σ 2 = 1. We
call this the standard normal distribution, denoted by Z, i.e. Z ∼ N (0, 1).
Tabulated probabilities which appear in statistical tables are for the standard normal
distribution.

3.7.4 Standard normal tables


We now discuss the determination of normal probabilities using standard statistical
tables. Relevant extracts of the New Cambridge Statistical Tables will be provided in
the examination. Here we focus on Table 4.

Standard normal probabilities

Table 4 of the New Cambridge Statistical Tables lists cumulative probabilities, which
can be represented as:

Φ(z) = F (z) = P (Z ≤ z) for z ≥ 0

using the conventional Z notation for a standard normal random variable.


Note that Table 4 uses the notation Φ(x). However, we will denote this as Φ(z) (for
z-score). Φ(x) and Φ(z) mean the same thing, of course.

We now consider some examples of working out probabilities from Z ∼ N (0, 1).

Example 3.6 If Z ∼ N (0, 1), what is P (Z > 1.2)?


When computing probabilities, it is useful to draw a quick sketch to visualise the
specific area of probability which we are after.

68
3.7. Normal distribution

So, for P (Z > 1.2), we require the upper-tail probability shaded in red in Figure 3.5.
This is simply 1 − Φ(1.2), which is 0.1151 from Table 4.

Figure 3.5: The standard normal distribution with the total shaded area depicting the
value of P (Z > 1.2).

Example 3.7 If Z ∼ N (0, 1), what is P (−1.24 < Z < 1.86)?


Again, begin by producing a sketch.
The probability we require is the sum of the blue and red areas in Figure 3.6. Using
Table 4, which note only covers z ≥ 0, we proceed as follows.
The red area is given by:
P (0 ≤ Z ≤ 1.86) = P (Z ≤ 1.86) − P (Z ≤ 0)
= Φ(1.86) − Φ(0)
= 0.9686 − 0.50
= 0.4686.

Remember that Table 4 gives cumulative probabilities, so we have that


P (Z ≤ 1.86) = Φ(1.86). Also, since Z is symmetric about 0, Φ(0) = 0.50.
The blue area is given by:
P (−1.24 ≤ Z ≤ 0) = P (Z ≤ 0) − P (Z ≤ −1.24)
= Φ(0) − Φ(−1.24)
= 0.50 − (1 − 0.8925)
= 0.3925.
Note by symmetry of Z about µ = 0, P (Z ≤ −1.24) = P (Z ≥ 1.24) = 1 − Φ(1.24).
Hence P (−1.24 < Z < 1.86) = 0.4686 + 0.3925 = 0.8611.

69
3. Continuous probability distributions

Figure 3.6: The standard normal distribution with the total shaded area depicting the
value of P (−1.24 < Z < 1.86).

3.7.5 The general normal distribution


We have already discussed that there exists an infinite number of different normal
distributions due to the infinite pairs of parameter values since −∞ < µ < ∞ and
σ 2 > 0. The good news is that Table 4 can be used to determine probabilities for any
normal random variable X, such that X ∼ N (µ, σ 2 ).
To do so, we need a little bit of magic – standardisation. This is a special (linear)
transformation which converts X ∼ N (µ, σ 2 ) into Z ∼ N (0, 1).

The transformation formula for standardisation

If X ∼ N (µ, σ 2 ), then the transformation:

X −µ
Z=
σ
creates a standard normal random variable, i.e. Z ∼ N (0, 1). So to standardise X
we subtract its mean and divide by its standard deviation.

To see why, first note that any linear transformation of a normal random variable is also
normally distributed. Therefore, as X is normal, so too is Z, since the standardisation
transformation is linear in X. It remains to show that standardisation results in a
random variable with a zero mean and a unit variance. This is easy to show and is
worth remembering.
Since X ∼ N (µ, σ 2 ), then:
X −µ
 
1 1 1
E(Z) = E = E(X − µ) = (E(X) − µ) = (µ − µ) = 0.
σ σ σ σ
This result exploits the fact that σ is a constant, hence it can be taken outside the

70
3.7. Normal distribution

expectation operator. Turning to the variance:

X −µ
 
1 1 1
Var(Z) = Var = 2 Var(X − µ) = 2 Var(X) = 2 × σ 2 = 1.
σ σ σ σ

This result uses the fact that we must square a constant when taking it outside the
‘Var’ operator.

Example 3.8 Suppose X ∼ N (5, 4). What is P (5.8 < X ≤ 7.0)?


We have:
 
5.8 − 5 X −5 7.0 − 5
P (5.8 < X ≤ 7.0) = P √ < √ ≤ √
4 4 4
= P (0.40 < Z ≤ 1.00)
= Φ(1.00) − Φ(0.40)
= 0.8413 − 0.6554 (from Table 4)
= 0.1859.

3.7.6 Linear functions of normal random variables


We are often interested in linear functions of normal random variables. The following
basic results are useful in dealing with such situations.

Linear transformations of normal random variables

If X ∼ N (µ, σ 2 ) and a and b are constants (with b 6= 0), then:

a + bX ∼ N (a + bµ, b2 σ 2 ).

Linear combinations of normal random variables

If X1 and X2 are independent normal random variables, such that X1 ∼ N (µ1 , σ12 )
and X2 ∼ N (µ2 , σ22 ), then:

X1 ± X2 ∼ N (µ1 ± µ2 , σ12 + σ22 ).

Note that the variances are added even when dealing with the difference between
independent random variables.

3.7.7 Transforming non-normal random variables


Even when the distribution of a random variable is clearly non-normal, a
transformation can frequently be found such that the transformed random variable is

71
3. Continuous probability distributions

normally distributed. This can be used, via the standard normal distribution, to
calculate probabilities for the original random variable.

Example 3.9 For asbestos fibres, if Y represents the length of a fibre, then the
distribution of Y is very positively-skewed and is definitely non-normal. However, if
we transform this to X = ln(Y ), then X is found to have a normal distribution. Any
Y for which such a transformation leads to a normal random variable is said to have
a log-normal distribution. Another common transformation which converts some
non-normal distributions into normal distributions is taking the square root.

The following example illustrates the ways in which, with an appropriate


transformation, we can calculate probabilities for the original distribution.

Example 3.10 Suppose that X = ln(Y ) and that X ∼ N (0.5, 0.16) and we seek
P (Y > 1). It follows that, using Table 4:
 
0 − 0.5
P (Y > 1) = P (X > ln(1)) = P (X > 0) = P Z > √
0.16
= P (Z > −1.25)
= Φ(1.25)
= 0.8944.

3.8 Normal approximation to the binomial


In Chapter 2 we discussed using a Poisson approximation to a Bin(n, π) distribution,
since n! is hard to calculate for large n. Recall the necessary conditions for this to be a
‘good’ approximation were:

n should be greater than 30


π should be sufficiently extreme such that nπ < 10.

If these criteria are satisfied, then we use a Pois(nπ) approximating distribution.


An alternative approach to approximating desired binomial probabilities is the normal
approximation to the binomial. As with the Poisson approximation, we require n to be
‘large’, greater than 30, say. However, the normal approximation is more suitable for
non-extreme π. It can be shown (not here) that for large n and moderate π we have:

Bin(n, π) ≈ N (nπ, nπ(1 − π)).

Unfortunately, there is one small caveat. The binomial distribution is discrete, but the
normal distribution is continuous. To see why this is problematic, consider the following.
Suppose X ∼ Bin(40, 0.4). Since X is discrete, such that x = 0, 1, 2, . . . , 40, then:

P (X ≤ 4) = P (X ≤ 4.5) = P (X < 5)

72
3.9. Overview of chapter

since P (4 < X ≤ 4.5) = 0 and P (4.5 < X < 5) = 0 due to the ‘gaps’ in the probability
mass for this distribution. In contrast, if Y ∼ N (16, 9.6), then:

P (Y ≤ 4) < P (Y ≤ 4.5) < P (Y < 5)

since P (4 < Y < 4.5) > 0 and P (4.5 < Y < 5) > 0 because this is a continuous
distribution.
The accepted way to circumvent this problem is to use a continuity correction which
corrects for the effects of the transition from a discrete Bin(n, π) distribution to a
continuous N (nπ, nπ(1 − π)) distribution.

Continuity correction

This technique involves representing each discrete binomial value x, for 0 ≤ x ≤ n, by


the continuous interval (x − 0.5, x + 0.5). Great care is needed to determine which
x values are included in the required probability. Suppose we are approximating
X ∼ Bin(n, π) with Y ∼ N (nπ, nπ(1 − π)), then:

P (X < 4) = P (X ≤ 3) ⇒ P (Y < 3.5) (since 4 is excluded)


P (X ≤ 4) = P (X < 5) ⇒ P (Y < 4.5) (since 4 is included)
P (1 ≤ X < 6) = P (1 ≤ X ≤ 5) ⇒ P (0.5 < Y < 5.5) (since 1 to 5 are included).

Example 3.11 A fair coin is tossed 100 times. What is the probability of getting
more than 60 heads?
Let X be the number of heads, hence X ∼ Bin(100, 0.5). Here n > 30 and π is
moderate, hence a normal approximation to the binomial is appropriate. We use
Y ∼ N (50, 25) as the approximating distribution. So:
 
60.5 − 50
P (X > 60) ≈ P (Y > 60.5) = P Z > √ = P (Z > 2.10) = 0.01786.
25

3.9 Overview of chapter


This chaper has introduced continuous random variables. In particular, some common
families of continuous probability distributions have been presented. In addition to the
functional form of each of these distributions, important properties (such as the
expected value and variance) have been studied.

3.10 Key terms and concepts


Central limit theorem Continuity correction
Continuous variable Exponential distribution

73
3. Continuous probability distributions

Interval Normal distribution


Probability density function Standardisation
Transformation Uniform distribution

3.11 Sample examination questions


1. The random variable X has the probability density function given by:
(
kx2 for 0 < x < 1
f (x) =
0 otherwise

where k > 0 is a constant.


(a) Find the value of k.

(b) Compute E(X) and Var(X).

2. The amount of coffee, C, dispensed into a coffee cup by a coffee machine follows a
normal distribution with mean 150 ml and standard deviation 10 ml. The coffee is
sold at the price of £1 per cup. However, the coffee cups are marked at the 137 ml
level, and any cup with coffee below this level will be given away free of charge.
The amounts of coffee dispensed in different cups are independent of each other.
(a) Find the probability that the total amount of coffee in 2 cups exceeds 280 ml.

(b) Find the probability that one cup is filled below the level of 137 ml.

(c) Find the expected income from selling one cup of coffee.

3. A random sample of n = 20 observations is drawn from a distribution with the


following probability density function:

3x2 for 0 ≤ x ≤ 1
f (x) =
0 otherwise.

Let Y denote the number of the 20 observations which are in the interval (0.5, 1).
Calculate E(Y ) and Var(Y ).

3.12 Solutions to Sample examination questions


R
1. (a) Since f (x) dx = 1, we have:
Z 1  3 1
2 kx k
kx dx = = =1
0 3 0 3
and so k = 3.

74
3.12. Solutions to Sample examination questions

(b) We have:
1 1 1
3x4
Z Z 
3 3
E(X) = x f (x) dx = 3x dx = =
0 0 4 0 4
and: 1
1 1
3x5
Z Z 
2 2 4 3
E(X ) = x f (x) dx = 3x dx = = .
0 0 5 0 5
Hence:
 2
2 3 2 3 3
Var(X) = E(X ) − (E(X)) = − = = 0.0375.
5 4 80

2. (a) The total amount of coffee in 2 cups, T , follows a normal distribution with a
mean of:
E(T ) = E(X1 + X2 ) = E(X1 ) + E(X2 ) = 150 + 150 = 300
and, due to independence, a variance of:
Var(T ) = Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) = (10)2 + (10)2 = 200.
Hence T ∼ N (300, 200). Therefore:
 
280 − 300
P (T > 280) = P Z > √ = P (Z > −1.41) = Φ(1.41) = 0.9207.
200

(b) Since C ∼ N (150, 100), one cup to be filled below the level of 137 ml has
probability:
 
137 − 150
P (C < 137) = P Z < √ = P (Z < −1.30) = 1 − Φ(1.30) = 0.0968.
100

(c) Let X denote the income from selling one cup of coffee. This is £0 if the cup is
filled below the level of 137 ml (with probability 0.0968), and is £1 otherwise
(with probability 1 − 0.0968 = 0.9032). Hence:
X=x 0 1
P (X = x) 0.0968 0.9032
Hence:
1
X
E(X) = x p(x) = 0 × 0.0968 + 1 × 0.9032 = £0.9032.
x=0

3. The probability that an observation of X lies in (0.5, 1) is:


Z 1 h i1 7
P (0.5 < X < 1) = 3x2 dx = x3 = = 0.875.
0.5 0.5 8
Therefore, Y ∼ Bin(20, 0.875). Hence E(Y ) = 20 × 0.875 = 17.5 and:
Var(Y ) = nπ(1 − π) = 20 × 0.875 × 0.125 = 2.1875.

There are two kinds of statistics, the kind you look up and the kind you make
up.
(Rex Stout)

75
3. Continuous probability distributions

76
Chapter 4
Multivariate random variables

4.1 Synopsis of chapter


Almost all applications of statistical methods deal with several measurements on the
same, or connected, items. To think statistically about several measurements on a
randomly selected item, you must understand some of the concepts for joint
distributions of random variables.

4.2 Learning outcomes


After completing this chapter, you should be able to:

arrange the probabilities for a discrete bivariate distribution in tabular form


define marginal and conditional distributions, and determine them for a discrete
bivariate distribution
recall how to define and determine independence for two random variables
define and compute expected values for functions of two random variables and
demonstrate how to prove simple properties of expected values
provide the definition of covariance and correlation for two random variables and
calculate these.

4.3 Introduction
So far, we have considered univariate situations, that is one random variable at a time.
Now we will consider multivariate situations, that is two or more random variables at
once, and together.
In particular, we consider two somewhat different types of multivariate situations.

1. Several different variables – such as the height and weight of a person.


2. Several observations of the same variable, considered together – such as the heights
of all n people in a sample.

Suppose that X1 , X2 , . . . , Xn are random variables, then the vector:

X = (X1 , X2 , . . . , Xn )0

77
4. Multivariate random variables

is a multivariate random variable (here n-variate), also known as a random


vector. Its possible values are the vectors:
x = (x1 , x2 , . . . , xn )0
where each xi is a possible value of the random variable Xi , for i = 1, 2, . . . , n.
The joint probability distribution of a multivariate random variable X is defined by
the possible values x, and their probabilities.
For now, we consider just the simplest multivariate case, a bivariate random variable
where n = 2. This is sufficient for introducing most of the concepts of multivariate
random variables.
For notational simplicity, we will use X and Y instead of X1 and X2 . A bivariate
random variable is then the pair (X, Y ).

Example 4.1 In this chapter, we consider the following example.


Discrete bivariate example – for a football match:

X = the number of goals scored by the home team

Y = the number of goals scored by the visiting (away) team.

4.4 Joint probability functions


When the random variables in (X1 , X2 , . . . , Xn ) are either all discrete or all continuous,
we also call the multivariate random variable either discrete or continuous, respectively.
For a discrete multivariate random variable, the joint probability distribution is
described by the joint probability function, defined as:
p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn )
for all vectors (x1 , x2 , . . . , xn ) of n real numbers. The value p(x1 , x2 , . . . , xn ) of the joint
probability function is itself a single number, not a vector.
In the bivariate case, this is:
p(x, y) = P (X = x, Y = y)
which we sometimes write as pX,Y (x, y) to make the random variables clear.

Example 4.2 Consider a randomly selected football match in the English Premier
League (EPL), and the two random variables:

X = the number of goals scored by the home team

Y = the number of goals scored by the visiting (away) team.

Suppose both variables have possible values 0, 1, 2 and 3 (to keep this example
simple, we have recorded the small number of scores of 4 or greater also as 3).

78
4.5. Marginal distributions

Consider the joint distribution of (X, Y ). We use probabilities based on data from
the 2009–10 EPL season.
Suppose the values of pX,Y (x, y) = p(x, y) = P (X = x, Y = y) are the following:

Y =y
X=x 0 1 2 3
0 0.100 0.031 0.039 0.031
1 0.100 0.146 0.092 0.015
2 0.085 0.108 0.092 0.023
3 0.062 0.031 0.039 0.006

and p(x, y) = 0 for all other (x, y).


Note that this satisfies the conditions for a probability function.

1. p(x, y) ≥ 0 for all (x, y).


3 P
P 3
2. p(x, y) = 0.100 + 0.031 + · · · + 0.006 = 1.000.
x=0 y=0

The joint probability function gives probabilities of values of (X, Y ), for example:

A 1–1 draw, which is the most probable single result, has probability

P (X = 1, Y = 1) = p(1, 1) = 0.146.

The match is a draw with probability:

P (X = Y ) = p(0, 0) + p(1, 1) + p(2, 2) + p(3, 3) = 0.344.

The match is won by the home team with probability:

P (X > Y ) = p(1, 0) + p(2, 0) + p(2, 1) + p(3, 0) + p(3, 1) + p(3, 2) = 0.425.

More than 4 goals are scored in the match with probability:

P (X + Y > 4) = p(2, 3) + p(3, 2) + p(3, 3) = 0.068.

4.5 Marginal distributions


Consider a multivariate discrete random variable X = (X1 , X2 , . . . , Xn ).
The marginal distribution of a subset of the variables in X is the (joint) distribution
of this subset. The joint pf of these variables (the marginal pf) is obtained by
summing the joint pf of X over the variables which are not included in the subset.

79
4. Multivariate random variables

Example 4.3 Consider X = (X1 , X2 , X3 , X4 ), and the marginal distribution of the


subset (X1 , X2 ). The marginal pf of (X1 , X2 ) is:
XX
p1,2 (x1 , x2 ) = P (X1 = x1 , X2 = x2 ) = p(x1 , x2 , x3 , x4 )
x3 x4

where the sum is of the values of the joint pf of (X1 , X2 , X3 , X4 ) over all possible
values of X3 and X4 .

The simplest marginal distributions are those of individual variables in the multivariate
random variable.
The marginal pf is then obtained by summing the joint pf over all the other variables.
The resulting marginal distribution is univariate, and its pf is a univariate pf.

Marginal distributions for discrete bivariate distributions

For the bivariate distribution of (X, Y ) the univariate marginal distributions are
those of X and Y individually. Their marginal pfs are:
X X
pX (x) = p(x, y) and pY (y) = p(x, y).
y x

Example 4.4 Continuing with the football example introduced in Example 4.2, the
joint and marginal probability functions are:

Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000

and p(x, y) = pX (x) = pY (y) = 0 for all other (x, y).


For example:
3
X
pX (0) = p(0, y) = p(0, 0) + p(0, 1) + p(0, 2) + p(0, 3)
y=0

= 0.100 + 0.031 + 0.039 + 0.031


= 0.201.

Even for a multivariate random variable, expected values E(Xi ), variances Var(Xi ) and
medians of individual variables are obtained from the univariate (marginal)
distributions of Xi , as defined in Chapter 2.

80
4.6. Conditional distributions

Example 4.5 Consider again the football example.

The expected number of goals scored by the home team is:


X
E(X) = x pX (x) = 0 × 0.201 + 1 × 0.353 + 2 × 0.308 + 3 × 0.138 = 1.383.
x

The expected number of goals scored by the visiting team is:


X
E(Y ) = y pY (y) = 0 × 0.347 + 1 × 0.316 + 2 × 0.262 + 3 × 0.075 = 1.065.
y

4.6 Conditional distributions


Consider discrete variables X and Y , with joint pf p(x, y) = pX,Y (x, y) and marginal pfs
pX (x) and pY (y), respectively.

Conditional distributions of discrete bivariate distributions

Let x be one possible value of X, for which pX (x) > 0. The conditional
distribution of Y given that X = x is the discrete probability distribution with
the pf:

P (X = x and Y = y) pX,Y (x, y)


pY |X (y | x) = P (Y = y | X = x) = =
P (X = x) pX (x)

for any value y.


This is the conditional probability function of Y given X = x.

Example 4.6 Recall that in the football example the joint and marginal pfs were:

Y =y
X=x 0 1 2 3 pX (x)
0 0.100 0.031 0.039 0.031 0.201
1 0.100 0.146 0.092 0.015 0.353
2 0.085 0.108 0.092 0.023 0.308
3 0.062 0.031 0.039 0.006 0.138
pY (y) 0.347 0.316 0.262 0.075 1.000

We can now calculate the conditional pf of Y given X = x for each x, i.e. of away
goals given home goals. For example:

pX,Y (0, y) pX,Y (0, y)


pY |X (y | 0) = pY |X (y | X = 0) = = .
pX (0) 0.201

So, for example, pY |X (1 | 0) = pX,Y (0, 1)/0.201 = 0.031/0.201 = 0.154.

81
4. Multivariate random variables

Calculating these for each value of x gives:

pY |X (y | x) when y is:
X=x 0 1 2 3 Sum
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.00
2 0.276 0.351 0.299 0.075 1.00
3 0.449 0.225 0.283 0.043 1.00

So, for example:

if the home team scores 0 goals, the probability that the visiting team scores 1
goal is pY |X (1 | 0) = 0.154

if the home team scores 1 goal, the probability that the visiting team wins the
match is pY |X (2 | 1) + pY |X (3 | 1) = 0.261 + 0.042 = 0.303.

4.6.1 Properties of conditional distributions


Each different value of x defines a different conditional distribution and conditional pf
pY |X (y | x). Each value of pY |X (y | x) is a conditional probability of the kind previously
defined. Defining events A = {Y = y} and B = {X = x}, then:
P (A ∩ B) P (Y = y and X = x)
P (A | B) = =
P (B) P (X = x)
= P (Y = y | X = x)
pX,Y (x, y)
=
pX (x)
= pY |X (y | x).
A conditional distribution is itself a probability distribution, and a conditional pf is a
pf. Clearly, pY |X (y | x) ≥ 0 for all y, and:
P
pX,Y (x, y)
X y pX (x)
pY |X (y | x) = = = 1.
y
p X (x) p X (x)

The conditional distribution and pf of X given Y = y (for any y such that pY (y) > 0) is
defined similarly, with the roles of X and Y reversed:
pX,Y (x, y)
pX|Y (x | y) =
pY (y)
for any value x.
Conditional distributions are general and are not limited to the bivariate case. If X
and/or Y are vectors of random variables, the conditional pf of Y given X = x is:
pX,Y (x, y)
pY|X (y | x) =
pX (x)

82
4.7. Covariance and correlation

where pX,Y (x, y) is the joint pf of the random vector (X, Y), and pX (x) is the marginal
pf of the random vector X.

4.6.2 Conditional mean and variance


Since a conditional distribution is a probability distribution, it also has a mean
(expected value) and variance (and median etc.).
These are known as the conditional mean and conditional variance, and are
denoted, respectively, by:

EY |X (Y | x) and VarY |X (Y | x).

Example 4.7 In the football example, we have:


X
EY |X (Y | 0) = y pY |X (y | 0) = 0 × 0.498 + 1 × 0.154 + 2 × 0.194 + 3 × 0.154 = 1.00.
y

So, if the home team scores 0 goals, the expected number of goals by the visiting
team is EY |X (Y | 0) = 1.00.
EY |X (Y | x) for x = 1, 2 and 3 are obtained similarly.
Here X is the number of goals by the home team, and Y is the number of goals by
the visiting team:

pY |X (y | x) when y is:
X=x 0 1 2 3 EY |X (Y | x)
0 0.498 0.154 0.194 0.154 1.00
1 0.283 0.414 0.261 0.042 1.06
2 0.276 0.351 0.299 0.075 1.17
3 0.449 0.225 0.283 0.043 0.92

Plots of the conditional means are shown in Figure 4.1.

4.7 Covariance and correlation


Suppose that the conditional distributions pY |X (y | x) of a random variable Y given
different values x of a random variable X are not all the same, i.e. the conditional
distribution of Y ‘depends on’ the value of X.
Therefore, there is said to be an association (or dependence) between X and Y .
If two random variables are associated (dependent), knowing the value of one (for
example, X) will help to predict the likely value of the other (for example, Y ).
We next consider two measures of association which are used to summarise the
strength of an association in a single number: covariance and correlation (scaled
covariance).

83
4. Multivariate random variables

3.0
Home goals x
Expected away goals E(Y|x)

2.5
2.0
1.5
1.0
0.5
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

Goals

Figure 4.1: Conditional means for Example 4.7.

4.7.1 Covariance

Definition of covariance

The covariance of two random variables X and Y is defined as:

Cov(X, Y ) = Cov(Y, X) = E((X − E(X))(Y − E(Y ))).

This can also be expressed as the more convenient formula:

Cov(X, Y ) = E(XY ) − E(X) E(Y ).

(Note that these involve expected values of products of two random variables, which
have not been defined yet.)

Properties of covariance

Suppose X and Y are random variables, and a, b, c and d are constants.

The covariance of a random variable with itself is the variance of the random
variable:
Cov(X, X) = E(XX) − E(X) E(X) = E(X 2 ) − (E(X))2 = Var(X).

The covariance of a random variable and a constant is 0:


Cov(a, X) = E(aX) − E(a) E(X) = a E(X) − a E(X) = 0.

The covariance of linear transformations of random variables is:


Cov(aX + b, cY + d) = ac Cov(X, Y ).

84
4.7. Covariance and correlation

4.7.2 Correlation

Definition of correlation

The correlation of two random variables X and Y is defined as:


Cov(X, Y ) Cov(X, Y )
Corr(X, Y ) = Corr(Y, X) = p = .
Var(X) Var(Y ) sd(X) sd(Y )

When Cov(X, Y ) = 0, then Corr(X, Y ) = 0. When this is the case, we say that X
and Y are uncorrelated.

Correlation and covariance are measures of the strength of the linear (‘straight-line’)
association between X and Y .
The further the correlation is from 0, the stronger is the linear association. The most
extreme possible values of correlation are −1 and +1, which are obtained when Y is an
exact linear function of X.
Corr(X, Y ) = +1 when Y = aX + b with a > 0.
Corr(X, Y ) = −1 when Y = aX + b with a < 0.

If Corr(X, Y ) > 0, we say that X and Y are positively correlated.


If Corr(X, Y ) < 0, we say that X and Y are negatively correlated.

Example 4.8 Recall the joint pf pX,Y (x, y) in the football example:

Y =y
X=x 0 1 2 3
0 0 0 0 0
0.100 0.031 0.039 0.031
1 0 1 2 3
0.100 0.146 0.092 0.015
2 0 2 4 6
0.085 0.108 0.092 0.023
3 0 3 6 9
0.062 0.031 0.039 0.006

Here, the numbers in bold are the values of xy for each combination of x and y.
From these and their probabilities, we can derive the probability distribution of XY .
For example:

P (XY = 2) = pX,Y (1, 2) + pX,Y (2, 1) = 0.092 + 0.108 = 0.200.

The pf of the product XY is:

XY = xy 0 1 2 3 4 6 9
P (XY = xy) 0.448 0.146 0.200 0.046 0.092 0.062 0.006

85
4. Multivariate random variables

Hence:

E(XY ) = 0 × 0.448 + 1 × 0.146 + 2 × 0.200 + · · · + 9 × 0.006 = 1.478.

From the marginal pfs pX (x) and pY (y) we get:

E(X) = 1.383, E(Y ) = 1.065

also:
E(X 2 ) = 2.827, E(Y 2 ) = 2.039
hence:
Var(X) = 2.827 − (1.383)2 = 0.9143
and:
Var(Y ) = 2.039 − (1.065)2 = 0.9048.
Therefore, the covariance of X and Y is:

Cov(X, Y ) = E(XY ) − E(X) E(Y )


= 1.478 − 1.383 × 1.065
= 0.00511

and the correlation is:


Cov(X, Y )
Corr(X, Y ) = p
Var(X) Var(Y )
0.00511
=√
0.9143 × 0.9048
= 0.00562.

The numbers of goals scored by the home and visiting teams are very nearly
uncorrelated (i.e. not linearly associated).

4.8 Independent random variables


Two discrete random variables X and Y are associated if pY |X (y | x) depends on x.
What if it does not, i.e. what if:

pX,Y (x, y)
pY |X (y | x) = = pY (y) for all x and y
pX (x)

so that knowing the value of X does not help to predict Y ? This implies that:

pX,Y (x, y) = pX (x) pY (y) for all x, y. (4.1)

X and Y are independent of each other if and only if (4.1) is true.

86
4.8. Independent random variables

Independent random variables

In general, suppose that X1 , X2 , . . . , Xn are discrete random variables. These are


independent if and only if their joint pf is:

p(x1 , x2 , . . . , xn ) = p1 (x1 ) p2 (x2 ) · · · pn (xn )

for all numbers x1 , x2 , . . . , xn , where p1 (x1 ), p2 (x2 ), . . . , pn (xn ) are the univariate
marginal pfs of X1 , X2 , . . . , Xn , respectively.
Similarly, continuous random variables X1 , X2 , . . . , Xn are independent if and only
if their joint pdf is:

f (x1 , x2 , . . . , xn ) = f1 (x1 ) f2 (x2 ) · · · fn (xn )

for all x1 , x2 , . . . , xn , where f1 (x1 ), f2 (x2 ), . . . , fn (xn ) are the univariate marginal pdfs
of X1 , X2 , . . . , Xn , respectively.

If two random variables are independent, they are also uncorrelated, i.e. we have:

Cov(X, Y ) = 0 and Corr(X, Y ) = 0.

The reverse is not true, i.e. two random variables can be dependent even when their
correlation is 0. This can happen when the dependence is non-linear.

Example 4.9 The football example is an instance of this. The conditional


distributions pY |X (y | x) are clearly not all the same, but the correlation is very
nearly 0 (see Example 4.8).

4.8.1 Joint distribution of independent random variables


When random variables are independent, we can easily derive their joint pf or pdf as
the product of their univariate marginal distributions. This is particularly simple if all
the marginal distributions are the same.

Example 4.10 Suppose that X1 , X2 , . . . , Xn are independent, and each of them


follows the Poisson distribution with the same mean λ. Therefore, the marginal pf of
each Xi is:
e−λ λxi
p(xi ) =
xi !
and the joint pf of the random variables is:
P
n n xi
Y Y e−λ λxi e−nλ λ i
p(x1 , x2 , . . . , xn ) = p(x1 ) p(x2 ) · · · p(xn ) = p(xi ) = = Q .
i=1 i=1
xi ! xi !
i

87
4. Multivariate random variables

Example 4.11 For a continuous example, suppose that X1 , X2 , . . . , Xn are


independent, and each of them follows a normal distribution with the same mean µ
and same variance σ 2 . Therefore, the marginal pdf of each Xi is:

(xi − µ)2
 
1
f (xi ) = √ exp −
2πσ 2 2σ 2

and the joint pdf of the variables is:


n
Y
f (x1 , x2 , . . . , xn ) = f (x1 ) f (x2 ) · · · f (xn ) = f (xi )
i=1
n
(xi − µ)2
 
Y 1
= √ exp −
i=1 2πσ 2 2σ 2
n
!
1 1 X
= √ n exp − 2 (xi − µ)2 .
2πσ 2 2σ i=1

4.9 Sums of random variables


Suppose X1 , X2 , . . . , Xn are random variables. We now go from the multivariate setting
back to the univariate setting, by considering univariate functions of X1 , X2 , . . . , Xn . In
particular, we consider sums like:
n
X
ai Xi + b = a1 X1 + a2 X2 + · · · + an Xn + b (4.2)
i=1

where a1 , a2 , . . . , an and b are constants.


Each such sum is itself a univariate random variable. The probability distribution of
such a function depends on the joint distribution of X1 , X2 , . . . , Xn .

Example 4.12 In the football example, the sum Z = X + Y is the total number of
goals scored in a match.
Its probability function is obtained from the joint pf pX,Y (x, y), that is:

Z=z 0 1 2 3 4 5 6
pZ (z) 0.100 0.131 0.270 0.293 0.138 0.062 0.006

For example, pZP


(1) = pX,Y (0, 1) + pX,Y (1, 0) = 0.031 + 0.100 = 0.131. The mean of Z
is then E(Z) = z pZ (z) = 2.448.
z

However, what can we say about such distributions in general, in cases where we cannot
derive them as easily?

88
4.9. Sums of random variables

4.9.1 Expected values and variances of sums of random


variables
We state, without proof, the following important result.
If X1 , X2 , . . . , Xn are random variables with means E(X1 ), E(X2 ), . . . , E(Xn ),
respectively, and a1 , a2 , . . . , an and b are constants, then:
n
!
X
E ai Xi + b = E(a1 X1 + a2 X2 + · · · + an Xn + b)
i=1

= a1 E(X1 ) + a2 E(X2 ) + · · · + an E(Xn ) + b


n
X
= ai E(Xi ) + b. (4.3)
i=1

Two simple special cases of this, when n = 2, are:

E(X + Y ) = E(X) + E(Y ), obtained by choosing X1 = X, X2 = Y , a1 = a2 = 1


and b = 0

E(X − Y ) = E(X) − E(Y ), obtained by choosing X1 = X, X2 = Y , a1 = 1,


a2 = −1 and b = 0.

Example 4.13 In the football example, we have previously shown that


E(X) = 1.383, E(Y ) = 1.065 and E(X + Y ) = 2.448. So E(X + Y ) = E(X) + E(Y ),
as the theorem claims.

If X1 , X2 , . . . , Xn are random variables with variances Var(X1 ), Var(X2 ), . . . , Var(Xn ),


respectively, and covariances Cov(Xi , Xj ) for i 6= j, and a1 , a2 , . . . , an and b are
constants, then:
n
! n
X X XX
Var ai Xi + b = a2i Var(Xi ) + 2 ai aj Cov(Xi , Xj ). (4.4)
i=1 i=1 i<j

In particular, for n = 2:

Var(X + Y ) = Var(X) + Var(Y ) + 2 × Cov(X, Y )


Var(X − Y ) = Var(X) + Var(Y ) − 2 × Cov(X, Y ).

If X1 , X2 , . . . , Xn are independent random variables, then Cov(Xi , Xj ) = 0 for all i 6= j,


and so (4.4) simplifies to:
n
! n
X X
Var ai Xi = a2i Var(Xi ). (4.5)
i=1 i=1

In particular, for n = 2, when X and Y are independent:

Var(X + Y ) = Var(X) + Var(Y )


Var(X − Y ) = Var(X) + Var(Y ).

89
4. Multivariate random variables

These results also hold whenever Cov(Xi , Xj ) = 0 for all i 6= j, even if the random
variables are not independent.

4.9.2 Distributions of sums of random variables


We now know the expected value and variance of the sum:
a1 X1 + a2 X2 + · · · + an Xn + b
whatever the joint distribution of X1 , X2 , . . . , Xn . This is usually all we can say about
the distribution of this sum.
In particular, the form of the distribution of the sum (i.e. its pf/pdf) depends on the
joint distribution of X1 , X2 , . . . , Xn , and there are no simple general results about that.
For example, even if X and Y have distributions from the same family, the distribution
of X + Y is often not from that same family. However, such results are available for a
few special cases.

Sums of independent binomial and Poisson random variables

Suppose X1 , X2 , . . . , Xn are random variables, and we consider the unweighted sum:


n
X
Xi = X1 + X2 + · · · + Xn .
i=1

That is, the general sum given by (4.2), with a1 = a2 = · · · = an = 1 and b = 0.


The following results hold when the random variables X1 , X2 , . . . , Xn are independent,
but not otherwise.
P P
If Xi ∼ Bin(ni , π), then i Xi ∼ Bin( i ni , π).
P P
If Xi ∼ Pois(λi ), then i Xi ∼ Pois( i λi ).

Application to the binomial distribution

An easy proof that the mean and variance of X ∼ Bin(n, π) are E(X) = nπ and
Var(X) = nπ(1 − π) is as follows.

1. Let Z1 , Z2 , . . . , Zn be independent random variables, each distributed as


Zi ∼ Bernoulli(π) = Bin(1, π).
2. It is easy to show that E(Zi ) = π and Var(Zi ) = π(1 − π) for each i = 1, 2, . . . , n.
n
P
3. Also, Zi = X ∼ Bin(n, π) by the result above for sums of independent binomial
i=1
random variables.
4. Therefore, using the results (4.2) and (4.5), we have:
n
X n
X
E(X) = E(Zi ) = nπ and Var(X) = Var(Zi ) = nπ(1 − π).
i=1 i=1

90
4.9. Sums of random variables

Sums of normally distributed random variables

All sums (linear combinations) of normally distributed random variables are also
normally distributed.
Suppose X1 , X2 , . . . , Xn are normally distributed random variables, with Xi ∼ N (µi , σi2 )
for i = 1, 2, . . . , n, and a1 , a2 , . . . , an and b are constants, then:
n
X
ai Xi + b ∼ N (µ, σ 2 )
i=1

where:
n
X n
X XX
µ= ai µi + b and σ 2 = a2i σi2 + 2 ai aj Cov(Xi , Xj ).
i=1 i=1 i<j

If the Xi s are independent (or just uncorrelated), i.e. if Cov(Xi , Xj ) = 0 for all i 6= j,
n
the variance simplifies to σ 2 = a2i σi2 .
P
i=1

Example 4.14 Suppose that in the population of English people aged 16 or over:

the heights of men (in cm) follow a normal distribution with mean 174.9 and
standard deviation 7.39
the heights of women (in cm) follow a normal distribution with mean 161.3 and
standard deviation 6.85.

Suppose we select one man and one woman at random and independently of each
other. Denote the man’s height by X and the woman’s height by Y . What is the
probability that the man is at most 10 cm taller than the woman?
In other words, what is the probability that the difference between X and Y is at
most 10?
Since X and Y are independent we have:
2
D = X − Y ∼ N (µX − µY , σX + σY2 )
= N (174.9 − 161.3, (7.39)2 + (6.85)2 )
= N (13.6, (10.08)2 ).
The probability we need is:
 
D − 13.6 10 − 13.6
P (D ≤ 10) = P ≤
10.08 10.08
= P (Z ≤ −0.36)
= P (Z ≥ 0.36)
= 0.3594
using Table 4 of the New Cambridge Statistical Tables.
The probability that a randomly selected man is at most 10 cm taller than a
randomly selected woman is about 0.3594.

91
4. Multivariate random variables

4.10 Overview of chapter


This chapter has introduced how to deal with more than one random variable at a time.
Focusing mainly on discrete bivariate distributions, the relationships between joint,
marginal and conditional distributions were explored. Sums of random variables
concluded the chapter.

4.11 Key terms and concepts


Association Bivariate
Conditional distribution Conditional mean
Conditional variance Correlation
Covariance Dependence
Independence Joint probability distribution
Joint probability (density) function Marginal distribution
Multivariate Random vector
Uncorrelated Univariate

4.12 Sample examination questions


1. Consider two random variables, X and Y . They both take the values −1, 0 and 1.
The joint probabilities for each pair of values, (x, y), are given in the following
table.

X = −1 X=0 X=1
Y = −1 0.09 0.16 0.15
Y =0 0.09 0.08 0.03
Y =1 0.12 0.16 0.12

(a) Determine the marginal distributions and calculate the expected values of X
and Y , respectively.

(b) Calculate the covariance of the random variables X and Y .

(c) Calculate E(X | Y = 0) and E(X | X + Y = 1).

(d) Define U = |X| and V = Y . Calculate E(U ) and the covariance of U and V .
Are U and V correlated?

2. Suppose X and Y are two independent random variables with the following
probability distributions:

X=x −1 0 1 Y =y −1 0 1
and
P (X = x) 0.30 0.40 0.30 P (Y = y) 0.40 0.20 0.40

92
4.13. Solutions to Sample examination questions

The random variables S and T are defined as:

S = X2 + Y 2 and T = X + Y.

(a) Construct the table of the joint probability distribution of S and T .

(b) Calculate the following quantities:


i. Var(T ), given that E(T ) = 0.

ii. Cov(S, T ).

iii. E(S | T = 0).

(c) Are S and T uncorrelated? Are S and T independent? Justify your answers.

4.13 Solutions to Sample examination questions


1. (a) The marginal distribution of X is:

X −1 0 1
pX (x) 0.30 0.40 0.30
The marginal distribution of Y is:
Y −1 0 1
pY (y) 0.40 0.20 0.40
Hence:
X
E(X) = x pX (x) = (−1 × 0.30) + (0 × 0.40) + (1 × 0.30) = 0
x

and:
X
E(Y ) = y pY (y) = (−1 × 0.40) + (0 × 0.20) + (1 × 0.40) = 0.
y

(b) We have:

E(XY ) = (−1 × −1 × 0.09) + (−1 × 1 × 0.12) + (1 × −1 × 0.15) + (1 × 1 × 0.12)


= 0.09 − 0.12 − 0.15 + 0.12
= −0.06.

Therefore:

Cov(X, Y ) = E(XY ) − E(X) E(Y ) = −0.06 − 0 × 0 = −0.06.

93
4. Multivariate random variables

(c) We have P (Y = 0) = 0.09 + 0.08 + 0.03 = 0.20, hence:

0.09
P (X = −1 | Y = 0) = = 0.45
0.20
0.08
P (X = 0 | Y = 0) = = 0.40
0.20
0.03
P (X = 1 | Y = 0) = = 0.15
0.20

and therefore:

E(X | Y = 0) = −1 × 0.45 + 0 × 0.4 + 1 × 0.15 = −0.30.

We also have P (X + Y = 1) = 0.16 + 0.03 = 0.19, hence:

0.16 16
P (X = 0 | X + Y = 1) = =
0.19 19
0.03 3
P (X = 1 | X + Y = 1) = =
0.19 19

and therefore:

16 3 3
E(X | X + Y = 1) = 0 × +1× = = 0.1579.
19 19 19

(d) Here is the table of joint probabilities:

U =0 U =1
V = −1 0.16 0.24
V =0 0.08 0.12
V =1 0.16 0.24

We then have that P (U = 0) = 0.16 + 0.08 + 0.16 = 0.40 and also that
P (U = 1) = 1 − P (U = 0) = 0.60. Also, we have that P (V = −1) = 0.40,
P (V = 0) = 0.20 and P (V = 1) = 0.40. So:

E(U ) = 0 × 0.40 + 1 × 0.60 = 0.60

E(V ) = −1 × 0.40 + 0 × 0.20 + 1 × 0.40 = 0

and:
E(U V ) = −1 × 1 × 0.24 + 1 × 1 × 0.24 = 0.

Hence Cov(U, V ) = E(U V ) − E(U )E(V ) = 0 − 0.60 × 0 = 0. Since the


covariance is zero, so is the correlation coefficient, therefore U and V are
uncorrelated.

94
4.13. Solutions to Sample examination questions

2. (a) The joint probability distribution of S and T is:


S
0 1 2
−2 0 0 0.12
−1 0 0.22 0
T 0 0.08 0 0.24
1 0 0.22 0
2 0 0 0.12

(b) i. Since E(T ) = 0, we have:

Var(T ) = E(T 2 )
2
X
= t2 p(t)
t=−2

= (−2)2 × 0.12 + (−1)2 × 0.22 + 02 × 0.32 + 12 × 0.22 + 22 × 0.12


= 1.4.

ii. We have that:


2 X
X 2
E(ST ) = st p(s, t) = (−4×0.12)+(−1×0.22)+1×0.22+4×0.12 = 0.
s=0 t=−2

Since E(T ) = 0, then:

Cov(S, T ) = E(ST ) − E(S) E(T ) = E(ST ) = 0.

iii. We have:
2
X 0.08 0.24
E(S | T = 0) = s pS|T =0 (s | t = 0) = 0 × +2× = 1.5.
s=0
0.32 0.32

(c) The random variables S and T are uncorrelated, since Cov(S, T ) = 0. However:

P (T = −2) = 0.12 and P (S = 0) = 0.08 ⇒ P (T = −2) P (S = 0) = 0.0096

but:
P ({T = −2} ∩ {S = 0}) = 0 6= P (T = −2) P (S = 0)
which is sufficient to show that S and T are not independent.

Statistics: the mathematical theory of ignorance.


(Morris Kline)

95
4. Multivariate random variables

96
Chapter 5
Sampling distributions of statistics

5.1 Synopsis of chapter


This chapter considers the idea of sampling and the concept of a sampling distribution
for a statistic (such as a sample mean) which must be understood by all users of
statistics.

5.2 Learning outcomes


After completing this chapter, you should be able to:

demonstrate how sampling from a population results in a sampling distribution for


a statistic

prove and apply the results for the mean and variance of the sampling distribution
of the sample mean when a random sample is drawn with replacement

state the central limit theorem and recall when the limit is likely to provide a good
approximation to the distribution of the sample mean.

5.3 Introduction
Suppose we have a sample of n observations of a random variable X:

{X1 , X2 , . . . , Xn }.

We have already stated that in statistical inference each individual observation Xi is


regarded as a value of a random variable X, with some probability distribution (that is,
the population distribution).
In this chapter we discuss how we define and work with:

the joint distribution of the whole sample {X1 , X2 , . . . , Xn }, treated as a


multivariate random variable

distributions of univariate functions of {X1 , X2 , . . . , Xn } (statistics).

97
5. Sampling distributions of statistics

5.4 Random samples


Many of the results discussed here hold for many (or even all) probability distributions,
not just for some specific distributions.
It is then convenient to use generic notation.

We use f (x) to denote both the pdf of a continuous random variable, and the pf of
a discrete random variable.

The parameter(s) of a distribution are generally denoted as θ. For example, for the
Poisson distribution θ stands for λ, and for the normal distribution θ stands for
(µ, σ 2 ).

Parameters are often included in the notation: f (x; θ) denotes the pf/pdf of a
distribution with parameter(s) θ, and F (x; θ) is its cdf.

For simplicity, we may often use phrases like ‘distribution f (x; θ)’ or ‘distribution
F (x; θ)’ when we mean ‘distribution with the pf/pdf f (x; θ)’ and ‘distribution with the
cdf F (x; θ)’, respectively.
The simplest assumptions about the joint distribution of the sample are as follows.

1. {X1 , X2 , . . . , Xn } are independent random variables.

2. {X1 , X2 , . . . , Xn } are identically distributed random variables. Each Xi has the


same distribution f (x; θ), with the same value of the parameter(s) θ.

The random variables {X1 , X2 , . . . , Xn } are then called:

independent and identically distributed (IID) random variables from the


distribution (population) f (x; θ)

a random sample of size n from the distribution (population) f (x; θ).

We will assume this most of the time from now. So you will see many examples and
questions which begin something like:

‘Let {X1 , X2 , . . . , Xn } be a random sample from a normal distribution with


mean µ and variance σ 2 . . .’.

5.4.1 Joint distribution of a random sample


The joint probability distribution of the random variables in a random sample is an
important quantity in statistical inference. It is known as the likelihood function.
You will hear more about it in the chapter on point estimation.
For a random sample the joint distribution is easy to derive, because the Xi s are
independent.

98
5.5. Statistics and their sampling distributions

The joint pf/pdf of a random sample is:


n
Y
f (x1 , x2 , . . . , xn ) = f (x1 ; θ) f (x2 ; θ) · · · f (xn ; θ) = f (xi ; θ).
i=1

Other assumptions about random samples

Not all problems can be seen as IID random samples of a single random variable. There
are other possibilities, which you will see more of in the future.

IID samples from multivariate population distributions. For example, a sample of


n
Q
(Xi , Yi ), with the joint distribution f (xi , yi ).
i=1

Independent but not identically distributed observations. For example, observations


(Xi , Yi ) where Yi (the ‘response variable’) is treated as random, but Xi (the
‘explanatory variable’) is not. Hence the joint distribution of the Yi s is
Qn
fY |X (yi | xi ; θ) where fY |X (y | x; θ) is the conditional distribution of Y given X.
i=1
This is the starting point of regression modelling (covered in EC2020 Elements
of econometrics).

Non-independent observations. For example, a time series {Y1 , Y2 , . . . , YT } where


i = 1, 2, . . . , T are successive time points. The joint distribution of the series is, in
general:

f (y1 ; θ) f (y2 | y1 ; θ) f (y3 | y1 , y2 ; θ) · · · f (yT | y1 , y2 , . . . , yT −1 ; θ).

Random samples and their observed values

Here we treat {X1 , X2 , . . . , Xn } as random variables. Therefore, we consider what values


{X1 , X2 , . . . , Xn } might have in different samples.
Once a real sample is actually observed, the values of {X1 , X2 , . . . , Xn } in that specific
sample are no longer random variables, but realised values of random variables, i.e.
known numbers.
Sometimes this distinction is emphasised in the notation by using:

X1 , X2 , . . . , Xn for the random variables

x1 , x2 , . . . , xn for the observed values.

5.5 Statistics and their sampling distributions


A statistic is a known function of the random variables {X1 , X2 , . . . , Xn } in a random
sample.

99
5. Sampling distributions of statistics

Example 5.1 All of the following are statistics:


n
P
the sample mean X̄ = Xi /n
i=1

n √
the sample variance S 2 = (Xi − X̄)2 /(n − 1) and standard deviation S =
P
S2
i=1

the sample median, quartiles, minimum, maximum etc.

quantities such as:


n
X X̄
Xi2 and √ .
i=1
S/ n

Here we focus on single (univariate) statistics. More generally, we could also consider
vectors of statistics, i.e. multivariate statistics.

5.5.1 Sampling distribution of a statistic

A (simple) random sample is modelled as a sequence of IID random variables. A


statistic is a function of these random variables, so it is also a random variable, with a
distribution of its own.
In other words, if we collected several random samples from the same population, the
values of a statistic would not be the same from one sample to the next, but would vary
according to some probability distribution.
The sampling distribution is the probability distribution of the values which the
statistic would have in a large number of samples collected (independently) from the
same population.

Example 5.2 Suppose we collect a random sample of size n = 20 from a normal


population (distribution) X ∼ N (5, 1).
Consider the following statistics:

sample mean X̄, sample variance S 2 , and maxX = max(X1 , X2 , . . . , Xn ).

Here is one such random sample (with values rounded to 2 decimal places):
6.28 5.22 4.19 3.56 4.15 4.11 4.03 5.81 5.43 6.09
4.98 4.11 5.55 3.95 4.97 5.68 5.66 3.37 4.98 6.58
For this random sample, the values of our statistics are:

x̄ = 4.94

s2 = 0.90

maxx = 6.58.

100
5.5. Statistics and their sampling distributions

Here is another such random sample (with values rounded to 2 decimal places):
5.44 6.14 4.91 5.63 3.89 4.17 5.79 5.33 5.09 3.90
5.47 6.62 6.43 5.84 6.19 5.63 3.61 5.49 4.55 4.27
For this sample, the values of our statistics are:

x̄ = 5.22 (the first sample had x̄ = 4.94)

s2 = 0.80 (the first sample had s2 = 0.90)

maxx = 6.62 (the first sample had maxx = 6.58).

How to derive a sampling distribution?

The sampling distribution of a statistic is the distribution of the values of the statistic
in (infinitely) many repeated samples. However, typically we only have one sample
which was actually observed. Therefore, the sampling distribution seems like an
essentially hypothetical concept.
Nevertheless, it is possible to derive the forms of sampling distributions of statistics
under different assumptions about the sampling schemes and population distribution
f (x; θ).
There are two main ways of doing this.

Exactly or approximately through mathematical derivation. This is the most


convenient way for subsequent use, but is not always easy.
With simulation, i.e. by using a computer to generate (artificial) random samples
from a population distribution of a known form.

Example 5.3 Consider again a random sample of size n = 20 from the population
X ∼ N (5, 1), and the statistics X̄, S 2 and maxX .

We first consider deriving the sampling distributions of these by approximation


through simulation.

Here a computer was used to draw 10,000 independent random samples of


n = 20 from N (5, 1), and the values of X̄, S 2 and maxX for each of these
random samples were recorded.

Figures 5.1, 5.2 and 5.3 show histograms of the statistics for these 10,000
random samples.

We now consider deriving the exact sampling distribution. Here this is possible. For
a random sample of size n from N (µ, σ 2 ) we have:

(a) X̄ ∼ N (µ, σ 2 /n)

(b) (n − 1)S 2 /σ 2 ∼ χ2n−1

101
5. Sampling distributions of statistics

(c) the sampling distribution of Y = maxX has the following pdf:

fY (y) = n(FX (y))n−1 fX (y)

where FX (x) and fX (x) are the cdf and pdf of X ∼ N (µ, σ 2 ), respectively.

Curves of the densities of these distributions are also shown in Figures 5.1, 5.2 and
5.3.

4.5 5.0 5.5 6.0

Sample mean

Figure 5.1: Simulation-generated sampling distribution of X̄ to accompany Example 5.3.

5.6 Sample mean from a normal population


Consider one very common statistic, the sample mean:
n
1X 1 1 1
X̄ = Xi = X1 + X2 + · · · + Xn .
n i=1
n n n

What is the sampling distribution of X̄?


We know from Section 4.9.1 that for independent {X1 , X2 , . . . , Xn } from any
distribution: !
Xn Xn
E ai Xi = ai E(Xi )
i=1 i=1

and: !
n
X n
X
Var ai Xi = a2i Var(Xi ).
i=1 i=1

102
5.6. Sample mean from a normal population

0.5 1.0 1.5 2.0 2.5

Sample variance

Figure 5.2: Simulation-generated sampling distribution of S 2 to accompany Example 5.3.

5 6 7 8 9

Maximum value

Figure 5.3: Simulation-generated sampling distribution of maxX to accompany Example


5.3.

103
5. Sampling distributions of statistics

For a random sample, all Xi s are independent and E(X P i ) = E(X) is the same
Pfor all of
them, since the Xi s are identically distributed. X̄ = i Xi /n is of the form i ai Xi ,
with ai = 1/n for all i = 1, 2, . . . , n.
Therefore:
n
X 1 1
E(X̄) = E(X) = n × E(X) = E(X)
i=1
n n
and:
n
X 1 1 Var(X)
Var(X̄) = Var(X) = n × Var(X) = .
i=1
n2 n2 n

So the mean and variance of X̄ are E(X) and Var(X)/n, respectively, for a random
sample from any population distribution of X. What about the form of the sampling
distribution of X̄?
This depends on the distribution of X, and is not generally known. However, when the
distribution of X is normal, we do know that the sampling distribution of X̄ is also
normal.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a normal distribution with
mean µ and variance σ 2 , then:

σ2
 
X̄ ∼ N µ, .
n

For example, the pdf drawn on the histogram in Figure 5.1 is that of N (5, 1/20).
We have E(X̄) = E(X) = µ.

In an individual sample, x̄ is not usually equal to µ, the expected value of the


population.

However, over repeated samples the values of X̄ are centred at µ.


We also have Var(X̄) = Var(X)/n = σ 2 /n, and hence also sd(X̄) = σ/ n.

The variation of the values of X̄ in different samples (the sampling variance) is


large when the population variance of X is large.

More interestingly, the sampling variance gets smaller when the sample size n
increases.

In other words, when n is large the distribution of X̄ is more tightly concentrated


around µ than when n is small.

Figure 5.4 shows sampling distributions of X̄ from N (5, 1) for different n.

Example 5.4 Suppose that the heights (in cm) of men (aged over 16) in a
population follow a normal distribution with some unknown mean µ and a known
standard deviation of 7.39.

104
5.6. Sample mean from a normal population

n=100

n=20

n=5

4.0 4.5 5.0 5.5 6.0

Figure 5.4: Sampling distributions of X̄ from N (5, 1) for different n.

We plan to select a random sample of n men from the population, and measure their
heights. How large should n be so that there is a probability of at least 0.95 that the
sample mean X̄ will be within 1 cm of the population mean µ?

Here X ∼ N (µ, (7.39)2 ), so X̄ ∼ N (µ, (7.39/ n)2 ). What we need is the smallest n
such that:
P (|X̄ − µ| ≤ 1) ≥ 0.95.
So:

P (|X̄ − µ| ≤ 1) ≥ 0.95
P (−1 ≤ X̄ − µ ≤ 1) ≥ 0.95
 
−1 X̄ − µ 1
P √ ≤ √ ≤ √ ≥ 0.95
7.39/ n 7.39/ n 7.39/ n
 √ √ 
n n
P − ≤Z≤ ≥ 0.95
7.39 7.39
 √ 
n 0.05
P Z> < = 0.025
7.39 2

where Z ∼ N (0, 1). From Table 4 of the New Cambridge Statistical Tables, we see
that the smallest z which satisfies P (Z > z) < 0.025 is z = 1.97. Therefore:

n
≥ 1.97 ⇔ n ≥ (7.39 × 1.97)2 = 211.9.
7.39
Therefore, n should be at least 212.

105
5. Sampling distributions of statistics

5.7 The central limit theorem


We have discussed the very convenient result that if a random sample comes from a
normally-distributed population, the sampling distribution of X̄ is also normal. How
about sampling distributions of X̄ from other populations?
For this, we can use a remarkable mathematical result, the central limit theorem
(CLT). In essence, the CLT states that the normal sampling distribution of X̄ which
holds exactly for random samples from a normal distribution, also holds approximately
for random samples from nearly any distribution.
The CLT applies to ‘nearly any’ distribution because it requires that the variance of the
population distribution is finite. If it is not, the CLT does not hold. However, such
distributions are not common.
Suppose that {X1 , X2 , . . . , Xn } is a random sample from a population distribution
which has mean E(Xi ) = µ < ∞ and variance Var(Xi ) = σ 2 < ∞, that is with a finite
mean and finite variance. Let X̄n denote the sample mean calculated from a random
sample of size n, then:
X̄n − µ
 
lim P √ ≤ z = Φ(z)
n→∞ σ/ n
for any z, where Φ(z) denotes the cdf of the standard normal distribution.
The ‘ lim ’ indicates that this is an asymptotic result, i.e. one which holds increasingly
n→∞
well as n increases, and exactly when the sample size is infinite.
The full proof of the CLT is not straightforward and beyond the scope of this course.
In less formal language, the CLT says that for a random sample from nearly any
distribution with mean µ and variance σ 2 then:
σ2
 
X̄ ∼ N µ,
n
approximately, when n is sufficiently large. We can then say that X̄ is asymptotically
normally distributed with mean µ and variance σ 2 /n.

The wide reach of the CLT

It may appear that the CLT is still somewhat limited, in that it applies only to sample
means calculated from random (IID) samples. However, this is not really true, for two
main reasons.

There are more general versions of the CLT which do not require the observations
Xi to be IID.

Even the basic version applies very widely, when we realise that the ‘X’ can also be
a function of the original variables in the data. For example, if X and Y are
random variables in the sample, we can also apply the CLT to:
n n
X ln(Xi ) X X i Yi
or .
i=1
n i=1
n

106
5.7. The central limit theorem

Therefore, the CLT can also be used to derive sampling distributions for many statistics
which do not initially look at all like X̄ for a single random variable in an IID sample.
You may get to do this in future courses.

How large is ‘large n’?

The larger the sample size n, the better the normal approximation provided by the CLT
is. In practice, we have various rules-of-thumb for what is ‘large enough’ for the
approximation to be ‘accurate enough’. This also depends on the population
distribution of Xi . For example:

for symmetric distributions, even small n is enough

for very skewed distributions, larger n is required.

For many distributions, n > 30 is sufficient for the approximation to be reasonably


accurate.

Example 5.5 In the first case, we simulate random samples of sizes:

n = 1, 5, 10, 30, 100 and 1,000

from the Exp(0.25) distribution (for which µ = 4 and σ 2 = 16). This is clearly a
skewed distribution, as shown by the histogram for n = 1 in Figure 5.5.
10,000 independent random samples of each size were generated. Histograms of the
values of X̄ in these random samples are shown in Figure 5.5. Each plot also shows
the pdf of the approximating normal distribution, N (4, 16/n). The normal
approximation is reasonably good already for n = 30, very good for n = 100, and
practically perfect for n = 1,000.

Example 5.6 In the second case, we simulate 10,000 independent random samples
of sizes:
n = 1, 10, 30, 50, 100 and 1,000
from the Bernoulli(0.2) distribution (for which µ = 0.2 and σ 2 = 0.16).
Here the distribution of Xi itself is not even continuous, and has only two possible
values, 0 and 1. Nevertheless, the sampling distribution of X̄ can be very
well-approximated by the normal distribution, when n is large enough.
n
P
Note that since here Xi = 1 or Xi = 0 for all i, X̄ = Xi /n = m/n, where m is the
i=1
number of observations for which Xi = 1. In other words, X̄ is the sample
proportion of the value X = 1.
The normal approximation is clearly very bad for small n, but reasonably good
already for n = 50, as shown by the histograms in Figure 5.6.

107
5. Sampling distributions of statistics

n=5 n = 10

n=1

0 10 20 30 40 0 2 4 6 8 10 12 14 2 4 6 8 10

n = 30 n = 100 n = 1000

2 3 4 5 6 7 2.5 3.0 3.5 4.0 4.5 5.0 5.5 3.6 3.8 4.0 4.2 4.4

Figure 5.5: Sampling distributions of X̄ for various n when sampling from the Exp(0.25)
distribution.

5.8 Some common sampling distributions


We now introduce three very important families of probability distributions: the χ2
(‘chi-squared’) distribution, the t distribution, and the F distribution.
These are not often used as distributions of individual variables. Instead, they are used
as sampling distributions for various statistics. Each of them arises from the normal
distribution in a particular way. We will now briefly introduce their main properties.
This is in preparation for statistical inference – for example, in EC2020 Elements of
econometrics and ST2134 Advanced statistics: statistical inference.

5.8.1 The χ2 distribution

Definition of the χ2 distribution

Let Z1 , Z2 , . . . , Zk be independent N (0, 1) random variables. If:


k
X
X= Z12 + Z22 + ··· + Zk2 = Zi2
i=1

the distribution of X is the χ2 distribution with k degrees of freedom. This is


denoted by X ∼ χ2 (k) or X ∼ χ2k .

The χ2k distribution is a continuous distribution, which can take values of x ≥ 0. Its
mean and variance are:

108
5.8. Some common sampling distributions

n = 30

n = 10

n=1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5

n = 100 n = 1000
n = 50

0.0 0.1 0.2 0.3 0.4 0.50.05 0.10 0.15 0.20 0.25 0.30 0.35 0.16 0.18 0.20 0.22 0.24

Figure 5.6: Sampling distributions of X̄ for various n when sampling from the
Bernoulli(0.2) distribution.

E(X) = k

Var(X) = 2k.

For reference, the probability density function of X ∼ χ2k is:


(
(2k/2 Γ(k/2))−1 xk/2−1 e−x/2 for x ≥ 0
f (x) =
0 otherwise

where: Z ∞
Γ(α) = xα−1 e−x dx
0

is the gamma function, which is defined for all α > 0. (Note the formula of the pdf of
X ∼ χ2k is not examinable.)
The shape of the pdf depends on the degrees of freedom k, as illustrated in Figure 5.7.
In most applications of the χ2 distribution the appropriate value of k is known, in which
case it does not need to be estimated from data.
If X1 , X2 , . . . , Xm are independent random variables and Xi ∼ χ2ki , then their sum is
also χ2 -distributed where the individual degrees of freedom are added, such that:

X1 + X2 + · · · + Xm ∼ χ2k1 +k2 +···+km .

Use of the χ2 distribution will be seen in courses such as ST2134 Advanced


statistics: statistical inference. One example though is if {X1 , X2 , . . . , Xn } is a

109
5. Sampling distributions of statistics

0.10
0.6
k=1 k=10
k=2 k=20

0.5

0.08
k=4 k=30
k=6 k=40
0.4

0.06
0.3

0.04
0.2

0.02
0.1
0.0

0.0
0 2 4 6 8 0 10 20 30 40 50

Figure 5.7: χ2 pdfs for various degrees of freedom.

random sample from the population N (µ, σ 2 ), and S 2 is the sample variance, then:

(n − 1)S 2
∼ χ2n−1 .
σ2
This result is used to derive basic tools of statistical inference for both µ and σ 2 for the
normal distribution.

Tables of the χ2 distribution

In exercises and the examination, you will need a table of some probabilities for the χ2
distribution. Table 8 of the New Cambridge Statistical Tables shows the following
information.

The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 100.

The columns correspond to the right-tail probability P (X > x) = α, where X ∼ χ2k ,


for different values of α. The first page contains α = 0.9995, 0.999, . . . , 0.60, and the
second page contains α = 0.50, 0.25, . . . , 0.0005. (The table presents α in units of
percentage points, P , so, for example, α = 0.60 corresponds to P = 60.)

The numbers in the table are values of x such that P (X > x) = α for the k and α
in that row and column.

Example 5.7 Consider two numbers in the ‘ν = 5’ row, the 9.236 in the ‘α = 0.10
(P = 10)’ column and the 11.07 in the ‘α = 0.05 (P = 5)’ column. These mean that
for X ∼ χ25 we have:

P (X > 9.236) = 0.10 (and hence P (X ≤ 9.236) = 0.90)

110
5.8. Some common sampling distributions

P (X > 11.07) = 0.05 (and hence P (X ≤ 11.07) = 0.95).

These also provide bounds for probabilities of other values. For example, since 10.0
is between 9.236 and 11.07, we can conclude that:

0.05 < P (X > 10.0) < 0.10.

5.8.2 (Student’s) t distribution

Definition of Student’s t distribution

Suppose Z ∼ N (0, 1), X ∼ χ2k , and Z and X are independent. The distribution of
the random variable:
Z
T = p
X/k
is the t distribution with k degrees of freedom. This is denoted T ∼ tk or
T ∼ t(k). The distribution is also known as ‘Student’s t distribution’.

The tk distribution is continuous with the pdf:


−(k+1)/2
x2

Γ((k + 1)/2)
f (x) = √ 1+
kπΓ(k/2) k
for all −∞ < x < ∞. Examples of f (x) for different k are shown in Figure 5.8. (Note
the formula of the pdf of tk is not examinable.)
0.4

N(0,1)
k=1
k=3
k=8
0.3

k=20
0.2
0.1
0.0

−2 0 2

Figure 5.8: Student’s t pdfs for various degrees of freedom.

From Figure 5.8, we see the following.

The distribution is symmetric around 0.


As k → ∞, the tk distribution tends to the standard normal distribution, so tk with
large k is very similar to N (0, 1).

111
5. Sampling distributions of statistics

For any finite value of k, the tk distribution has heavier tails than the standard
normal distribution, i.e. tk places more probability on values far from 0 than
N (0, 1) does.

For T ∼ tk , the mean and variance of the distribution are:

E(T ) = 0 for k > 1

and:
k
Var(T ) = for k > 2.
k−2
This means that for t1 neither E(T ) nor Var(T ) exist, and for t2 , Var(T ) does not exist.

Tables of the t distribution

In exercises and the examination, you will need a table of some probabilities for the t
distribution. Table 10 of the New Cambridge Statistical Tables shows the following
information.

The rows correspond to different degrees of freedom k (denoted in the table by ν).
The table shows values of k up to 120, and then ‘∞’, which is N (0, 1).

If you need a tk distribution for which k is not in the table, use the nearest value or
use interpolation.

The columns correspond to the right-tail probability P (T > t) = α, where T ∼ tk ,


for α = 0.40, 0.05, . . . , 0.0005. (The table presents α in units of percentage points,
P , so, for example, α = 0.40 corresponds to P = 40.)

The numbers in the table are values of t such that P (T > t) = α for the k and α in
that row and column.

Example 5.8 Consider the number 2.132 in the ‘ν = 4’ row, and the ‘α = 0.05
(P = 5)’ column. This means that for T ∼ t4 we have:

P (T > 2.132) = 0.05 (and hence P (T ≤ 2.132) = 0.95).

The table also provides bounds for other probabilities. For example, the number in
the ‘α = 0.025 (P = 2.5)’ column is 2.776, so P (T > 2.776) = 0.025. Since
2.132 < 2.5 < 2.776, we know that 0.025 < P (T > 2.5) < 0.05.
Results for left-tail probabilities P (T < t) = α can also be obtained, because the t
distribution is symmetric around 0. This means that P (T < t) = P (T > −t). For
example:
P (T < −2.132) = P (T > 2.132) = 0.05
and P (T < −2.5) < 0.05 since P (T > 2.5) < 0.05.
This is the same trick we used for the standard normal distribution.

112
5.8. Some common sampling distributions

5.8.3 The F distribution

Definition of the F distribution

Let U and V be two independent random variables, where U ∼ χ2p and V ∼ χ2k .
The distribution of:
U/p
F =
V /k
is the F distribution with degrees of freedom (p, k), denoted F ∼ Fp, k or
F ∼ F (p, k).

The F distribution is a continuous distribution, with non-zero probabilities for x > 0.


The general shape of its pdf is shown in Figure 5.9.

(10,50)
(10,10)
(10,3)
f(x)

0 1 2 3 4

Figure 5.9: F pdfs for various degrees of freedom.

For F ∼ Fp, k , E(F ) = k/(k − 2), for k > 2. If F ∼ Fp, k , then 1/F ∼ Fk, p . If T ∼ tk ,
then T 2 ∼ F1, k .
Tables of F distributions will be needed for some purposes. They will be available in the
examination. We now consider how to use them in the following example.

Example 5.9 Here we practise use of Table A.3 of the Dougherty Statistical Tables
to obtain critical values for the F distribution.
Table A.3 can be used to find the top 100αth percentile of the Fν1 , ν2 distribution for
α = 0.05, 0.01 and 0.001.
For example, for ν1 = 3 and ν2 = 5, then:

P (F3, 5 > 5.41) = 0.05

P (F3, 5 > 12.06) = 0.01

113
5. Sampling distributions of statistics

and:
P (F3, 5 > 33.20) = 0.001.
To find the bottom 100αth percentile, we note that F1−α, ν1 , ν2 = 1/Fα, ν2 , ν1 . So, for
ν1 = 3 and ν2 = 5, we have:
 
1 1
P F3, 5 < = = 0.111 = 0.05
F0.05, 5, 3 9.01
 
1 1
P F3, 5 < = = 0.035 = 0.01
F0.01, 5, 3 28.24

and:  
1 1
P F3, 5 < = = 0.007 = 0.001.
F0.001, 5, 3 134.58

5.9 Overview of chapter


This chapter introduced sampling distributions of statistics which are the foundations
to statistical inference. The sampling distribution of the sample mean was derived
exactly when sampling from normal populations and also approximately for more
general distributions using the central limit theorem. Three new families of distributions
(χ2 , t and F ) were defined.

5.10 Key terms and concepts


Central limit theorem Chi-squared (χ2 ) distribution
F distribution IID random variables
Random sample Sampling distribution
Sampling variance Statistic
(Student’s) t distribution

5.11 Sample examination questions


1. Suppose that on a certain statistics examination, students from university X
achieve scores which are normally distributed with a mean of 62 and a variance of
10, while students from university Y achieve scores which are normally distributed
with a mean of 60 and a variance of 15. If two students from university X and three
students from university Y, selected at random, sit this examination, what is the
probability that the average of the scores of the two students from university X will
be greater than the average of the scores of the three students from university Y?

2. Suppose Xi ∼ N (0, 9), for i = 1, 2, 3, 4. Assume all these random variables are
independent. Derive the value of k in each of the following.

114
5.12. Solutions to Sample examination questions

(a) P (X1 + 6X2 < k) = 0.3974.

 4

Xi2
P
(b) P <k = 0.90.
i=1

 
1/2
(c) P X1 > (k(X22 + X32 )) = 0.10.

5.12 Solutions to Sample examination questions


1. With obvious notation, we have:
2
σY2
   
σX
X̄ ∼ N µX , = N (62, 5) and Ȳ ∼ N µY , = N (60, 5).
nX nY

Therefore, by independence:

X̄ − Ȳ ∼ N (2, 10)

so:  
−2
P (X̄ − Ȳ > 0) = P Z > √ = P (Z > −0.63) = 0.7357.
10

2. (a) Xi ∼ N (0, 9), for i = 1, 2, 3, 4. We have:

X1 + 6X2 ∼ N (0, 333).

Hence:  
k
P (X1 + 6X2 < k) = P Z < √ = 0.3974.
333
Since, using Table 4 of the New Cambridge Statistical Tables, we can deduce
that Φ(−0.26) = 0.3974, we have:

k
√ = −0.26 ⇒ k = −4.7446.
333

√ 4
(b) Xi / 9 ∼ N (0, 1), and so Xi2 /9 ∼ χ21 . Hence we have that Xi2 /9 ∼ χ24 .
P
i=1
Therefore, using Table 8 of the New Cambridge Statistical Tables, we have:
4
!  
X
2 k k
P Xi < k = P X < = 0.90 ⇒ = 7.779 ⇒ k = 70.011
i=1
9 9

where X ∼ χ24 .

115
5. Sampling distributions of statistics

(c) We have:
√ !
X1 / 9 √ √
(k(X22 X32 ))1/2

P X1 > + =P p > 2× k
(X22 + X32 )/(9 × 2)
√ √
= P (T > 2 × k)
= 0.10

where
√ T ∼ t2 . From Table 10 of the New Cambridge Statistical Tables,

2 × k = 1.886, hence k = 1.7785.

Did you hear the one about the statistician? Probably.


(Anon)

116
Chapter 6
Estimator properties

6.1 Synopsis of chapter


This chapter introduces the concept of point estimation and examines the desirable
properties of estimators, such that an estimator is a statistic (i.e. a known function of
sample data) that can be used to estimate an unknown parameter. When faced with
more than one estimator of a specific parameter, we will see how to compare the
performance of competing estimators.

6.2 Learning outcomes


After completing this chapter, you should be able to:

summarise the performance of an estimator with reference to its sampling


distribution

use the concepts of bias and variance of an estimator

define mean squared error and calculate it for simple estimators.

6.3 Introduction
Reiterating previous discussions, one of the main uses of statistics, and of sampling, is
to estimate the value of some unknown population characteristic (parameter).1 Given a
relevant set of data (a random sample) drawn from the population, the problem is to
perform a calculation using these data values in order to arrive at a value which, in some
sense, comes ‘close’ to the unknown population parameter which we wish to estimate.
Whatever it is that we are trying to estimate, this (numerical) statistic which we
calculate is known as a point estimate, and the general name for making such value
estimates is ‘point estimation’. Note that, in principle, we might want to estimate any
characteristic of the (true) population distribution, but some will be far more common
than others.
The statistic used to obtain a point estimate is known as an estimator, and Chapter 5
derived the sampling distribution of arguably the most common estimator, i.e. the
sample mean, X̄, which is used as our preferred estimator of the population mean, µ.
1
Clearly, if the population parameter is known, then there is no need to estimate it!

117
6. Estimator properties

Of course, in practice we do not realistically expect to achieve a point estimate which is


exactly equal to the unknown population parameter, due to sample data being subject
to randomness. One can think of a point estimate as our ‘best guess’ of the parameter
based on the available sample data. We would hope, of course, that this ‘best guess’ is
reasonably close to the truth, i.e. that our sampling error (the difference between the
point estimate and the true parameter value) is small. This chapter will establish
well-defined criteria for selecting a desirable estimator, as we may be faced with several
competing estimators for a parameter and need to be able to identify the ‘best’ one.

6.4 Estimation criteria – bias, variance and mean


squared error
Estimators are random variables and, therefore, have probability distributions, known
as sampling distributions. As we know, two important properties of probability
distributions are the mean and variance. Our objective is to create a formal criterion
which combines both of these properties to assess the relative performance of different
estimators.

Bias of an estimator

Let θb be an estimator of the population parameter θ.2 We define the bias of an


estimator θb as:
Bias(θ)
b = E(θ)b − θ. (6.1)
An estimator is:
b −θ >0
positively biased if E(θ)

b −θ =0
unbiased if E(θ)

b − θ < 0.
negatively biased if E(θ)

A positively-biased estimator means the estimator would systematically overestimate


the parameter by the size of the bias, on average. An unbiased estimator means the
estimator would estimate the parameter correctly, on average. A negatively-biased
estimator means the estimator would systematically underestimate the parameter by
the size of the bias, on average.
In words, the bias of an estimator is the difference between the expected (average) value
of the estimator and the true parameter being estimated. Intuitively, it would be
desirable, other things being equal, to have an estimator with zero bias, called an
unbiased estimator. Given the definition of bias in (6.1), an unbiased estimator of θ
would satisfy:
E(θ)
b = θ.

2
The b (hat) notation is frequently deployed by statisticians to denote an estimator of the symbol
beneath the b. So, for example, λ
b denotes an estimator of the Poisson rate parameter, λ.

118
6.4. Estimation criteria – bias, variance and mean squared error

In words, the expected value of the estimator is the true parameter being estimated, i.e.
on average, under repeated sampling, the estimator correctly estimates θ.
We view bias as a ‘bad’ thing, so, other things being equal, the smaller an estimator’s
bias the better.

Variance of common estimators

The variance of an estimator, denoted Var(θ), b is obtained directly from the


estimator’s sampling distribution. For example, in Chapter 5, we had the following
estimator variances:
σ2
Var(X̄) = (6.2)
n
and:
π(1 − π)
Var(P ) = (6.3)
n
where P = X̄ when sampling from the Bernoulli distribution (see Example 5.6), i.e.
by the central limit theorem as n → ∞ then:

σ2
   
π(1 − π)
P = X̄ ∼ N µ, = N π,
n n

since E(X) = π and Var(X) = π(1 − π) when X ∼ Bernoulli(π).

It is clear that in both (6.2) and (6.3) increasing the sample size n decreases the
estimator’s variance (and hence the standard error), so in turn increases the precision of
the estimator.3 We conclude that variance is also a ‘bad’ thing so, other things being
equal, the smaller an estimator’s variance the better.
A popular quality metric for assessing how ‘good’ an estimator is based on the
estimator’s average squared error is the ‘mean squared error’.

Mean squared error (MSE)

The mean squared error (MSE) of an estimator is the average squared error.
Formally, this is defined as:
b = E((θb − θ)2 ).
MSE(θ) (6.4)

It is possible to decompose this into components involving both the bias and the
variance of an estimator. Recall that:

Var(X) = E(X 2 ) − (E(X))2 ⇒ E(X 2 ) = Var(X) + (E(X))2 .

Also, note that for any constant k, Var(X ± k) = Var(X), hence adding or subtracting a
constant has no effect on the variance of a random variable. Noting that the true
parameter θ is some (unknown) constant,4 it immediately follows, by setting
3
Remember, however, that this increased precision comes at a cost – namely the increased expenditure
on data collection.
4
Even though θ is an unknown constant, it is known to be a constant!

119
6. Estimator properties

X = (θb − θ), that:

b = E((θb − θ)2 )
MSE(θ)

= Var(θb − θ) + (E(θb − θ))2

= Var(θ) b 2.
b + (Bias(θ)) (6.5)

It is the form of the MSE given by (6.5), rather than (6.4), which we will use in practice.
We have already established that both the bias and the variance of an estimator are
‘bad’ things, so the MSE (being the sum of a bad thing and a bad thing squared) can
also be viewed as a ‘bad’ thing.5 Therefore, when faced with several competing
estimators, we prefer the estimator with the smallest MSE.
So, although an unbiased estimator is intuitively appealing, it is perfectly possible that
a biased estimator might be preferred if the ‘cost’ of the bias is offset by a substantial
reduction in the variance. Hence the MSE provides us with a formal criterion to assess
the trade-off between the bias and variance of different estimators.

Example 6.1 A population is known to be normally distributed, i.e. X ∼ N (µ, σ 2 ).


Suppose we wish to estimate the population mean, µ. We draw a random sample
{X1 , X2 , . . . , Xn }, such that these random variables are independent and identically
distributed (IID). We have three candidate estimators – T1 , T2 and T3 – where:
n
P
Xi
i=1 X1 + X n
T1 = X̄ = , T2 = and T3 = X̄ + 3.
n 2
Which estimator should we choose?
We begin by computing the MSE for T1 , noting:

E(T1 ) = E(X̄) = µ

hence T1 is an unbiased estimator of µ, and:

σ2
Var(T1 ) = Var(X̄) = .
n
So MSE(T1 ) = σ 2 /n. Moving to T2 , note:
 
X1 + Xn E(X1 ) + E(Xn ) µ+µ
E(T2 ) = E = = =µ
2 2 2

and:
2σ 2 σ2
 
X1 + Xn Var(X1 ) + Var(Xn )
Var(T2 ) = Var = = = .
2 22 4 2
So T2 is also an unbiased estimator of µ, hence MSE(T2 ) = σ 2 /2. Finally, consider
T3 , noting:
E(T3 ) = E(X̄ + 3) = E(X̄) + 3 = µ + 3
5
Or, for that matter, a ‘very bad’ thing!

120
6.5. Unbiased estimators

and:
σ2
Var(T3 ) = Var(X̄ + 3) = Var(X̄) = .
n
So T3 is a biased estimator of µ, with a bias of:

Bias(T3 ) = E(T3 ) − µ = µ + 3 − µ = 3

hence MSE(T3 ) = σ 2 /n + 32 = σ 2 /n + 9.
We seek the estimator with the smallest MSE. Clearly, MSE(T1 ) < MSE(T3 ) so we
can eliminate T3 . Now comparing T1 with T2 , we note that:

for n = 2, MSE(T1 ) = MSE(T2 ), since the estimators are identical

for n > 2, MSE(T1 ) < MSE(T2 ), so T1 is preferred.

So T1 = X̄ is our preferred estimator of µ. Intuitively, this should make sense. Note


for n > 2, T1 uses all the information in the sample (i.e. all observations are used),
unlike T2 which uses the first and last observations only. Of course, for n = 2, these
two estimators are identical.

6.5 Unbiased estimators


Suppose unbiasedness is considered such a desirable estimator property that we decide
to restrict attention to unbiased estimators only. Note that for an unbiased estimator θ,
b
Bias(θ)
b = 0, by definition. Hence:

MSE(θ)
b = Var(θ) b 2 = Var(θ)
b + (Bias(θ)) b + 02 = Var(θ).
b

So, minimising the MSE for unbiased estimators is the same as choosing the estimator
with the smallest variance, hence we term such an estimator the minimum variance
unbiased estimator. Therefore, if we had two unbiased estimators of θ, say θb1 and θb2 ,
then we prefer θb1 if Var(θb1 ) < Var(θb2 ). If this is the case, then θb1 is called the more
efficient estimator.
Examples of unbiased estimators of parameters include:

the sample mean, X̄, as an estimator of µ for data from N (µ, σ 2 )


the sample proportion, P , as an estimator of π for Bernoulli sampling with success
parameter π
the sample variance, S 2 , as an estimator of σ 2 for data from N (µ, σ 2 ).6

It should be noted that unbiasedness is not an invariant property, by which we mean


that if θb is an unbiased estimator of θ there are important functions of θb which are not
unbiased. For example, if E(θ) b = θ, it follows that:

E(θb2 ) = Var(θ)
b + θ2 > θ2 .
6
This justifies use of the n − 1 divisor when computing the sample variance, since this results in an
unbiased estimator of σ 2 .

121
6. Estimator properties

Unbiased estimators of functions of parameters, when the parameter is drawn from the
population distribution, can, however, often be found by an appropriate adjustment.

Example 6.2 Suppose we have a random sample of n values from N (µ, σ 2 ), and we
wish to find an unbiased estimator of µ2 . If we try X̄ 2 , then:

σ2
E(X̄ 2 ) = Var(X̄) + (E(X̄))2 = + µ2 6= µ2 .
n
However, we know E(S 2 ) = σ 2 , so by combining this with the above, it follows that:

S2
 2
σ2 σ2
 
2 2 S
E X̄ − = E(X̄ ) − E = + µ2 − = µ2 .
n n n n

Hence X̄ 2 − S 2 /n is an unbiased estimator of µ2 .

Finding unbiased estimators

Example 6.2 illustrates a somewhat rough-and-ready, but effective, method for


finding unbiased estimators of parameters.

1. Guess at a sensible estimator.

2. Derive its expectation.

3. Adjust the estimator accordingly to remove the bias.

Unbiasedness is a very desirable property for estimators used in survey sampling,


where the ‘fairness’ represented by the unbiased concept is regarded as being
extremely important.

6.6 Overview of chapter


This chapter has examined the desirable properties of estimators. The mean squared
error criterion was discussed as a way of choosing between competing estimators, taking
into account the bias and variance of an estimator.

6.7 Key terms and concepts


Bias Efficiency
Information Mean squared error
Point estimation Precision
Sampling error Unbiased estimator
Variance

122
6.8. Sample examination questions

6.8 Sample examination questions


1. A random variable X can take the values 0, 1 and 2 with P (X = 0) = 1 − 3α/4,
P (X = 1) = α/2 and P (X = 2) = α/4 for some parameter α > 0. One observation
is taken and we would like to estimate α.

(a) Consider the estimators T1 = X and T2 = 2X 2 /3. Show that they are both
unbiased estimators of α.

(b) Which of these estimators do you prefer and why?

2. For λ > 0, let X be a random variable following a Poisson distribution with


parameter λ and let Y be a random variable following a Poisson distribution with
parameter 3λ. Suppose that X and Y are independent. In the following you may
use without proof results listed on the formula sheet. Consider the following three
estimators of λ:
Y X +Y
X, and .
3 4

(a) For each of the three estimators above, determine whether or not they are
unbiased estimators of λ.

(b) Which of these estimators would you choose and why?

6.9 Solutions to Sample examination questions


1. (a) We calculate:
2  
X 3α α α
E(T1 ) = x p(x) = 0 × 1 − +1× +2× =α
x=0
4 2 4

and:
2
2x2
 
X 3α 2 α 8 α 2α 8α
E(T2 ) = p(x) = 0 × 1 − + × + × = + =α
x=0
3 4 3 2 3 4 6 12

hence both are unbiased estimators of α.

(b) Since T1 and T2 are unbiased estimators we prefer the one with the smallest
variance. Since E(T1 ) = E(T2 ) = α, this is equivalent to choosing the estimator
with the smaller value of E(T12 ) and E(T22 ). We have:
2  
X 3α α α α 4α 3α
E(T12 ) = 2 2
x p(x) = 0 × 1 − + 12 × + 2 2 × = + =
x=0
4 2 4 2 4 2

123
6. Estimator properties

and:
2
4x4
 
X 3α 4 α 64 α 4α 64α
E(T22 ) = p(x) = 0 × 1 − + × + × = + = 2α
x=0
9 4 9 2 9 4 18 36

so we prefer T1 .

2. (a) Using the formula sheet, we have that:


 
Y E(Y ) 3λ
E(X) = λ, E = = =λ
3 3 3

and:  
X +Y E(X) + E(Y ) λ + 3λ
E = = =λ
4 4 4
hence these are all unbiased estimators of λ.

(b) Because each estimator is unbiased, minimising the mean squared error is
equivalent to minimising the variance. Again, using the formula sheet, due to
independence of X and Y we have that:
 
Y Var(Y ) 3λ λ
Var(X) = λ, Var = = =
3 9 9 3

and:  
X +Y Var(X) + Var(Y ) λ + 3λ λ
Var = = = .
4 16 16 4
Of these three unbiased estimators we prefer the one with the smallest
variance, hence we prefer (X + Y )/4.

Facts are stubborn, but statistics are more pliable.


(Mark Twain)

124
Chapter 7
Point estimation

7.1 Synopsis of chapter


This chapter covers point estimation. Specifically, three alternative techniques for
deriving estimators of unknown parameters are presented: method of moments
estimation, least squares estimation, and maximum likelihood estimation.

7.2 Learning outcomes


After completing this chapter, you should be able to:
derive estimators using the method of moments, least squares and maximum
likelihood approaches
investigate the statistical properties of such estimators, when requested to do so.

7.3 Introduction
We have previously seen a selection of families of theoretical probability distributions.
Some of these were discrete distributions, such as the Bernoulli, binomial and Poisson
distributions (seen in Chapter 2), while others were continuous, such as the exponential
and normal distributions (seen in Chapter 3). These are ‘families’ of distributions in
that the different members of these families vary in terms of the values of the
parameter(s) of the distribution.

Example 7.1 X ∼ Bernoulli(π) denotes a Bernoulli random variable with


parameter 0 < π < 1.
X ∼ Pois(λ) denotes a Poisson random variable with parameter λ > 0.
X ∼ N (µ, σ 2 ) denotes a normal random variable with parameters −∞ < µ < ∞ and
σ 2 > 0.

Let θ be the general notation for the parameter of a probability distribution. If its value
is unknown, we need to estimate it using an estimator. In general, how should we find
an estimator of θ in a practical situation? There are three conventional methods:
method of moments estimation
least squares estimation
maximum likelihood estimation.

125
7. Point estimation

7.4 Method of moments (MM) estimation

Method of moments estimation

Let {X1 , X2 , . . . , Xn } be a random sample from a population with cdf F (x; θ), i.e.
a cdf which depends on θ. Suppose θ has p components (for example, for a normal
population N (µ, σ 2 ), p = 2; for a Poisson population with parameter λ, p = 1).
Let:
µk = µk (θ) = E(X k )
denote the kth population moment, for k = 1, 2, . . .. Therefore, µk depends on the
unknown parameter θ, as everything else about the distribution F (x; θ) is known.
Denote the kth sample moment by:
n
1X X1k + X2k + · · · + Xnk
Mk = Xik = .
n i=1
n

The MM estimator (MME) θb of θ is the solution of the p equations:

µk (θ)
b = Mk for k = 1, 2, . . . , p.

Example 7.2 Let {X1 , X2 , . . . , Xn } be a random sample from a population with


mean µ and variance σ 2 < ∞. Find the MM estimator of (µ, σ 2 ).
There are two unknown parameters. Let:
n
1X 2
µ
b=µ
b1 = M1 and µ
b2 = M2 = X .
n i=1 i
This gives us µ
b = M1 = X̄.
Since σ 2 = µ2 − µ21 = E(X 2 ) − (E(X))2 , we have:
n n
2 2 1X 2 2 1X
b = M2 − M1 =
σ X − X̄ = (Xi − X̄)2 .
n i=1 i n i=1
Note we have:
n
!
1X 2
σ2) = E
E(b X − X̄ 2
n i=1 i
n
1X
= E(Xi2 ) − E(X̄ 2 )
n i=1

= E(X 2 ) − E(X̄ 2 )
 2 
2 2 σ 2
=σ +µ − +µ
n
(n − 1)σ 2
= .
n

126
7.4. Method of moments (MM) estimation

Since:
σ2
σ2) − σ2 = −
E(b <0
n
b2 is a negatively-biased estimator of σ 2 .
σ
The sample variance, defined as:
n
2 1 X
S = (Xi − X̄)2
n − 1 i=1

is a more frequently-used estimator of σ 2 as it has zero bias, i.e. it is an unbiased


estimator since E(S 2 ) = σ 2 . This is why we use the n − 1 divisor when calculating
the sample variance.

Note the MME does not use any information on F (x; θ) beyond the moments.
The idea is that Mk should be pretty close to µk when n is sufficiently large. In fact:
n
1X
Mk = Xik
n i=1
converges to:
µk = E(X k )
as n → ∞. This is due to the law of large numbers (LLN). We illustrate this
phenomenon by simulation using R.

Example 7.3 For N (2, 4), we have µ1 = 2 and µ2 = 8. We use the sample moments
M1 and M2 as estimators of µ1 and µ2 , respectively. Note how the sample moments
converge to the population moments as the sample size increases.
For a sample of size n = 10, we obtained m1 = 0.5145838 and m2 = 2.171881.

> x <- rnorm(10,2,2)


> x
[1] 0.70709403 -1.38416864 -0.01692815 2.51837989 -0.28518898 1.96998829
[7] -1.53308559 -0.42573724 1.76006933 1.83541490
> mean(x)
[1] 0.5145838
> x2 <- x^2
> mean(x2)
[1] 2.171881

For a sample of size n = 100, we obtained m1 = 2.261542 and m2 = 8.973033.

> x <- rnorm(100,2,2)


> mean(x)
[1] 2.261542
> x2 <- x^2
> mean(x2)
[1] 8.973033

For a sample of size n = 500, we obtained m1 = 1.912112 and m2 = 7.456353.

127
7. Point estimation

> x <- rnorm(500,2,2)


> mean(x)
[1] 1.912112
> x2 <- x^2
> mean(x2)
[1] 7.456353

Example 7.4 For a Poisson distribution with λ = 1, we have µ1 = 1 and µ2 = 2.


With a sample of size n = 500, we obtained m1 = 1.09 and m2 = 2.198.

> x <- rpois(500,1)


> mean(x)
[1] 1.09
> x2 <- x^2
> mean(x2)
[1] 2.198
> x
[1] 1 2 2 1 0 0 0 0 0 0 2 2 1 2 1 1 1 2 ...

7.5 Least squares (LS) estimation


Given a random sample {X1 , X2 , . . . , Xn } from a population with mean µ and variance
σ 2 , how can we estimate µ?
n
P
The MME of µ is the sample mean X̄ = Xi /n.
i=1

Least squares estimator of µ

The estimator X̄ is also the least squares estimator (LSE) of µ, defined as:
n
X
µ
b = X̄ = min (Xi − a)2 .
a
i=1

Proof: Given that:


n
X n
X
S= (Xi − a)2 = ((Xi − X̄) + (X̄ − a))2
i=1 i=1
n
X n
X n
X
= (Xi − X̄)2 + (X̄ − a)2 + 2 (Xi − X̄)(X̄ − a)
i=1 i=1 i=1
n
X n
X
= (Xi − X̄)2 + n(X̄ − a)2 + 2(X̄ − a) (Xi − X̄)
i=1 i=1
n
X
= (Xi − X̄)2 + n(X̄ − a)2
i=1

128
7.5. Least squares (LS) estimation

where all terms are non-negative, then the value of a for which S is minimised is when
n(X̄ − a)2 = 0, i.e. a = X̄.


Estimator accuracy

In order to assess the accuracy of µ


b = X̄ as an estimator of µ we calculate its MSE:

2
σ2
MSE(µ) b − µ) ) =
b = E((µ .
n
In order to determine the distribution of µ b we require knowledge of the underlying
distribution. Even if the relevant knowledge is available, one may only compute the
exact distribution of µ
b explicitly for a limited number of cases.
By the central limit theorem, as n → ∞, we have:

X̄ − µ
 
P √ ≤ z → Φ(z)
σ/ n

for any z, where Φ(z) is the cdf of N (0, 1), i.e. when n is large, X̄ ∼ N (µ, σ 2 /n)
approximately.
Some remarks are the following.

i. The LSE is a geometric solution – it minimises the sum of squared distances


between the estimated value and each observation. It makes no use of any
information about the underlying distribution.
n
(Xi − a)2 with respect to a, and equating it to 0, we
P
ii. Taking the derivative of
i=1
obtain (after dividing through by −2):
n
X n
X
(Xi − a) = Xi − na = 0.
i=1 i=1

Hence the solution is µ


b=b
a = X̄. This is another way to derive the LSE of µ.

Example 7.5 Suppose that you are given independent observations y1 , y2 and y3
such that:

y1 = 3α + β + ε1

y2 = α − 2β + ε2

y3 = −α + 2β + ε3 .

The random variables εi , for i = 1, 2, 3, are normally distributed with a mean of 0


and a variance of 4. We will derive the least squares estimators of the parameters α
and β, and verify that they are unbiased estimators. We will also calculate the
variance of the estimator of α.

129
7. Point estimation

We have to minimise:
3
X
S= ε2i = (y1 − 3α − β)2 + (y2 − α + 2β)2 + (y3 + α − 2β)2 .
i=1

We have:
∂S
= −6(y1 − 3α − β) − 2(y2 − α + 2β) + 2(y3 + α − 2β)
∂α
= 22α − 2β − 2(3y1 + y2 − y3 )

and:
∂S
= −2(y1 − 3α − β) + 4(y2 − α + 2β) − 4(y3 + α − 2β)
∂β
= −2α + 18β − 2(y1 − 2y2 + 2y3 ).

The estimators α
b and βb are the solutions of the equations ∂S/∂α = 0 and
∂S/∂β = 0. Hence:
α − 2βb = 6y1 + 2y2 − 2y3
22b
and:
−2b
α + 18βb = 2y1 − 4y2 + 4y3 .
Solving yields:
4y1 + y2 − y3 2y1 − 3y2 + 3y3
α
b= and βb = .
14 14
They are unbiased estimators since:
 
4y1 + y2 − y3 12α + 4β + α − 2β + α − 2β
E(b
α) = E = =α
14 14

and:
 
2y1 − 3y2 + 3y3 6α + 2β − 3α + 6β − 3α + 6β
E(β)
b =E = = β.
14 14

Due to independence, we have (noting that Var(yi ) = Var(εi ) = 4 for i = 1, 2, 3):


 
4y1 + y2 − y3
Var(b
α) = Var
14
 2  2  2
4 1 1
= Var(y1 ) + Var(y2 ) + Var(y3 )
14 14 14
 2  2  2
4 1 1
= ×4+ ×4+ ×4
14 14 14
18
= .
49

130
7.6. Maximum likelihood (ML) estimation

7.6 Maximum likelihood (ML) estimation


We begin with an illustrative example. Maximum likelihood (ML) estimation
generalises the reasoning in the following example to arbitrary settings.

Example 7.6 Suppose we toss a coin 10 times, and record the number of ‘heads’ as
a random variable X. Therefore:

X ∼ Bin(10, π)

where π = P (heads) ∈ (0, 1) is the unknown parameter.

If x = 8, what is your best guess (i.e. estimate) of π? Obviously 0.8!

Is π = 0.1 possible? Yes, but very unlikely.

Is π = 0.5 possible? Yes, but not very likely.

Is π = 0.7 or 0.9 possible? Yes, very likely.

Nevertheless, π = 0.8 is the most likely, or ‘maximally’ likely value of the parameter.
Why do we think ‘π = 0.8’ is most likely?
Let:
10! 8
L(π) = P (X = 8) = π (1 − π)2 .
8! 2!
Since x = 8 is the event which occurred in the experiment, this probability would be
very large. Figure 7.1 shows a plot of L(π) as a function of π.
The most likely value of π should make this probability as large as possible. This
value is taken as the maximum likelihood estimate of π.
Maximising L(π) is equivalent to maximising:

l(π) = ln(L(π)) = 8 ln π + 2 ln(1 − π) + c

where c is the constant ln(10!/(8! 2!)). Setting:

d
l(π) = 0

we obtain the maximum likelihood estimate π
b = 0.8.

Maximum likelihood definition

Let f (x1 , x2 , . . . , xn ; θ) be the joint probability density function (or probability


function) for random variables (X1 , X2 , . . . , Xn ). The maximum likelihood estimator
(MLE) of θ based on the observations {X1 , X2 , . . . , Xn } is defined as:

θb = max f (X1 , X2 , . . . , Xn ; θ).


θ

131
7. Point estimation

Figure 7.1: Plot of the likelihood function in Example 7.10.

Some remarks are the following.

i. The MLE depends only on the observations {X1 , X2 , . . . , Xn }, such that:

θb = θ(X
b 1 , X2 , . . . , Xn ).

Therefore, θb is a statistic (as it must be for an estimator of θ).

ii. If {X1 , X2 , . . . , Xn } is a random sample from a population with probability density


function f (x; θ), the joint probability density function for (X1 , X2 , . . . , Xn ) is:
n
Y
f (xi ; θ).
i=1

The joint pdf is a function of (X1 , X2 , . . . , Xn ), while θ is a parameter.

The joint pdf describes the probability distribution of {X1 , X2 , . . . , Xn }.

The likelihood function is defined as:


n
Y
L(θ) = f (Xi ; θ). (7.1)
i=1

The likelihood function is a function of θ, while {X1 , X2 , . . . , Xn } are treated as


constants (as given observations).

The likelihood function reflects the information about the unknown parameter θ in
the data {X1 , X2 , . . . , Xn }.

132
7.6. Maximum likelihood (ML) estimation

Some remarks are the following.

i. The likelihood function is a function of the parameter. It is defined up to positive


constant factors. A likelihood function is not a probability density function. It
contains all the information about the unknown parameter from the observations.

ii. The MLE is θb = max L(θ).


θ

iii. It is often more convenient to use the log-likelihood function1 denoted as:
n
X
l(θ) = ln L(θ) = ln f (Xi ; θ)
i=1

as it transforms the product in (7.1) into a sum. Note that:

θb = max l(θ).
θ

iv. For a smooth likelihood function, the MLE is often the solution of the equation:

d
l(θ) = 0.

v. If θb is the MLE and φ = g(θ) is a function of θ, φb = g(θ)


b is the MLE of φ (which is
known as the invariance principle of the MLE).

vi. Unlike the MME or LSE, the MLE uses all the information about the population
distribution. It is often more efficient (i.e. more accurate) than the MME or LSE.

vii. In practice, ML estimation should be used whenever possible.

Example 7.7 Let {X1 , X2 , . . . , Xn } be a random sample from a distribution with


pdf: (
λ2 xe−λx for x > 0
f (x; λ) =
0 otherwise
where λ > 0 is unknown. Find the MLE of λ.
n
The joint pdf is f (x1 , x2 , . . . , xn ; λ) = (λ2 xi e−λxi ) if all xi > 0, and 0 otherwise.
Q
i=1

The likelihood function is:


n
! n
X Y
L(λ) = λ2n exp −λ Xi Xi
i=1 i=1
n
Y
= λ2n exp(−nλX̄) Xi .
i=1

1
Throughout where ‘log’ is used in log-likelihood functions, it will be assumed to be the logarithm to
the base e, i.e. the natural logarithm.

133
7. Point estimation

n
Q
The log-likelihood function is l(λ) = 2n ln λ − nλX̄ + c, where c = ln Xi is a
i=1
constant.
Setting:
d 2n
l(λ) = − nX̄ = 0
dλ λ
b
we obtain λ
b = 2/X̄.

Note the MLE λ b may be obtained from maximising L(λ) directly. However, it is
much easier to work with l(λ) instead.

Example 7.8 Let {X1 , X2 , . . . , Xn } be a random sample from N (µ, σ 2 ).


 n 
2 −n/2 2 2
P
The joint pdf is (2πσ ) exp − (xi − µ) /(2σ ) .
i=1

Case I: σ 2 is known.
The likelihood function is:
n
!
1 1 X
L(µ) = 2 n/2
exp − 2 (Xi − µ)2
(2πσ ) 2σ i=1
n
!
1 1 X  n 
= exp − 2 (Xi − X̄)2 exp − 2 (X̄ − µ)2 .
(2πσ 2 )n/2 2σ i=1 2σ

Hence the log-likelihood function is:


  n
1 1 X n
l(µ) = ln 2 n/2
− 2 (Xi − X̄)2 − 2 (X̄ − µ)2 .
(2πσ ) 2σ i=1 2σ

Maximising l(µ) with respect to µ gives µ


b = X̄.
Case II: σ 2 is unknown.
The likelihood function is:
n
!
2 −n/2 2 −n/2 1 X 2
L(µ, σ ) = (2π) (σ ) exp − 2 (Xi − µ) .
2σ i=1

Hence the log-likelihood function is:


n
n 1 X
l(µ, σ 2 ) = − ln(σ 2 ) − 2 (Xi − µ)2 + c
2 2σ i=1

where c = −(n/2) ln(2π). Regardless of the value of σ 2 , l(X̄, σ 2 ) ≥ l(µ, σ 2 ). Hence


µ
b = X̄.
The MLE of σ 2 should maximise:
n
2 n 2 1 X
l(X̄, σ ) = − ln(σ ) − 2 (Xi − X̄)2 + c.
2 2σ i=1

134
7.6. Maximum likelihood (ML) estimation

n
b2 = (Xi − X̄)2 /n.
P
It follows from the lemma below that σ
i=1

Lemma: Let g(x) = −a ln(x) − b/x, where a, b > 0, then:


 
b
g = max g(x).
a x>0

Proof: Letting g 0 (x) = −a/x + b/x2 = 0 leads to the solution x = b/a.



Now suppose we wanted to find the MLE of γ = σ/µ.
Since γ = γ(µ, σ), by the invariance principle the MLE of γ is:
rn
P
(Xi − X̄)2 /n
σ
b) = = i=1P
b
γ
b = γ(b
µ, σ n .
µ
Xi /n
b
i=1

Example 7.9 Consider a population with three types of individuals labelled 1, 2


and 3, and occurring according to the Hardy–Weinberg proportions:

p(1; θ) = θ2 , p(2; θ) = 2θ(1 − θ) and p(3; θ) = (1 − θ)2

where 0 < θ < 1. Note that p(1; θ) + p(2; θ) + p(3; θ) = 1.


A random sample of size n is drawn from this population with n1 observed values
equal to 1 and n2 observed values equal to 2 (therefore, there are n − n1 − n2 values
equal to 3). Find the MLE of θ.
Let us assume {X1 , X2 , . . . , Xn } is the sample (i.e. n observed values). Among them,
there are n1 ‘1’s, n2 ‘2’s, and n − n1 − n2 ‘3’s. The likelihood function is (where ∝
means ‘proportional to’):
n
Y
L(θ) = p(Xi ; θ) = p(1; θ)n1 p(2; θ)n2 p(3; θ)n−n1 −n2
i=1

= θ2n1 (2θ(1 − θ))n2 (1 − θ)2(n−n1 −n2 )


∝ θ2n1 +n2 (1 − θ)2n−2n1 −n2 .

The log-likelihood is l(θ) ∝ (2n1 + n2 ) ln θ + (2n − 2n1 − n2 ) ln(1 − θ).


Setting:
d 2n1 + n2 2n − 2n1 − n2
l(θ) = − =0
dθ θb 1 − θb
that is:
(1 − θ)(2n
b 1 + n2 ) = θ(2n − 2n1 − n2 )
b
leads to the MLE:
2n1 + n2
θb = .
2n

135
7. Point estimation

For example, for a sample with n = 4, n1 = 1 and n2 = 2, we obtain a point estimate


of θb = 0.5.

7.7 Overview of chapter


This chapter introduced three point estimation techniques. Method of moments, least
squares and maximum likelihood estimation have been presented.

7.8 Key terms and concepts


Invariance principle Law of large numbers (LLN)
Least squares estimation Likelihood function
Log-likelihood function Maximum likelihood estimation
Method of moments estimation Parameter
Point estimator Population moment
Sample moment

7.9 Sample examination questions


1. Let {X1 , X2 , . . . , Xn } be a random sample from the probability distribution with
the probability density function:
(
(1 + θx)/2 for − 1 ≤ x ≤ 1
f (x; θ) =
0 otherwise

where −1 ≤ θ ≤ 1 is an unknown parameter.


(a) Derive the method of moments estimator of θ.

(b) Is the estimator of θ derived in part (a) biased or unbiased? Justify your
answer.

(c) Determine the variance of the estimator derived in part (a) and check whether
it is a consistent estimator of θ.

(d) Suppose n = 5, resulting in the sample:

x1 = 0.68, x2 = 0.05, x3 = 0.77, x4 = −0.65 and x5 = 0.35.

Use this sample to calculate the method of moments estimate of θ using the
estimator derived in part (a), and sketch the above probability density
function based on this estimate.

136
7.10. Solutions to Sample examination questions

2. Let {X1 , X2 , . . . , Xn } be a random sample of size n from the following probability


density function:
1
f (x; α, θ) = xα−1 e−x/θ
(α − 1)! θα
for x > 0, and 0 otherwise, where α > 0 is known, and θ > 0.
(a) Derive the maximum likelihood estimator of θ. (You do not need to verify the
solution is a maximum.)

(b) Show that the estimator derived in part (a) is mean square consistent for θ.
Hint: You may use the fact that E(X) = αθ and Var(X) = αθ2 .

7.10 Solutions to Sample examination questions


1. (a) The first population moment is:
Z ∞ Z 1 Z 1 1
x + θx2
 2
1 + θx x θx3 θ
E(X) = x f (x) dx = x dx = dx = + = .
−∞ −1 2 −1 2 4 6 −1 3
Estimating the first population moment using the first sample moment, the
method of moments estimator of θ is:
θb
= X̄ ⇒ θb = 3X̄.
3
(b) Noting that E(X̄) = E(X) = θ/3, we have:
θ
E(θ)
b = E(3X̄) = 3 E(X̄) = 3 =θ
3
hence θb is an unbiased estimator of θ.
b → 0 as
(c) As θb is an unbiased estimator of θ, we simply check whether Var(θ)
n → ∞. We have:
9 Var(X)
Var(θ)
b = Var(3X̄) = 9 Var(X̄) =
n
Now:
∞ 1 1 1
x2 + θx3
 3
θx4
Z Z Z
2 2 21 + θx x 1
E(X ) = x f (x) dx = x dx = dx = + = .
−∞ −1 2 −1 2 6 8 −1 3
Hence:
1 θ2 3 − θ2
Var(X) = E(X 2 ) − (E(X))2 = − ⇒ Var(θ)
b = .
3 9 n
The mean squared error of θb is:

b 2= 3 − θ2
MSE(θ)
b = Var(θ)
b + (Bias(θ)) + 02 → 0
n
as n → ∞, hence θb is a consistent estimator of θ.

137
7. Point estimation

(d) The sample mean is x̄ = 0.24, hence θb = 3x̄ = 3 × 0.24 = 0.72. Therefore:
(
b = 0.50 + 0.36x for − 1 ≤ x ≤ 1
f (x; θ)
0 otherwise.

A sketch of f (x; θ)
b is:

2. (a) For α > 0 known, due to independence the likelihood function is:
n n
!α−1 n
!
Y 1 Y 1X
L(θ) = f (xi ; α, θ) = n θ nα
xi exp − xi .
i=1
((α − 1)!) i=1
θ i=1

Hence the log-likelihood function is:


n
! n
Y 1X
l(θ; ) = −n log((α − 1)!) − nα log θ + (α − 1) log xi − xi
i=1
θ i=1
such that: n
d nα 1 X
l(θ) = − + 2 xi .
dθ θ θ i=1

Equating to zero and solving for θ,


b the maximum likelihood estimator of θ is:
n
1 X X̄
θb = Xi = .
nα i=1 α

(b) Noting the hint, we have:


 
b = E X̄ = 1 E(X̄) = 1 E(X) = αθ = θ
E(θ)
α α α α
hence θb is an unbiased estimator of θ. Also:
αθ2 θ2
 
X̄ 1 1
Var(θ) = Var
b = 2 Var(X̄) = Var(X) = = .
α α nα2 nα2 nα
b → 0 as n → ∞, then θb is a
Since θb is unbiased and noting that Var(θ)
consistent estimator of θ.

The group was alarmed to find that if you are a labourer, cleaner or dock
worker, you are twice as likely to die than a member of the professional classes.
(The Sunday Times, 31 August 1980)

138
Chapter 8
Analysis of variance (ANOVA)

8.1 Synopsis of chapter


This chapter introduces analysis of variance (ANOVA) which is a widely-used technique
for detecting differences between groups based on continuous dependent variables. This
chapter employs hypothesis testing and confidence intervals, the principal mechanics of
which were covered in ST104a Statistics 1.

8.2 Learning outcomes


After completing this chapter, you should be able to:

explain the purpose of analysis of variance

restate and interpret the models for one-way and two-way analysis of variance

conduct small examples of one-way and two-way analysis of variance with a


calculator, reporting the results in an ANOVA table

perform hypothesis tests and construct confidence intervals for one-way and
two-way analysis of variance

explain how to interpret residuals from an analysis of variance.

8.3 Introduction
Analysis of variance (ANOVA) is a popular tool which has an applicability and power
which we can only start to appreciate in this course. The idea of analysis of variance is
to investigate how variation in structured data can be split into pieces associated with
components of that structure. We look only at one-way and two-way classifications,
providing tests and confidence intervals which are widely used in practice.

8.4 Testing for equality of three population means


We begin with an illustrative example to test the hypothesis that three population
means are equal.

139
8. Analysis of variance (ANOVA)

Example 8.1 To assess the teaching quality of class teachers, a random sample of
6 examination marks was selected from each of three classes. The examination marks
for each class are listed in the table below.
Can we infer from these data that there is no significant difference in the
examination marks among all three classes?

Class 1 Class 2 Class 3


85 71 59
75 75 64
82 73 62
76 74 69
71 69 75
85 82 67

Suppose examination marks from Class j follow the distribution N (µj , σ 2 ), for
j = 1, 2, 3. So we assume examination marks are normally distributed with the same
variance in each class, but possibly different means.
We need to test the hypothesis:

H0 : µ1 = µ2 = µ3 .

The data form a 6 × 3 array. Denote the data point at the (i, j)th position as Xij .
We compute the column means first where the jth column mean is:
X1j + X2j + · · · + Xnj j
X̄·j =
nj

where nj is the sample size of group j (here nj = 6 for all j).


This leads to x̄·1 = 79, x̄·2 = 74 and x̄·3 = 66. Transposing the table, we get:

Observation
1 2 3 4 5 6 Mean
Class 1 85 75 82 76 71 85 79
Class 2 71 75 73 74 69 82 74
Class 3 59 64 62 69 75 67 66

Note that similar problems arise from other practical situations. For example:

comparing the returns of three stocks

comparing sales using three advertising strategies

comparing the effectiveness of three medicines.

If H0 is true, the three observed sample means x̄·1 , x̄·2 and x̄·3 should be very close to
each other, i.e. all of them should be close to the overall sample mean, x̄, which is:
x̄·1 + x̄·2 + x̄·3 79 + 74 + 66
x̄ = = = 73
3 3

140
8.5. One-way analysis of variance

i.e. the mean value of all 18 observations.


So we wish to perform a hypothesis test based on the variation in the sample means
such that the greater the variation, the more likely we are to reject H0 . One possible
measure for the variation in the sample means X̄·j about the overall sample mean X̄,
for j = 1, 2, 3, is:
X3
(X̄·j − X̄)2 . (8.1)
j=1

However, (8.1) is not scale-invariant, so it would be difficult to judge whether the


realised value is large enough to warrant rejection of H0 due to the magnitude being
dependent on the units of measurement of the data. So we seek a scale-invariant test
statistic.
Just as we scaled the covariance between two random variables to give the
scale-invariant correlation coefficient, we can similarly scale (8.1) to give the
following possible test statistic:
3
(X̄·j − X̄)2
P
j=1
T = .
sum of the three sample variances

Hence we would reject H0 for large values of T . (Note t = 0 if x̄·1 = x̄·2 = x̄·3 which
would mean that there is no variation at all between the sample means. In this case
all the sample means would equal x̄.)
It remains to determine the distribution of T under H0 .

8.5 One-way analysis of variance


We now extend Example 8.1 to consider a general setting where there are k
independent random samples available from k normal distributions N (µj , σ 2 ), for
j = 1, 2, . . . , k. (Example 8.1 corresponds to k = 3.)
Denote by X1j , X2j , . . . , Xnj j the random sample with sample size nj from N (µj , σ 2 ), for
j = 1, 2, . . . , k.
Our goal is to test:
H0 : µ1 = µ2 = · · · = µk
vs.
H1 : not all µj s are the same.
One-way analysis of variance (one-way ANOVA) involves a continuous dependent
variable and one categorical independent variable (sometimes called a factor, or
treatment), where the k different levels of the categorical variable are the k different
groups.
We now introduce statistics associated with one-way ANOVA.

141
8. Analysis of variance (ANOVA)

Statistics associated with one-way ANOVA

The jth sample mean is:


nj
1 X
X̄·j = Xij .
nj i=1

The overall sample mean is:


k nj k
1 XX 1X
X̄ = Xij = nj X̄·j
n j=1 i=1
n j=1

k
P
where n = nj is the total number of observations across all k groups.
j=1
The total variation is:
nj
k X
X
(Xij − X̄)2
j=1 i=1

with n − 1 degrees of freedom.


The between-groups variation is:
k
X
B= nj (X̄·j − X̄)2
j=1

with k − 1 degrees of freedom.


The within-groups variation is:
nj
k X
X
W = (Xij − X̄·j )2
j=1 i=1

k
P
with n − k = (nj − 1) degrees of freedom.
j=1
The ANOVA decomposition is:
nj
k X k nj
k X
X X X
2 2
(Xij − X̄) = nj (X̄·j − X̄) + (Xij − X̄·j )2 .
j=1 i=1 j=1 j=1 i=1

We have already discussed the jth sample mean and overall sample mean. The total
variation is a measure of the overall (total) variability in the data from all k groups
about the overall sample mean. The ANOVA decomposition decomposes this into two
components: between-groups variation (which is attributable to the factor level) and
within-groups variation (which is attributable to the variation within each group and is
assumed to be the same σ 2 for each group).
Some remarks are the following.

i. B and W are also called, respectively, between-treatments variation and

142
8.5. One-way analysis of variance

within-treatments variation. In fact W is effectively a residual (error) sum of


squares, representing the variation which cannot be explained by the treatment or
group factor.
ii. The ANOVA decomposition follows from the identity:
m
X m
X
2
(ai − b) = (ai − ā)2 + m(ā − b)2 .
i=1 i=1

However, the actual derivation is not required for this course.


iii. The following are some useful formulae for manual computations.
k
• n=
P
nj .
j=1
nj k
• X̄·j =
P P
Xij /nj and X̄ = nj X̄·j /n.
i=1 j=1
nj
k P
• Total variation = Total SS = B + W = Xij2 − nX̄ 2 .
P
j=1 i=1
k
• B= nj X̄·j2 − nX̄ 2 .
P
j=1
nj
k P k k
• Residual (Error) SS = W = Xij2 − nj X̄·j2 = (nj − 1)Sj2 where Sj2 is
P P P
j=1 i=1 j=1 j=1
the jth sample variance.

We now note, without proof, the following results.


k nj
k P
nj (X̄·j − X̄)2 and W = (Xij − X̄·j )2 are independent of each other.
P P
i. B =
j=1 j=1 i=1

nj
k P
ii. W/σ 2 = (Xij − X̄·j )2 /σ 2 ∼ χ2n−k .
P
j=1 i=1

k
iii. Under H0 : µ1 = · · · = µk , then B/σ 2 = nj (X̄·j − X̄)2 /σ 2 ∼ χ2k−1 .
P
j=1

In order to test H0 : µ1 = µ2 = · · · = µk , we define the following test statistic:


k
nj (X̄·j − X̄)2 /(k − 1)
P
j=1 B/(k − 1)
F = k Pnj
= .
P W/(n − k)
(Xij − X̄·j )2 /(n − k)
j=1 i=1

Under H0 , F ∼ Fk−1, n−k . We reject H0 at the 100α% significance level if:

f > Fα, k−1, n−k

where Fα, k−1, n−k is the top 100αth percentile of the Fk−1, n−k distribution, i.e.
P (F > Fα, k−1, n−k ) = α, and f is the observed test statistic value.

143
8. Analysis of variance (ANOVA)

The p-value of the test is:

p-value = P (F > f ).

It is clear that f > Fα, k−1, n−k if and only if the p-value < α, as we must reach the same
conclusion regardless of whether we use the critical value approach or the p-value
approach to hypothesis testing.

One-way ANOVA table

Typically, one-way ANOVA results are presented in a table as follows:


Source DF SS MS F p-value
B/(k−1)
Factor k−1 B B/(k − 1) W/(n−k)
p
Error n−k W W/(n − k)
Total n−1 B+W

Example 8.2 Continuing with Example 8.1, for the given data, k = 3,
n1 = n2 = n3 = 6, n = n1 + n2 + n3 = 18, x̄·1 = 79, x̄·2 = 74, x̄·3 = 66 and x̄ = 73.
The sample variances are calculated to be s21 = 34, s22 = 20 and s23 = 32. Therefore:
3
X
b= 6(x̄·j − x̄)2 = 6 × ((79 − 73)2 + (74 − 73)2 + (66 − 73)2 ) = 516
j=1

and:
3 X
X 6 3 X
X 6 3
X
w= (xij − x̄·j )2 = x2ij − 6 x̄2·j
j=1 i=1 j=1 i=1 j=1

3
X
= 5s2j
j=1

= 5 × (34 + 20 + 32)
= 430.

Hence:
b/(k − 1) 516/2
f= = = 9.
w/(n − k) 430/15
Under H0 : µ1 = µ2 = µ3 , F ∼ Fk−1, n−k = F2, 15 . Since F0.01, 2, 15 = 6.36 < 9, using
Table A.3 of the Dougherty Statistical Tables, we reject H0 at the 1% significance
level. In fact the p-value (using a computer) is P (F > 9) = 0.003. Therefore, we
conclude that there is a significant difference among the mean examination marks
across the three classes.

144
8.5. One-way analysis of variance

The one-way ANOVA table is as follows:

Source DF SS MS F p-value
Class 2 516 258 9 0.003
Error 15 430 28.67
Total 17 946

Example 8.3 A study performed by a Columbia University professor counted the


number of times per minute professors from three different departments said ‘uh’ or
‘ah’ during lectures to fill gaps between words. The data listed in ‘UhAh.csv’
(available on the VLE) were derived from observing 100 minutes from each of the
three departments. If we assume that the more frequent use of ‘uh’ or ‘ah’ results in
more boring lectures, can we conclude that some departments’ professors are more
boring than others?
The counts for English, Mathematics and Political Science departments are stored.
As always in statistical analysis, we first look at the summary (descriptive) statistics
of these data.

> attach(UhAh)
> summary(UhAh)
Frequency Department
Min. : 0.00 English :100
1st Qu.: 4.00 Mathematics :100
Median : 5.00 Political Science:100
Mean : 5.48
3rd Qu.: 7.00
Max. :11.00
> xbar <- tapply(Frequency, Department, mean)
> s <- tapply(Frequency, Department, sd)
> n <- tapply(Frequency, Department, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
English Mathematics Political Science
5.81 5.30 5.33

[[2]]
English Mathematics Political Science
2.493203 2.012587 1.974867

[[3]]
English Mathematics Political Science
100 100 100

[[4]]
English Mathematics Political Science
0.2493203 0.2012587 0.1974867

145
8. Analysis of variance (ANOVA)

Surprisingly, professors in English say ‘uh’ or ‘ah’ more on average than those in
Mathematics and Political Science (compare the sample means of 5.81, 5.30 and
5.33), but the difference seems small. However, we need to formally test whether the
(seemingly small) differences are statistically significant.
Using the data, R produces the following one-way ANOVA table:

> anova(lm(Frequency ~ Department))


Analysis of Variance Table

Response: Frequency
Df Sum Sq Mean Sq F value Pr(>F)
Department 2 16.38 8.1900 1.7344 0.1783
Residuals 297 1402.50 4.7222
Since the p-value for the F test is 0.1783, we cannot reject the following hypothesis:

H0 : µ1 = µ2 = µ3 .

Therefore, there is no evidence of a difference in the mean number of ‘uh’s or ‘ah’s


said by professors across the three departments.

In addition to a one-way ANOVA table, we can also obtain the following.

An estimator of σ is:
s
W
σ
b =S= .
n−k

95% confidence intervals for µj are given by:

S
X̄·j ± t0.025, n−k × √ for j = 1, 2, . . . , k
nj

where t0.025, n−k is the top 2.5th percentile of the Student’s tn−k distribution, which
can be obtained from Table 10 of the New Cambridge Statistical Tables.

Example 8.4 Assuming a common variance for each group, from the preceding
output in Example 8.3 we see that:

1,402.50 √
r
σ
b=s= = 4.72 = 2.173.
297
Since t0.025, 297 ≈ t0.025, ∞ = 1.96, using Table 10 of the New Cambridge Statistical
Tables, we obtain the following 95% confidence intervals for µ1 , µ2 and µ3 ,

146
8.5. One-way analysis of variance

respectively:
2.173
j=1: 5.81 ± 1.96 × √ ⇒ (5.38, 6.24)
100
2.173
j=2: 5.30 ± 1.96 × √ ⇒ (4.87, 5.73)
100
2.173
j=3: 5.33 ± 1.96 × √ ⇒ (4.90, 5.76).
100

R can produce the following:

> stripchart(Frequency ~ Department,pch=16,vert=T)


> arrows(1:3,xbar+1.96*2.173/sqrt(n),1:3,xbar-1.96*2.173/sqrt(n),
angle=90,code=3,length=0.1)
> lines(1:3,xbar,pch=4,type="b",cex=2)
These 95% confidence intervals can be seen plotted in the R output below. Note that
these confidence intervals all overlap, which is consistent with our failure to reject
the null hypothesis that all population means are equal.
10
8
Frequency

6
4
2
0

English Mathematics Political Science

Figure 8.1: Overlapping confidence intervals.

Example 8.5 In early 2001, the American economy was slowing down and
companies were laying off workers. A poll conducted during February 2001 asked a
random sample of workers how long (in months) it would be before they faced
significant financial hardship if they lost their jobs, with the data available in the file
‘GallupPoll.csv’ (available on the VLE). They are classified into four groups
according to their incomes. Below is part of the R output of the descriptive statistics

147
8. Analysis of variance (ANOVA)

of the classified data. Can we infer that income group has a significant impact on the
mean length of time before facing financial hardship?

Hardship Income.group
Min. : 0.00 $20 to 30K: 81
1st Qu.: 8.00 $30 to 50K:114
Median :15.00 Over $50K : 39
Mean :16.11 Under $20K: 67
3rd Qu.:22.00
Max. :50.00

> xbar <- tapply(Hardship, Income.group, mean)


> s <- tapply(Hardship, Income.group, sd)
> n <- tapply(Hardship, Income.group, length)
> sem <- s/sqrt(n)
> list(xbar,s,n,sem)
[[1]]
$20 to 30K $30 to 50K Over $50K Under $20K
15.493827 18.456140 22.205128 9.313433

[[2]]
$20 to 30K $30 to 50K Over $50K Under $20K
9.233260 9.507464 11.029099 8.087043

[[3]]
$20 to 30K $30 to 50K Over $50K Under $20K
81 114 39 67

[[4]]
$20 to 30K $30 to 50K Over $50K Under $20K
1.0259178 0.8904556 1.7660693 0.9879896
Inspection of the sample means suggests that there is a difference between income
groups, but we need to conduct a one-way ANOVA test to see whether the
differences are statistically significant.
We apply one-way ANOVA to test whether the means in the k = 4 groups are equal,
i.e. H0 : µ1 = µ2 = µ3 = µ4 , from highest to lowest income groups.
We have n1 = 39, n2 = 114, n3 = 81 and n4 = 67, hence:
k
X
n= nj = 39 + 114 + 81 + 67 = 301.
j=1

Also x̄·1 = 22.21, x̄·2 = 18.456, x̄·3 = 15.49, x̄·4 = 9.313 and:
k
1X 39 × 22.21 + 114 × 18.456 + 81 × 15.49 + 67 × 9.313
x̄ = nj X̄·j = = 16.109.
n j=1 301

148
8.5. One-way analysis of variance

Now:
k
X
b= nj (x̄·j − x̄)2
j=1

= 39 × (22.21 − 16.109)2 + 114 × (18.456 − 16.109)2


+ 81 × (15.49 − 16.109)2 + 67 × (9.313 − 16.109)2
= 5,205.097.

We have s21 = (11.03)2 = 121.661, s22 = (9.507)2 = 90.383, s23 = (9.23)2 = 85.193 and
s24 = (8.087)2 = 65.400, hence:
nj
k X k
X X
2
w= (xij − x̄·j ) = (nj − 1)s2j
j=1 i=1 j=1

= 38 × 121.661 + 113 × 90.383 + 80 × 85.193 + 66 × 65.400


= 25,968.24.

Consequently:
b/(k − 1) 5,205.097/3
f= = = 19.84.
w/(n − k) 25,968.24/(301 − 4)
Under H0 , F ∼ Fk−1, n−k = F3, 297 . Since F0.01, 3, 297 ≈ 3.85 < 19.84, we reject H0 at
the 1% significance level, i.e. there is strong evidence that income group has a
significant impact on the mean length of time before facing financial hardship.
The pooled estimate of σ is:
p p
s = w/(n − k) = 25,968.24/(301 − 4) = 9.351.

A 95% confidence interval for µj is:

s 9.351 18.328
x̄·j ± t0.025, 297 × √ = x̄·j ± 1.96 × √ = x̄·j ± √ .
nj nj nj

Hence, for example, a 95% confidence interval for µ1 is:

18.328
22.21 ± √ ⇒ (19.28, 25.14)
39

and a 95% confidence interval for µ4 is:

18.328
9.313 ± √ ⇒ (7.07, 11.55).
67

Notice that these two confidence intervals do not overlap, which is consistent with
our conclusion that there is a difference between the group means.
R output for the data is:

149
8. Analysis of variance (ANOVA)

> anova(lm(Hardship ~ Income.group))


Analysis of Variance Table

Response: Hardship
Df Sum Sq Mean Sq F value Pr(>F)
Income.group 3 5202.1 1734.03 19.828 9.636e-12 ***
Residuals 297 25973.3 87.45
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that minor differences are due to rounding errors in calculations.

8.6 From one-way to two-way ANOVA


One-way ANOVA: a review
We have independent observations Xij ∼ N (µj , σ 2 ) for i = 1, 2, . . . , nj and
j = 1, 2, . . . , k. We are interested in testing:
H0 : µ1 = µ2 = · · · = µk .
The variation of the Xij s is driven by a factor at different levels µ1 , µ2 , . . . , µk , in
addition to random fluctuations (i.e. random errors). We test whether such a factor
effect exists or not. We can model a one-way ANOVA problem as follows:
Xij = µ + βj + εij for i = 1, 2, . . . , nj and j = 1, 2, . . . , k
where εij ∼ N (0, σ 2 ) and the εij s are independent. µ is the average effect and βj is the
Pk
factor (or treatment) effect at the jth level. Note that βj = 0. The null hypothesis
j=1
(i.e. that the group means are all equal) can also be expressed as:
H0 : β1 = β2 = · · · = βk = 0.

8.7 Two-way analysis of variance


Two-way analysis of variance (two-way ANOVA) involves a continuous dependent
variable and two categorical independent variables (factors). Two-way ANOVA models
the observations as:
Xij = µ + γi + βj + εij for i = 1, 2, . . . , r and j = 1, 2, . . . , c
where:

µ represents the average effect


β1 , β2 , . . . , βc represent c different treatment (column) levels
γ1 , γ2 , . . . , γr represent r different block (row) levels
εij ∼ N (0, σ 2 ) and the εij s are independent.

150
8.7. Two-way analysis of variance

In total, there are n = r × c observations. We now consider the conditions to make the
parameters µ, γi and βj identifiable for i = 1, 2, . . . , r and j = 1, 2, . . . , c. The conditions
are:

γ1 + γ2 + · · · + γr = 0 and β1 + β2 + · · · + βc = 0.

We will be interested in testing the following hypotheses.

The ‘no treatment (column) effect’ hypothesis of H0 : β1 = β2 = · · · = βc = 0.

The ‘no block (row) effect’ hypothesis of H0 : γ1 = γ2 = · · · = γr = 0.

We now introduce statistics associated with two-way ANOVA.

Statistics associated with two-way ANOVA

The sample mean at the ith block level is:


c
1X
X̄i· = Xij for i = 1, 2, . . . , r.
c j=1

The sample mean at the jth treatment level is:


r
X 1
X̄·j = Xij for j = 1, 2, . . . , c.
i=1
r

The overall sample mean is:


r c
1 XX
X̄ = X̄·· = Xij .
n i=1 j=1

The total variation (with rc − 1 degrees of freedom) is:


r X
X c
Total SS = (Xij − X̄)2 .
i=1 j=1

The between-blocks (rows) variation (with r − 1 degrees of freedom) is:


r
X
Brow =c (X̄i· − X̄)2 .
i=1

151
8. Analysis of variance (ANOVA)

The between-treatments (columns) variation (with c − 1 degrees of freedom)


is: c
X
Bcol = r (X̄·j − X̄)2 .
j=1

The residual (error) variation (with (r − 1)(c − 1) degrees of freedom) is:


r X
X c
Residual SS = (Xij − X̄i· − X̄·j + X̄)2 .
i=1 j=1

The (two-way) ANOVA decomposition is:


r X
X c r
X c
X
2 2
(Xij − X̄) = c (X̄i· − X̄) + r (X̄·j − X̄)2
i=1 j=1 i=1 j=1

r X
X c
+ (Xij − X̄i· − X̄·j + X̄)2 .
i=1 j=1

The total variation is a measure of the overall (total) variability in the data and the
(two-way) ANOVA decomposition decomposes this into three components:
between-blocks variation (which is attributable to the row factor level),
between-treatments variation (which is attributable to the column factor level) and
residual variation (which is attributable to the variation not explained by the row and
column factors).
The following are some useful formulae for manual computations.

c
P
Row sample means: X̄i· = Xij /c, for i = 1, 2, . . . , r.
j=1

r
P
Column sample means: X̄·j = Xij /r, for j = 1, 2, . . . , c.
i=1

r P
P c r
P c
P
Overall sample mean: X̄ = Xij /n = X̄i· /r = X̄·j /c.
i=1 j=1 i=1 j=1

r P
c
Xij2 − rcX̄ 2 .
P
Total SS =
i=1 j=1

r
X̄i·2 − rcX̄ 2 .
P
Between-blocks (rows) variation: Brow = c
i=1

c
X̄·j2 − rcX̄ 2 .
P
Between-treatments (columns) variation: Bcol = r
j=1

r P
c r c
Xij2 − c X̄i·2 − r X̄·j2 + rcX̄ 2 .
P P P
Residual SS = (Total SS) − Brow − Bcol =
i=1 j=1 i=1 j=1

152
8.8. Residuals

In order to test the ‘no block (row) effect’ hypothesis of H0 : γ1 = γ2 = · · · = γr = 0, the


test statistic is defined as:
Brow /(r − 1) (c − 1)Brow
F = = .
(Residual SS)/((r − 1)(c − 1)) Residual SS
Under H0 , F ∼ Fr−1, (r−1)(c−1) . We reject H0 at the 100α% significance level if:
f > Fα, r−1, (r−1)(c−1)
where Fα, r−1, (r−1)(c−1) is the top 100αth percentile of the Fr−1, (r−1)(c−1) distribution, i.e.
P (F > Fα, r−1, (r−1)(c−1) ) = α, and f is the observed test statistic value.
The p-value of the test is:
p-value = P (F > f ).
In order to test the ‘no treatment (column) effect’ hypothesis of
H0 : β1 = β2 = · · · = βc = 0, the test statistic is defined as:
Bcol /(c − 1) (r − 1)Bcol
F = = .
(Residual SS)/((r − 1)(c − 1)) Residual SS
Under H0 , F ∼ Fc−1, (r−1)(c−1) . We reject H0 at the 100α% significance level if:
f > Fα, c−1, (r−1)(c−1) .
The p-value of the test is defined in the usual way.

Two-way ANOVA table

As with one-way ANOVA, two-way ANOVA results are presented in a table as follows:
Source DF SS MS F p-value
(c−1)Brow
Row factor r−1 Brow Brow /(r − 1) Residual SS
p
(r−1)Bcol
Column factor c−1 Bcol Bcol /(c − 1) Residual SS
p
Residual SS
Residual (r − 1)(c − 1) Residual SS (r−1)(c−1)
Total rc − 1 Total SS

8.8 Residuals
Before considering an example of two-way ANOVA, we briefly consider residuals.
Recall the original two-way ANOVA model:
Xij = µ + γi + βj + εij .
We now decompose the observations as follows:
Xij = X̄ + (X̄i· − X̄) + (X̄·j − X̄) + (Xij − X̄i· − X̄·j + X̄)
for i = 1, 2, . . . , r and j = 1, 2, . . . , c, where we have the following point estimators.

153
8. Analysis of variance (ANOVA)

µ
b = X̄ is the point estimator of µ.

bi = X̄i· − X̄ is the point estimator of γi , for i = 1, 2, . . . , r.


γ

βbj = X̄·j − X̄ is the point estimator of βj , for j = 1, 2, . . . , c.

It follows that the residual, i.e. the estimator of εij , is:

εbij = Xij − X̄i· − X̄·j + X̄

for i = 1, 2, . . . r and j = 1, 2, . . . , c.
The two-way ANOVA model assumes εij ∼ N (0, σ 2 ) and so, if the model structure is
correct, then the εbij s should behave like independent N (0, σ 2 ) random variables.

Example 8.6 The following table lists the percentage annual returns (calculated
four times per annum) of the Common Stock Index at the New York Stock Exchange
during 1981–85, available in the data file ‘NYSE.csv’ (available on the VLE).

1st quarter 2nd quarter 3rd quarter 4th quarter


1981 5.7 6.0 7.1 6.7
1982 7.2 7.0 6.1 5.2
1983 4.9 4.1 4.2 4.4
1984 4.5 4.9 4.5 4.5
1985 4.4 4.2 4.2 3.6

(a) Is the variability in returns from year to year statistically significant?

(b) Are returns affected by the quarter of the year?


Using two-way ANOVA, we test the no row effect hypothesis to answer (a), and test
the no column effect hypothesis to answer (b). We have r = 5 and c = 4.
c
P
The row sample means are calculated using X̄i· = Xij /c, which gives 6.375, 6.375,
j=1
4.4, 4.6 and 4.1, for i = 1, 2, . . . , 5, respectively.
r
P
The column sample means are calculated using X̄·j = Xij /r, which gives 5.34,
i=1
5.24, 5.22 and 4.88, for j = 1, 2, 3, 4, respectively.
r
P
The overall sample mean is x̄ = x̄i· /r = 5.17.
i=1
r P
c
x2ij = 559.06.
P
The sum of the squared observations is
i=1 j=1

154
8.8. Residuals

Hence we have the following.

r X
X c
Total SS = x2ij − rcx̄2 = 559.06 − 20 × (5.17)2 = 559.06 − 534.578 = 24.482.
i=1 j=1

r
X
brow = c x̄2i· − rcx̄2 = 4 × 138.6112 − 534.578 = 19.867.
i=1

c
X
bcol = r x̄2·j − rcx̄2 = 5 × 107.036 − 534.578 = 0.602.
j=1

Residual SS = (Total SS) − brow − bcol = 24.482 − 19.867 − 0.602 = 4.013.

To test the no row effect hypothesis H0 : γ1 = γ2 = · · · = γ5 = 0, the test statistic


value is:
(c − 1)brow 3 × 19.867
f= = = 14.852.
Residual SS 4.013
Under H0 , F ∼ Fr−1, (r−1)(c−1) = F4, 12 . Using Table A.3 of the Dougherty Statistical
Tables, since F0.01, 4, 12 = 5.41 < 14.852, we reject H0 at the 1% significance level. We
conclude that there is strong evidence that the return does depend on the year.

To test the no column effect hypothesis H0 : β1 = β2 = β3 = β4 = 0, the test statistic


value is:
(r − 1)bcol 4 × 0.602
f= = = 0.600.
Residual SS 4.013
Under H0 , F ∼ Fc−1, (r−1)(c−1) = F3, 12 . Since F0.05, 3, 12 = 3.49 > 0.600, we cannot
reject H0 even at the 5% significance level. Therefore, there is no significant evidence
indicating that the return depends on the quarter.
The results may be summarised in a two-way ANOVA table as follows:

Source DF SS MS F p-value
Year 4 19.867 4.967 14.852 < 0.01
Quarter 3 0.602 0.201 0.600 > 0.10
Residual 12 4.013 0.334
Total 19 24.482

We could also provide 95% confidence interval estimates for each block and
treatment level by using the pooled estimator of σ 2 , which is:

Residual SS
S2 = = Residual MS.
(r − 1)(c − 1)

For the given data, s2 = 0.334.

155
8. Analysis of variance (ANOVA)

R produces the following output:

> anova(lm(Return ~ Year + Quarter))


Analysis of Variance Table

Response: Return
Df Sum Sq Mean Sq F value Pr(>F)
Year 4 19.867 4.9667 14.852 0.0001349 ***
Quarter 3 0.602 0.2007 0.600 0.6271918
Residuals 12 4.013 0.3344
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the confidence intervals for years 1 and 2 (corresponding to 1981 and
1982) are separated from those for years 3 to 5 (that is, 1983 to 1985), which is
consistent with rejection of H0 in the no row effect test. In contrast, the confidence
intervals for each quarter all overlap, which is consistent with our failure to reject H0
in the no column effect test.
Finally, we may also look at the residuals:

εbij = Xij − µ
b−γ
bi − βbj for i = 1, 2, . . . r and j = 1, 2, . . . , c.

If the assumed normal model (structure) is correct, the εbij s should behave like
independent N (0, σ 2 ) random variables.

8.9 Overview of chapter

This chapter introduced analysis of variance as a statistical tool to detect differences


between group means. One-way and two-way analysis of variance frameworks were
presented depending on whether one or two independent variables were modelled,
respectively. Statistical inference in the form of hypothesis tests and confidence intervals
was conducted.

8.10 Key terms and concepts

ANOVA decomposition Between-blocks variation


Between-groups variation Between-treatments variation
One-way ANOVA Random errors
Residual Sample mean
Total variation Two-way ANOVA
Within-groups variation

156
8.11. Sample examination questions

8.11 Sample examination questions


1. An indicator of the value of a stock relative to its earnings is its price-earnings
ratio. The following table provides the summary statistics of the price-earnings
ratios for a random sample of 36 stocks, 12 each from the financial, industrial and
pharmaceutical sectors.

Sector Sample mean Sample variance Sample size


Financial 32.50 3.86 12
Industrial 29.91 3.26 12
Pharmaceutical 29.31 3.14 12

You are also given that:


3 X
X 12
x2ij = 33,829.70.
j=1 i=1

Test at the 5% significance level whether the true mean price-earnings ratios for
the three market sectors are the same. Use the ANOVA table format to summarise
your calculations. You may exclude the p-value.

2. The audience shares (in %) of three major television networks’ evening news
broadcasts in four major cities were examined. The average audience share for the
three networks (A, B and C) were 21.35%, 17.28% and 20.18%, respectively. The
following is the calculated ANOVA table with some entries missing.

Source Degrees of freedom Sum of squares Mean square F -value


City 1.95
Network
Error
Total 51.52

(a) Complete the table using the information provided above.

(b) Test, at the 5% significance level, whether there is evidence of a difference in


audience shares between networks.

8.12 Solutions to Sample examination questions


1. For these n = 36 observations and k = 3 groups, we are given that x̄·1 = 32.50,
x̄·2 = 29.91 and x̄·3 = 29.31. Hence:
32.50 + 29.91 + 29.31
x̄ = = 30.57.
3
Hence the total variation is:
3 X
X 12
x2ij − nx̄2 = 33,829.70 − 36 × (30.57)2 = 186.80.
j=1 i=1

157
8. Analysis of variance (ANOVA)

The between-groups variation is:


3
X
b= nj x̄2·j − nx̄2 = 12 × ((32.50)2 + (29.91)2 + (29.31)2 ) − 36 × (30.57)2
j=1

= 76.31.
Therefore, w = 186.80 − 76.31 = 110.49. Hence the ANOVA table is:
Source DF SS MS F
Sector 2 76.31 38.16 11.39
Error 33 110.49 3.35
Total 35 186.80
We test:
H0 : PE ratio means are equal vs. H1 : PE ratio means are not equal
and we reject H0 if:
f > F0.05, 2, 33 ≈ 3.30.
Since 3.30 < 11.39, we reject H0 and conclude that there is evidence of a difference
in the mean price-earnings ratios across the sectors.

2. (a) The average audience share of all networks is:


21.35 + 17.28 + 20.18
= 19.60.
3
Hence the sum of squares (SS) due to networks is:
4 × ((21.35 − 19.60)2 + (17.28 − 19.60)2 + (20.18 − 19.60)2 ) = 35.13
and the mean sum of squares (MS) due to networks is 35.13/(3 − 1) = 17.57.
The degrees of freedom are 4 − 1 = 3, 3 − 1 = 2, (4 − 1)(3 − 1) = 6 and
4 × 3 − 1 = 11 for cities, networks, error and total sum of squares, respectively.
The SS for cities is 3 × 1.95 = 5.85. We have that the SS due to residuals is
given by 51.52 − 5.85 − 35.13 = 10.54 and the MS is 10.54/6 = 1.76. The
F -values are 1.95/1.76 = 1.11 and 17.57/1.76 = 9.98 for cities and networks,
respectively.
Source Degrees of freedom Sum of squares Mean square F -value
City 3 5.85 1.95 1.11
Network 2 35.13 17.57 9.98
Error 6 10.54 1.76
Total 11 51.52

(b) We test H0 : There is no difference between networks against H1 : There is a


difference between networks. The F -value is 9.98 and at a 5% significance level
the critical value is F0.05, 2, 6 = 5.14, hence we reject H0 and conclude that there
is evidence of a difference between networks.

A total of 4,000 cans are opened around the world every second. Ten babies are
conceived around the world every second. Each time you open a can, you stand
a 1-in-400 chance of falling pregnant.
(True or false?)

158
Appendix A
Probability theory

A.1 Worked examples


1. A and B are independent events. Suppose that P (A) = 2π, P (B) = π and
P (A ∪ B) = 0.8. Evaluate π.
Solution:
We have:

P (A ∪ B) = 0.8 = P (A) + P (B) − P (A ∩ B)


= P (A) + P (B) − P (A) P (B)
= 2π + π − 2π 2 .

Therefore: √
2 3± 9 − 6.4
2π − 3π + 0.8 = 0 ⇒ π= .
4
Hence π = 0.346887, since the other root is > 1!

2. A and B are events such that P (A | B) > P (A). Prove that:

P (Ac | B c ) > P (Ac )

where Ac and B c are the complements of A and B, respectively, and P (B c ) > 0.


Solution:
From the definition of conditional probability:

P (Ac ∩ B c ) P ((A ∪ B)c ) 1 − P (A) − P (B) + P (A ∩ B)


P (Ac | B c ) = c
= c
= .
P (B ) P (B ) 1 − P (B)

However:
P (A ∩ B)
P (A | B) = > P (A) i.e. P (A ∩ B) > P (A) P (B).
P (B)

Hence:
1 − P (A) − P (B) + P (A) P (B)
P (Ac | B c ) > = 1 − P (A) = P (Ac ).
1 − P (B)

159
A. Probability theory

3. A, B and C are independent events. Prove that A and (B ∪ C) are independent.


Solution:
Using the distributive law:
P (A ∩ (B ∪ C)) = P ((A ∩ B) ∪ (A ∩ C))
= P (A ∩ B) + P (A ∩ C) − P (A ∩ B ∩ C)
= P (A) P (B) + P (A) P (C) − P (A) P (B) P (C)
= P (A) (P (B) + P (C) − P (B) P (C))
= P (A) P (B ∪ C).

4. A and B are any two events in the sample space S. The binary set operator ∨
denotes an exclusive union, such that:
A ∨ B = (A ∪ B) ∩ (A ∩ B)c = {s | s ∈ A or B, and s 6∈ (A ∩ B)}.
Show, from the axioms of probability, that:
(a) P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B)
(b) P (A ∨ B | A) = 1 − P (B | A).

Solution:
(a) We have:
A ∨ B = (A ∩ B c ) ∪ (B ∩ Ac ).
By axiom 3, noting that (A ∩ B c ) and (B ∩ Ac ) are disjoint:
P (A ∨ B) = P (A ∩ B c ) + P (B ∩ Ac ).
We can write A = (A ∩ B) ∪ (A ∩ B c ), hence (using axiom 3):
P (A ∩ B c ) = P (A) − P (A ∩ B).
Similarly, P (B ∩ Ac ) = P (B) − P (A ∩ B), hence:
P (A ∨ B) = P (A) + P (B) − 2 × P (A ∩ B).

(b) We have:
P ((A ∨ B) ∩ A)
P (A ∨ B | A) =
P (A)
P (A ∩ B c )
=
P (A)
P (A) − P (A ∩ B)
=
P (A)
P (A) P (A ∩ B)
= −
P (A) P (A)
= 1 − P (B | A).

160
A.1. Worked examples

5. State and prove Bayes’ theorem.

Solution:
Bayes’ theorem is:
P (A | Bj ) P (Bj )
P (Bj | A) = K
.
P
P (A | Bi ) P (Bi )
i=1

By definition:
P (Bj ∩ A) P (A | Bj ) P (Bj )
P (Bj | A) = = .
P (A) P (A)
If {Bi }, for i = 1, 2, . . . , K, is a partition of the sample space S, then:

K
X K
X
P (A) = P (A ∩ Bi ) = P (A | Bi ) P (Bi ).
i=1 i=1

Hence the result.

6. A man has two bags. Bag A contains five keys and bag B contains seven keys. Only
one of the twelve keys fits the lock which he is trying to open. The man selects a
bag at random, picks out a key from the bag at random and tries that key in the
lock. What is the probability that the key he has chosen fits the lock?

Solution:
Define a partition {Ci }, such that:

5 1 5
C1 = key in bag A and bag A chosen ⇒ P (C1 ) = × =
12 2 24
7 1 7
C2 = key in bag B and bag A chosen ⇒ P (C2 ) = × =
12 2 24
5 1 5
C3 = key in bag A and bag B chosen ⇒ P (C3 ) = × =
12 2 24
7 1 7
C4 = key in bag B and bag B chosen ⇒ P (C4 ) = × = .
12 2 24
Hence we require, defining the event F = ‘key fits’:

1 1 1 5 1 7 1
P (F ) = × P (C1 ) + × P (C4 ) = × + × = .
5 7 5 24 7 24 12

7. Continuing with Question 6, suppose the first key chosen does not fit the lock.
What is the probability that the bag chosen:
(a) is bag A?
(b) contains the required key?

161
A. Probability theory

Solution:

(a) We require P (bag A | F c ) which is:

P (F c | C1 ) P (C1 ) + P (F c | C2 ) P (C2 )
P (bag A | F c ) = 4
.
P c
P (F | Ci ) P (Ci )
i=1

The conditional probabilities are:

4 6
P (F c | C1 ) = , P (F c | C2 ) = 1, P (F c | C3 ) = 1 and P (F c | C4 ) = .
5 7
Hence:
4/5 × 5/24 + 1 × 7/24 1
P (bag A | F c ) = = .
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24 2

(b) We require P (right bag | F c ) which is:

P (F c | C1 ) P (C1 ) + P (F c | C4 ) P (C4 )
P (right bag | F c ) = 4
P
P (F c | Ci ) P (Ci )
i=1

4/5 × 5/24 + 6/7 × 7/24


=
4/5 × 5/24 + 1 × 7/24 + 1 × 5/24 + 6/7 × 7/24
5
= .
11

8. Assume that a calculator has a ‘random number’ key and that when the key is
pressed an integer between 0 and 999 inclusive is generated at random, all numbers
being generated independently of one another.
(a) What is the probability that the number generated is less than 300?
(b) If two numbers are generated, what is the probability that both are less than
300?
(c) If two numbers are generated, what is the probability that the first number
exceeds the second number?
(d) If two numbers are generated, what is the probability that the first number
exceeds the second number, and their sum is exactly 300?
(e) If five numbers are generated, what is the probability that at least one number
occurs more than once?

Solution:

(a) Simply 300/1,000 = 0.3.


(b) Simply 0.3 × 0.3 = 0.09.

162
A.1. Worked examples

(c) Suppose P (first greater) = x, then by symmetry we have that


P (second greater) = x. However, the probability that both are equal is (by
counting):

{0, 0}, {1, 1}, . . . , {999, 999} 1,000


= = 0.001.
1,000,000 1,000,000

Hence x + x + 0.001 = 1, so x = 0.4995.


(d) The following cases apply {300, 0}, {299, 1}, . . . , {151, 149}, i.e. there are 150
possibilities from (10)6 . So the required probability is:

150
= 0.00015.
1,000,000

(e) The probability that they are all different is:

999 998 997 996


1× × × × .
1,000 1,000 1,000 1,000

Note that the first number can be any number (with probability 1).
Subtracting from 1 gives the required probability, i.e. 0.009965.

9. Suppose that three components numbered 1, 2 and 3 have probabilities of failure


π1 , π2 and π3 , respectively. Determine the probability of a system failure in each of
the following cases where component failures are assumed to be independent.
(a) Parallel system – the system fails if all components fail.
(b) Series system – the system fails unless all components do not fail.
(c) Mixed system – the system fails if component 1 fails or if both component 2
and component 3 fail.

Solution:
(a) Since the component failures are independent, the probability of system failure
is π1 π2 π3 .
(b) The probability that component i does not fail is 1 − πi , hence the probability
that the system does not fail is (1 − π1 )(1 − π2 )(1 − π3 ), and so the probability
that the system fails is:

1 − (1 − π1 )(1 − π2 )(1 − π3 ).

(c) Components 2 and 3 may be combined to form a notional component 4 with


failure probability π2 π3 . So the system is equivalent to a component with
failure probability π1 and another component with failure probability π2 π3 ,
these being connected in series. Therefore, the failure probability is:

1 − (1 − π1 )(1 − π2 π3 ) = π1 + π2 π3 − π1 π2 π3 .

163
A. Probability theory

10. Why is S = {1, 1, 2}, not a sensible way to try to define a sample space?
Solution:
Because there is no need to list the elementary outcome ‘1’ twice. It is much clearer
to write S = {1, 2}.

11. Write out all the events for the sample space S = {a, b, c}. (There are eight of
them.)
Solution:
The possible events are {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} (the sample
space S) and ∅.

12. For an event A, work out a simpler way to express the events A ∩ S, A ∪ S, A ∩ ∅
and A ∪ ∅.
Solution:
We have:

A ∩ S = A, A ∪ S = S, A ∩ ∅ = ∅ and A ∪ ∅ = A.

13. If all elementary outcomes are equally likely, S = {a, b, c, d}, A = {a, b, c} and
B = {c, d}, find P (A | B) and P (B | A).
Solution:
S has 4 elementary outcomes which are equally likely, so each elementary outcome
has probability 1/4.
We have:
P (A ∩ B) P ({c}) 1/4 1
P (A | B) = = = =
P (B) P ({c, d}) 1/4 + 1/4 2
and:
P (B ∩ A) P ({c}) 1/4 1
P (B | A) = = = = .
P (A) P ({a, b, c}) 1/4 + 1/4 + 1/4 3

14. Suppose that we toss a fair coin twice. The sample space is given by
S = {HH, HT, T H, T T }, where the elementary outcomes are defined in the
obvious way – for instance HT is heads on the first toss and tails on the second
toss. Show that if all four elementary outcomes are equally likely, then the events
‘heads on the first toss’ and ‘heads on the second toss’ are independent.
Solution:
Note carefully here that we have equally likely elementary outcomes (due to the
coin being fair), so that each has probability 1/4, and the independence follows.
The event ‘heads on the first toss’ is A = {HH, HT } and has probability 1/2,
because it is specified by two elementary outcomes. The event ‘heads on the second
toss’ is B = {HH, T H} and has probability 1/2. The event ‘heads on the first toss
and the second toss’ is A ∩ B = {HH} and has probability 1/4. So the

164
A.1. Worked examples

multiplication property P (A ∩ B) = 1/4 = 1/2 × 1/2 = P (A) P (B) is satisfied, and


the two events are independent.

15. Show that if A and B are disjoint events, and are also independent, then P (A) = 0
or P (B) = 0.1
Solution:
It is important to get the logical flow in the right direction here. We are told that
A and B are disjoint events, that is:

A ∩ B = ∅.

So:
P (A ∩ B) = 0.
We are also told that A and B are independent, that is:

P (A ∩ B) = P (A) P (B).

It follows that:
0 = P (A) P (B)
and so either P (A) = 0 or P (B) = 0.

16. Write down the condition for three events A, B and C to be independent.
Solution:
Applying the product rule, we must have:

P (A ∩ B ∩ C) = P (A) P (B) P (C).

Therefore, since all subsets of two events from A, B and C must be independent,
we must also have:

P (A ∩ B) = P (A) P (B)
P (A ∩ C) = P (A) P (C)

and:
P (B ∩ C) = P (B) P (C).
One must check that all four conditions hold to verify independence of A, B and C.

17. Prove the simplest version of Bayes’ theorem from first principles.
Solution:
Applying the definition of conditional probability, we have:
P (B ∩ A) P (A ∩ B) P (A | B) P (B)
P (B | A) = = = .
P (A) P (A) P (A)

1
Note that independence and disjointness are not similar ideas.

165
A. Probability theory

18. A statistics teacher knows from past experience that a student who does their
homework consistently has a probability of 0.95 of passing the examination,
whereas a student who does not do their homework has a probability of 0.30 of
passing.

(a) If 25% of students do their homework consistently, what percentage of all


students can expect to pass?

(b) If a student chosen at random from the group gets a pass, what is the
probability that the student has done their homework consistently?

Solution:

Here the random experiment is to choose a student at random, and to record


whether the student passes (P ) or fails (F ), and whether the student has done
their homework consistently (C) or has not (N ).2 The sample space is
S = {P C, P N, F C, F N }. We use the events Pass = {P C, P N }, and Fail
= {F C, F N }. We consider the sample space partitioned by Homework
= {P C, F C}, and No Homework = {P N, F N }.

(a) The first part of the example asks for the denominator of Bayes’ theorem:

P (Pass) = P (Pass | Homework) P (Homework)


+ P (Pass | No Homework) P (No Homework)
= 0.95 × 0.25 + 0.30 × (1 − 0.25)
= 0.2375 + 0.225
= 0.4625.

(b) Now applying Bayes’ theorem:

P (Homework ∩ Pass)
P (Homework | Pass) =
P (Pass)
P (Pass | Homework) P (Homework)
=
P (Pass)
0.95 × 0.25
=
0.4625
= 0.5135.

Alternatively, we could arrange the calculations in a tree diagram as shown


below.

2
Notice that F = P c and N = C c .

166
A.1. Worked examples

19. Plagiarism is a serious problem for assessors of coursework. One check on


plagiarism is to compare the coursework with a standard text. If the coursework
has plagiarised the text, then there will be a 95% chance of finding exactly two
phrases which are the same in both coursework and text, and a 5% chance of
finding three or more phrases. If the work is not plagiarised, then these
probabilities are both 50%.
Suppose that 5% of coursework is plagiarised. An assessor chooses some coursework
at random. What is the probability that it has been plagiarised if it has exactly two
phrases in the text?3
What if there are three or more phrases? Did you manage to get a roughly correct
guess of these results before calculating?
Solution:
Suppose that two phrases are the same. We use Bayes’ theorem:
0.95 × 0.05
P (plagiarised | two the same) = = 0.0909.
0.95 × 0.05 + 0.5 × 0.95
Finding two phrases has increased the chance the work is plagiarised from 5% to
9.1%. Did you get anywhere near 9% when guessing? Now suppose that we find
three or more phrases:
0.05 × 0.05
P (plagiarised | three or more the same) = = 0.0052.
0.05 × 0.05 + 0.5 × 0.95
It seems that no plagiariser is silly enough to keep three or more phrases the same,
so if we find three or more, the chance of the work being plagiarised falls from 5%
to 0.5%! How close did you get by guessing?
3
Try making a guess before doing the calculation!

167
A. Probability theory

A.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix I.

1. (a) A, B and C are any three events in the sample space S. Prove that:

P (A∪B∪C) = P (A)+P (B)+P (C)−P (A∩B)−P (B∩C)−P (A∩C)+P (A∩B∩C).

(b) A and B are events in a sample space S. Show that:

P (A) + P (B)
P (A ∩ B) ≤ ≤ P (A ∪ B).
2

2. Suppose A and B are events with P (A) = p, P (B) = 2p and P (A ∪ B) = 0.75.


(a) Evaluate p and P (A | B) if A and B are independent events.
(b) Evaluate p and P (A | B) if A and B are mutually exclusive events.

3. (a) Show that if A and B are independent events in a sample space, then Ac and
B c are also independent.
(b) Show that if X and Y are mutually exclusive events in a sample space, then
X c and Y c are not in general mutually exclusive.

168
Appendix B
Discrete probability distributions

B.1 Worked examples


1. A fair die is thrown once. Determine the probability distribution of the value of the
upturned face, X, and find its mean and variance.

Solution:
Each face has an equal chance of 1/6 of turning up, so we get the following table:

X=x 1 2 3 4 5 6 Total
P (X = x) 1/6 1/6 1/6 1/6 1/6 1/6 1
x P (X = x) 1/6 2/6 3/6 4/6 5/6 6/6 21/6 = 3.5
x2 P (X = x) 1/6 4/6 9/6 16/6 25/6 36/6 91/6

Hence the mean is E(X) = 3.5. The variance is E(X 2 ) − µ2 = 91/6 − (3.5)2 = 2.92.

2. Two fair dice are thrown.


(a) Determine the probability distribution of the sum of the two dice, X, and find
its mean and variance.

(b) Determine the probability distribution of the absolute difference of the two
dice, Y , and find its mean and variance.

Solution:

(a) The pattern is made clearer by using the same denominator (i.e. 36) below.

X=x 2 3 4 5 6 7
P (X = x) 1/36 2/36 3/36 4/36 5/36 6/36
x P (X = x) 2/36 6/36 12/36 20/36 30/36 42/36
x2 P (X = x) 4/36 18/36 48/36 100/36 180/36 294/36

X=x 8 9 10 11 12 Total
P (X = x) 5/36 4/36 3/36 2/36 1/36 1
x P (X = x) 40/36 36/36 30/36 22/36 12/36 252/36
x2 P (X = x) 320/36 324/36 300/36 242/36 144/36 1,974/36

This yields a mean of E(X) = 252/36 = 7, and a variance of E(X 2 ) − µ2


= (1,974/36) − 72 = 5.83. Although not required, a plot of the distribution is:

169
B. Discrete probability distributions

Probability distribution − sum of two dice

0.16
0.14
0.12
Probability

0.10
0.08
0.06
0.04

2 4 6 8 10 12

Value of sum

(b) Again, for clarity, we use the same denominator.


Y =y 0 1 2 3 4 5 Total
P (Y = y) 6/36 10/36 8/36 6/36 4/36 2/36 1
y P (Y = y) 0 10/36 16/36 18/36 16/36 10/36 70/36
y 2 P (Y = y) 0 10/36 32/36 54/36 64/36 50/36 210/36
This yields a mean of E(Y ) = 70/36 = 1.94, while the variance is:
210
E(Y 2 ) − µ2 =
− (1.94)2 = 2.05.
36
Again, although not required, a plot of the distribution is:

Probability distribution − difference in two dice


0.25
0.20
Probability

0.15
0.10
0.05

0 1 2 3 4 5

Absolute value of difference

3. An examination consists of four multiple choice questions, each with a choice of


three answers. Let X be the number of questions answered correctly when a
student resorts to pure guesswork for each answer.

170
B.1. Worked examples

(a) Draw the probability distribution of X, and find its mean and variance.
(b) The examiner calculates a rescaled mark using the formula Y = 10 + 22.5X.
Find the mean and variance of Y .
Solution:
(a) The distribution of X is binomial with n = 4 and π = 1/3.
X=x 0 1 2 3 4 Total
P (X = x) 0.1975 0.3951 0.2963 0.0988 0.0123 1
x P (X = x) 0 0.3951 0.5926 0.2964 0.0492 1.33
x2 P (X = x) 0 0.3951 1.1852 0.8892 0.1968 2.67

We have E(X) = 1.33 and:


Var(X) = E(X 2 ) − µ2 = 2.67 − (1.33)2 = 0.89.
Probability distribution − number of correct answers
0.4
0.3
Probability

0.2
0.1
0.0

0 1 2 3 4

Number of correct guesses

(b) We have:
E(Y ) = 10 + 22.5 × E(X) = 10 + 22.5 × 1.33 = 39.93
and:
Var(Y ) = (22.5)2 × Var(X) = (22.5)2 × 0.89 = 450.6.

4. In a game show each contestant has two chances out of three of winning a prize,
independently of other contestants. If six contestants take part, determine the
probability distribution of the number of winners. Find the mean and variance of
the number of winners.
Solution:
The number of winners, X, is binomial with n = 6 and π = 2/3.
X=x 0 1 2 3 4 5 6 Total
P (X = x) 0.0014 0.0165 0.0823 0.2195 0.3292 0.2634 0.0878 1
x P (X = x) 0 0.0165 0.1646 0.6585 1.3168 1.3170 0.5268 4
x2 P (X = x) 0 0.0165 0.3292 1.9755 5.2672 6.5850 3.1608 17.33

171
B. Discrete probability distributions

This yields a mean of E(X) = 4, and a variance of E(X 2 ) − µ2 = 17.33 − 42 = 1.33.

Probability distribution − number of winners

0.30
0.25
0.20
Probability

0.15
0.10
0.05
0.00

0 1 2 3 4 5 6

Number of winners

5. Suppose that the probability of a warship hitting a target with any shot is 0.2.
(a) What are the probabilities that in six shots it will hit the target:
i. exactly twice
ii. at least three times
iii. at most twice?
(b) What assumptions are you implicitly making in (a)? Are the assumptions
reasonable?
Solution:
(a) Using the binomial distribution with n = 6 and π = 0.2 we obtain the
following:
i. P (X = 2) = 15 × (0.2)2 × (0.8)4 = 0.2458.
ii. P (X = 0) = (0.8)6 = 0.2621 and P (X = 1) = 6 × (0.2)1 × (0.8)5 = 0.3932.
So:

P (at most 2) = P (X ≤ 2) = P (X = 0) + P (X = 1) + P (X = 2)
= 0.2621 + 0.3932 + 0.2458
= 0.9011.

Therefore:

P (at least 3) = P (X ≥ 3) = 1 − 0.9011 = 0.0989.

iii. Already computed in (ii.), that is P (X ≤ 2) = 0.9011.

172
B.1. Worked examples

(b) This assumes independence and that the probability of a hit is constant at 0.2.
In practice these assumptions might not be valid. For example, there might be
an improvement in accuracy due to experience – naval and artillery gunners
often use ‘ranging shots’ (so that, for instance, if the shell has landed too far
to the left, then they can aim a bit more to the right next time). These
answers might, nevertheless, be reasonable approximations.

6. Components for assembly are delivered in batches of 100 and experience shows that
5% of each batch are defective. On arrival, five pieces are selected at random from
each batch and tested. If two or more of the five are found to be faulty, the entire
batch is rejected. What is the probability that a 5% defective batch will be
accepted?

Solution:
Since this question involves sampling without replacement, we ought in principle to
use the ‘hypergeometric distribution’ (not covered in this course). However, we
frequently want to save effort in these calculations (with a nod to pre-computer
approaches), and here n = 5 (both the sample size and the upper limit of the
values of interest) and N = 100, so that π = n/N = 5/100 = 0.05 is small.
Moreover, N π = n = 5 is also ‘small’.
Therefore, we can use a binomial approximation to these preferable, but more
complicated, methods, with n = 5 and π = 0.05. We find that:

P (X = 0) = (0.95)5 = 0.7738 and P (X = 1) = 5 × (0.05)1 × (0.95)4 = 0.2036.

Hence P (batch accepted) = 0.7738 + 0.2036 = 0.9774.

7. One in ten of the new cars leaving a factory has minor faults of one kind or
another.
(a) Assuming that a batch of ten cars delivered to a dealer represents a random
sample of the output, what is the probability that:
i. at least 1 will be faulty

ii. more than 3 will be faulty?

On receiving a delivery of ten new cars from the manufacturer, the dealer checks
out four of these, chosen at random, before they are delivered to customers.

(b) If, in fact, two of the cars are faulty, what is the probability that both faults
will be discovered?

Solution:

(a) Using the binomial distribution with n = 10 and π = 0.1 we have the
following:
i. P (at least 1 faulty) = 1 − P (X = 0) = 1 − (0.9)10 = 0.6513.

173
B. Discrete probability distributions

ii. We have:
P (more than 3 faulty) = 1 − P (X = 0) − P (X = 1) − P (X = 2) − P (X = 3)
= 1 − 0.3487 − 0.3874 − 0.1937 − 0.0574
= 0.0128.

(b) We can consider this as involving a population of 10 cars. A sample of 4 cars is


taken without replacement. There are 2 ‘defectives’ in the population – what is
the probability that both are in the sample?
There are a total of 10 C4 = 210 ways of choosing 4 cars from 10 cars. If the two
defectives are included in the sample, there are 8 candidates for the remaining
2 places, so there are 8 C2 = 28 samples which contain the faulty cars. Hence
the required probability is 28/210 = 0.1333.

8. Over a period of time the number of break-ins per month in a given district has
been observed to follow a Poisson distribution with mean 2.
(a) For a given month, find the probability that the number of break-ins is:
i. fewer than 2
ii. more than 4
iii. at least 1, but no more than 3.
(b) What is the probability that there will be fewer than ten break-ins in a
six-month period?
Solution:
(a) The first few entries in the table of the probability function are given below.
They all follow from the Poisson probability function:
e−λ λx
P (X = x) =
x!
with λ = 2.
X=x 0 1 2 3 4 5
P (X = x) 0.1353 0.2707 0.2707 0.1804 0.0902 ...
i. We have:
P (X < 2) = P (X = 0) + P (X = 1)
= 0.1353 + 0.2707
= 0.4060.

ii. We have:
P (X > 4) = 1 − P (X = 0) − P (X = 1) − P (X = 2) − P (X = 3) − P (X = 4)
= 1 − 0.1353 − 0.2707 − 0.2707 − 0.1804 − 0.0902
= 0.0527.

174
B.1. Worked examples

iii. We have:

P (1 ≤ X ≤ 3) = P (X = 1) + P (X = 2) + P (X = 3)
= 0.2707 + 0.2707 + 0.1804
= 0.7218.

(b) If there are an average of 2 break-ins per month, there will be an average of 12
break-ins in a 6-month period. Therefore, the number of break-ins will have a
Poisson distribution with λ = 12. We need to calculate P (X < 10). This is
time-consuming (though not difficult) by hand, resulting in a probability of
0.2424.

9. Two per cent of the videotapes produced by a company are known to be defective.
If a random sample of 100 videotapes is selected for inspection, calculate the
probability of getting no defectives by using:
(a) the binomial distribution
(b) the Poisson distribution.
Solution:

(a) Using the binomial with n = 100 and π = 0.02, P (X = 0) = (0.98)100 = 0.1326.
(b) Using the Poisson with λ = 100 × 0.02 = 2, P (X = 0) = exp(−2) = 0.1353.
Note that the answers are almost equal. This is because n is large and π is small.

10. The probability that a marksman hits the bullseye in target practice is 0.7. Assume
that successive shots are independent.
(a) What is the probability that, out of seven shots, he hits the target:
i. exactly once
ii. every time
iii. at least five times
iv. at least five times in succession (with all hits being in succession)?
(b) Suppose he bets £1 that he can hit the target at least five times out of seven
(i.e. he gains £1 if he succeeds and loses £1 if he fails). What are his expected
winnings?
Solution:

(a) Use the binomial distribution such that X ∼ Bin(7, 0.7).


i. P (X = 1) = 7 × 0.7 × (0.3)6 = 0.00357.
ii. P (every time) = P (X = 7) = (0.7)7 = 0.0824.

175
B. Discrete probability distributions

iii. We have:

P (X ≥ 5) = P (X = 5) + P (X = 6) + P (X = 7)
   
7 5 2 7
= × (0.7) × (0.3) + × (0.7)6 × 0.3 + (0.7)7
5 6
= 0.6471.

iv. We have:

P (at least 5 in succession) = P (every time) + P (exactly 6 in succession)


+ P (exactly 5 in succession)
= 0.0824 + P (missed exactly 1st or last time)
+ P (missed exactly 1st or last two times)
+ P (missed exactly 1st and last times)
= 0.08235 + (2 × (0.7)6 × 0.3) + (3 × (0.7)5 × (0.3)2 )
= 0.1983.

(b) The expected winnings are:

0.6471 × £1 + (1 − 0.6471) × −£1 = £0.2942.

B.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix I.

1. On average a fire brigade in a large town receives 1.6 emergency calls per hour.
Assuming a suitable probability distribution for this variable, find the probability
that the number of calls is:
(a) none in 1 hour
(b) exactly 5 in 1 hour
(c) more than 4 in 2 hours.

2. A glacier in Greenland ‘calves’ (lets fall off into the sea) an iceberg on average
twice every five weeks. (Seasonal effects can be ignored for this question.)
(a) Explain which distribution you would use to estimate the probabilities of
different numbers of icebergs being calved in different periods, justifying your
selection.
(b) What is the probability that no iceberg is calved in the next two weeks?

176
B.2. Practice questions

(c) What is the probability that no iceberg is calved in the two weeks after the
next two weeks?
(d) What is the probability that exactly three icebergs are calved in the next four
weeks?
(e) If exactly three icebergs are calved in the next four weeks, what is the
probability that exactly three more icebergs will be calved in the four-week
period after the next four weeks?
(f) Comment on the relationship between your answers to (d) and (e).

3. A random variable R has a binomial distribution with n = 4 and π = 0.2. Evaluate


the probability distribution of T = (R − 2)2 , and find the mean and the standard
deviation of T .

177
B. Discrete probability distributions

178
Appendix C
Continuous probability distributions

C.1 Worked examples


1. A continuous random variable, X, has the following cumulative distribution
function (cdf): 
0 for x < 0

F (x) = x2 for 0 ≤ x ≤ 1

1 for x > 1.

Determine the probability density function (pdf) of X, and find its mean and
standard deviation.
Solution:
We obtain the pdf by differentiating the cdf with respect to x. Hence:
(
2x for 0 ≤ x ≤ 1
f (x) =
0 otherwise.
By definition the mean, µ, is:
Z ∞ 1 1
2x3
Z 
2 2
µ = E(X) = x f (x) dx = 2x dx = = = 0.6667.
−∞ 0 3 0 3
We obtain the variance using Var(X) = E(X ) − (E(X)) , so we also need E(X 2 ):
2 2

Z ∞ Z 1  4 1
2 2 3 2x 1
E(X ) = x f (x) dx = 2x dx = = = 0.5.
−∞ 0 4 0 2
Hence Var(X) = √σ 2 = 1/2 − (2/3)2 = 1/18 ≈ 0.0556. Therefore, the standard
deviation is σ = 0.0556 = 0.2357.

2. A random variable, X, has a pdf for some constant c given by:


(
cx2 for 0 ≤ x ≤ 1
f (x) =
0 otherwise.

(a) Find the value of c.


(b) Determine:
i. P (X ≤ 1/2)
ii. P (1/4 ≤ X ≤ 3/4).
(c) Find the mean and standard deviation of X.

179
C. Continuous probability distributions

Solution:

(a) The constant c must be such that the area between the curve representing the
pdf and the (horizontal) x-axis equals 1. Hence the value c satisfies:

∞ 1 1
x3
Z Z 
2 c
1= f (x) dx = c x dx = c = ⇒ c = 3.
−∞ 0 3 0 3

(b) i. We have:
  Z 1/2 Z 1/2
1 h i1/2 1
P X≤ = f (x) dx = 3x2 dx = x3 = = 0.125.
2 −∞ 0 0 8

ii. We have:
  Z 3/4 Z 3/4
1 3 h i3/4 13
P ≤X≤ = f (x) dx = 2
3x dx = x3 = = 0.4063.
4 4 1/4 1/4 1/4 32

(c) To determine the mean and variance, we compute E(X) and E(X 2 ):

∞ 1 1
3x4
Z Z 
3 3
E(X) = x f (x) dx = 3x dx = = = 0.75
−∞ 0 4 0 4

and:
∞ 1 1
3x5
Z Z 
2 2 4 3
E(X ) = x f (x) dx = 3x dx = = = 0.6.
−∞ 0 5 0 5

Hence the mean is 0.75, and the variance is:

σ 2 = E(X 2 ) − (E(X))2 = 0.6 − (0.75)2 = 0.0375.



Therefore, the standard deviation is σ = 0.0375 = 0.1936.

3. A continuous random variable, X, has the following cdf:


(
1 − e−x for x ≥ 0
F (x) =
0 otherwise.

(a) Calculate P (0.5 < X < 1).

(b) Find x, such that P (X > x) = 0.05.

(c) Determine E(X) and Var(X).

180
C.1. Worked examples

Solution:
(a) For the continuous random variable X, we know that:

P (0.5 < X < 1) = F (1)−F (0.5) = (1−e−1 )−(1−e−0.5 ) = e−0.5 −e−1 = 0.2387.

(b) P (X > x) = 0.05 means F (x) = 0.95, i.e. 1 − e−x = 0.95. That is:

e−x = 0.05 ⇒ ex = 20.

Taking logs yields the solution: x = ln(20) = 2.9957 ≈ 3.


(c) The question asks for E(X), the mean value, when the known information
about the random variable X is its cdf, F (x), as defined.
So to work out the mean, we need to use the basic relationship that gives us
the pdf, f (x), from the cdf, F (x), i.e. we differentiate the cdf:
d
f (x) = F (x)
dx
R
and then we use the definition of the mean, i.e. µ = S x f (x) dx, where S ⊆ R
is the interval on which f (x) is defined. Here S = [0, ∞). Using these ideas:
d
f (x) = (1 − e−x ) = e−x
dx
for x ≥ 0, and 0 otherwise. So:
Z ∞
µ= x e−x dx.
0

X is in fact exponentially distributed with a rate parameter of λ = 1. The


mean will be 1/λ = 1 and the variance will be 1/λ2 = 1.

4. A random variable, X, has the following pdf:



2x/3
 for 0 ≤ x < 1
f (x) = (3 − x)/3 for 1 ≤ x ≤ 3

0 otherwise.

(a) Derive the cdf of X.


(b) Find the mean and the standard deviation of X.
Solution:
(a) We determine the cdf by integrating the pdf over the appropriate range, hence:



 0 for x < 0


x2 /3

for 0 ≤ x < 1
F (x) =


 x − x2 /6 − 1/2 for 1 ≤ x ≤ 3



1 for x > 3.

181
C. Continuous probability distributions

This results from the following calculations. Firstly, for x < 0, we have:
Z x Z x
F (x) = f (t) dt = 0 dt = 0.
−∞ −∞

For 0 ≤ x < 1, we have:


Z x 0 x  2 x
x2
Z Z
2t 2t
F (x) = f (t) dt = 0 dt + dt = = .
−∞ −∞ 0 3 6 0 3

For 1 ≤ x ≤ 3, we have:
Z x 0 Z x 1
3−t
Z Z
2t
F (x) = f (t) dt = 0 dt + dt + dt
−∞ −∞ 0 3 1 3
x
t2 x2 x2 1
    
1 1 1
=0+ + t− = + x− − 1− =x− − .
3 6 1 3 6 6 6 2

(b) To find the mean we proceed as follows:


∞ Z 3 1
2x2 3x − x2
Z Z
µ = E(X) = x f (x) dx = dx + dx
−∞ 0 3 1 3
 3 1  2 3
2x x x3
= + −
9 0 2 9 1
   
2 9 27 1 1
= + − − −
9 2 9 2 9
4
= .
3
Similarly:
∞ 1 3
2x3 3x2 − x3
Z Z Z
2 2 13
E(X ) = x f (x) dx = dx + dx = = 2.1667.
−∞ 0 3 1 3 6

Hence the variance is:


 2
2 2 13 4
2 39 32 7
σ = E(X ) − (E(X)) = − = − = ≈ 0.3889.
6 3 18 18 18

Therefore, the standard deviation is σ = 0.3889 = 0.6236.

5. If Z ∼ N (0, 1), find:


(a) P (0 < Z < 1.2)
(b) P (−0.68 < Z < 0)
(c) P (−0.46 < Z < 2.21)
(d) P (0.81 < Z < 1.94)

182
C.1. Worked examples

(e) P (Z < −0.6)


(f) P (Z > −1.28)
(g) P (Z > 2.05).
Solution:
(a) P (0 < Z < 1.2) = P (Z > 0) − P (Z > 1.2) = 0.5 − 0.1151 = 0.3849.
(b) P (−0.68 < Z < 0) = P (Z < 0) − P (Z < −0.68) = 0.5 − P (Z > 0.68) =
0.5 − 0.2483 = 0.2517.
(c) P (−0.26 < Z < 2.21) = 1 − P (Z > 2.21) − P (Z > 0.26) =
1 − 0.0136 − 0.3228 = 0.6636.
(d) P (0.81 < Z < 1.94) = P (Z > 0.81) − P (Z > 1.94) = 0.2090 − 0.0262 = 0.1828.
(e) P (Z < −0.62) = P (Z > 0.62) = 0.2743.
(f) P (Z > −1.28) = 1 − P (Z < −1.28) = 1 − P (Z > 1.28) = 1 − 0.1003 = 0.8997.
(g) P (Z > 2.05) = 0.0202.

6. Suppose X ∼ N (10, 4).


(a) Find:
i. P (X > 13.4)
ii. P (8 < X < 9).
(b) Find the value a such that P (10 − a < X < 10 + a) = 0.95.
(c) Find the value b such that P (10 − b < X < 10 + b) = 0.99.
(d) How far above the mean of the standard normal distribution must we go such
that only 1% of the probability remains in the right-hand tail?
(e) How far below the mean of the standard normal distribution must we go such
that only 5% of the probability remains in the left-hand tail?
Solution:
Since X ∼ N (10,√4), we use the transformation Z = (X − µ)/σ with the values
µ = 10 and σ = 4 = 2.
(a) i. We have P (X > 13.4) = P (Z > (13.4 − 10)/2) = P (Z > 1.7) = 0.0446.
ii. We have:
 
8 − 10 9 − 10
P (8 < X < 9) = P <Z<
2 2
= P (−1 < Z < −0.5)
= P (Z < −0.5) − P (Z < −1)
= P (Z > 0.5) − P (Z > 1)
= 0.3085 − 0.1587
= 0.1498.

183
C. Continuous probability distributions

(b) We want to find the value a such that P (10 − a < X < 10 + a) = 0.95, that is:
 
(10 − a) − 10 (10 + a) − 10
0.95 = P <Z<
2 2
 a a
=P − <Z<
2 2
 a   a
=1−P Z > −P Z <−
2 2
 a
=1−2×P Z > .
2

This is the same as 2 × P (Z > a/2) = 0.05, i.e. P (Z > a/2) = 0.025. Hence,
from Table 4, a/2 = 1.96, and so a = 3.92.

(c) We want to find the value b such that P (10 − b < X < 10 + b) = 0.99. Similar
reasoning shows that P (Z > b/2) = 0.005. Hence, from Table 4, b/2 = 2.58, so
that b = 5.16.

(d) We want k such that P (Z > k) = 0.01. From Table 4, k = 2.33.

(e) We want x such that P (Z < x) = 0.05. This means that x < 0 and
P (Z > |x|) = 0.05, so, from Table 4, |x| = 1.65 and hence x = −1.65.

7. Your company requires a special type of light bulb which is available from only two
suppliers. Supplier A’s light bulbs have a mean lifetime of 2,000 hours with a
standard deviation of 180 hours. Supplier B’s light bulbs have a mean lifetime of
1,850 hours with a standard deviation of 100 hours. The distribution of the
lifetimes of each type of light bulb is normal. Your company requires that the
lifetime of a light bulb be not less than 1,500 hours. All other things being equal,
which type of bulb should you buy, and why?

Solution:
Let A and B be the random variables representing the lifetimes (in hours) of light
bulbs from supplier A and supplier B, respectively. We are told that:

A ∼ N (2,000, (180)2 ) and B ∼ N (1,850, (100)2 ).

Since the relevant criterion is that light bulbs last at least 1,500 hours, the
company should choose the supplier whose light bulbs have a greater probability of
doing so. We find that:
 
1,500 − 2,000
P (A > 1,500) = P Z >
180
= P (Z > −2.78)
= 1 − P (Z > 2.78)
= 0.9973

184
C.1. Worked examples

and:
 
1,500 − 1,850
P (B > 1,500) = P Z >
100
= P (Z > −3.50)
= 1 − P (Z > 3.50)
= 0.9998.

Therefore, the company should buy light bulbs from supplier B, since they have a
greater probability of lasting the required time.
Note it is good practice to define notation and any units of measurement and to
state the distributions of the random variables. Note also that here it is not essential
to compute the probability values in order to determine what the company should
do, since −2.78 > −3.5 implies that P (Z > −2.78) < P (Z > −3.5).

8. The life, in hours, of a light bulb is normally distributed with a mean of 200 hours.
If a consumer requires at least 90% of the light bulbs to have lives exceeding 150
hours, what is the largest value that the standard deviation can have?

Solution:
Let X be the random variable representing the lifetime of a light bulb (in hours),
so that for some value σ we have X ∼ N (200, σ 2 ). We want P (X > 150) = 0.9,
such that:
   
150 − 200 50
P (X > 150) = P Z > =P Z>− = 0.9.
σ σ

Note that this is the same as P (Z > 50/σ) = 1 − 0.9 = 0.1, so 50/σ = 1.28, giving
σ = 39.06.

9. A company manufactures rods whose diameters are normally distributed with a


mean of 5 mm and a standard deviation of 0.05 mm. It also drills holes to receive
the rods and the diameters of these holes are normally distributed with a mean of
5.2 mm and a standard deviation of 0.07 mm. The rods are allocated to the holes
at random. What proportion of rods will fit into the holes?

Solution:
Let X and Y , respectively, denote the random variables for the diameters of the
rods and of the holes (in millimetres), so that:

X ∼ N (5, (0.05)2 ) and Y ∼ N (5.2, (0.07)2 ).

Therefore, P (rod fits hole) = P (X < Y ) = P (Y − X > 0). Now note that:

Y − X ∼ N (5.2 − 5, (0.05)2 + (0.07)2 ) ⇒ Y − X ∼ N (0.2, 0.0074).

185
C. Continuous probability distributions

So:
 
0 − 0.2
P (rod fits hole) = P Z>√
0.0074
= P (Z > −2.33)
= 1 − P (Z > 2.33)
= 0.9901.

10. An investor has the choice of two out of four investments: X1 , X2 , X3 and X4 . The
profits (in £000s per annum) from these may be assumed to be independently
distributed and:

profit from X1 ∼ N (2, 1), profit from X2 ∼ N (3, 3),

profit from X3 ∼ N (1, 0.25), profit from X4 ∼ N (2.5, 4).

Which pair of investments should the investor choose in order to maximise the
probability of making a total profit of at least £2,000? What is this maximum
probability?
Solution:
Let A1 , A2 , A3 and A4 be the profits from the investments X1 , X2 , X3 and X4 ,
respectively. There are 4 C2 = 6 possible pairs, and we simply have to work out the
probabilities for each. From the information in the question, we see that:
 
2−5
A1 + A2 ∼ N (5, 4) ⇒ P (A1 + A2 > 2) = P Z > √ = 0.9332
4
 
2−3
A1 + A3 ∼ N (3, 1.25) ⇒ P (A1 + A3 > 2) = P Z > √ = 0.8133
1.25
 
2 − 4.5
A1 + A4 ∼ N (4.5, 5) ⇒ P (A1 + A4 > 2) = P Z > √ = 0.8686
5
 
2−4
A2 + A3 ∼ N (4, 3.25) ⇒ P (A2 + A3 > 2) = P Z > √ = 0.8665
3.25
 
2 − 5.5
A2 + A4 ∼ N (5.5, 7) ⇒ P (A2 + A4 > 2) = P Z > √ = 0.9066
7
 
2 − 3.5
A3 + A4 ∼ N (3.5, 4.25) ⇒ P (A3 + A4 > 2) = P Z > √ = 0.7673.
4.25
Therefore, the investor should choose X1 and X2 , for which the maximum
probability is 0.9332.

C.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix I.

186
C.2. Practice questions

1. An advertising agency claims that 40% of all television viewers watch a particular
programme. In a random sample of 500 viewers, what is the probability that fewer
than 170 will be watching the programme if the agency’s claim is correct?

2. This question is about calculations based on random sample surveys of people.


(a) It is believed that 40% of the 20 adults in a village are supporters of EMFSS
FC. If this belief is correct, and four different people are picked at random and
asked whether they support EMFSS FC, what is the probability that exactly
three will be supporters?
(b) It is believed that 40% of the many thousands of adults in Holborn are
supporters of EMFSS FC. If this belief is correct, and 40 people are picked at
random and asked about their allegiance, what is the probability that exactly
twenty will be supporters?
(c) It is believed that 40% of the many thousands of adults in Holborn are
supporters of EMFSS FC. If this belief is correct, and 100 people are picked at
random and asked about their allegiance, what is the probability that at least
thirty will be supporters?
(d) If you have used a suitable approximation in any of the previous parts, explain
why it is appropriate in each case.
(e) Comment on the differences, if any, in the assumptions and methods you have
used when calculating the probabilities obtained above.

3. The number of newspapers sold daily at a kiosk is normally distributed with a


mean of 350 and a standard deviation of 30.
(a) Find the probability that fewer than 300 newspapers are sold on Monday.
(b) Find the probability that fewer newspapers are sold on Tuesday than on
Monday.
(c) Find the probability that fewer than 1,700 newspapers are sold in a (five-day)
week. What assumption have you made in order to answer this?
(d) How many newspapers should the newsagent stock each day such that the
probability of running out on any particular day is 10%?

187
C. Continuous probability distributions

188
Appendix D
Multivariate random variables

D.1 Worked examples


1. X and Y are independent random variables with distributions as follows:
X=x 0 1 2 Y =y 1 2
pX (x) 0.4 0.2 0.4 pY (y) 0.4 0.6
The random variables W and Z are defined by W = 2X and Z = Y − X,
respectively.
(a) Compute the joint distribution of W and Z.
(b) Evaluate P (W = 2 | Z = 1), E(W | Z = 0) and Cov(W, Z).

Solution:
(a) The joint distribution (with marginal probabilities) is:
W =w
0 2 4 pZ (z)
−1 0.00 0.00 0.16 0.16
Z=z 0 0.00 0.08 0.24 0.32
1 0.16 0.12 0.00 0.28
2 0.24 0.00 0.00 0.24
pW (w) 0.40 0.20 0.40 1.00
(b) It is straightforward to see that:
P (W = 2 ∩ Z = 1) 0.12 3
P (W = 2 | Z = 1) = = = .
P (Z = 1) 0.28 7
For E(W | Z = 0), we have:
X 0 0.08 0.24
E(W | Z = 0) = w P (W = w | Z = 0) = 0 × +2× +4× = 3.5.
w
0.32 0.32 0.32
We see E(W ) = 2 (by symmetry), and:
E(Z) = −1 × 0.16 + 0 × 0.32 + 1 × 0.28 + 2 × 0.24 = 0.6.
Also:
XX
E(W Z) = wz p(w, z) = −4 × 0.16 + 2 × 0.12 = −0.4
w z

hence:
Cov(W, Z) = E(W Z) − E(W ) E(Z) = −0.4 − 2 × 0.6 = −1.6.

189
D. Multivariate random variables

2. The joint probability distribution of the random variables X and Y is:

X=x
−1 0 1
−1 0.05 0.15 0.10
Y =y 0 0.10 0.05 0.25
1 0.10 0.05 0.15

(a) Identify the marginal distributions of X and Y and the conditional


distribution of X given Y = 1.
(b) Evaluate E(X | Y = 1) and the correlation coefficient of X and Y .
(c) Are X and Y independent random variables?

Solution:

(a) The marginal and conditional distributions are, respectively:


X=x −1 0 1 Y =y −1 0 1
pX (x) 0.25 0.25 0.50 pY (y) 0.30 0.40 0.30

X = x|Y = 1 −1 0 1
pX|Y =1 (x | Y = 1) 1/3 1/6 1/2
(b) From the conditional distribution we see:

1 1 1 1
E(X | Y = 1) = −1 × +0× +1× = .
3 6 2 6
E(Y ) = 0 (by symmetry), and so Var(Y ) = E(Y 2 ) = 0.6.
E(X) = 0.25 and:

Var(X) = E(X 2 ) − (E(X))2 = 0.75 − (0.25)2 = 0.6875.

(Note that Var(X) and Var(Y ) are not strictly necessary here!)
Next:
XX
E(XY ) = xy p(x, y)
x y

= (−1)(−1)(0.05) + (1)(−1)(0.1) + (−1)(1)(0.1) + (1)(1)(0.15)


= 0.

So:
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = 0 ⇒ Corr(X, Y ) = 0.

(c) X and Y are not independent random variables since, for example:

P (X = 1, Y = −1) = 0.1 6= P (X = 1) P (Y = −1) = 0.5 × 0.3 = 0.15.

190
D.1. Worked examples

3. X1 , X2 , . . . , Xn are independent random variables with the common probability


density function: (
λ2 xe−λx for x ≥ 0
f (x) =
0 otherwise.
Derive the joint probability density function, f (x1 , x2 , . . . , xn ).
Solution:
Since the Xi s are independent (and identically distributed) random variables, we
have:
n
Y
f (x1 , x2 , . . . , xn ) = f (xi ).
i=1

So, the joint probability density function is:


n
n n n
! P
Y Y Y −λ xi
2 −λxi 2n −λx1 −λx2 −···−λxn 2n
f (x1 , x2 , . . . , xn ) = λ xi e =λ xi e =λ xi e i=1 .
i=1 i=1 i=1

4. The random variables X1 and X2 are independent and have the common
distribution given in the table below:

X=x 0 1 2 3
pX (x) 0.2 0.4 0.3 0.1

The random variables W and Y are defined by W = max(X1 , X2 ) and


Y = min(X1 , X2 ).
(a) Calculate the table of probabilities which defines the joint distribution of W
and Y .
(b) Find:
i. the marginal distribution of W
ii. the conditional distribution of Y given W = 2
iii. E(Y | W = 2) and Var(Y | W = 2)
iv. Cov(W, Y ).

Solution:

(a) The joint distribution of W and Y is:


W =w
0 1 2 3
0 (0.2)2 2(0.2)(0.4) 2(0.2)(0.3) 2(0.2)(0.1)
Y =y 1 0 (0.4)(0.4) 2(0.4)(0.3) 2(0.4)(0.1)
2 0 0 (0.3)(0.3) 2(0.3)(0.1)
3 0 0 0 (0.1)(0.1)
(0.2)2 (0.8)(0.4) (1.5)(0.3) (1.9)(0.1)

191
D. Multivariate random variables

which is:
W =w
0 1 2 3
0 0.04 0.16 0.12 0.04
Y =y 1 0.00 0.16 0.24 0.08
2 0.00 0.00 0.09 0.06
3 0.00 0.00 0.00 0.01
0.04 0.32 0.45 0.19

(b) i. Hence the marginal distribution of W is:


W =w 0 1 2 3
pW (w) 0.04 0.32 0.45 0.19
ii. The conditional distribution of Y | W = 2 is:
Y = y|W = 2 0 1 2 3
pY |W =2 (y | W = 2) 4/15 8/15 2/10 0
= 0.26̇ = 0.53̇ = 0.2 0
iii. We have:
4 8 2
E(Y | W = 2) = 0 × +1× +2× + 3 × 0 = 0.93̇
15 15 10
and:

Var(Y | W = 2) = E(Y 2 | W = 2)−(E(Y | W = 2))2 = 1.3̇−(0.93̇)2 = 0.4622.

iv. E(W Y ) = 1.69, E(W ) = 1.79 and E(Y ) = 0.81, therefore:

Cov(W, Y ) = E(W Y ) − E(W ) E(Y ) = 1.69 − 1.79 × 0.81 = 0.2401.

5. Consider two random variables X and Y . X can take the values −1, 0 and 1, and
Y can take the values 0, 1 and 2. The joint probabilities for each pair are given by
the following table:
X = −1 X = 0 X = 1
Y =0 0.10 0.20 0.10
Y =1 0.10 0.05 0.10
Y =2 0.10 0.05 0.20

(a) Calculate the marginal distributions and expected values of X and Y .


(b) Calculate the covariance of the random variables U and V , where U = X + Y
and V = X − Y .
(c) Calculate E(V | U = 1).

Solution:
(a) The marginal distribution of X is:
X=x −1 0 1
pX (x) 0.3 0.3 0.4

192
D.1. Worked examples

The marginal distribution of Y is:


Y =y 0 1 2
pY (y) 0.40 0.25 0.35
Hence:
E(X) = −1 × 0.3 + 0 × 0.3 + 1 × 0.4 = 0.1
and:
E(Y ) = 0 × 0.40 + 1 × 0.25 + 2 × 0.35 = 0.95.

(b) We have:

Cov(U, V ) = Cov(X + Y, X − Y )
= E((X + Y )(X − Y )) − E(X + Y ) E(X − Y )
= E(X 2 − Y 2 ) − (E(X) + E(Y ))(E(X) − E(Y ))

E(X 2 ) = ((−1)2 × 0.3) + (02 × 0.3) + (12 × 0.4) = 0.7

E(Y 2 ) = (02 × 0.4) + (12 × 0.25) + (22 × 0.35) = 1.65

hence:

Cov(U, V ) = (0.7 − 1.65) − (0.1 + 0.95)(0.1 − 0.95) = −0.0575.

(c) U = 1 is achieved for (X, Y ) pairs (−1, 2), (0, 1) or (1, 0). The corresponding
values of V are −3, −1 and 1. We have:

P (U = 1) = 0.1 + 0.05 + 0.1 = 0.25

0.1 2
P (V = −3 | U = 1) = =
0.25 5
0.05 1
P (V = −1 | U = 1) = =
0.25 5
0.1 2
P (V = 1 | U = 1) = =
0.25 5
hence:
     
2 1 2
E(V | U = 1) = −3 × + −1 × + 1× = −1.
5 5 5

6. Two refills for a ballpoint pen are selected at random from a box containing three
blue refills, two red refills and three green refills. Define the following random
variables:
X = the number of blue refills selected
Y = the number of red refills selected.
(a) Show that P (X = 1, Y = 1) = 3/14.

193
D. Multivariate random variables

(b) Form the table showing the joint probability distribution of X and Y .
(c) Calculate E(X), E(Y ) and E(X | Y = 1).
(d) Find the covariance between X and Y .
(e) Are X and Y independent random variables? Give a reason for your answer.
Solution:
(a) With the obvious notation B = blue and R = red:
3 2 2 3 3
P (X = 1, Y = 1) = P (BR) + P (RB) = × + × = .
8 7 8 7 14

(b) We have:
X=x
0 1 2
0 3/28 9/28 3/28
Y =y 1 3/14 3/14 0
2 1/28 0 0
(c) The marginal distribution of X is:
X=x 0 1 2
pX (x) 10/28 15/28 3/28
Hence:
10 15 3 3
E(X) = 0 ×+1× +2× = .
28 28 28 4
The marginal distribution of Y is:
Y =y 0 1 2
pY (y) 15/28 12/28 1/28
Hence:
15 12 1 1
+1×
E(Y ) = 0 × +2× = .
28 28 28 2
The conditional distribution of X given Y = 1 is:
X = x|Y = 1 0 1
pX|Y =1 (x | y = 1) 1/2 1/2
Hence:
1 1 1
E(X | Y = 1) = 0 × +1× = .
2 2 2
(d) The distribution of XY is:
XY = xy 0 1
pXY (xy) 22/28 6/28
Hence:
22 6 3
E(XY ) = 0 × +1× =
28 28 14
and:
3 3 1 9
Cov(X, Y ) = E(XY ) − E(X) E(Y ) = − × =− .
14 4 2 56
(e) Since Cov(X, Y ) 6= 0, a necessary condition for independence fails to hold. The
random variables are not independent.

194
D.2. Practice questions

D.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix I.

1. X and Y are discrete random variables which can assume values 0, 1 and 2 only.

P (X = x, Y = y) = A(x + y) for some constant A and x, y ∈ {0, 1, 2}.

(a) Draw up a table to describe the joint distribution of X and Y and find the
value of the constant A.
(b) Describe the marginal distributions of X and Y .
(c) Give the conditional distribution of X | Y = 1 and find E(X | Y = 1).
(d) Are X and Y independent? Give a reason for your answer.

2. Consider two random variables X and Y which both take the values 0, 2 and 4.
The joint probabilities for each pair are given by the following table:

X=0 X=2 X=4


Y =0 0.14 0.10 0.06
Y =2 0.06 0.16 0.08
Y =4 0.12 0.20 0.08

(a) Calculate P (X = 2 | Y < 3).

(b) Define U = |X − 2| and V = Y . Calculate the covariance of U and V .

195
D. Multivariate random variables

196
Appendix E
Sampling distributions of statistics

E.1 Worked examples


1. Suppose A, B and C are independent chi-squared random variables with 5, 7 and
10 degrees of freedom, respectively. Calculate:
(a) P (B < 12)
(b) P (A + B + C < 14)
(c) P (A − B − C < 0)
(d) P (A3 + B 3 + C 3 < 0).
In this question, you should use the closest value given in the New Cambridge
Statistical Tables or the Dougherty Statistical Tables. Further approximation is not
required.

Solution:

(a) P (B < 12) ≈ 0.90, directly from Table 8, where B ∼ χ27 .


(b) A + B + C ∼ χ25+7+10 = χ222 , so P (A + B + C < 14) is the probability that such
a random variable is less than 14, which is approximately 0.10 from Table 8.
(c) Transforming and rearranging the probability, we need:
 
A B + C 17
P (A < B + C) = P < ×
5 17 5
 
A/5
=P < 3.4
(B + C)/17
= P (F < 3.4) ≈ 0.975

where F ∼ F5, 17 , using Table A.3 of the Dougherty Statistical Tables (practice
of which will be covered later in the course1 ).
(d) A chi-squared random variable only assumes non-negative values. Hence each
of A, B and C is non-negative, so A3 + B 3 + C 3 ≥ 0, and:

P (A3 + B 3 + C 3 < 0) = 0.

1
Although we have yet to ‘formally’ introduce Table A.3 of the Dougherty Statistical Tables, you
should be able to see how this works.

197
E. Sampling distributions of statistics

2. Suppose {Zi }, for i = 1, 2, . . . , k, are independent and identically distributed


standard normal random variables, i.e. Zi ∼ N (0, 1), for i = 1, 2, . . . , k.
State the distribution of:
(a) Z12
(b) Z12 /Z22
p
(c) Z1 / Z22
k
P
(d) Zi /k
i=1
k
Zi2
P
(e)
i=1

(f) (3/2) × (Z12 + Z22 )/(Z32 + Z42 + Z52 ).

Solution:
(a) Z12 ∼ χ21
(b) Z12 /Z22 ∼ F1, 1
p
(c) Z1 / Z22 ∼ t1
k
P
(d) Zi /k ∼ N (0, 1/k)
i=1
k
Zi2 ∼ χ2k
P
(e)
i=1

(f) (3/2) × (Z12 + Z22 )/(Z32 + Z42 + Z52 ) ∼ F2, 3 .

3. X1 , X2 , X3 and X4 are independent normally distributed random variables each


with a mean of 0 and a standard deviation of 3. Find:
(a) P (X1 + 2X2 > 9)
(b) P (X12 + X22 > 54)
(c) P ((X12 + X22 ) > 99(X32 + X42 )).

Solution:
(a) We have X1 ∼ N (0, 9) and X2 ∼ N (0, 9). Hence 2X2 ∼ N (0, 36) and
X1 + 2X2 ∼ N (0, 45). So:
 
9
P (X1 + 2X2 > 9) = P Z > √ = P (Z > 1.34) = 0.0901.
45

(b) We have X1 /3 ∼ N (0, 1) and X2 /3 ∼ N (0, 1). Hence X12 /9 ∼ χ21 and
X22 /9 ∼ χ21 . Therefore, X12 /9 + X22 /9 ∼ χ22 . So:

P (X12 + X22 > 54) = P (Y > 6) = 0.05

where Y ∼ χ22 .

198
E.1. Worked examples

(c) We have X12 /9 + X22 /9 ∼ χ22 and also X32 /9 + X42 /9 ∼ χ22 . So:

X12 + X22 (X12 + X22 )/18


= ∼ F2, 2 .
X32 + X42 (X32 + X42 )/18

Hence:
P ((X12 + X22 ) > 99(X32 + X42 )) = P (Y > 99) = 0.01
where Y ∼ F2, 2 .

4. The independent random variables X1 , X2 and X3 are each normally distributed


with a mean of 0 and a variance of 4. Find:
(a) P (X1 > X2 + X3 )
(b) P (X12 > 9.25(X22 + X32 ))
(c) P (X1 > 5(X22 + X32 )1/2 ).

Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, hence:

X1 − X2 − X3 ∼ N (0, 12).

So:
P (X1 > X2 + X3 ) = P (X1 − X2 − X3 > 0) = P (Z > 0) = 0.5.

(b) We have Xi /2 ∼ N (0, 1), so Xi2 /4 ∼ χ21 for i = 1, 2, 3. Hence:

2X12 (X12 /4)/1


= ∼ F1, 2 .
X22 + X32 ((X22 + X32 )/4)/2

So:
2X12
 
P (X12 > 9.25(X22 + X32 )) =P > 9.25 × 2 = P (Y > 18.5) = 0.05
X22 + X32

where Y ∼ F1, 2 .
(c) We have:
1/2 !
X22 X32

X1
P (X1 > 5(X22 + X32 )1/2 ) = P >5 +
2 4 4
1/2 !  !
√ X22 X32 √

X1
=P >5 2 + 2
2 4 4
√ p
i.e. P (Y1 > 5 2Y2 ), where Y1 ∼ N (0, 1) and Y2 ∼ χ22 /2, or P (Y3 > 7.07),
where Y3 ∼ t2 . From Table 10 of the New Cambridge Statistical Tables, this is
approximately 0.01.

199
E. Sampling distributions of statistics

5. The independent random variables X1 , X2 , X3 and X4 are each normally


distributed with a mean of 0 and a variance of 4. Using the New Cambridge
Statistical Tables or the Dougherty Statistical Tables, derive values for k in each of
the following cases:
(a) P (3X1 + 4X2 > 5) = k
p
(b) P (X1 > k X32 + X42 ) = 0.025
(c) P (X12 + X22 + X32 < k) = 0.9
(d) P (X22 + X32 + X42 > 19X12 + 20X32 ) = k.

Solution:
(a) We have Xi ∼ N (0, 4), for i = 1, 2, 3, 4, hence 3X1 ∼ N (0, 36) and
4X2 ∼ N (0, 64). Therefore:
3X1 + 4X2
= Z ∼ N (0, 1).
10
So, P (3X1 + 4X2 > 5) = k = P (Z > 0.5) = 0.3085.

(b) We have Xi /2 ∼ N (0, 1), for i = 1, 2, 3, 4, hence (X32 + X42 )/4 ∼ χ22 . So:

 q 
2 2
P X1 > k X3 + X4 = 0.025 = P (T > k 2)

where T ∼ t2 and hence k 2 = 4.303, so k = 3.04268.
(c) We have (X12 + X22 + X32 )/4 ∼ χ23 , so:
 
k
P (X12 + X22 + X32 < k) = 0.9 = P X<
4
where X ∼ χ23 . Therefore, k/4 = 6.251. Hence k = 25.004.
(d) P (X22 + X32 + X42 > 19X12 + 20X32 ) = k simplifies to:

P (X22 + X42 > 19(X12 + X32 )) = k

and:
X22 + X42
∼ F2, 2 .
X12 + X32
So, from Table A.3 of the Dougherty Statistical Tables, k = 0.05.

6. Suppose that the heights of students are normally distributed with a mean of 68.5
inches and a standard deviation of 2.7 inches. If 200 random samples of size 25 are
drawn from this population with means recorded to the nearest 0.1 inch, find:
(a) the expected mean and standard deviation of the sampling distribution of the
mean
(b) the expected number of recorded sample means which fall between 67.9 and
69.2 inclusive
(c) the expected number of recorded sample means falling below 67.0.

200
E.1. Worked examples

Solution:

(a) The sampling distribution of the mean of 25 observations has the same mean
as the population, which is 68.5 inches.
√ The standard deviation (standard
error) of the sample mean is 2.7/ 25 = 0.54.
(b) Notice that the samples are random, so we cannot be sure exactly how many
will have means between 67.9 and 69.2 inches. We can work out the probability
that the sample mean will lie in this interval using the sampling distribution:

X̄ ∼ N (68.5, (0.54)2 ).

We need to make a continuity correction, to account for the fact that the
recorded means are rounded to the nearest 0.1 inch. For example, the
probability that the recorded mean is ≥ 67.9 inches is the same as the
probability that the sample mean is > 67.85. Therefore, the probability we
want is:
 
67.85 − 68.5 69.25 − 68.5
P (67.85 < X < 69.25) = P <Z<
0.54 0.54
= P (−1.20 < Z < 1.39)
= Φ(1.39) − Φ(−1.20)
= 0.9177 − (1 − 0.1151)
= 0.8026.

As usual, the values of Φ(1.39) and Φ(−1.20) can be found from Table 4 of the
New Cambridge Statistical Tables. Since there are 200 independent random
samples drawn, we can now think of each as a single trial. The recorded mean
lies between 67.9 and 69.2 with probability 0.8026 at each trial. We are dealing
with a binomial distribution with n = 200 trials and probability of success
π = 0.8026. The expected number of successes is:

nπ = 200 × 0.8026 = 160.52.

(c) The probability that the recorded mean is < 67.0 inches is:
 
66.95 − 68.5
P (X < 66.95) = P Z < = P (Z < −2.87) = Φ(−2.87) = 0.00205
0.54

so the expected number of recorded means below 67.0 out of a sample of 200 is:

200 × 0.00205 = 0.41.

7. If Z is a random variable with a standard normal distribution, what is


P (Z 2 < 3.841)?

201
E. Sampling distributions of statistics

Solution:
We can compute the probability in two different ways. Working with the standard
normal distribution, we have:
 √ √ 
2
P (Z < 3.841) = P − 3.841 < Z < 3.841

= P (−1.96 < Z < 1.96)


= Φ(1.96) − Φ(−1.96)
= 0.9750 − (1 − 0.9750) = 0.95.

Alternatively, we can use the fact that Z 2 follows a χ21 distribution. From Table 8
of the New Cambridge Statistical Tables we can see that 3.841 is the 5% right-tail
value for this distribution, and so P (Z 2 < 3.84) = 0.95, as before.

8. Suppose that X1 and X2 are independent N (0, 4) random variables. Compute


P (X12 < 36.84 − X22 ).
Solution:
Rearrange the inequality to obtain:

P (X12 < 36.84 − X22 ) = P (X12 + X22 < 36.84)


 2
X1 + X22

36.84
=P <
4 4
 2  2 !
X1 X2
=P + < 9.21 .
2 2

Since X1 /2 and X2 /2 are independent N (0, 1) random variables, the sum of their
squares will follow a χ22 distribution. Using Table 8 of the New Cambridge
Statistical Tables, we see that 9.210 is the 1% right-tail value, so the probability we
are looking for is 0.99.

9. Suppose that X1 , X2 and X3 are independent N (0, 1) random variables, while Y


(independently) follows a χ25 distribution. Compute P (X12 + X22 < 7.236Y − X32 ).
Solution:
Rearranging the inequality gives:

P (X12 + X22 < 7.236Y − X32 ) = P (X12 + X22 + X32 < 7.236Y )
 2
X1 + X22 + X32

=P < 7.236
Y
 2
(X1 + X22 + X32 )/3

5
=P < × 7.236
Y /5 3
 2
(X1 + X22 + X32 )/3

=P < 12.060 .
Y /5

202
E.2. Practice questions

Since X12 + X22 + X32 ∼ χ23 , we have a ratio of independent χ23 and χ25 random
variables, each divided by its degrees of freedom. By definition, this follows an F3, 5
distribution. From Table A.3 of the Dougherty Statistical Tables, we see that 12.06
is the 1% upper-tail value for this distribution, so the probability we want is equal
to 0.99.

E.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix I.

1. (a) Suppose {X1 , X2 , X3 , X4 } is a random sample of size n = 4 from the


n
P
Bernoulli(0.2) distribution. What is the distribution of Xi in this case?
i=1
n
P
(b) Write down the sampling distribution of X̄ = Xi /n for the sample
i=1
considered in (a). In other words, write down the possible values of X̄ and
their probabilities.
P
Hint: what are the possible values of i Xi , and their probabilities?
(c) Suppose we have a random sample of size n = 100 from the Bernoulli(0.2)
distribution. What is the approximate sampling distribution of X̄ suggested by
the central limit theorem in this case? Use this distribution to calculate an
approximate value for the probability that X̄ > 0.3. (The true value of this
probability is 0.0061.)

2. Suppose that we plan to take a random sample of size n from a normal distribution
with mean µ and standard deviation σ = 2.
(a) Suppose µ = 4 and n = 20.
i. What is the probability that the mean X̄ of the sample is greater than 5?
ii. What is the probability that X̄ is smaller than 3?
iii. What is P (|X̄ − µ| ≤ 1) in this case?
(b) How large should n be in order that P (|X̄ − µ| ≤ 0.5) ≥ 0.95 for every possible
value of µ?
(c) It is claimed that the true value of µ is 5 in a population. A random sample of
size n = 100 is collected from this population, and the mean for this sample is
x̄ = 5.8. Based on the result in (b), what would you conclude from this value
of X̄?

3. A random sample of 25 audits is to be taken from a company’s total audits, and


the average value of these audits is to be calculated.
(a) Explain what you understand by the sampling distribution of this average and
discuss its relationship to the population mean.

203
E. Sampling distributions of statistics

(b) Is it reasonable to assume that this sampling distribution is normal?


(c) If the population of all audits has a mean of £54 and a standard deviation of
£10, find the probability that:
i. the sample mean will be greater than £60
ii. the sample mean will be within 5% of the population mean.

204
Appendix F
Estimator properties

F.1 Worked examples


1. Let {X1 , X2 , . . . , Xn } be a random sample from the distribution N (µ, σ 2 ). X̄ is an
estimator of the population mean µ, where:
n
1X
X̄ = Xi .
n i=1

(a) Is X̄ an unbiased estimator of µ? Show why or why not.


(b) Show that Var(X̄) = σ 2 /n.
(c) State the sampling distribution of X̄.
(d) What is the mean squared error (MSE) of X̄?
(e) Is X̄ a desirable estimator of µ for large n? Explain why or why not.
Solution:
(a) X̄ is an unbiased estimator of µ because E(X̄) = µ. We have:
n
! n
! n
1X 1 X 1X 1
E(X̄) = E Xi = E Xi = E(Xi ) = nµ = µ.
n i=1 n i=1
n i=1 n

(b) We have:
n
!
1X 1 1 2 σ2
Var(X̄) = Var Xi = Var(X 1 + X 2 + · · · + X n ) = nσ = .
n i=1 n2 n2 n

(c) X̄ ∼ N (µ, σ 2 /n).


(d) X̄ is an unbiased estimator of µ, hence MSE(X̄) = Var(X̄) = σ 2 /n.
(e) Yes, X̄ is an unbiased estimator of µ and Var(X̄) → 0 as n → ∞.

2. Let {X1 , X2 , . . . , Xn } be a random sample from the distribution N (µ, σ 2 ), for


µ > 0. µ
b is an estimator of the population mean µ, where:
n
1 X
µ
b= Xi .
n + 1 i=1

205
F. Estimator properties

Show that µb is a biased estimator of µ and comment briefly on the nature of the
bias. Determine the bias in terms of n and µ, and suggest how the bias can be
removed to create an unbiased estimator of µ.

Solution:
We have:
n
! n
! n
1 X 1 X 1 X nµ
E(b
µ) = E Xi = E Xi = E(Xi ) = <µ
n + 1 i=1 n+1 i=1
n + 1 i=1 n+1

b is a negatively-biased estimator of µ. The bias is −µ/(n + 1), so we could


hence µ
remove the bias by multiplying µb by (n + 1)/n, giving X̄.

3. Given the mean squared error of an estimator θb is defined as E((θb − θ)2 ), show that
this can also be expressed as:

Var(θ) b 2.
b + (Bias(θ))

Solution:
Since Var(X) = E(X 2 ) − (E(X))2 , E(X 2 ) = Var(X) + (E(X))2 . Let X = θb − θ.
Substituting in we get:

b and (E(θb − θ))2 = (Bias(θ))


Var(θb − θ) = Var(θ) b 2.

Hence:
b = E((θb − θ)2 ) = Var(θ)
MSE(θ) b 2.
b + (Bias(θ))

4. Let T1 be an unbiased estimator of the parameter θ, and T2 be an unbiased


estimator of the parameter φ. Is T1 T2 an unbiased estimator of θφ?

Solution:
No. Let θ = φ and T1 = T2 = T . Hence E(T1 T2 ) = E(T 2 ) > (E(T ))2 = θ2 = θφ.

5. A random variable X can take the values 0, 1 and 2. We know that:

3α α α
P (X = 0) = 1 − , P (X = 1) = and P (X = 2) =
4 2 4

such that 0 < α < 4/3. One observation is taken and we want to estimate α.
Consider the estimators T1 = X and T2 = 2X(X − 1) of α.
(a) Show that T1 and T2 are both unbiased estimators of α.

(b) Would you prefer estimator T1 or T2 ? Justify your choice.

206
F.2. Practice questions

Solution:
(a) We have:
X α 2α
E(T1 ) = E(X) = x p(x) = 0 + + =α
∀x
2 4
and: X 4α
E(T2 ) = 2x(x − 1) p(x) = 0 + 0 + = α.
∀x
4
Hence both T1 and T2 are unbiased estimators of α.
(b) Since both estimators are unbiased, we prefer the minimum variance
estimator. We have:

Var(T1 ) = Var(X) = E(X 2 ) − α2 = − α2
2
and:
Var(T2 ) = E(T22 ) − α2 = 4α − α2 .
Hence Var(T1 ) < Var(T2 ), so we choose T1 .

F.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix I.

1. Let T1 and T2 be two unbiased estimators of the parameter θ. T1 and T2 have the
same variance and they are independent. Consider the following estimators of θ:
T1 + T2
S= and R = 2T1 − T2 .
2

(a) Show that S and R are unbiased estimators of θ.


(b) Which one of the three estimators of θ, i.e. T1 , S or R, is the best, and which
one is the worst? Explain your answer.

2. Suppose we have a random sample of n values from a N (µ, σ 2 ) population. The


sample mean, X̄, is known to be an unbiased estimator of µ, and the sample
variance S 2 is an unbiased estimator of σ 2 . We seek an unbiased estimator of µ2 .
(a) Show that X̄ 2 is a biased estimator of µ2 .
(b) Derive an unbiased estimator of µ2 which uses the sample mean and sample
variance.

3. Based on a sample of two independent observations from a population with mean µ


and standard deviation σ, consider the following two estimators of µ:
X1 X2 X1 2X2
X= + and Y = + .
2 2 3 3
Are they unbiased estimators of µ?

207
F. Estimator properties

208
Appendix G
Point estimation

G.1 Worked examples


1. Let {X1 , X2 , . . . , Xn } be a random sample from the (continuous) uniform
distribution such that X ∼ Uniform[0, θ], where θ > 0. Find the method of
moments estimator (MME) of θ.
Solution:
The pdf of Xi is: (
θ−1 for 0 ≤ xi ≤ θ
f (xi ; θ) =
0 otherwise.
Therefore:  θ
1 θ 1 x2i
Z
θ
E(Xi ) = xi dxi = = .
θ 0 θ 2 0 2
Therefore, setting µ
b1 = M1 , we have:
n
θb X Xi
= X̄ ⇒ θb = 2X̄ = 2 .
2 i=1
n

2. Let {X1 , X2 , . . . , Xn } be a random sample from the distribution N (µ, 1). Find the
maximum likelihood estimator (MLE) of µ.
Solution:
The joint pdf of the observations is:
n   n
!
Y 1 1 1 1 X
f (x1 , x2 , . . . , xn ; µ) = √ exp − (xi − µ)2 = n/2
exp − (xi − µ)2 .
i=1
2π 2 (2π) 2 i=1

We write the above as a function of µ only:


n
!
1X
L(µ) = C exp − (Xi − µ)2
2 i=1

where C > 0 is a constant. The MLE µ b maximises this function, and also
maximises the function:
n
1X
l(µ) = ln L(µ) = − (Xi − µ)2 + log(C).
2 i=1
n
(Xi − µ)2 , i.e. the MLE is also the
P
Therefore, the MLE effectively minimises
i=1
least squares estimator (LSE), i.e. µ
b = X̄.

209
G. Point estimation

3. Let {X1 , X2 , . . . , Xn } be a random sample from a Poisson distribution with mean


λ > 0. Find the maximum likelihood estimator (MLE) of λ.
Solution:
The probability function is:

e−λ λx
P (X = x) = .
x!
The likelihood and log-likelihood functions are, respectively:
n  −λ Xi 
Y e λ e−nλ λnX̄
L(λ) = = Qn
Xi !
i=1 Xi !
i=1

and:
l(λ) = ln L(λ) = nX̄ ln(λ) − nλ + C = n(X̄ ln(λ) − λ) + C
where C is a constant (i.e. it may depend on Xi but cannot depend on the
parameter). Setting:  
d X̄
l(λ) = n −1 =0
dλ λ
b
we obtain the MLE λ
b = X̄, which is also the MME.

4. Let {X1 , X2 , . . . , Xn } be a random sample from the (continuous) uniform


distribution Uniform[0, θ], where θ > 0 is unknown.
(a) Find the maximum likelihood estimator (MLE) of θ.
(b) If n = 3, x1 = 0.2, x2 = 3.6 and x3 = 1.1, what is the maximum likelihood
estimate of θ?

Solution:
(a) The pdf of Uniform[0, θ] is:
(
θ−1 for 0 ≤ x ≤ θ
f (x; θ) =
0 otherwise.

The joint pdf is:


(
θ−n for 0 ≤ x1 , x2 , . . . , xn ≤ θ
f (x1 , x2 , . . . , xn ; θ) =
0 otherwise.

In fact f (x1 , x2 , . . . , xn ; θ), as a function of θ, is the likelihood function, L(θ).


The maximum likelihood estimator of θ is the value at which the likelihood
function L(θ) achieves its maximum. Note:
(
θ−n for X(n) ≤ θ
L(θ) =
0 otherwise

210
G.1. Worked examples

where:
X(n) = max Xi .
i

Hence the MLE is θb = X(n) , which is different from the MME. For example, if
x(n) = 1.16, we have:

(b) For the given data, the maximum observation is x(3) = 3.6. Therefore, the
maximum likelihood estimate is θb = 3.6.

5. Use the observed random sample x1 = 8.2, x2 = 10.6, x3 = 9.1 and x4 = 4.9 to
calculate the maximum likelihood estimate of λ in the exponential pdf:
(
λe−λx for x ≥ 0
f (x; λ) =
0 otherwise.

Solution:
We derive a general formula with a random sample {X1 , X2 , . . . , Xn } first. The
joint pdf is:
(
λn e−λnx̄ for x1 , x2 , . . . , xn ≥ 0
f (x1 , x2 , . . . , xn ; λ) =
0 otherwise.

With all xi ≥ 0, L(λ) = λn e−λnX̄ , hence the log-likelihood function is:

l(λ) = ln L(λ) = n ln(λ) − λnX̄.

Setting:
d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄
For the given sample, x̄ = (8.2 + 10.6 + 9.1 + 4.9)/4 = 8.2. Therefore, λ
b = 0.1220.

211
G. Point estimation

6. The following data show the number of occupants in passenger cars observed
during one hour at a busy junction. It is assumed that these data follow a
geometric distribution with pf:
(
(1 − π)x−1 π for x = 1, 2, . . .
p(x; π) =
0 otherwise.

Number of occupants 1 2 3 4 5 ≥6 Total


Frequency 678 227 56 28 8 14 1,011

Find the maximum likelihood estimate of π.


Solution:
The sample size is n = 1,011. If we know all the 1,011 observations, the joint
probability function for x1 , x2 , . . . , x1,011 is:
1,011
Y
L(π) = p(xi ; π).
i=1

However, we only know that there are 678 xi s equal to 1, 227 xi s equal to 2, . . .,
and 14 xi s equal to some integers not smaller than 6.
Note that:

X
P (Xi ≥ 6) = p(x; π) = π(1 − π)5 (1 + (1 − π) + (1 − π)2 + · · · )
x=6

1
= π(1 − π)5 ×
π

= (1 − π)5 .

Hence we may only use:

L(π) = p(1, π)678 p(2, π)227 p(3, π)56 p(4, π)28 p(5, π)8 ((1 − π)5 )14
= π 1,011−14 (1 − π)227+56×2+28×3+8×4+14×5
= π 997 (1 − π)525

hence:
l(π) = ln L(π) = 997 ln(π) + 525 ln((1 − π)).
Setting:
d 997 525 997
l(π) = − =0 ⇒ π
b= = 0.655.
dπ π
b 1−π b 997 + 525
Remark: Since P (Xi = 1) = π, πb = 0.655 indicates that about 2/3 of cars have only
one occupant. Note E(Xi ) = 1/π. In order to ensure that the average number of
occupants is not smaller than k, we require π < 1/k.

212
G.1. Worked examples

7. Suppose that we have a random sample {X1 , X2 , . . . , Xn } from a Uniform[−θ, θ]


distribution. Find the method of moments estimator of θ.

Solution:
The mean of the Uniform[a, b] distribution is (a + b)/2. In our case, this gives
E(X) = (−θ + θ)/2 = 0. The first population moment does not depend on θ, so we
need to move to the next (i.e. second) population moment.
Recall that the variance of the Uniform[a, b] distribution is (b − a)2 /12. Hence the
second population moment is:

E(X 2 ) = Var(X) + E(X)2


(θ − (−θ))2
= + 02
12
θ2
= .
3

We set this equal to the second sample moment to obtain:


n
1 X 2 θb2
X = .
n i=1 i 3

Therefore, the method of moments estimator of θ is:


v
u n
u3 X
θbM M =t Xi2 .
n i=1

8. Consider again the Uniform[−θ, θ] distribution from Question 7. Suppose that we


observe the following data:

1.8, 0.7, −0.2, −1.8, 2.8, 0.6, −1.3 and − 0.1.

Estimate θ using the method of moments.

Solution:
The point estimate is:
v
u 8
u3 X
θbM M =t x2 ≈ 2.518
8 i=1 i

which implies that the data came from a Uniform[−2.518, 2.518] distribution.
However, this clearly cannot be true since the observation x5 = 2.8 falls outside this
range! The method of moments does not take into account that all of the
observations need to lie in the interval [−θ, θ], and so it fails to produce a useful
estimate.

213
G. Point estimation

G.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix I.

1. Let {X1 , X2 , . . . , Xn } be a random sample from an Exp(λ) distribution. Find the


MLE of λ.

2. Suppose that you are given observations y1 , y2 , y3 and y4 such that:

y 1 = α + β + ε1

y2 = −α + β + ε2

y 3 = α − β + ε3

y4 = −α − β + ε4 .

The random variables εi , for i = 1, 2, 3, 4, are independent and normally distributed


with mean 0 and variance σ 2 .
(a) Find the least squares estimators of the parameters α and β.
(b) Verify that the least squares estimators in (a) are unbiased estimators of their
respective parameters.
(c) Find the variance of the least squares estimator of α.

214
Appendix H
Analysis of variance (ANOVA)

H.1 Worked examples


1. Three trainee salespeople were working on a trial basis. Salesperson A went in the
field for 5 days and made a total of 440 sales. Salesperson B was tried for 7 days
and made a total of 630 sales. Salesperson C was tried for 10 days and made a total
of 690 sales. Note that these figures
P are total sales, not daily averages. The sum of
the squares of all 22 daily sales ( x2i ) is 146,840.
(a) Construct a one-way analysis of variance table.
(b) Would you say there is a difference between the mean daily sales of the three
salespeople? Justify your answer.
(c) Construct a 95% confidence interval for the mean difference between
salesperson B and salesperson C. Would you say there is a difference?

Solution:

(a) The means are 440/5 = 88, 630/7 = 90 and 690/10 = 69. We will perform a
one-way ANOVA. First, we calculate the overall mean. This is:

440 + 630 + 690


= 80.
22
We can now calculate the sum of squares between salespeople. This is:

5 × (88 − 80)2 + 7 × (90 − 80)2 + 10 × (69 − 80)2 = 2,230.

The total sum of squares is:

146,840 − 22 × (80)2 = 6,040.

Here is the one-way ANOVA table:


Source DF SS MS F p-value
Salesperson 2 2,230 1,115 5.56 ≈ 0.01
Error 19 3,810 200.53
Total 21 6,040

(b) As 5.56 > 3.52 = F0.05, 2, 19 , which is the top 5th percentile of the F2, 19
distribution (interpolated from Table A.3 of the Dougherty Statistical Tables),
we reject H0 : µ1 = µ2 = µ3 and conclude that there is evidence that the means
are not equal.

215
H. Analysis of variance (ANOVA)

(c) We have:
s  
1 1
90 − 69 ± 2.093 × 200.53 × + = 21 ± 14.61.
7 10

Here 2.093 is the top 2.5th percentile point of the t distribution with 19
degrees of freedom. A 95% confidence interval is (6.39, 35.61). As zero is not
included, there is evidence of a difference.

2. The total times spent by three basketball players on court were recorded. Player A
was recorded on three occasions and the times were 29, 25 and 33 minutes. Player
B was recorded twice and the times were 16 and 30 minutes. Player C was recorded
on three occasions and the times were 12, 14 and 16 minutes. Use analysis of
variance to test whether there is any difference in the average times the three
players spend on court.
Solution:
We have x̄·A = 29, x̄·B = 23, x̄·C = 14 and x̄ = 21.875. Hence:

3 × (29 − 21.875)2 + 2 × (23 − 21.875)2 + 3 × (14 − 21.875)2 = 340.875.

The total sum of squares is:

4,307 − 8 × (21.875)2 = 478.875.

Here is the one-way ANOVA table:

Source DF SS MS F p-value
Players 2 340.875 170.4375 6.175 ≈ 0.045
Error 5 138 27.6
Total 7 478.875

We test H0 : µ1 = µ2 = µ3 (i.e. the average times they play are the same) vs. H1 :
The average times they play are not the same.

As 6.175 > 5.79 = F0.05, 2, 5 , which is the top 5th percentile of the F2, 5 distribution,
we reject H0 and conclude that there is evidence of a difference between the means.

3. Three independent random samples were taken. Sample A consists of 4


observations taken from a normal distribution with mean µA and variance σ 2 ,
sample B consists of 6 observations taken from a normal distribution with mean µB
and variance σ 2 , and sample C consists of 5 observations taken from a normal
distribution with mean µC and variance σ 2 .
The average value of the first sample was 24, the average value of the second
sample was 20, and the average value of the third sample was 18. The sum of the
squared observations (all of them) was 6,722.4. Test the hypothesis:

H 0 : µA = µB = µC

against the alternative that this is not so.

216
H.1. Worked examples

Solution:
We will perform a one-way ANOVA. First we calculate the overall mean:
4 × 24 + 6 × 20 + 5 × 18
= 20.4.
15
We can now calculate the sum of squares between groups:

4 × (24 − 20.4)2 + 6 × (20 − 20.4)2 + 5 × (18 − 20.4)2 = 81.6.

The total sum of squares is:

6,722.4 − 15 × (20.4)2 = 480.

Here is the one-way ANOVA table:

Source DF SS MS F p-value
Sample 2 81.6 40.8 1.229 ≈ 0.327
Error 12 398.4 33.2
Total 14 480

As 1.229 < 3.89 = F0.05, 2, 12 , which is the top 5th percentile of the F2, 12
distribution, we see that there is no evidence that the means are not equal.

4. Four suppliers were asked to quote prices for seven different building materials. The
average quote of supplier A was 1,315.8. The average quotes of suppliers B, C and
D were 1,238.4, 1,225.8 and 1,200.0, respectively. The following is the calculated
two-way ANOVA table with some entries missing.

Source DF SS MS F p-value
Materials 17,800
Suppliers
Error
Total 358,700
(a) Complete the table using the information provided above.
(b) Is there a significant difference between the quotes of different suppliers?
Explain your answer.
(c) Construct a 90% confidence interval for the difference between suppliers A and
D. Would you say there is a difference?
Solution:
(a) The average quote of all suppliers is:
1,315.8 + 1,238.4 + 1,225.8 + 1,200.0
= 1,245.
4
Hence the sum of squares (SS) due to suppliers is:

7×((1,315.8−1,245)2 +(1,238.4−1,245)2 +(1,225.8−1,245)2 +(1,200.0−1,245)2 ] = 52,148.88

217
H. Analysis of variance (ANOVA)

and the MS due to suppliers is 52,148.88/(4 − 1) = 17,382.96.


The degrees of freedom are 7 − 1 = 6, 4 − 1 = 3, (7 − 1)(4 − 1) = 18 and
7 × 4 − 1 = 27 for materials, suppliers, error and total sum of squares,
respectively.
The SS for materials is 6 × 17,800 = 106,800. We have that the SS due to the
error is given by 358,700 − 52,148.88 − 106,800 = 199,751.12 and the MS is
199,751.12/18 = 11,097.28. The F values are:

17,800 17,382.96
= 1.604 and = 1.567
11,097.28 11,097.28

for materials and suppliers, respectively. The two-way ANOVA table is:
Source DF SS MS F p-value
Materials 6 106,800 17,800 1.604 ≈ 0.203
Suppliers 3 52,148.88 17,382.96 1.567 ≈ 0.232
Error 18 199,751.12 11,097.28
Total 27 358,700
(b) We test H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no difference between suppliers)
vs. H1 : There is a difference between suppliers. The F value is 1.567 and at a
5% significance level the critical value from Table A.3 of the Dougherty
Statistical Tables (degrees of freedom 3 and 18) is 3.16, hence we do not reject
H0 and conclude that there is not enough evidence that there is a difference.
(c) The top 5th percentile of the t distribution with 18 degrees of freedom is 1.734
and the MS value is 11,097.28. So a 90% confidence interval is:
s  
1 1
1,315.8 − 1,200 ± 1.734 × 11,097.28 + = 115.8 ± 97.64
7 7

giving (18.16, 213.44). Since zero is not in the interval, there appears to be a
difference between suppliers A and D.

5. Blood alcohol content (BAC) is measured in milligrams per decilitre of blood


(mg/dL). A researcher is looking into the effects of alcoholic drinks. Four different
individuals tried five different brands of strong beer (A, B, C, D and E) on different
days, of course! Each individual consumed 1L of beer over a 30-minute period and
their BAC was measured one hour later. The average BAC for beers A, C, D and E
were 83.25, 95.75, 79.25 and 99.25, respectively. The value for beer B is not given.
The following information is provided as well.

Source DF SS MS F p-value
Drinker 1.56
Beer 303.5
Error 695.6
Total

218
H.1. Worked examples

(a) Complete the table using the information provided above.


(b) Is there a significant difference between the effects of different beers? What
about different drinkers?
(c) Construct a 90% confidence interval for the difference between the effects of
beers C and D. Would you say there is a difference?
Solution:
(a) We have:
Source DF SS MS F p-value
Drinker 3 271.284 90.428 1.56 ≈ 0.250
Beer 4 1214 303.5 5.236 ≈ 0.011
Error 12 695.6 57.967
Total 19 2,180.884
(b) We test the hypothesis H0 : µ1 = µ2 = · · · = µ5 (i.e. there is no difference
between the effects of different beers) vs. the alternative H1 : There is a
difference between the effects of different beers. The F value is 5.236 and at a
5% significance level the critical value from Table A.3 of the Dougherty
Statistical Tables is F0.05, 4, 12 = 3.26, so since 5.236 > 3.26 we reject H0 and
conclude that there is evidence of a difference.
For drinkers, we test the hypothesis H0 : µ1 = µ2 = µ3 = µ4 (i.e. there is no
difference between the effects on different drinkers) vs. the alternative H1 :
There is a difference between the effects on different drinkers. The F value is
1.56 and at a 5% significance level the critical value from Table A.3 of the
Dougherty Statistical Tables is F0.05, 3, 12 = 3.49, so since 1.56 < 3.49 we fail to
reject H0 and conclude that there is no evidence of a difference.
(c) The top 5th percentile of the t distribution with 12 degrees of freedom is 1.782.
So a 90% confidence interval is:
s  
1 1
95.75 − 79.25 ± 1.782 × 57.967 + = 16.5 ± 9.59
4 4

giving (6.91, 26.09). As the interval does not contain zero, there is evidence of
a difference between the effects of beers C and D.

6. A motor manufacturer operates five continuous-production plants: A, B, C, D and


E. The average rate of production has been calculated for the three shifts of each
plant and recorded in the table below. Does there appear to be a difference in
production rates in different plants or by different shifts?

A B C D E
Early shift 102 93 85 110 72
Late shift 85 87 71 92 73
Night shift 75 80 75 77 76

Solution:
Here r = 3 and c = 5. We may obtain the two-way ANOVA table as follows:

219
H. Analysis of variance (ANOVA)

Source DF SS MS F
Shift 2 652.13 326.07 5.62
Plant 4 761.73 190.43 3.28
Error 8 463.87 57.98
Total 14 1,877.73

Under the null hypothesis of no shift effect, F ∼ F2, 8 . Since F0.05, 2, 8 = 4.46 < 5.62,
we can reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.030.)
Under the null hypothesis of no plant effect, F ∼ F4, 8 . Since F0.05, 4, 8 = 3.84 > 3.28,
we cannot reject the null hypothesis at the 5% significance level. (Note the p-value
= 0.072.)
Overall, the data collected show some evidence of a shift effect but little evidence
of a plant effect.

7. Complete the two-way ANOVA table below. In the places of p-values, indicate in
the form such as ‘< 0.01’ appropriately and use the closest value which you may
find from the Dougherty Statistical Tables.

Source DF SS MS F p-value
Row factor 4 ? 234.23 ? ?
Column factor 6 270.84 45.14 1.53 ?
Error ? 708.00 ?
Total 34 1,915.76

Solution:
First, C2 SS = (C2 MS)×4 = 936.92.
The degrees of freedom for Error is 34 − 4 − 6 = 24. Therefore, Error MS
= 708.00/24 = 29.5.
Hence the F statistic for testing no C2 effect is 234.23/29.5 = 7.94. From Table A.3
of the Dougherty Statistical Tables, F0.001, 4, 24 = 6.59 < 7.94. Therefore, the
corresponding p-value is smaller than 0.001.
Since F0.05, 6, 24 = 2.51 > 1.53, the p-value for testing the C3 effect is greater than
0.05.
The complete ANOVA table is as follows:

Two-way ANOVA: C1 versus C2, C3

Source DF SS MS F P
C2 4 936.92 234.23 7.94 <0.001
C3 6 270.84 45.14 1.53 >0.05
Error 24 708.00 29.5
Total 34 1,915.76

220
H.2. Practice questions

H.2 Practice questions


Try to solve the questions before looking at the solutions – promise?! Solutions are
located in Appendix I.

1. An executive of a prepared frozen meals company is interested in the amounts of


money spent on such products by families in different income ranges. The table
below lists the monthly expenditures (in dollars) on prepared frozen meals from 15
randomly selected families divided into three groups according to their incomes.

Under $15,000 $15,000 – $30,000 Over $30,000


45.2 53.2 52.7
60.1 56.6 73.6
52.8 68.7 63.3
31.7 51.8 51.8
33.6 54.2
39.4

(a) Based on these data, can we infer at the 5% significance level that the
population mean expenditures on prepared frozen meals are the same for the
three different income groups?
(b) Produce a one-way ANOVA table.
(c) Construct 95% confidence intervals for the mean expenditures of the first
(under $15,000) and the third (over $30,000) income groups.

2. Does the level of success of publicly-traded companies affect the way their board
members are paid? The annual payments (in $000s) of randomly selected
publicly-traded companies to their board members were recorded. The companies
were divided into four quarters according to the returns in their stocks, and the
payments from each quarter were grouped together. Some summary statistics are
provided below.

Descriptive Statistics: 1st quarter, 2nd quarter, 3rd quarter, 4th quarter

Variable N Mean SE Mean StDev


1st quarter 30 74.10 2.89 15.81
2nd quarter 30 75.67 2.48 13.57
3rd quarter 30 78.50 2.79 15.28
4th quarter 30 81.30 2.85 15.59
(a) Can we infer that the amount of payment differs significantly across the four
groups of companies?
(b) Construct 95% confidence intervals for the mean payment of the 1st quarter
companies and the 4th quarter companies.

221
H. Analysis of variance (ANOVA)

222
Appendix I
Solutions to Practice questions

I.1 Appendix A – Probability theory


1. (a) We know P (E ∪ F ) = P (E) + P (F ) − P (E ∩ F ).
Consider A ∪ B ∪ C as (A ∪ B) ∪ C (i.e. as the union of the two sets A ∪ B and
C) and then apply the result above to obtain:

P (A ∪ B ∪ C) = P ((A ∪ B) ∪ C) = P (A ∪ B) + P (C) − P ((A ∪ B) ∩ C).

Now (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C) – a Venn diagram can be drawn to


check this.
So:

P (A∪B ∪C) = P (A∪B)+P (C)−(P (A∩C)+P (B ∩C)−P ((A∩C)∩(B ∩C)))

using the earlier result again for A ∩ C and B ∩ C.


Now (A ∩ C) ∩ (B ∩ C) = A ∩ B ∩ C and if we apply the earlier result once
more for A and B, we obtain:

P (A∪B∪C) = P (A)+P (B)−P (A∩B)+P (C)−P (A∩C)−P (B∩C)+P (A∩B∩C)

which is the required result.

(b) Use the result that if X ⊂ Y then P (X) ≤ P (Y ) for events X and Y .
Since A ⊂ A ∪ B and B ⊂ A ∪ B, we have P (A) ≤ P (A ∪ B) and
P (B) ≤ P (A ∪ B).
Adding these inequalities, P (A) + P (B) ≤ 2P (A ∪ B) so:

P (A) + P (B)
≤ P (A ∪ B).
2

Similarly, A ∩ B ⊂ A and A ∩ B ⊂ B, so P (A ∩ B) ≤ P (A) and


P (A ∩ B) ≤ P (B).
Adding, 2P (A ∩ B) ≤ P (A) + P (B) so:

P (A) + P (B)
P (A ∩ B) ≤ .
2

223
I. Solutions to Practice questions

2. (a) We know that P (A ∪ B) = P (A) + P (B) − P (A ∩ B). For independent events


A and B, P (A ∩ B) = P (A) P (B), so P (A ∪ B) = P (A) + P (B) − P (A) P (B)
gives 0.75 = p + 2p − 2p2 , or 2p2 − 3p + 0.75 = 0.
Solving the quadratic equation gives:

3− 3
p= ≈ 0.317
4
suppressing the irrelevant case for which p > 1.
Since A and B are independent, P (A | B) = P (A) = p = 0.317.
(b) For mutually exclusive events, P (A ∪ B) = P (A) + P (B), so 0.75 = p + 2p,
leading to p = 0.25.
Here P (A ∩ B) = 0, so P (A | B) = P (A ∩ B)/P (B) = 0.

3. (a) We are given that A and B are independent, so P (A ∩ B) = P (A) P (B). We


need to show a similar result for Ac and B c , namely we need to show that
P (Ac ∩ B c ) = P (Ac ) P (B c ).
Now Ac ∩ B c = (A ∪ B)c from basic set theory (draw a Venn diagram), hence:

P (Ac ∩ B c ) = P ((A ∪ B)c )


= 1 − P (A ∪ B)
= 1 − (P (A) + P (B) − P (A ∩ B))
= 1 − P (A) − P (B) + P (A ∩ B)
= 1 − P (A) − P (B) + P (A) P (B) (independence assumption)
= (1 − P (A))(1 − P (B)) (factorising)
= P (Ac ) P (B c ) (as required).

(b) To show that X c and Y c are not necessarily mutually exclusive when X and Y
are mutually exclusive, the best approach is to find a counterexample.
Attempts to ‘prove’ the result directly are likely to be logically flawed.
Look for a simple example. Suppose we roll a die. Let X = {6} be the event of
obtaining a 6, and let Y = {5} be the event of obtaining a 5. Obviously X and
Y are mutually exclusive, but X c = {1, 2, 3, 4, 5} and Y c = {1, 2, 3, 4, 6} have
X c ∩ Y c 6= ∅, so X c and Y c are not mutually exclusive.

I.2 Appendix B – Discrete probability distributions


1. For one hour, we use the Poisson distribution with λ = 1.6.
(a) We have P (X = 0) = e−1.6 × (1.6)0 /0! = e−1.6 = 0.2019.
(b) P (X = 5) = e−1.6 × (1.6)5 /5! = 0.0176.
For two hours, we use the Poisson distribution with λ = 3.2.

224
I.2. Appendix B – Discrete probability distributions

(c) We have:
P (X > 4) = 1 − (P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4))
(3.2)0 (3.2)1 (3.2)2 (3.2)3 (3.2)4
 
−3.2
=1−e + + + +
0! 1! 2! 3! 4!
= 0.2194.

2. (a) If we assume that the calving process is random (as the remark about
seasonality hints) then we are counting events over periods of time (with, in
particular, no obvious upper maximum), and hence the appropriate probability
distribution is the Poisson distribution.
(b) The rate parameter for one week is 0.4, so for two weeks we use λ = 0.8, hence:
e−0.8 × (0.8)0
P (X = 0) = = e−0.8 = 0.4493.
0!
(c) If it is correct to use the Poisson distribution then events are independent, and
so:
P (none in weeks 1 & 2) = P (none in weeks 3 & 4) = · · · = 0.4493.

(d) The rate parameter for four weeks is λ = 1.6, hence:


e−1.6 × (1.6)3
P (X = 3) = = 0.1378.
3!
(e) Bayes’ formula tells us that:
P (3 in weeks 5 to 8 | 3 in weeks 1 to 4) × P (3 in weeks 1 to 4)
= P (3 in weeks 5 to 8 ∩ 3 in weeks 1 to 4).
If it is correct to use the Poisson distribution then events are independent, and
so:
P (3 in weeks 5 to 8 ∩ 3 in weeks 1 to 4)
= P (3 in weeks 5 to 8) × P (3 in weeks 1 to 4).
Therefore, cancelling, we get:
P (3 in weeks 5 to 8 | 3 in weeks 1 to 4) = P (3 in weeks 5 to 8)
= P (3 in weeks 1 to 4)
= 0.1378.

(f) The fact that the results are identical in the two cases is a consequence of the
independence built into the assumption that the Poisson distribution is the
appropriate one to use. A Poisson process does not ‘remember’ what happened
before the start of a period under consideration.

225
I. Solutions to Practice questions

3. It is useful to show the distribution of R first.

R=r 0 1 2 3 4 Total
P (R = r) 0.4096 0.4096 0.1536 0.0256 0.0016 1

So the distribution of T = (R − 2)2 is:

T =t 0 1 4 Total
P (T = t) 0.1536 0.4352 0.4112 1

since T = 1 means R = 1 or R = 3, so we add the probabilities. Similarly, T = 4


means R = 0 or R = 4. Therefore:
X
E(T ) = t P (T = t) = (0 × 0.1536) + (1 × 0.4352) + (4 × 0.4112) = 2.08.
t

Note E(T 2 ) = 7.0144, hence Var(T ) =√E(T 2 ) − µ2T = 7.0144 − (2.08)2 = 2.6880.
Hence the standard deviation is σ = 2.6880 = 1.6395.

I.3 Appendix C – Continuous probability distributions


1. Let X be the number of television viewers in a random sample of 500, such that if
the advertising agency’s claim is true, we would have X ∼ Bin(500, 0.4). Given the
sample size, we want to use the normal approximation to X, with µ = 500 × 0.4
= 200 and σ 2 = 500 × 0.4 × (1 − 0.4) = 120. Therefore, we can assume that
X ∼ N (200, 120).
We might use the continuity correction, but in this case (chiefly because the sample
size is so much larger than 30) it makes little difference. So we want:
 
170 − 200
P (X < 170) = P Z < √
120
= P (Z < −2.74)
= P (Z > 2.74)
= 0.0031.

2. (a) With a population of just 20, and 40% of them, i.e. 8, being EMFSS FC
supporters, the probability of success (a ‘yes’ response) changes each time a
respondent is asked, and the change depends on the respondent’s answer. For
instance:
• The probability is π1 = 0.4 that the first respondent supports EMFSS FC.
• If the first respondent supports EMFSS FC, the probability is
π2|1=‘yes’ = 7/19 = 0.3684 that the second respondent supports EMFSS FC
too.

226
I.3. Appendix C – Continuous probability distributions

• If the first respondent does not support EMFSS FC, the probability is
π2|1=‘no’ = 8/19 = 0.4211 that the second respondent supports EMFSS FC.
This means that we cannot assume that the probability of a ‘yes’ is (virtually)
the same from one respondent to the next. Therefore, we cannot regard the
successive questions to different people as being successive identical Bernoulli
trials. Equivalently, we must treat the problem as if we were using sampling
without replacement.
So, if ‘S’ means ‘supports EMFSS FC’ and ‘S c ’ means ‘does not support
EMFSS FC’, there are four possible sequences of responses which give exactly
three Ss, and we want the sum of their separate probabilities. So we have:

8 7 6 12 12 × 8 × 7 × 6
P (SSSS c ) = × × × =
20 19 18 17 20 × 19 × 18 × 17
8 7 12 6 12 × 8 × 7 × 6
P (SSS c S) = × × × =
20 19 18 17 20 × 19 × 18 × 17
8 12 7 6 12 × 8 × 7 × 6
P (SS c SS) = × × × =
20 19 18 17 20 × 19 × 18 × 17
12 8 7 6 12 × 8 × 7 × 6
P (S c SSS) = × × × = .
20 19 18 17 20 × 19 × 18 × 17

Hence P (exactly 3 Ss) = 4 × (12 × 8 × 7 × 6)/(20 × 19 × 18 × 17) = 0.1387.

(b) In this case we can assume that the probability of an ‘S’ is (for practical
purposes) constant, hence we have a sequence of independent and identical
Bernoulli trials (and/or that we are using sampling with replacement).
Therefore, we can assume that, if X is the random variable which counts the
number of Ss out of forty people surveyed, then X ∼ Bin(40, 0.4). Hence:
 
40
P (X = 20) = × (0.4)20 × (0.6)20 = 0.0554.
20

Alternatively, we could use a normal approximation to the binomial where


Y ∼ N (16, 9.6), since E(X) = 16 and Var(X) = 9.6. Hence (using the
continuity correction):
 
19.5 − 16 20.5 − 16
P (X = 20) = P (19.5 ≤ Y ≤ 20.5) = P √ ≤Z≤ √
9.6 9.6
= P (1.13 ≤ Z ≤ 1.45)
= 1 − P (Z > 1.45) − (1 − P (Z > 1.13))
= 1 − 0.0735 − (1 − 0.1292)
= 0.0557.

(c) With n = 100 (and a ‘large’ population) we know that we can use the normal
approximation to X ∼ Bin(100, 0.4). Since E(X) = 40 and Var(X) = 24, we
can approximate X with Y , where Y ∼ N (40, 24). Hence (using the continuity

227
I. Solutions to Practice questions

correction):
 
29.5 − 40
P (X ≥ 30) = P (Y ≥ 29.5) = P Z ≥ √
24
= P (Z ≥ −2.14)
= 1 − P (Z ≥ 2.14)
= 0.98382.

(d) The only approximation used here is the normal approximation to the
binomial, used in (b) and (c), and it is justified because:
• n is ‘large’ (although whether n = 40 in (b) can be considered large is
debatable)
• the population is large enough to justify using the binomial in the first
place
• nπ > 5 and n(1 − π) > 5 in each case.
(e) Three possible comments are the following.
• For (a), if we had (wrongly) used Bin(4, 0.4) then we would have obtained
P (X = 3) = 0.1536, which is quite a long way from the true value (roughly
11%, proportionally) – we might have reached the wrong conclusion.
• For (b) and (c) it was important, to justify using the binomial, that the
population was ‘large’.
• The true value (to 4 decimal places) for (c) is 0.9852, so the
approximation obtained is pretty good – it is very unlikely that we might
have reached the wrong conclusion.

3. Let X be the number of newspapers sold, hence X ∼ N (350, 900).


(a) P (X < 300) = P (Z < (300 − 350)/30) = P (Z < −1.67) = 0.0475.
(b) If XM and XT are the random variables for newspaper sales on Mondays and
on Tuesdays, respectively, the mean of XM − XT is zero, and we want to find
P (XM − XT > 0). Since we can clearly suppose that the distribution of
XM − XT is left-right symmetric about its mean, 0, P (XM − XT > 0) = 0.5.
(c) If T is the number of newspapers sold in a five-day week then we have
T ∼ N (1,750, 4,500) (since it is the sum of five copies of X, which may be
assumed to be independent). Therefore:
 
1,700 − 1,750
P (T < 1,700) = P Z < √ = P (Z < −0.75) = 0.2266.
4,500

Note that since we have assumed that daily newspaper sales are independent,
which we were not told explicitly in the question, we need to (i.) think if it is
reasonable – in the given question – to assume independence, and then (ii.) say
we have assumed it.

228
I.4. Appendix D – Multivariate random variables

(d) Let s be the required stock, then we require P (X > s) = 0.1. Hence:
 
s − 350
P Z> = 0.1
30
s − 350
⇒ ≥ 1.28
30
⇒ s ≥ 350 + 1.28 × 30 = 388.4.

Rounding up, the required stock is 389.

I.4 Appendix D – Multivariate random variables

1. (a) The joint distribution table is:


X=x
0 1 2
0 0 A 2A
Y =y 1 A 2A 3A
2 2A 3A 4A
PP
Since pX,Y (x, y) = 1, we have A = 1/18.
∀x ∀y

(b) The marginal distribution of X (similarly of Y ) is:


X=x 0 1 2
P (X = x) 3A = 1/6 6A = 1/3 9A = 1/2

(c) The distribution of X | Y = 1 is:


X = x|y = 1 0 1 2
PX|Y =1 (X = x | y = 1) A/6A = 1/6 2A/6A = 1/3 3A/6A = 1/2
Hence E(X | Y = 1) = (0 × 1/6) + (1 × 1/3) + (2 × 1/2) = 4/3.

(d) Even though the distributions of X and X | Y = 1 are the same, X and Y are
not independent. For example, P (X = 0, Y = 0) = 0 although P (X = 0) 6= 0
and P (Y = 0) 6= 0.

2. (a) We have:

P (X = 2, Y < 3) P (X = 2, Y = 0) + P (X = 2, Y = 2)
P (X = 2 | Y < 3) = =
P (Y < 3) P (Y = 0) + P (Y = 2)
0.10 + 0.16
=
0.30 + 0.30
= 0.4333.

229
I. Solutions to Practice questions

(b) Here is the table of joint probabilities:

U =0 U =2
V =0 0.10 0.20
V =2 0.16 0.14
V =4 0.20 0.20

We then have:

P (U = 0) = 0.10 + 0.16 + 0.20 = 0.46 and P (U = 2) = 1 − 0.46 = 0.54.

Also, P (V = 0) = 0.30, P (V = 2) = 0.30 and P (V = 4) = 0.40. So:

E(U ) = 0 × 0.46 + 2 × 0.54 = 1.08

E(V ) = 0 × 0.30 + 2 × 0.30 + 4 × 0.40 = 2.20

and:
E(U V ) = 2 × 2 × 0.14 + 2 × 4 × 0.20 = 2.16
Hence:

Cov(U, V ) = E(U V ) − E(U ) E(V ) = 2.16 − 1.08 × 2.20 = −0.216.

I.5 Appendix E – Sampling distributions of statistics


1. (a) The sum of n independent Bernoulli random variables, each with success
4
P
probability π, is Bin(n, π). Here n = 4 and π = 0.2, so Xi ∼ Bin(4, 0.2).
i=1
P
(b) The possible values of Xi are 0, 1, 2, 3 and 4, and their probabilities can be
calculated from the binomial distribution. For example:
4
!  
X 4
P Xi = 1 = (0.2)1 (0.8)3 = 4 × 0.2 × 0.512 = 0.4096.
i=1
1

The other probabilities are shown in the table below.


P
Since X̄ = Xi /4, the possible values of X̄ are 0, 0.25, 0.5, 0.75 and
P 1. Their
probabilities are the same asPthose of the corresponding values of Xi . For
example, P (X̄ = 0.25) = P ( Xi = 1) = 0.4096. The values and their
probabilities are:
X̄ = x̄ 0.0 0.25 0.50 0.75 1.0
P (X̄ = x̄) 0.4096 0.4096 0.1536 0.0256 0.0016
(c) For Xi ∼ Bernoulli(π), E(Xi ) = π and Var(Xi ) = π(1 − π). Therefore, the
approximate normal sampling distribution of X̄, derived from the central limit
theorem, is N (π, π(1 − π)/n). Here this is:
 
0.2 × 0.8
N 0.2, = N (0.2, 0.0016) = N (0.2, (0.04)2 ).
100

230
I.5. Appendix E – Sampling distributions of statistics

Therefore, the probability requested by the question is approximately:


 
X̄ − 0.2 0.3 − 0.2
P (X̄ > 0.3) = P > = P (Z > 2.5) = 0.0062
0.04 0.04
using Table 4 of the New Cambridge Statistical Tables. This is very close to the
probability obtained from the exact sampling distribution, which is about
0.0061.

2. (a) Let {X1 , X2 , . . . , Xn } denote the random sample. We know that the sampling
distribution of X̄ is N (µ, σ 2 /n), here N (4, 22 /20) = N (4, 0.2).
i. The probability we need is:
 
X̄ − 4 5−4
P (X̄ > 5) = P √ > √ = P (Z > 2.24) = 0.0126
0.2 0.2
where, as usual, Z ∼ N (0, 1).
ii. P (X̄ < 3) is obtained similarly. Note that this leads to
P (Z < −2.24) = 0.0126, which is equal to the P (X̄ > 5) = P (Z > 2.24)
result obtained above. This is because 5 is one unit above the mean µ = 4,
and 3 is one unit below the mean, and because the normal distribution is
symmetric around its mean.
iii. One way of expressing this is:
P (X̄ − µ > 1) = P (X̄ − µ < −1) = 0.0126
for µ = 4. This also shows that:
P (X̄ − µ > 1) + P (X̄ − µ < −1) = P (|X̄ − µ| > 1) = 2 × 0.0126 = 0.0252
and hence:
P (|X̄ − µ| ≤ 1) = 1 − 2 × 0.0126 = 0.9748.
In other words, the probability is 0.9748 that the sample mean is within
one unit of the true population mean, µ = 4.
(b) We can use the same ideas as in (a). Since X̄ ∼ N (µ, 4/n) we have:
P (|X̄ − µ| ≤ 0.5) = 1 − 2 × P (X̄ − µ > 0.5)
!
X̄ − µ 0.5
=1−2×P p >p
4/n 4/n

= 1 − 2 × P (Z > 0.25 n)
≥ 0.95
which holds if:
√ 0.05
P (Z > 0.25 n) ≤ = 0.025.
2
From Table√ 4 of the New Cambridge Statistical Tables, we see that this is true
when 0.25 n ≥ 1.96, i.e. when n ≥ (1.96/0.25)2 = 61.5. Rounding up to the
nearest integer, we get n ≥ 62. The sample size should be at least 62 for us to
be 95% confident that the sample mean will be within 0.5 units of the true
mean, µ.

231
I. Solutions to Practice questions

(c) Here n > 62, yet x̄ is further than 0.5 units from the claimed mean of µ = 5.
Based on the result in (b), this would be quite unlikely if µ is really 5. One
explanation of this apparent contradiction is that µ is not really equal to 5.
This kind of reasoning will be the basis of statistical hypothesis testing, which
will be discussed later in the course.

3. (a) The sample average is composed of 25 randomly sampled data which are
subject to sampling variability, hence the average is also subject to this
variability. Its sampling distribution describes its probability properties. If a
large number of such averages were independently sampled, then their
histogram would be the sampling distribution.
(b) It is reasonable to assume that this sampling distribution is normal due to the
CLT, although the sample size is rather small. If n = 25 and µ = 54 and
σ = 10, then the CLT says that:
σ2
   
100
X̄ ∼ N µ, = N 54, .
n 25

(c) i. We have:
!
60 − 54
P (X̄ > 60) = P Z>p = P (Z > 3) = 0.0013
100/25

using Table 4 of the New Cambridge Statistical Tables.


ii. We are asked for:
 
−0.05 × 54 0.05 × 54
P (0.95 × 54 < X̄ < 1.05 × 54) = P <Z<
2 2
= P (−1.35 < Z < 1.35)
= 0.8230

using Table 4 of the New Cambridge Statistical Tables.

I.6 Appendix F – Estimator properties


1. (a) To show both S and R are unbiased estimators of θ, we have:
 
T1 + T2 E(T1 ) + E(T2 ) θ+θ
E(S) = E = = =θ
2 2 2
and:
E(R) = E(2T1 − T2 ) = 2 E(T1 ) − E(T2 ) = 2θ − θ = θ.

(b) Let σ 2 be the variance of T1 . We have:

Var(T1 ) + Var(T2 ) σ2
Var(S) = =
4 2
232
I.7. Appendix G – Point estimation

and:
Var(R) = 4 Var(T1 ) + Var(T2 ) = 5σ 2 .
Since Var(S) < Var(T1 ) < Var(R), and given that they are all unbiased
estimators, then S is the best estimator and R is the worst estimator due to
the equivalent ranking of their MSEs.

2. (a) We have:

E(X̄ 2 ) = E(X̄)2 + Var(X̄)


σ2
= µ2 +
n
6= µ2 .

(b) We know E(S 2 ) = σ 2 , so that, by combining this with the above, it follows
that:
S2
 2
σ2 σ2
 
2 2 S
E X̄ − = E(X̄ ) − E = + µ2 − = µ2 .
n n n n
Hence X̄ 2 − S 2 /n is an unbiased estimator of µ2 .

3. We have:
 
X 1 X2 1 1 1 1
E(X) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ
2 2 2 2 2 2

and:
 
X1 2X2 1 2 1 2
E(Y ) = E + = × E(X1 ) + × E(X2 ) = × µ + × µ = µ.
3 3 3 3 3 3

It follows that both estimators are unbiased estimators of µ.

I.7 Appendix G – Point estimation


1. The likelihood function is:
n
Y n
Y P
L(λ) = f (xi ; θ) = λe−λXi = λn e−λ i Xi
= λn e−λnX̄
i=1 i=1

so the log-likelihood function is:

l(λ) = ln(λn e−λnX̄ ) = n ln(λ) − λnX̄.

Differentiating and setting equal to zero gives:

d n b= 1.
l(λ) = − nX̄ = 0 ⇒ λ
dλ λ
b X̄

233
I. Solutions to Practice questions

The second derivative of the log-likelihood function is:

d2 n
2
l(λ) = − 2
dλ λ

which is always negative, hence the MLE λ


b = 1/X̄ is indeed a maximum. This
happens to be the same as the method of moments estimator of λ.

2. (a) We start off with the sum of squares function:

4
X
S= ε2i = (y1 − α − β)2 + (y2 + α − β)2 + (y3 − α + β)2 + (y4 + α + β)2 .
i=1

Now take the partial derivatives:

∂S
= −2(y1 − α − β) + 2(y2 + α − β) − 2(y3 − α + β) + 2(y4 + α + β)
∂α
= −2(y1 − y2 + y3 − y4 ) + 8α

and:

∂S
= −2(y1 − α − β) − 2(y2 + α − β) + 2(y3 − α + β) + 2(y4 + α + β)
∂β
= −2(y1 + y2 − y3 − y4 ) + 8β.

The least squares estimators α


b and βb are the solutions to ∂S/∂α = 0 and
∂S/∂β = 0. Hence:

y1 − y2 + y3 − y4 y1 + y2 − y3 − y4
α
b= and βb = .
4 4

(b) α
b is an unbiased estimator of α since:
 
y1 − y2 + y3 − y4 α+β+α−β+α−β+α+β
E(b
α) = E = = α.
4 4

βb is an unbiased estimator of β since:


 
y1 + y2 − y3 − y4 α+β−α+β−α+β+α+β
E(β)
b =E = = β.
4 4

(c) We have:
4σ 2 σ2
 
y1 − y2 + y3 − y4
Var(b
α) = Var = = .
4 16 4

234
I.8. Appendix H – Analysis of variance (ANOVA)

I.8 Appendix H – Analysis of variance (ANOVA)


1. (a) For this example, k = 3, n1 = 6, n2 = 5, n3 = 4 and n = n1 + n2 + n3 = 15.
We have x̄·1 = 43.8, x̄·2 = 56.9, x̄·3 = 60.35 and x̄ = 52.58.
nj
3 P
x2ij = 43,387.85.
P
Also,
j=1 i=1
nj
3 P
x2ij − nx̄2 = 43,387.85 − 41,469.85 = 1,918.
P
Total SS =
j=1 i=1
nj
3 P P3
x2ij − nj x̄2·j = 43,387.85 − 42,267.18 = 1,120.67.
P
w= j=1
j=1 i=1

Therefore, b = Total SS − w = 1,918 − 1,120.67 = 797.33.


To test H0 : µ1 = µ2 = µ3 , the test statistic value is:
b/(k − 1) 797.33/2
f= = = 4.269.
w/(n − k) 1,120.67/12
Under H0 , F ∼ F2, 12 . Since F0.05, 2, 12 = 3.89 < 4.269, we reject H0 at the 5%
significance level, i.e. there exists evidence indicating that the population mean
expenditures on frozen meals are not the same for the three different income
groups.
(b) The ANOVA table is as follows:
Source DF SS MS F P
Income 2 797.33 398.67 4.269 <0.05
Error 12 1,120.67 93.39
Total 14 1,918.00
(c) A 95% confidence interval for µj is of the form:

S 93.39 21.056
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 12 × √ = X̄·j ± √ .
nj nj nj

For j = 1, a 95% confidence interval is 43.8 ± 21.056/ 6 ⇒ (35.20, 52.40).

For j = 3, a 95% confidence interval is 60.35 ± 21.056/ 4 ⇒ (49.82, 70.88).

2. (a) Here k = 4 and n1 = n2 = n3 = n4 = 30. We have x̄·1 = 74.10, x̄·2 = 75.67,


x̄·3 = 78.50, x̄·4 = 81.30, b = 909, w = 26,408 and the pooled estimate of σ is
s = 15.09.
Hence the test statistic value is:
b/(k − 1)
f= = 1.33.
w/(n − k)
Under H0 : µ1 = µ2 = µ3 = µ4 , F ∼ Fk−1, n−k = F3, 116 . Since
F0.05, 3, 116 = 2.68 > 1.33, we cannot reject H0 at the 5% significance level.
Hence there is no evidence to support the claim that payments among the four
groups are significantly different.

235
I. Solutions to Practice questions

(b) A 95% confidence interval for µj is of the form:

S 15.09
X̄·j ± t0.025, n−k × √ = X̄·j ± t0.025, 116 × √ = X̄·j ± 5.46.
nj 30

For j = 1, a 95% confidence interval is 74.10 ± 5.46 ⇒ (68.64, 79.56).


For j = 4, a 95% confidence interval is 81.30 ± 5.46 ⇒ (75.84, 86.76).

236
Appendix J
Formula sheet in the examination

Discrete distributions:

Distribution p(x) E(X) Var(X)

1 k+1 k2 − 1
Uniform for all x = 1, 2, . . . , k
k 2 12

Bernoulli π x (1 − π)1−x for x = 0, 1 π π(1 − π)

n

Binomial x
π x (1 − π)n−x for x = 0, 1, 2, . . . , n nπ nπ(1 − π)

1 1−π
Geometric (1 − π)x−1 π for x = 1, 2, . . .
π π2

e−λ λx
Poisson for x = 0, 1, 2, . . . λ λ
x!

Continuous distributions:

Distribution f (x) F (x) E(X) Var(X)

1 x−a a+b (b − a)2


Uniform for a ≤ x ≤ b for a ≤ x ≤ b
b−a b−a 2 12

1 1
Exponential λe−λx for x ≥ 0 1 − e−λx for x ≥ 0
λ λ2

1 2 /2σ 2
Normal √ e−(x−µ) for all x µ σ2
2πσ 2

237
J. Formula sheet in the examination

One-way ANOVA:
nj
k P nj
k P
(Xij − X̄)2 = Xij2 − nX̄ 2 .
P P
Total variation:
j=1 i=1 j=1 i=1
k k
nj (X̄·j − X̄)2 = nj X̄·j2 − nX̄ 2 .
P P
Between-treatments variation: B =
j=1 j=1
nj
k P nj
k P k
(Xij − X̄·j )2 = Xij2 − nj X̄·j2 .
P P P
Within-treatments variation: W =
j=1 i=1 j=1 i=1 j=1

Two-way ANOVA:
r P
c r P
c
(Xij − X̄)2 = Xij2 − rcX̄ 2 .
P P
Total variation:
i=1 j=1 i=1 j=1
r r
(X̄i· − X̄)2 = c X̄i·2 − rcX̄ 2 .
P P
Between-blocks (rows) variation: Brow = c
i=1 i=1
c c
(X̄·j − X̄)2 = r X̄·j2 − rcX̄ 2 .
P P
Between-treatments (columns) variation: Bcol = r
j=1 j=1

Residual (error) variation:


r X
X c r X
X c r
X c
X
2 2 2
(Xij − X̄i· − X̄·j + X̄) = Xij − c X̄i· − r X̄·j2 + rcX̄ 2 .
i=1 j=1 i=1 j=1 i=1 j=1

238
Appendix K
Sample examination paper

Time allowed: 2 hours.


Candidates should answer FOUR of the following questions: Question 1 of Section A
(40 marks) and all questions from Section B (60 marks in total). Candidates are
strongly advised to divide their time accordingly.

SECTION A
Answer all parts of Question 1 (40 marks).

1. (a) For each one of the statements below say whether the statement is true or
false, explaining your answer. Note that A and B (and its complement B c ) are
events, while X and Y are random variables.

i. If A and B are mutually exclusive events, then P (A ∩ B) = ∅.

ii. If A and B are two events such that:

P (A | B c ) > P (A)

then P (B | A) < P (B).

iii. If X ∼ Bin(n = 20, π = 0.50), then E(X) is less than the median of X.

iv. If X ∼ χ27 , then P (X 3 < 0) = 1.

v. The covariance between X and Y is different from the covariance between


Y and X.
(10 marks)

(b) A box contains 12 light bulbs, of which two are defective. If a person selects 7
light bulbs at random, without replacement, what is the probability that both
defective light bulbs will be selected?
(5 marks)

(c) Suppose that X is a normal random variable with mean µ = 8. If it is such


that P (X < 4) = 0.40, approximately what is the value of Var(X)?
(5 marks)

239
K. Sample examination paper

(d) A random sample of n = 60 observations is drawn from a distribution with the


following probability density function:

x2 /9 for 0 ≤ x ≤ 3
fX (x) =
0 otherwise.

Let Y denote the number out of the 60 observations which are in the interval
(0.75, 2.75). Calculate E(Y ) and Var(Y ).
(7 marks)

(e) A continuous random variable X has the probability density function:



x/8
 for 0 ≤ x < 2
f (x) = (8 − x)/24 for 2 ≤ x ≤ 8

0 otherwise.

i. Sketch the above probability density function. The sketch can be drawn
on ordinary paper – no graph paper needed.
(3 marks)
ii. Derive the cumulative distribution function of X.
(6 marks)
iii. Calculate the lower quartile of X, denoted Q1 .
(2 marks)
iv. What is the mode of X? Briefly justify your answer.
(2 marks)

240
SECTION B
Answer all three questions from this section (60 marks in total).

2. (a) A random variable X can take the values −1, 0 and 1. We know that

P (X = −1) = 2β, P (X = 0) = 1 − 5β and P (X = 1) = 3β.

One observation is taken and we want to estimate β. Consider the estimators:

X2
βb1 = X and β2 =
b .
5

i. Check whether βb1 and βb2 are unbiased estimators of β.


(4 marks)
ii. Which one of the two above estimators would you prefer? Justify your
answer.
(6 marks)

(b) X1 , X2 , X3 and X4 are independent normal random variables with mean 0 and
variance 36. Using the nearest values in the statistical tables provided,
calculate the following probabilities:

i. P (X1 > −X2 − X3 + 22).


(3 marks)
 4

2
P
ii. P Xi > 342 .
i=1

(3 marks)
 p 
2 2
iii. P X1 < 7.02 (X2 + X3 ) .
(4 marks)

241
K. Sample examination paper

3. (a) Suppose that you are given observations y1 , y2 and y3 such that:

y 1 = α − β + ε1

y2 = α + 2β + ε2

y3 = −α − β + ε3 .

The random variables εi , for i = 1, 2, 3, are independent and normally


distributed each with a mean of 0 and a variance of 3, i.e. εi ∼ N (0, 3).

i. Derive the least squares estimators of the parameters α and β.


(7 marks)
ii. Determine the variance of the estimator of α.
(3 marks)

(b) Let {X1 , X2 , . . . , Xn } be a random sample from the probability distribution


with probability density function:
(
θxθ−1 for 0 ≤ x ≤ 1
f (x; θ) =
0 otherwise

where θ > 0 is an unknown parameter.

i. Derive the method of moments estimator of θ.


(4 marks)
ii. Derive the maximum likelihood estimator of θ. You do not need to verify
the solution is a maximum.
(6 marks)

242
4. (a) A cinema chain decided to analyse their sales in five countries over a period of
twelve months. Annual sales (i.e. in total across the year, in $ millions) for the
five countries were 71.7, 78.6, 80.1, 81.9 and 89.7. The following is the
calculated ANOVA table with some entries missing.
Source Degrees of freedom Sum of squares Mean square F -value
Country
Month 0.915
Error 42.236
Total
Write out the full two-way ANOVA table using the information provided
above, showing your working.
(8 marks)

(b) Consider two random variables X and Y , where X can take the values −1, 0
and 1, and Y can take the values 0 and 1. You are provided with the following
information:
P (X = 0) = 0.2, P (X = 1) = 0.5
also:

P (Y = 1 | X = −1) = 0.8

P (Y = 1 | X = 0) = 0.5

P (Y = 1 | X = 1) = 0.6.

i. Calculate P (Y = 0) and E(Y ).


(6 marks)
ii. Calculate P (Y = 0 | X + Y ≥ 0) to four decimal places.
(6 marks)

[END OF PAPER]

243
K. Sample examination paper

244
Appendix L
Sample examination paper –
Solutions

Section A

1. (a) i. False. Either A ∩ B = ∅ or P (A ∩ B) = 0.


ii. True. We have:
P (B c | A) P (A)
P (A | B c ) > P (A) ⇒ > P (A).
P (B c )

So P (B c | A) > P (B c ) then 1 − P (B | A) > 1 − P (B) and the result follows.


iii. False. Since π = 0.50, then X is symmetric, hence E(X) = median of X.
iv. False. As X ∼ χ27 , then x ≥ 0, hence P (X 3 < 0) = 0.
v. False. Covariance is symmetric:

Cov(X, Y ) = E(XY ) − E(X) E(Y ) = E(Y X) − E(Y ) E(X) = Cov(Y, X).

(b) The sample space consists of all (unordered) subsets of 7 out of the 12 light
bulbs in the box. There are 127
such subsets. The number of subsets which
contain the two defective bulbs is the number of subsets of size 5 out of the
10

other 10 bulbs, 5 , so the probability we want is:
10

5 7×6
12
 = = 0.3182.
7
12 × 11

(c) We have:
   
4−8 4
0.40 = P (X < 4) = P Z < =P Z<−
σ σ

where Z ∼ N (0, 1). Since P (Z > 0.255) = P (Z < −0.255) ≈ 0.40, we have:

4
−0.255 = − ⇒ σ = 15.686
σ
so, approximately, Var(X) = (15.686)2 = 246.05.

245
L. Sample examination paper – Solutions

(d) The probability that an observation of X lies in (0.75, 2.75) is:


2.75  3 2.75
x2
Z
x
P (0.75 < X < 2.75) = dx = = 0.7546.
0.75 9 27 0.75

Therefore, Y ∼ Bin(60, 0.7546). Hence E(Y ) = 60 × 0.7546 = 45.276 and:

Var(Y ) = nπ(1 − π) = 60 × 0.7546 × 0.2454 = 11.111.

(e) i. We have:

ii. We determine the cdf by integrating the pdf over the appropriate range,
hence: 


 0 for x < 0


x2 /16

for 0 ≤ x < 2
F (x) =


 (x/3) − (x2 /48) − 1/3 for 2 ≤ x ≤ 8



1 for x > 8.
This results from the following calculations. Firstly, for x < 0, we have:
Z x Z x
F (x) = f (t) dt = 0 dt = 0.
−∞ −∞

For 0 ≤ x < 2, we have:


Z x 0 x  2 x
x2
Z Z
t t
F (x) = f (t) dt = 0 dt + dt = = .
−∞ −∞ 0 8 16 0 16

For 2 ≤ x ≤ 8, we have:
Z x 0 2 x
8−t
Z Z Z
t
F (x) = f (t) dt = 0 dt + dt dt +
−∞ −∞ 0 2 24 8
x
t2 x x2 x x2 1
    
1 t 1 2 1
=0+ + − = + − − − = − − .
4 3 48 2 4 3 48 3 12 3 48 3

iii. The lower quartile solves F (Q1 ) = 0.25, hence Q1 = 2.


iv. The mode is 2, i.e. where the probability density function reaches a
maximum.

246
Section B

2. (a) i. We have:
X
E(βb1 ) = E(X) = x p(x) = −2β + 0 + 3β = β
∀x

and:
X x2 2β 3β
E(βb2 ) = p(x) = +0+ = β.
∀x
5 5 5
Therefore, both estimators are unbiased estimators of β.
ii. Since both estimators are unbiased, we prefer the smaller variance
estimator (and hence the smaller mean squared error). We have:

Var(βb1 ) = Var(X) = E(X 2 ) − β 2 = 5β − β 2

and:
β
Var(βb2 ) = E(βb22 ) − β 2 = − β 2 .
5
Hence Var(βb2 ) < Var(βb1 ), we prefer βb2 .

(b) i. Xi ∼ N (0, 36), for i = 1, 2, 3, 4. We have:

X1 + X2 + X3 ∼ N (0, 108).

Hence:
 
22
P (X1 > −X2 − X3 + 22) = P (X1 + X2 + X3 > 22) = P Z>√
108
≈ P (Z > 2.12)
= 0.0170.

4
ii. Xi /6 ∼ N (0, 1), and so Xi2 /36 ∼ χ21 . Hence we have that Xi2 /36 ∼ χ24 .
P
i=1
Therefore:
4
! 4
!
X X X2 i 342
P Xi2 > 342 =P > = P (X > 9.5) ≈ 0.05
i=1 i=1
36 36

where X ∼ χ24 .
iii. We have:
!

 
X1 /6
q
2 2
P X1 < 7.02 X2 + X3 = P p < 2 × 7.02
(X22 + X32 )/(36 × 2)

= P (T < 9.928)
≈ 0.995

where T ∼ t2 .

247
L. Sample examination paper – Solutions

3. (a) i. We have to minimise:


3
X
S= ε2i = (y1 − α + β)2 + (y2 − α − 2β)2 + (y3 + α + β)2 .
i=1

The first-order conditions are (remembering the chain rule):


∂S
= −2(y1 − α + β) − 2(y2 − α − 2β) + 2(y3 + α + β)
∂α
= 2(3α + 2β + (−y1 − y2 + y3 ))

and:
∂S
= 2(y1 − α + β) − 4(y2 − α − 2β) + 2(y3 + α + β)
∂β
= 2(2α + 6β + (y1 − 2y2 + y3 )).

The estimators α
b and βb are the solutions of the equations:
∂S ∂S
= 0 and = 0.
∂α ∂β
Hence:

α + 2βb = y1 + y2 − y3
3b α + 6βb = −y1 + 2y2 − y3 .
and 2b

Solving yields:
4y1 + y2 − 2y3 −5y1 + 4y2 − y3
α
b= and βb = .
7 14

ii. We determine the variance of α


b. Note that Var(yi ) = Var(εi ) = 3 for all i.
Due to independence:
 
16 1 4 63
Var(b
α) = + + ×3= = 1.2857.
49 49 49 49

(b) i. The first population moment is:


∞ 1 1 1
θxθ+1
Z Z Z 
θ−1 θ θ
E(X) = x f (x) dx = x θx = θx dx = = .
−∞ 0 0 θ+1 0 θ+1

Estimating the first population moment with the first sample moment, we
get:
θb X̄
= X̄ ⇒ θb = .
θb + 1 1 − X̄
ii. Due to independence, the likelihood function is:
n
Y n
Y
L(θ) = θXiθ−1 =θ n
Xiθ−1 .
i=1 i=1

248
The log-likelihood function is:
n
X
l(θ) = log L(θ) = n log θ + (θ − 1) log Xi .
i=1

Differentiating with respect to θ gives us:


n
d n X
l(θ) = + log Xi .
dθ θ i=1

Setting to zero and solving for θb returns the maximum likelihood


estimator: n
n X n
+ log Xi = 0 ⇒ θb = − P n .
θb i=1 log Xi
i=1

4. (a) The average monthly sales in each country was 71.7/12 = 5.975,
78.6/12 = 6.550, 80.1/12 = 6.675, 81.9/12 = 6.825 and 89.7/12 = 7.475. The
average of these values is 6.70. Hence the SS due to country is:

12 × (5.975 − 6.70)2 + (6.550 − 6.70)2 + (6.675 − 6.70)2




+(6.825 − 6.70)2 + (7.475 − 6.70)2 = 13.980




and MS due to country is 13.98/(5 − 1) = 3.495. Degrees of freedom are


5 − 1 = 4, 12 − 1 = 11, (12 − 1)(5 − 1) = 44 and 12 × 5 − 1 = 59 for country,
months, error and total sum of squares, respectively.
We have that the MS due to residuals is given by 42.236/44 = 0.960. The
F -value for month of the year is 0.915, from which it follows that the MS due
to months is 0.915 × 0.960 = 0.878 and SS is 11 × 0.878 = 9.658. The F -value
for country is 3.495/0.960 = 3.641. The total sum of squares is
13.980 + 9.658 + 42.236 = 65.874. To summarise:
Source Degrees of freedom Sum of squares Mean square F -value
Country 4 13.98 3.495 3.641
Month 11 9.658 0.878 0.915
Error 44 42.236 0.960
Total 59 65.874

(b) i. We have that


P (X = −1) = 1 − P (X = 0) − P (X = 1) = 1 − 0.2 − 0.5 = 0.3. Hence:

P (Y = 0) = P (Y = 0, X = −1) + P (Y = 0, X = 0) + P (Y = 0, X = 1)
= P (Y = 0 | X = −1) P (X = −1) + P (Y = 0 | X = 0) P (X = 0)
+ P (Y = 0 | X = 1) P (X = 1)
= (1 − 0.8) × 0.3 + (1 − 0.5) × 0.2 + (1 − 0.6) × 0.5
= 0.36

249
L. Sample examination paper – Solutions

and:
1
X
E(Y ) = y P (Y = y) = 0 × 0.36 + 1 × 0.64 = 0.64.
y=0

ii. We find:
P (Y = 0, X + Y ≥ 0)
P (Y = 0 | X + Y ≥ 0) =
P (X + Y ≥ 0)
P (Y = 0, X ≥ 0)
=
1 − P (X + Y = −1)
P (Y = 0 | X = 0) P (X = 0) + P (Y = 0 | X = 1) P (X = 1)
=
1 − P (Y = 0 | X = −1) P (X = −1)
0.5 × 0.2 + 0.4 × 0.5
=
1 − 0.2 × 0.3
= 0.3191.

250
STATISTICAL
TABLES
Cumulative normal distribution
Critical values of the t distribution
Critical values of the F distribution
Critical values of the chi-squared distribution

New Cambridge Statistical Tables pages 17-29

© C. Dougherty 2001, 2002 ([email protected]). These tables have been computed to accompany the text C. Dougherty Introduction to
Econometrics (second edition 2002, Oxford University Press, Oxford), They may be reproduced freely provided that this attribution is retained.
STATISTICAL TABLES 1

TABLE A.1

Cumulative Standardized Normal Distribution

A(z) is the integral of the standardized normal


distribution from  f to z (in other words, the
area under the curve to the left of z). It gives the
probability of a normal random variable not
A(z) being more than z standard deviations above its
mean. Values of z of particular importance:

z A(z)
1.645 0.9500 Lower limit of right 5% tail
1.960 0.9750 Lower limit of right 2.5% tail
2.326 0.9900 Lower limit of right 1% tail
2.576 0.9950 Lower limit of right 0.5% tail
3.090 0.9990 Lower limit of right 0.1% tail
3.291 0.9995 Lower limit of right 0.05% tail
-4 -3 -2 -1 0 1 z 2 3 4

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
3.5 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998 0.9998
3.6 0.9998 0.9998 0.9999

.
STATISTICAL TABLES 2

TABLE A.2
t Distribution: Critical Values of t

Significance level
Degrees of Two-tailed test: 10% 5% 2% 1% 0.2% 0.1%
freedom One-tailed test: 5% 2.5% 1% 0.5% 0.1% 0.05%
1 6.314 12.706 31.821 63.657 318.309 636.619
2 2.920 4.303 6.965 9.925 22.327 31.599
3 2.353 3.182 4.541 5.841 10.215 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869
6 1.943 2.447 3.143 3.707 5.208 5.959
7 1.894 2.365 2.998 3.499 4.785 5.408
8 1.860 2.306 2.896 3.355 4.501 5.041
9 1.833 2.262 2.821 3.250 4.297 4.781
10 1.812 2.228 2.764 3.169 4.144 4.587
11 1.796 2.201 2.718 3.106 4.025 4.437
12 1.782 2.179 2.681 3.055 3.930 4.318
13 1.771 2.160 2.650 3.012 3.852 4.221
14 1.761 2.145 2.624 2.977 3.787 4.140
15 1.753 2.131 2.602 2.947 3.733 4.073
16 1.746 2.120 2.583 2.921 3.686 4.015
17 1.740 2.110 2.567 2.898 3.646 3.965
18 1.734 2.101 2.552 2.878 3.610 3.922
19 1.729 2.093 2.539 2.861 3.579 3.883
20 1.725 2.086 2.528 2.845 3.552 3.850
21 1.721 2.080 2.518 2.831 3.527 3.819
22 1.717 2.074 2.508 2.819 3.505 3.792
23 1.714 2.069 2.500 2.807 3.485 3.768
24 1.711 2.064 2.492 2.797 3.467 3.745
25 1.708 2.060 2.485 2.787 3.450 3.725
26 1.706 2.056 2.479 2.779 3.435 3.707
27 1.703 2.052 2.473 2.771 3.421 3.690
28 1.701 2.048 2.467 2.763 3.408 3.674
29 1.699 2.045 2.462 2.756 3.396 3.659
30 1.697 2.042 2.457 2.750 3.385 3.646
32 1.694 2.037 2.449 2.738 3.365 3.622
34 1.691 2.032 2.441 2.728 3.348 3.601
36 1.688 2.028 2.434 2.719 3.333 3.582
38 1.686 2.024 2.429 2.712 3.319 3.566
40 1.684 2.021 2.423 2.704 3.307 3.551
42 1.682 2.018 2.418 2.698 3.296 3.538
44 1.680 2.015 2.414 2.692 3.286 3.526
46 1.679 2.013 2.410 2.687 3.277 3.515
48 1.677 2.011 2.407 2.682 3.269 3.505
50 1.676 2.009 2.403 2.678 3.261 3.496
60 1.671 2.000 2.390 2.660 3.232 3.460
70 1.667 1.994 2.381 2.648 3.211 3.435
80 1.664 1.990 2.374 2.639 3.195 3.416
90 1.662 1.987 2.368 2.632 3.183 3.402
100 1.660 1.984 2.364 2.626 3.174 3.390
120 1.658 1.980 2.358 2.617 3.160 3.373
150 1.655 1.976 2.351 2.609 3.145 3.357
200 1.653 1.972 2.345 2.601 3.131 3.340
300 1.650 1.968 2.339 2.592 3.118 3.323
400 1.649 1.966 2.336 2.588 3.111 3.315
500 1.648 1.965 2.334 2.586 3.107 3.310
600 1.647 1.964 2.333 2.584 3.104 3.307
f 1.645 1.960 2.326 2.576 3.090 3.291

.
STATISTICAL TABLES 3

TABLE A.3

F Distribution: Critical Values of F (5% significance level)

v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 161.45 199.50 215.71 224.58 230.16 233.99 236.77 238.88 240.54 241.88 243.91 245.36 246.46 247.32 248.01
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38 19.40 19.41 19.42 19.43 19.44 19.45
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79 8.74 8.71 8.69 8.67 8.66
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96 5.91 5.87 5.84 5.82 5.80
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74 4.68 4.64 4.60 4.58 4.56
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06 4.00 3.96 3.92 3.90 3.87
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64 3.57 3.53 3.49 3.47 3.44
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35 3.28 3.24 3.20 3.17 3.15
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14 3.07 3.03 2.99 2.96 2.94
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 2.91 2.86 2.83 2.80 2.77
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85 2.79 2.74 2.70 2.67 2.65
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 2.69 2.64 2.60 2.57 2.54
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67 2.60 2.55 2.51 2.48 2.46
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60 2.53 2.48 2.44 2.41 2.39
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54 2.48 2.42 2.38 2.35 2.33
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 2.42 2.37 2.33 2.30 2.28
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 2.38 2.33 2.29 2.26 2.23
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 2.34 2.29 2.25 2.22 2.19
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38 2.31 2.26 2.21 2.18 2.16
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 2.28 2.22 2.18 2.15 2.12
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32 2.25 2.20 2.16 2.12 2.10
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 2.23 2.17 2.13 2.10 2.07
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27 2.20 2.15 2.11 2.08 2.05
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 2.18 2.13 2.09 2.05 2.03
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 2.16 2.11 2.07 2.04 2.01
26 4.22 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 2.15 2.09 2.05 2.02 1.99
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 2.13 2.08 2.04 2.00 1.97
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 2.12 2.06 2.02 1.99 1.96
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 2.10 2.05 2.01 1.97 1.94
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16 2.09 2.04 1.99 1.96 1.93
35 4.12 3.27 2.87 2.64 2.49 2.37 2.29 2.22 2.16 2.11 2.04 1.99 1.94 1.91 1.88
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08 2.00 1.95 1.90 1.87 1.84
50 4.03 3.18 2.79 2.56 2.40 2.29 2.20 2.13 2.07 2.03 1.95 1.89 1.85 1.81 1.78
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 1.92 1.86 1.82 1.78 1.75
70 3.98 3.13 2.74 2.50 2.35 2.23 2.14 2.07 2.02 1.97 1.89 1.84 1.79 1.75 1.72
80 3.96 3.11 2.72 2.49 2.33 2.21 2.13 2.06 2.00 1.95 1.88 1.82 1.77 1.73 1.70
90 3.95 3.10 2.71 2.47 2.32 2.20 2.11 2.04 1.99 1.94 1.86 1.80 1.76 1.72 1.69
100 3.94 3.09 2.70 2.46 2.31 2.19 2.10 2.03 1.97 1.93 1.85 1.79 1.75 1.71 1.68
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91 1.83 1.78 1.73 1.69 1.66
150 3.90 3.06 2.66 2.43 2.27 2.16 2.07 2.00 1.94 1.89 1.82 1.76 1.71 1.67 1.64
200 3.89 3.04 2.65 2.42 2.26 2.14 2.06 1.98 1.93 1.88 1.80 1.74 1.69 1.66 1.62
250 3.88 3.03 2.64 2.41 2.25 2.13 2.05 1.98 1.92 1.87 1.79 1.73 1.68 1.65 1.61
300 3.87 3.03 2.63 2.40 2.24 2.13 2.04 1.97 1.91 1.86 1.78 1.72 1.68 1.64 1.61
400 3.86 3.02 2.63 2.39 2.24 2.12 2.03 1.96 1.90 1.85 1.78 1.72 1.67 1.63 1.60
500 3.86 3.01 2.62 2.39 2.23 2.12 2.03 1.96 1.90 1.85 1.77 1.71 1.66 1.62 1.59
600 3.86 3.01 2.62 2.39 2.23 2.11 2.02 1.95 1.90 1.85 1.77 1.71 1.66 1.62 1.59
750 3.85 3.01 2.62 2.38 2.23 2.11 2.02 1.95 1.89 1.84 1.77 1.70 1.66 1.62 1.58
1000 3.85 3.00 2.61 2.38 2.22 2.11 2.02 1.95 1.89 1.84 1.76 1.70 1.65 1.61 1.58

.
STATISTICAL TABLES 4

TABLE A.3 (continued)

F Distribution: Critical Values of F (5% significance level)

v1 25 30 35 40 50 60 75 100 150 200


v2
1 249.26 250.10 250.69 251.14 251.77 252.20 252.62 253.04 253.46 253.68
2 19.46 19.46 19.47 19.47 19.48 19.48 19.48 19.49 19.49 19.49
3 8.63 8.62 8.60 8.59 8.58 8.57 8.56 8.55 8.54 8.54
4 5.77 5.75 5.73 5.72 5.70 5.69 5.68 5.66 5.65 5.65
5 4.52 4.50 4.48 4.46 4.44 4.43 4.42 4.41 4.39 4.39
6 3.83 3.81 3.79 3.77 3.75 3.74 3.73 3.71 3.70 3.69
7 3.40 3.38 3.36 3.34 3.32 3.30 3.29 3.27 3.26 3.25
8 3.11 3.08 3.06 3.04 3.02 3.01 2.99 2.97 2.96 2.95
9 2.89 2.86 2.84 2.83 2.80 2.79 2.77 2.76 2.74 2.73
10 2.73 2.70 2.68 2.66 2.64 2.62 2.60 2.59 2.57 2.56
11 2.60 2.57 2.55 2.53 2.51 2.49 2.47 2.46 2.44 2.43
12 2.50 2.47 2.44 2.43 2.40 2.38 2.37 2.35 2.33 2.32
13 2.41 2.38 2.36 2.34 2.31 2.30 2.28 2.26 2.24 2.23
14 2.34 2.31 2.28 2.27 2.24 2.22 2.21 2.19 2.17 2.16
15 2.28 2.25 2.22 2.20 2.18 2.16 2.14 2.12 2.10 2.10
16 2.23 2.19 2.17 2.15 2.12 2.11 2.09 2.07 2.05 2.04
17 2.18 2.15 2.12 2.10 2.08 2.06 2.04 2.02 2.00 1.99
18 2.14 2.11 2.08 2.06 2.04 2.02 2.00 1.98 1.96 1.95
19 2.11 2.07 2.05 2.03 2.00 1.98 1.96 1.94 1.92 1.91
20 2.07 2.04 2.01 1.99 1.97 1.95 1.93 1.91 1.89 1.88
21 2.05 2.01 1.98 1.96 1.94 1.92 1.90 1.88 1.86 1.84
22 2.02 1.98 1.96 1.94 1.91 1.89 1.87 1.85 1.83 1.82
23 2.00 1.96 1.93 1.91 1.88 1.86 1.84 1.82 1.80 1.79
24 1.97 1.94 1.91 1.89 1.86 1.84 1.82 1.80 1.78 1.77
25 1.96 1.92 1.89 1.87 1.84 1.82 1.80 1.78 1.76 1.75
26 1.94 1.90 1.87 1.85 1.82 1.80 1.78 1.76 1.74 1.73
27 1.92 1.88 1.86 1.84 1.81 1.79 1.76 1.74 1.72 1.71
28 1.91 1.87 1.84 1.82 1.79 1.77 1.75 1.73 1.70 1.69
29 1.89 1.85 1.83 1.81 1.77 1.75 1.73 1.71 1.69 1.67
30 1.88 1.84 1.81 1.79 1.76 1.74 1.72 1.70 1.67 1.66
35 1.82 1.79 1.76 1.74 1.70 1.68 1.66 1.63 1.61 1.60
40 1.78 1.74 1.72 1.69 1.66 1.64 1.61 1.59 1.56 1.55
50 1.73 1.69 1.66 1.63 1.60 1.58 1.55 1.52 1.50 1.48
60 1.69 1.65 1.62 1.59 1.56 1.53 1.51 1.48 1.45 1.44
70 1.66 1.62 1.59 1.57 1.53 1.50 1.48 1.45 1.42 1.40
80 1.64 1.60 1.57 1.54 1.51 1.48 1.45 1.43 1.39 1.38
90 1.63 1.59 1.55 1.53 1.49 1.46 1.44 1.41 1.38 1.36
100 1.62 1.57 1.54 1.52 1.48 1.45 1.42 1.39 1.36 1.34
120 1.60 1.55 1.52 1.50 1.46 1.43 1.40 1.37 1.33 1.32
150 1.58 1.54 1.50 1.48 1.44 1.41 1.38 1.34 1.31 1.29
200 1.56 1.52 1.48 1.46 1.41 1.39 1.35 1.32 1.28 1.26
250 1.55 1.50 1.47 1.44 1.40 1.37 1.34 1.31 1.27 1.25
300 1.54 1.50 1.46 1.43 1.39 1.36 1.33 1.30 1.26 1.23
400 1.53 1.49 1.45 1.42 1.38 1.35 1.32 1.28 1.24 1.22
500 1.53 1.48 1.45 1.42 1.38 1.35 1.31 1.28 1.23 1.21
600 1.52 1.48 1.44 1.41 1.37 1.34 1.31 1.27 1.23 1.20
750 1.52 1.47 1.44 1.41 1.37 1.34 1.30 1.26 1.22 1.20
1000 1.52 1.47 1.43 1.41 1.36 1.33 1.30 1.26 1.22 1.19

.
STATISTICAL TABLES 5

TABLE A.3 (continued)

F Distribution: Critical Values of F (1% significance level)

v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 4052.18 4999.50 5403.35 5624.58 5763.65 5858.99 5928.36 5981.07 6022.47 6055.85 6106.32 6142.67 6170.10 6191.53 6208.73
2 98.50 99.00 99.17 99.25 99.30 99.33 99.36 99.37 99.39 99.40 99.42 99.43 99.44 99.44 99.45
3 34.12 30.82 29.46 28.71 28.24 27.91 27.67 27.49 27.35 27.23 27.05 26.92 26.83 26.75 26.69
4 21.20 18.00 16.69 15.98 15.52 15.21 14.98 14.80 14.66 14.55 14.37 14.25 14.15 14.08 14.02
5 16.26 13.27 12.06 11.39 10.97 10.67 10.46 10.29 10.16 10.05 9.89 9.77 9.68 9.61 9.55
6 13.75 10.92 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87 7.72 7.60 7.52 7.45 7.40
7 12.25 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62 6.47 6.36 6.28 6.21 6.16
8 11.26 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81 5.67 5.56 5.48 5.41 5.36
9 10.56 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26 5.11 5.01 4.92 4.86 4.81
10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 4.71 4.60 4.52 4.46 4.41
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 4.40 4.29 4.21 4.15 4.10
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 4.16 4.05 3.97 3.91 3.86
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 3.96 3.86 3.78 3.72 3.66
14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 3.80 3.70 3.62 3.56 3.51
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 3.67 3.56 3.49 3.42 3.37
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 3.55 3.45 3.37 3.31 3.26
17 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 3.46 3.35 3.27 3.21 3.16
18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 3.37 3.27 3.19 3.13 3.08
19 8.18 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43 3.30 3.19 3.12 3.05 3.00
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 3.23 3.13 3.05 2.99 2.94
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 3.17 3.07 2.99 2.93 2.88
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 3.12 3.02 2.94 2.88 2.83
23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 3.07 2.97 2.89 2.83 2.78
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 3.03 2.93 2.85 2.79 2.74
25 7.77 5.57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 2.99 2.89 2.81 2.75 2.70
26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 2.96 2.86 2.78 2.72 2.66
27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 2.93 2.82 2.75 2.68 2.63
28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 2.90 2.79 2.72 2.65 2.60
29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 2.87 2.77 2.69 2.63 2.57
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 2.84 2.74 2.66 2.60 2.55
35 7.42 5.27 4.40 3.91 3.59 3.37 3.20 3.07 2.96 2.88 2.74 2.64 2.56 2.50 2.44
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 2.66 2.56 2.48 2.42 2.37
50 7.17 5.06 4.20 3.72 3.41 3.19 3.02 2.89 2.78 2.70 2.56 2.46 2.38 2.32 2.27
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63 2.50 2.39 2.31 2.25 2.20
70 7.01 4.92 4.07 3.60 3.29 3.07 2.91 2.78 2.67 2.59 2.45 2.35 2.27 2.20 2.15
80 6.96 4.88 4.04 3.56 3.26 3.04 2.87 2.74 2.64 2.55 2.42 2.31 2.23 2.17 2.12
90 6.93 4.85 4.01 3.53 3.23 3.01 2.84 2.72 2.61 2.52 2.39 2.29 2.21 2.14 2.09
100 6.90 4.82 3.98 3.51 3.21 2.99 2.82 2.69 2.59 2.50 2.37 2.27 2.19 2.12 2.07
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 2.34 2.23 2.15 2.09 2.03
150 6.81 4.75 3.91 3.45 3.14 2.92 2.76 2.63 2.53 2.44 2.31 2.20 2.12 2.06 2.00
200 6.76 4.71 3.88 3.41 3.11 2.89 2.73 2.60 2.50 2.41 2.27 2.17 2.09 2.03 1.97
250 6.74 4.69 3.86 3.40 3.09 2.87 2.71 2.58 2.48 2.39 2.26 2.15 2.07 2.01 1.95
300 6.72 4.68 3.85 3.38 3.08 2.86 2.70 2.57 2.47 2.38 2.24 2.14 2.06 1.99 1.94
400 6.70 4.66 3.83 3.37 3.06 2.85 2.68 2.56 2.45 2.37 2.23 2.13 2.05 1.98 1.92
500 6.69 4.65 3.82 3.36 3.05 2.84 2.68 2.55 2.44 2.36 2.22 2.12 2.04 1.97 1.92
600 6.68 4.64 3.81 3.35 3.05 2.83 2.67 2.54 2.44 2.35 2.21 2.11 2.03 1.96 1.91
750 6.67 4.63 3.81 3.34 3.04 2.83 2.66 2.53 2.43 2.34 2.21 2.11 2.02 1.96 1.90
1000 6.66 4.63 3.80 3.34 3.04 2.82 2.66 2.53 2.43 2.34 2.20 2.10 2.02 1.95 1.90

.
STATISTICAL TABLES 6

TABLE A.3 (continued)

F Distribution: Critical Values of F (1% significance level)

v1 25 30 35 40 50 60 75 100 150 200


v2
1 6239.83 6260.65 6275.57 6286.78 6302.52 6313.03 6323.56 6334.11 6344.68 6349.97
2 99.46 99.47 99.47 99.47 99.48 99.48 99.49 99.49 99.49 99.49
3 26.58 26.50 26.45 26.41 26.35 26.32 26.28 26.24 26.20 26.18
4 13.91 13.84 13.79 13.75 13.69 13.65 13.61 13.58 13.54 13.52
5 9.45 9.38 9.33 9.29 9.24 9.20 9.17 9.13 9.09 9.08
6 7.30 7.23 7.18 7.14 7.09 7.06 7.02 6.99 6.95 6.93
7 6.06 5.99 5.94 5.91 5.86 5.82 5.79 5.75 5.72 5.70
8 5.26 5.20 5.15 5.12 5.07 5.03 5.00 4.96 4.93 4.91
9 4.71 4.65 4.60 4.57 4.52 4.48 4.45 4.41 4.38 4.36
10 4.31 4.25 4.20 4.17 4.12 4.08 4.05 4.01 3.98 3.96
11 4.01 3.94 3.89 3.86 3.81 3.78 3.74 3.71 3.67 3.66
12 3.76 3.70 3.65 3.62 3.57 3.54 3.50 3.47 3.43 3.41
13 3.57 3.51 3.46 3.43 3.38 3.34 3.31 3.27 3.24 3.22
14 3.41 3.35 3.30 3.27 3.22 3.18 3.15 3.11 3.08 3.06
15 3.28 3.21 3.17 3.13 3.08 3.05 3.01 2.98 2.94 2.92
16 3.16 3.10 3.05 3.02 2.97 2.93 2.90 2.86 2.83 2.81
17 3.07 3.00 2.96 2.92 2.87 2.83 2.80 2.76 2.73 2.71
18 2.98 2.92 2.87 2.84 2.78 2.75 2.71 2.68 2.64 2.62
19 2.91 2.84 2.80 2.76 2.71 2.67 2.64 2.60 2.57 2.55
20 2.84 2.78 2.73 2.69 2.64 2.61 2.57 2.54 2.50 2.48
21 2.79 2.72 2.67 2.64 2.58 2.55 2.51 2.48 2.44 2.42
22 2.73 2.67 2.62 2.58 2.53 2.50 2.46 2.42 2.38 2.36
23 2.69 2.62 2.57 2.54 2.48 2.45 2.41 2.37 2.34 2.32
24 2.64 2.58 2.53 2.49 2.44 2.40 2.37 2.33 2.29 2.27
25 2.60 2.54 2.49 2.45 2.40 2.36 2.33 2.29 2.25 2.23
26 2.57 2.50 2.45 2.42 2.36 2.33 2.29 2.25 2.21 2.19
27 2.54 2.47 2.42 2.38 2.33 2.29 2.26 2.22 2.18 2.16
28 2.51 2.44 2.39 2.35 2.30 2.26 2.23 2.19 2.15 2.13
29 2.48 2.41 2.36 2.33 2.27 2.23 2.20 2.16 2.12 2.10
30 2.45 2.39 2.34 2.30 2.25 2.21 2.17 2.13 2.09 2.07
35 2.35 2.28 2.23 2.19 2.14 2.10 2.06 2.02 1.98 1.96
40 2.27 2.20 2.15 2.11 2.06 2.02 1.98 1.94 1.90 1.87
50 2.17 2.10 2.05 2.01 1.95 1.91 1.87 1.82 1.78 1.76
60 2.10 2.03 1.98 1.94 1.88 1.84 1.79 1.75 1.70 1.68
70 2.05 1.98 1.93 1.89 1.83 1.78 1.74 1.70 1.65 1.62
80 2.01 1.94 1.89 1.85 1.79 1.75 1.70 1.65 1.61 1.58
90 1.99 1.92 1.86 1.82 1.76 1.72 1.67 1.62 1.57 1.55
100 1.97 1.89 1.84 1.80 1.74 1.69 1.65 1.60 1.55 1.52
120 1.93 1.86 1.81 1.76 1.70 1.66 1.61 1.56 1.51 1.48
150 1.90 1.83 1.77 1.73 1.66 1.62 1.57 1.52 1.46 1.43
200 1.87 1.79 1.74 1.69 1.63 1.58 1.53 1.48 1.42 1.39
250 1.85 1.77 1.72 1.67 1.61 1.56 1.51 1.46 1.40 1.36
300 1.84 1.76 1.70 1.66 1.59 1.55 1.50 1.44 1.38 1.35
400 1.82 1.75 1.69 1.64 1.58 1.53 1.48 1.42 1.36 1.32
500 1.81 1.74 1.68 1.63 1.57 1.52 1.47 1.41 1.34 1.31
600 1.80 1.73 1.67 1.63 1.56 1.51 1.46 1.40 1.34 1.30
750 1.80 1.72 1.66 1.62 1.55 1.50 1.45 1.39 1.33 1.29
1000 1.79 1.72 1.66 1.61 1.54 1.50 1.44 1.38 1.32 1.28

.
STATISTICAL TABLES 7

TABLE A.3 (continued)

F Distribution: Critical Values of F (0.1% significance level)

v1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20
v2
1 4.05e05 5.00e05 5.40e05 5.62e05 5.76e05 5.86e05 5.93e05 5.98e05 6.02e05 6.06e05 6.11e05 6.14e05 6.17e05 6.19e05 6.21e05
2 998.50 999.00 999.17 999.25 999.30 999.33 999.36 999.37 999.39 999.40 999.42 999.43 999.44 999.44 999.45
3 167.03 148.50 141.11 137.10 134.58 132.85 131.58 130.62 129.86 129.25 128.32 127.64 127.14 126.74 126.42
4 74.14 61.25 56.18 53.44 51.71 50.53 49.66 49.00 48.47 48.05 47.41 46.95 46.60 46.32 46.10
5 47.18 37.12 33.20 31.09 29.75 28.83 28.16 27.65 27.24 26.92 26.42 26.06 25.78 25.57 25.39
6 35.51 27.00 23.70 21.92 20.80 20.03 19.46 19.03 18.69 18.41 17.99 17.68 17.45 17.27 17.12
7 29.25 21.69 18.77 17.20 16.21 15.52 15.02 14.63 14.33 14.08 13.71 13.43 13.23 13.06 12.93
8 25.41 18.49 15.83 14.39 13.48 12.86 12.40 12.05 11.77 11.54 11.19 10.94 10.75 10.60 10.48
9 22.86 16.39 13.90 12.56 11.71 11.13 10.70 10.37 10.11 9.89 9.57 9.33 9.15 9.01 8.90
10 21.04 14.91 12.55 11.28 10.48 9.93 9.52 9.20 8.96 8.75 8.45 8.22 8.05 7.91 7.80
11 19.69 13.81 11.56 10.35 9.58 9.05 8.66 8.35 8.12 7.92 7.63 7.41 7.24 7.11 7.01
12 18.64 12.97 10.80 9.63 8.89 8.38 8.00 7.71 7.48 7.29 7.00 6.79 6.63 6.51 6.40
13 17.82 12.31 10.21 9.07 8.35 7.86 7.49 7.21 6.98 6.80 6.52 6.31 6.16 6.03 5.93
14 17.14 11.78 9.73 8.62 7.92 7.44 7.08 6.80 6.58 6.40 6.13 5.93 5.78 5.66 5.56
15 16.59 11.34 9.34 8.25 7.57 7.09 6.74 6.47 6.26 6.08 5.81 5.62 5.46 5.35 5.25
16 16.12 10.97 9.01 7.94 7.27 6.80 6.46 6.19 5.98 5.81 5.55 5.35 5.20 5.09 4.99
17 15.72 10.66 8.73 7.68 7.02 6.56 6.22 5.96 5.75 5.58 5.32 5.13 4.99 4.87 4.78
18 15.38 10.39 8.49 7.46 6.81 6.35 6.02 5.76 5.56 5.39 5.13 4.94 4.80 4.68 4.59
19 15.08 10.16 8.28 7.27 6.62 6.18 5.85 5.59 5.39 5.22 4.97 4.78 4.64 4.52 4.43
20 14.82 9.95 8.10 7.10 6.46 6.02 5.69 5.44 5.24 5.08 4.82 4.64 4.49 4.38 4.29
21 14.59 9.77 7.94 6.95 6.32 5.88 5.56 5.31 5.11 4.95 4.70 4.51 4.37 4.26 4.17
22 14.38 9.61 7.80 6.81 6.19 5.76 5.44 5.19 4.99 4.83 4.58 4.40 4.26 4.15 4.06
23 14.20 9.47 7.67 6.70 6.08 5.65 5.33 5.09 4.89 4.73 4.48 4.30 4.16 4.05 3.96
24 14.03 9.34 7.55 6.59 5.98 5.55 5.23 4.99 4.80 4.64 4.39 4.21 4.07 3.96 3.87
25 13.88 9.22 7.45 6.49 5.89 5.46 5.15 4.91 4.71 4.56 4.31 4.13 3.99 3.88 3.79
26 13.74 9.12 7.36 6.41 5.80 5.38 5.07 4.83 4.64 4.48 4.24 4.06 3.92 3.81 3.72
27 13.61 9.02 7.27 6.33 5.73 5.31 5.00 4.76 4.57 4.41 4.17 3.99 3.86 3.75 3.66
28 13.50 8.93 7.19 6.25 5.66 5.24 4.93 4.69 4.50 4.35 4.11 3.93 3.80 3.69 3.60
29 13.39 8.85 7.12 6.19 5.59 5.18 4.87 4.64 4.45 4.29 4.05 3.88 3.74 3.63 3.54
30 13.29 8.77 7.05 6.12 5.53 5.12 4.82 4.58 4.39 4.24 4.00 3.82 3.69 3.58 3.49
35 12.90 8.47 6.79 5.88 5.30 4.89 4.59 4.36 4.18 4.03 3.79 3.62 3.48 3.38 3.29
40 12.61 8.25 6.59 5.70 5.13 4.73 4.44 4.21 4.02 3.87 3.64 3.47 3.34 3.23 3.14
50 12.22 7.96 6.34 5.46 4.90 4.51 4.22 4.00 3.82 3.67 3.44 3.27 3.41 3.04 2.95
60 11.97 7.77 6.17 5.31 4.76 4.37 4.09 3.86 3.69 3.54 3.32 3.15 3.02 2.91 2.83
70 11.80 7.64 6.06 5.20 4.66 4.28 3.99 3.77 3.60 3.45 3.23 3.06 2.93 2.83 2.74
80 11.67 7.54 5.97 5.12 4.58 4.20 3.92 3.70 3.53 3.39 3.16 3.00 2.87 2.76 2.68
90 11.57 7.47 5.91 5.06 4.53 4.15 3.87 3.65 3.48 3.34 3.11 2.95 2.82 2.71 2.63
100 11.50 7.41 5.86 5.02 4.48 4.11 3.83 3.61 3.44 3.30 3.07 2.91 2.78 2.68 2.59
120 11.38 7.32 5.78 4.95 4.42 4.04 3.77 3.55 3.38 3.24 3.02 2.85 2.72 2.62 2.53
150 11.27 7.24 5.71 4.88 4.35 3.98 3.71 3.49 3.32 3.18 2.96 2.80 2.67 2.56 2.48
200 11.15 7.15 5.63 4.81 4.29 3.92 3.65 3.43 3.26 3.12 2.90 2.74 2.61 2.51 2.42
250 11.09 7.10 5.59 4.77 4.25 3.88 3.61 3.40 3.23 3.09 2.87 2.71 2.58 2.48 2.39
300 11.04 7.07 5.56 4.75 4.22 3.86 3.59 3.38 3.21 3.07 2.85 2.69 2.56 2.46 2.37
400 10.99 7.03 5.53 4.71 4.19 3.83 3.56 3.35 3.18 3.04 2.82 2.66 2.53 2.43 2.34
500 10.96 7.00 5.51 4.69 4.18 3.81 3.54 3.33 3.16 3.02 2.81 2.64 2.52 2.41 2.33
600 10.94 6.99 5.49 4.68 4.16 3.80 3.53 3.32 3.15 3.01 2.80 2.63 2.51 2.40 2.32
750 10.91 6.97 5.48 4.67 4.15 3.79 3.52 3.31 3.14 3.00 2.78 2.62 2.49 2.39 2.31
1000 10.89 6.96 5.46 4.65 4.14 3.78 3.51 3.30 3.13 2.99 2.77 2.61 2.48 2.38 2.30

.
STATISTICAL TABLES 8

TABLE A.3 (continued)

F Distribution: Critical Values of F (0.1% significance level)

v1 25 30 35 40 50 60 75 100 150 200


v2
1 6.24e05 6.26e05 6.28e05 6.29e05 6.30e05 6.31e05 6.32e05 6.33e05 6.35e05 6.35e05
2 999.46 999.47 999.47 999.47 999.48 999.48 999.49 999.49 999.49 999.49
3 125.84 125.45 125.17 124.96 124.66 124.47 124.27 124.07 123.87 123.77
4 45.70 45.43 45.23 45.09 44.88 44.75 44.61 44.47 44.33 44.26
5 25.08 24.87 24.72 24.60 24.44 24.33 24.22 24.12 24.01 23.95
6 16.85 16.67 16.54 16.44 16.31 16.21 16.12 16.03 15.93 15.89
7 12.69 12.53 12.41 12.33 12.20 12.12 12.04 11.95 11.87 11.82
8 10.26 10.11 10.00 9.92 9.80 9.73 9.65 9.57 9.49 9.45
9 8.69 8.55 8.46 8.37 8.26 8.19 8.11 8.04 7.96 7.93
10 7.60 7.47 7.37 7.30 7.19 7.12 7.05 6.98 6.91 6.87
11 6.81 6.68 6.59 6.52 6.42 6.35 6.28 6.21 6.14 6.10
12 6.22 6.09 6.00 5.93 5.83 5.76 5.70 5.63 5.56 5.52
13 5.75 5.63 5.54 5.47 5.37 5.30 5.24 5.17 5.10 5.07
14 5.38 5.25 5.17 5.10 5.00 4.94 4.87 4.81 4.74 4.71
15 5.07 4.95 4.86 4.80 4.70 4.64 4.57 4.51 4.44 4.41
16 4.82 4.70 4.61 4.54 4.45 4.39 4.32 4.26 4.19 4.16
17 4.60 4.48 4.40 4.33 4.24 4.18 4.11 4.05 3.98 3.95
18 4.42 4.30 4.22 4.15 4.06 4.00 3.93 3.87 3.80 3.77
19 4.26 4.14 4.06 3.99 3.90 3.84 3.78 3.71 3.65 3.61
20 4.12 4.00 3.92 3.86 3.77 3.70 3.64 3.58 3.51 3.48
21 4.00 3.88 3.80 3.74 3.64 3.58 3.52 3.46 3.39 3.36
22 3.89 3.78 3.70 3.63 3.54 3.48 3.41 3.35 3.28 3.25
23 3.79 3.68 3.60 3.53 3.44 3.38 3.32 3.25 3.19 3.16
24 3.71 3.59 3.51 3.45 3.36 3.29 3.23 3.17 3.10 3.07
25 3.63 3.52 3.43 3.37 3.28 3.22 3.15 3.09 3.03 2.99
26 3.56 3.44 3.36 3.30 3.21 3.15 3.08 3.02 2.95 2.92
27 3.49 3.38 3.30 3.23 3.14 3.08 3.02 2.96 2.89 2.86
28 3.43 3.32 3.24 3.18 3.09 3.02 2.96 2.90 2.83 2.80
29 3.38 3.27 3.18 3.12 3.03 2.97 2.91 2.84 2.78 2.74
30 3.33 3.22 3.13 3.07 2.98 2.92 2.86 2.79 2.73 2.69
35 3.13 3.02 2.93 2.87 2.78 2.72 2.66 2.59 2.52 2.49
40 2.98 2.87 2.79 2.73 2.64 2.57 2.51 2.44 2.38 2.34
50 2.79 2.68 2.60 2.53 2.44 2.38 2.31 2.25 2.18 2.14
60 2.67 2.55 2.47 2.41 2.32 2.25 2.19 2.12 2.05 2.01
70 2.58 2.47 2.39 2.32 2.23 2.16 2.10 2.03 1.95 1.92
80 2.52 2.41 2.32 2.26 2.16 2.10 2.03 1.96 1.89 1.85
90 2.47 2.36 2.27 2.21 2.11 2.05 1.98 1.91 1.83 1.79
100 2.43 2.32 2.24 2.17 2.08 2.01 1.94 1.87 1.79 1.75
120 2.37 2.26 2.18 2.11 2.02 1.95 1.88 1.81 1.73 1.68
150 2.32 2.21 2.12 2.06 1.96 1.89 1.82 1.74 1.66 1.62
200 2.26 2.15 2.07 2.00 1.90 1.83 1.76 1.68 1.60 1.55
250 2.23 2.12 2.03 1.97 1.87 1.80 1.72 1.65 1.56 1.51
300 2.21 2.10 2.01 1.94 1.85 1.78 1.70 1.62 1.53 1.48
400 2.18 2.07 1.98 1.92 1.82 1.75 1.67 1.59 1.50 1.45
500 2.17 2.05 1.97 1.90 1.80 1.73 1.65 1.57 1.48 1.43
600 2.16 2.04 1.96 1.89 1.79 1.72 1.64 1.56 1.46 1.41
750 2.15 2.03 1.95 1.88 1.78 1.71 1.63 1.55 1.45 1.40
1000 2.14 2.02 1.94 1.87 1.77 1.69 1.62 1.53 1.44 1.38

.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.
Dennis V. Lindley, William F. Scott, New Cambridge Statistical Tables, (1995) © Cambridge University Press, reproduced with permission.

You might also like